I am trying to filter some data that I have in R. It is formatted like this:
id config_id alpha begin end day
1 1 1 5 138 139 6
2 1 2 5 137 138 6
3 1 3 5 47 48 2
4 1 3 3 46 47 2
5 1 4 3 45 46 2
6 1 4 3 43 44 2
...
id config_id alpha begin end day
1 2 1 5 138 139 6
2 2 2 5 137 138 6
3 2 2 5 136 137 6
4 2 3 3 45 46 2
5 2 3 3 44 45 2
6 2 4 3 43 44 2
My goal is to remove any configuration which results in having beginnings and endings on the same day. For example, in the top example config_id 3 is not acceptable because both instances of config_id occur on day 2. Same story for config_id 4. In the bottom example config_id 2 and config_id 3 are unacceptable for the same reason.
Basically, if I have a repeated config_id AND any day (from the day) column shows up more than once for that config_id, then I want to remove that config_id from the list.
Right now I'm using something of a fairly complex lapply algorithm but there must be an easier way.
Thanks!
You can do this several ways, assuming your data is stored in a data frame called my_data.
base R
same_day <- aggregate(my_data$day, my_data["config_id"], function(x) any(table(x) > 1))
names(same_day)[2] <- "same_day"
my_data <- merge(my_data, same_day, by = "config_id")
my_data <- same_day[!same_day$repeated_id, ]
dplyr
library(dplyr)
my_data %<>% group_by(config_id) %>%
mutate(same_day = any(table(day) > 1)) %>%
filter(!same_day)
data.table
library(data.table)
my_data <- data.table(my_data, key = "config_id")
same_day <- my_data[, .(same_day = any(table(day) > 1)), by = "config_id"]
my_data[!my_data[same_day]$same_day, ]
We can also use n_distinct from dplyr. Here, I am grouping by 'id' and 'config_id', then remove the rows using filter. If the number of elements within the group is greater than 1 (n()>1) and (&) the number of distinct elements in 'day' is equal to 1 (n_distinct==1), we remove it.
library(dplyr)
df1 %>%
group_by(id, config_id) %>%
filter(!(n()>1 & n_distinct(day)==1))
#Source: local data frame [4 x 6]
#Groups: id, config_id [4]
# id config_id alpha begin end day
# (int) (int) (int) (int) (int) (int)
#1 1 1 5 138 139 6
#2 1 2 5 137 138 6
#3 2 1 5 138 139 6
#4 2 4 3 43 44 2
This should also work if we have different 'day' for the same 'config_id'.
df1$day[4] <- 3
A similar option using data.table is uniqueN. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id' and 'config_id', we subset the dataset (.SD) using the logical condition.
library(data.table)#v1.9.6+
setDT(df1)[, if(!(.N>1 & uniqueN(day) == 1L)) .SD, by = .(id, config_id)]
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), config_id = c(1L, 2L, 3L, 3L, 4L, 4L, 1L, 2L, 2L, 3L,
3L, 4L), alpha = c(5L, 5L, 5L, 3L, 3L, 3L, 5L, 5L, 5L, 3L, 3L,
3L), begin = c(138L, 137L, 47L, 46L, 45L, 43L, 138L, 137L, 136L,
45L, 44L, 43L), end = c(139L, 138L, 48L, 47L, 46L, 44L, 139L,
138L, 137L, 46L, 45L, 44L), day = c(6L, 6L, 2L, 2L, 2L, 2L, 6L,
6L, 6L, 2L, 2L, 2L)), .Names = c("id", "config_id", "alpha",
"begin", "end", "day"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
Related
I have two vectors of id values associated with two different datasets. The two vectors correspond to the same individuals, but the id vectors are unrelated (and there are multiple observations for each individual in each dataset). My goal is to merge them by id, but because the ids are different and they are different lengths there is no way to do that without matching on id. There's obviously a lot more data than what I included in the example.
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
So 4033 = 1; 4833 = 4...etc.
dummy dataset1:
id day y
1 1 10
1 2 4
1 3 2
4 1 9
4 2 10
4 3 6
dummy dataset2:
id day y1
4033 1 100
4033 1 120
4033 2 150
4033 3 200
4833 1 120
4833 2 100
4833 2 50
4833 3 100
4833 3 200
What I would like is an easy way to get:
dummy dataset1 output:
id day y id.2
1 1 10 4033
1 2 4 4033
1 3 2 4033
4 1 9 4833
4 2 10 4833
4 3 6 4833
I'm trying a solution in a forloop like:
for (i in length(dataset)) {
dataset$id[dataset[[1]] %in% int] <- int1
}
But that's not working correctly (probably for an obvious reason I'm missing).
As we have two vectors, we can easily create a match with a named vector in base R
df1$id.2 <- setNames(a, b)[as.character(df1$id)]
df1
# id day y id.2
#1 1 1 10 4033
#2 1 2 4 4033
#3 1 3 2 4033
#4 4 1 9 4833
#5 4 2 10 4833
#6 4 3 6 4833
Or another base R option is match
df1$id.2 <- a[match(df1$id, b)]
data
df1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)),
class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(id = c(4033L, 4033L, 4033L, 4033L, 4833L, 4833L,
4833L, 4833L, 4833L), day = c(1L, 1L, 2L, 3L, 1L, 2L, 2L, 3L,
3L), y1 = c(100L, 120L, 150L, 200L, 120L, 100L, 50L, 100L, 200L
)), class = "data.frame", row.names = c(NA, -9L))
Another approach is to make a data.frame of the IDs and use merge.
datasetID <- data.frame(id = b, id.2 = a)
merge(dataset1,datasetID)
id day y a
1 1 1 10 4033
2 1 2 4 4033
3 1 3 2 4033
4 4 1 9 4833
5 4 2 10 4833
6 4 3 6 4833
Data
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
dataset1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)), class = "data.frame", row.names = c(NA,
-6L))
I am currently working with a data set in R that contains four variables for a large set of individuals: pid, month, window, and agedays. I'm trying to create a loop that will output the min and max agedays of each group of combinations between month and window into a new data table that I can export as a csv.
Here's an example of the data:
pid agedays month window
1 22 2 1
2 35 3 2
3 33 3 2
4 55 3 2
1 66 2 1
2 55 4 2
3 80 4 2
4 90 4 2
I'd like for the new data table to contain the min and max agedays of each group within each combination of window and month as well as the count of each group within each combination. The range for month is 2-24 and the range for window is 0-2.
The data table should look something like this:
month window min max N
2 1 22 66 1
3 2 33 55 3
etc....
where N is the number of unique individuals (pids) within each group
After grouping by 'month', 'window', get the min, max of 'agedays' and the number of distinct (n_distinct) elements of 'pid'
library(dplyr)
df1 %>%
group_by(month, window) %>%
summarise(min = min(agedays), max = max(agedays), N = n_distinct(pid))
# A tibble: 3 x 5
# Groups: month [3]
# month window min max N
# <int> <int> <int> <int> <int>
#1 2 1 22 66 1
#2 3 2 33 55 3
#3 4 2 55 90 3
We can also do this with data.table
library(data.table)
setDT(df1)[, .(min = min(agedays), max = max(agedays),
N = uniqueN(pid)), by = .(month, window)]
Or using split from base R
do.call(rbind, lapply(split(df1, df1[c('month', 'window')], drop = TRUE),
function(x) cbind(month = x$month[1], window = x$window[1], min = min(x$agedays), max = max(x$agedays),
N = length(unique(x$pid)))))
data
df1 <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Using data.table, we can calculate min, max of agedays along with number of rows for each combination of month and window.
library(data.table)
setDT(df) #Convert to data.table if it is not already
df[, .(min_age = min(agedays, na.rm = TRUE),
max_age = max(agedays, na.rm = TRUE), N = .N), .(month, window)]
# month window min_age max_age N
#1: 2 1 22 66 2
#2: 3 2 33 55 3
#3: 4 2 55 90 3
data
df <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -8L))
I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70
I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70
I have a very messy dataset created by a research device. This data shows a physiological measure ("Physio") for every few milliseconds ("Time"). The output lists several user messages, such as when a trial starts ("START_TRIAL n"), when a trial ends ("STOP_TRIAL"), and other random things that may be of interest to the researcher. Some times the "START_TRIAL n" message is repeated consecutively, and sometimes when there is no message, a simple "0" is left in what would otherwise be a blank cell.
I am hoping to create a new column that will signify which trial the current case belongs to. (See example data below).
Is there a way to do this with dplyr and mutate? I am wondering if I may need to do an if-then statement that changes the values of a new column for every case, but surely there's a more elegant solution? (Thank you in advance for helping out this newbie!)
Time Physio Cond
1 34 START_TRIAL 1
2 33 0
3 25 RANDOM_MSG
4 43 STOP_TRIAL
5 27 START_TRIAL 2
6 54 START_TRIAL 2
7 32 0
8 54 RANDOM_MSG
9 23 STOP_TRIAL
structure(list(Time = 1:9, Physio = c(34L, 33L, 25L, 43L, 27L,
54L, 32L, 54L, 23L), Cond = structure(c(4L, 2L, 3L, 6L, 5L, 5L,
2L, 3L, 6L), .Label = c("", "0", "RANDOM_MSG", "START_TRIAL 1",
"START_TRIAL 2", "STOP_TRIAL"), class = "factor")), .Names = c("Time",
"Physio", "Cond"), row.names = c(NA, 9L), class = "data.frame")
into
Time Physio Trial Cond
1 34 1 START_TRIAL 1
2 33 1 0
3 25 1 RANDOM_MSG
4 43 1 STOP_TRIAL
5 27 2 START_TRIAL 2
6 54 2 START_TRIAL 2
7 32 2 0
8 54 2 RANDOM_MSG
9 23 2 STOP_TRIAL
structure(list(Time = 1:9, Physio = c(34L, 33L, 25L, 43L, 27L,
54L, 32L, 54L, 23L), Trial = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), Cond = structure(c(4L, 2L, 3L, 6L, 5L, 5L, 2L, 3L, 6L), .Label = c("",
"0", "RANDOM_MSG", "START_TRIAL 1", "START_TRIAL 2", "STOP_TRIAL"
), class = "factor")), .Names = c("Time", "Physio", "Trial",
"Cond"), row.names = c(NA, 9L), class = "data.frame")
One option would be to identify the 'START_TRIAL' with grep, do a match to get the index and fill the NA elements with the previous non-NA adjacent element
library(dplyr)
library(tidyr)
df1 %>%
mutate(Trial = match(PhysioCond, unique(grep("START_TRIAL",
PhysioCond, value = TRUE)))) %>%
fill(Trial)
# Time PhysioCond Trial
#1 34 START_TRIAL 1 1
#2 33 0 1
#3 25 RANDOM_MSG 1
#4 43 STOP_TRIAL 1
#5 27 START_TRIAL 2 2
#6 54 START_TRIAL 2 2
#7 32 0 2
#8 54 RANDOM_MSG 2
#9 23 STOP_TRIAL 2
NOTE: Not clear about the column name, but the logic should work well
data
df1 <- structure(list(Time = c(34L, 33L, 25L, 43L, 27L, 54L, 32L, 54L,
23L), PhysioCond = c("START_TRIAL 1", "0", "RANDOM_MSG", "STOP_TRIAL",
"START_TRIAL 2", "START_TRIAL 2", "0", "RANDOM_MSG", "STOP_TRIAL"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9"))