How to perform countif in R and export the data? - r

I have a set of data:
ID<-c(111,111,222,222,222,222,222,222)
TreatmentDate<-as.Date(c("2010-12-12","2011-12-01","2009-8-7","2010-5-7","2011-3-7","2011-8-5","2013-8-27","2016-9-3"))
Treatment<-c("AA","BB","CC","DD","AA","BB","BB","CC")
df<-data.frame(ID,TreatmentDate,Treatment)
df
ID TreatmentDate Treatment
111 12/12/2010 AA
111 01/12/2011 BB
222 07/08/2009 CC
222 07/05/2010 DD
222 07/03/2011 AA
222 05/08/2011 BB
222 27/08/2013 BB
222 03/09/2016 CC
I also have another dataframe showing the test date for each subject:
UID<-c(111,222)
Testdate<-as.Date(c("2012-12-31","2014-12-31"))
SubjectTestDate<-data.frame(UID,Testdate)
I am trying to summarise the data such that, say if I want to see how many treatment a subject has prior to their test date, I would get something like this and I would like to export this to a spreasheet.
ID Prior_to_date TreatmentAA TreatmentBB TreatmentCC TreatmentDD
111 31/12/2012 1 1 0 0
222 31/12/2014 1 2 1 1
Any help would be much appreciated!!

We could join the two dataset with 'ID', create a column that checks the condition ('indx'), and use dcast to convert from 'long' to 'wide' format
library(data.table)#v1.9.5+
dcast(setkey(setDT(df), ID)[SubjectTestDate][,
indx:=sum(TreatmentDate <=Testdate) , list(ID, Treatment)],
ID+Testdate~ paste0('Treatment', Treatment), value.var='indx', length)
# ID Testdate TreatmentAA TreatmentBB TreatmentCC TreatmentDD
#1: 111 2012-12-31 1 1 0 0
#2: 222 2014-12-31 1 2 2 1
Update
Based on the modified 'df', we join the 'df' with 'SubjectTestDate', create the 'indx' column as before, and also a sequence column 'Seq', grouped by 'ID' and 'Treatment', use dcast and then remove the duplicated 'ID' rows with unique
unique(dcast(setkey(setDT(df), ID)[SubjectTestDate][,
c('indx', 'Seq') := list(sum(TreatmentDate <= Testdate), 1:.N) ,
.(ID, Treatment)], ID+ Seq+ Testdate ~ paste0('Treatment',
Treatment), value.var='indx', fill=0), by='ID')
# ID Seq Testdate TreatmentAA TreatmentBB TreatmentCC TreatmentDD
#1: 111 1 2012-12-31 1 1 0 0
#2: 222 1 2014-12-31 1 2 1 1

Related

Select data.table columns based on condition, within by

I want to extract data.table columns if their contents fulfill a criteria. And I need a method that will work with by (or in some other way within combinations of columns). I am not very experienced with data.table and have tried my best with .SDcol and what else I could think of.
Example: I often have datasets with observations at multiple time points for multiple subjects. They also contain covariates which do not vary within subjects.
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)
How do I (efficiently) select the columns that do not vary within id (in this case, id and age)? I'd like a function call that would return
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)? The function call should return:
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Of course, I am interested if you know of a function that addresses the specific example above, but I am even more curious on how to do this generally. Columns that contain more than two values > 1000 within any combinations of id and time in by=.(id,time), or whatever...
Thanks!
How do I (efficiently) select the columns that do not vary within id (in this case, id and age)?
Maybe something like:
f <- function(DT, byChar) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)==1])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f(dt1, "id")
output:
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)?
Similarly,
f2 <- function(DT, byChar, k) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)>k])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f2(dt1, "id", 1)
output:
id time meas
1: 1 1 452
2: 1 2 23
3: 2 1 555
4: 2 2 33
5: 3 1 322
6: 3 2 32
data:
library(data.table)
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)
This might also be an option:
Count unique values per column, by ID (using data.table::uniqueN )
Check in which columns the sum of unique values (by group) equals the number of unique IDs (using colSums)
Only keep (or drop) the wanted columns
library(data.table)
ids <- uniqueN(dt1$id)
#no variation
dt1[, c( TRUE, colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id age
1: 1 30
2: 1 30
3: 2 54
4: 2 54
5: 3 20
6: 3 20
#variation
dt1[, c( TRUE, !colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Based on chinsoon12's suggestion, I managed to put something together. I need four steps, and I'm not sure how efficient it is, but at least it does the job. To recap, this is the dataset:
dt1
id time meas age bw
1: 1 1 452 30 75
2: 1 2 23 30 75
3: 2 1 555 54 81
4: 2 2 33 54 81
5: 3 1 322 20 69
6: 3 2 32 20 70
I put this together to get the columns that are constant within "id" (only age):
cols.id <- "id"
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)==1)], by=cols.id]
ifkeep <- dt2[,sapply(.SD,all),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id age
1: 1 30
2: 2 54
3: 3 20
And to get the columns that vary within any value of "id" (age is dropped):
cols.id <- "id"
## differenct from above: ==1 -> >1
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)>1)], by=cols.id]
## difference from above: all -> any
ifkeep <- dt2[,sapply(.SD,any),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70

Lookup observations data based on another table

I have 2 tibble data frames that I am trying to reconcile. The first tibble has over a million observations, the first few rows are as follows:
data
ID Time(Converted to number)
1 23160
1 23161
1 23162
1 23163
1 23164
1 23165
2 24251
2 24252
The second tibble is a lookup table (that has information of a particular event that has occurred), simplified version as follows:
lookup_table
ID Event_Time Event_Indicator Number_of_Cumulative_Events
1 23162 1 1
1 23164 1 2
2 24255 1 1
2 24280 0 1
I would like to create a 3rd column in the first tibble, such that it shows the number of cumulative events at that time of the observation. The 3rd column in the above example would therefore be:
ID Time(Converted to number) Number
1 23160 0
1 23161 0
1 23162 1
1 23163 1
1 23164 2
1 23165 2
2 24251 0
2 24252 0
I am trying to avoid having to loop through the millions of observations to compare each observation's time to the Event_Time in the lookup table because of computation time.
However, I am not sure how to go about doing this without the use of a loop. The issue is that the lookup_table contains some IDs multiple times, if all IDs only appeared in the lookup_table only once, then I could do:
data$Event_Time <- lookup_table[match(data$ID, lookup_table$ID),"Event_Time"]
data$Number <- data %>% mutate(ifelse(Time >= Event_Time,1,0))
Any ideas how I could avoid the use of a loop and yet apply the lookup conditions for each observation? Thank you.
Edit: I am not trying to join the tables, but more of comparing the time columns in the lookup_table and data table to obtain my desired column. Example, if I were to write an inefficient loop function, it would be:
for (i in 1:nrow(data)) {
data$Number[i] <- subset(lookup_table,ID == data$ID[i])[max(which
(data$Time[i] >= lookup_table$Event_Time)), "Number_of_Cumulative_Events"]
}
A possible solution is to count the cumulative events after the join. Note that an update on join is used.
library(data.table)
setDT(data)[, new := 0L][setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
, new := cumsum(new), by = ID][]
ID Time new
1: 1 23160 0
2: 1 23161 0
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 0
8: 2 24252 0
Alternatively,
setDT(data)[setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
is.na(new), new := 0][
, new := cumsum(new), by = ID][]
will set missing entries to zero after the join.
A completely different approach is to use a rolling join:
lookup_table[, !"Event_Indicator"][data, on = .(ID, Event_Time = Time), roll = TRUE]
ID Event_Time Number_of_Cumulative_Events
1: 1 23160 NA
2: 1 23161 NA
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 NA
8: 2 24252 NA
(NA's have been left untouched for illustration)

Count number of observations in one data frame based on values from another data frame

I have two very large data frames(50 million and 1.5 million) where some of the variables in both are same. I need to compare both and add another column in one data frame which gives count of matching observations in the other data frame.
For example: DF1 and DF2 both contain id, date, age_grp and gender variables. I want to add another column (match_count) in DF1 which show the count where DF1.id = DF2.id and DF1.date = DF2.date and DF1.age_grp = DF2.age_grp and DF1.gender = DF2.gender
Note
DF1
id date age_grp gender val
101 20140110 1 1 666
102 20150310 2 2 777
103 20160901 3 1 444
104 20160903 4 1 555
105 20010910 5 1 888
DF2
id date age_grp gender state
101 20140110 1 1 10
101 20140110 1 1 12
101 20140110 1 2 22
102 20150310 2 2 33
In the above example the combination "id = 101, date = 20140110, age_grp = 1, gender = 1" appears twice in DF2, hence the count 2 and the combination "id = 102, date = 20150010, age_grp = 2, gender = 2" appears once,hence the count 1.
Below is the resultant data frame I am looking for
Result
id date age_grp gender val match_count
101 20140110 1 1 666 2
102 20150310 2 2 777 1
103 20160901 3 1 444 0
104 20160903 4 1 555 0
105 20010910 5 1 888 0
Here is what I am doing at the moment and it works perfectly well for small data but does not scale well for large data. For this instance it did not return any results even after several hours.
Note: I have gone through this thread and it does not address the scale issue
with(DF1
, mapply(
function(arg_id,arg_agegrp, arg_gender, arg_date){
sum(arg_id == DF2$id
& agegrp == DF2$agegrp
& gender_bool == DF2$gender
& arg_date == DF2$date)
},
id, agegrp, gender, date)
)
UPDATE
The Id column is not unique, hence there could be two observations where id, date, agegrp and sex could be same and only val column could be different.
Here is what I will solve this problem by using dplyr
df2$state=NULL#noted you do not need column state
Name=names(df2)
df2=df2%>%group_by_(.dots=names(df2))%>%dplyr::summarise(match_count=n())
Target=merge(df1,df2,by.x=Name,by.y=Name,all.x=T)
Target[is.na(Target)]=0
Target
id date age_grp gender val match_count
1 101 20140110 1 1 666 2
2 102 20150310 2 2 777 1
3 103 20160901 3 1 444 0
4 104 20160903 4 1 555 0
5 105 20010910 5 1 888 0
data.table might be helpful here too. Aggregate DF2 by the variables specified, then join this back to DF1.
library(data.table)
setDT(DF1)
setDT(DF2)
vars <- c("id","date","age_grp","gender")
DF1[DF2[, .N, by=vars], count := N, on=vars]
DF1
# id date age_grp gender val count
#1: 101 20140110 1 1 666 2
#2: 102 20150310 2 2 777 1
#3: 103 20160901 3 1 444 NA
#4: 104 20160903 4 1 555 NA
#5: 105 20010910 5 1 888 NA

Replace value in data frame with value from other data frame based on set of conditions

In df1 I need to replace values for msec with corresponding values in df2.
df1 <- data.frame(ID=c('rs', 'rs', 'rs', 'tr','tr','tr'), cond=c(1,1,2,1,1,2),
block=c(2,2,4,2,2,4), correct=c(1,0,1,1,1,0), msec=c(456,678,756,654,625,645))
df2 <- data.frame(ID=c('rs', 'rs', 'tr','tr'), cond=c(1,2,1,2),
block=c(2,4,2,4), mean=c(545,664,703,765))
In df1, if correct==0, then reference df2 with the matching values of ID, cond, and block. Replace the value for msec in df1 with the corresponding value for mean in df2.
For example, the second row in df1 has correct==0. So, in df2 find the corresponding row where ID=='rs', cond==1, block==2 and use the value for mean (mean=545) to replace the value for msec (msec=678). Note that in df1 combinations of ID, block, and cond can repeat, but each combination occurs only once in df2.
Using the data.table package:
# load the 'data.table' package
library(data.table)
# convert the data.frame's to data.table's
setDT(df1)
setDT(df2)
# update df1 by reference with a join with df2
df1[df2[, correct := 0], on = .(ID, cond, block, correct), msec := i.mean]
which gives:
> df1
ID cond block correct msec
1: rs 1 2 1 456
2: rs 1 2 0 545
3: rs 2 4 1 756
4: tr 1 2 1 654
5: tr 1 2 1 625
6: tr 2 4 0 765
Note: The above code will update df1 instead of creating a new dataframe, which is more memory-efficient.
One option would be to use base R with an interaction() and a match(). How about:
df1[which(df1$correct==0),"msec"] <- df2[match(interaction(df1[which(df1$correct==0),c("ID","cond","block")]),
interaction(df2[,c("ID","cond", "block")])),
"mean"]
df1
# ID cond block correct msec
#1 rs 1 2 1 456
#2 rs 1 2 0 545
#3 rs 2 4 1 756
#4 tr 1 2 1 654
#5 tr 1 2 1 625
#6 tr 2 4 0 765
We overwrite the correct == 0 columns with their matched rows in df2$mean
Edit: Another option would be a sql merge this could look like:
library(sqldf)
merged <- sqldf('SELECT l.ID, l.cond, l.block, l.correct,
case when l.correct == 0 then r.mean else l.msec end as msec
FROM df1 as l
LEFT JOIN df2 as r
ON l.ID = r.ID AND l.cond = r.cond AND l.block = r.block')
merged
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765
With dplyr. This solution left_join all columns and mutate when correct is 0.
library(dplyr)
left_join(df1,df2)%>%
mutate(msec=ifelse(correct==0,mean,msec))%>%
select(-mean)
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765

Subsetting rows in R

I have a huge data set in the following format:
ID Interaction Interaction_number
1 abc 1
1 xyz 2
1 pqr 3
1 ced 0
2 ab 0
2 efg 1
3 asdf 2
3 fgh 3
3 abc 0
4 sql 1
4 ghj 2
5 poi 2
6 pqr 1
Now I want to extract all the ID data where there is interaction_number as 0. for eg:
ID Interaction Interaction_number
1 abc 1
1 xyz 2
1 pqr 3
1 ced 0
2 ab 0
2 efg 1
3 asdf 2
3 fgh 3
3 abc 0
Its a huge dataset. I need to extract it using R.
I tried using the sqldf function.
x<-sqldf("select * from data where data$ID in (select data$ID from data where data$Interaction_number ==0)")
But the function didnt work. I was looking to add a flagging column ( 1 for all IDs where there is interaction_number 0) and then subset those rows. But I cant figure out exactly how to do.
Can we create the data frame of the ID's and then using that data frame, we can use subset to get all the rows?
Please help.
Thank You
I suggest using data.table package. Then you could obtain your result. Say your data is in data.frame df. Then
library(data.table)
dt <- data.table(df, key = 'ID')
tmp <- dt[, list(condition = any(Interaction_number == 0)), by = ID]
res <- dt[tmp[condition == TRUE, list(ID)]]
Use this
sqldf("SELECT * FROM data WHERE ID IN (SELECT ID FROM data WHERE Interaction_number=0)")
You do not need the double equal in your test, and do not use data$ID and such to refer to the data columns in the SQL expression (you can use data.ID but it is unnecessary to use the dataframe name in this case).
It may be helpful to read up on SQL before using this function much. Keep in mind that what it will do is turn all your referenced dataframes into tables using the same name as the dataframe, and all of the columns into fields using the same name as the columns. Thus in this case, we are querying a table named data with fields named ID, Interaction, and Interaction_number.
We can do this with dplyr. Group the 'data' by 'ID', and filter if there is any 0 values in the 'Interaction_number'.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(any(!Interaction_number))
# ID Interaction Interaction_number
# (int) (chr) (int)
#1 1 abc 1
#2 1 xyz 2
#3 1 pqr 3
#4 1 ced 0
#5 2 ab 0
#6 2 efg 1
#7 3 asdf 2
#8 3 fgh 3
#9 3 abc 0
Or using ave from base R
df1[with(df1, ave(!Interaction_number, ID, FUN=any)),]
Or this can be done without any group by
df1[df1$ID %in%subset(df1, !Interaction_number)$ID,]

Resources