Count number of observations in one data frame based on values from another data frame - r

I have two very large data frames(50 million and 1.5 million) where some of the variables in both are same. I need to compare both and add another column in one data frame which gives count of matching observations in the other data frame.
For example: DF1 and DF2 both contain id, date, age_grp and gender variables. I want to add another column (match_count) in DF1 which show the count where DF1.id = DF2.id and DF1.date = DF2.date and DF1.age_grp = DF2.age_grp and DF1.gender = DF2.gender
Note
DF1
id date age_grp gender val
101 20140110 1 1 666
102 20150310 2 2 777
103 20160901 3 1 444
104 20160903 4 1 555
105 20010910 5 1 888
DF2
id date age_grp gender state
101 20140110 1 1 10
101 20140110 1 1 12
101 20140110 1 2 22
102 20150310 2 2 33
In the above example the combination "id = 101, date = 20140110, age_grp = 1, gender = 1" appears twice in DF2, hence the count 2 and the combination "id = 102, date = 20150010, age_grp = 2, gender = 2" appears once,hence the count 1.
Below is the resultant data frame I am looking for
Result
id date age_grp gender val match_count
101 20140110 1 1 666 2
102 20150310 2 2 777 1
103 20160901 3 1 444 0
104 20160903 4 1 555 0
105 20010910 5 1 888 0
Here is what I am doing at the moment and it works perfectly well for small data but does not scale well for large data. For this instance it did not return any results even after several hours.
Note: I have gone through this thread and it does not address the scale issue
with(DF1
, mapply(
function(arg_id,arg_agegrp, arg_gender, arg_date){
sum(arg_id == DF2$id
& agegrp == DF2$agegrp
& gender_bool == DF2$gender
& arg_date == DF2$date)
},
id, agegrp, gender, date)
)
UPDATE
The Id column is not unique, hence there could be two observations where id, date, agegrp and sex could be same and only val column could be different.

Here is what I will solve this problem by using dplyr
df2$state=NULL#noted you do not need column state
Name=names(df2)
df2=df2%>%group_by_(.dots=names(df2))%>%dplyr::summarise(match_count=n())
Target=merge(df1,df2,by.x=Name,by.y=Name,all.x=T)
Target[is.na(Target)]=0
Target
id date age_grp gender val match_count
1 101 20140110 1 1 666 2
2 102 20150310 2 2 777 1
3 103 20160901 3 1 444 0
4 104 20160903 4 1 555 0
5 105 20010910 5 1 888 0

data.table might be helpful here too. Aggregate DF2 by the variables specified, then join this back to DF1.
library(data.table)
setDT(DF1)
setDT(DF2)
vars <- c("id","date","age_grp","gender")
DF1[DF2[, .N, by=vars], count := N, on=vars]
DF1
# id date age_grp gender val count
#1: 101 20140110 1 1 666 2
#2: 102 20150310 2 2 777 1
#3: 103 20160901 3 1 444 NA
#4: 104 20160903 4 1 555 NA
#5: 105 20010910 5 1 888 NA

Related

Is there code to determine the amount of criteria met by a row in R?

I am trying to figure out a way to assign a column that would list out the number of criteria that is met by a certain row. For example, I am looking at how many risk factors for heart disease someone has met and trying to run an ordinal regression on those values. I have tried
cvd_status <- ifelse( data_tot$X5_A_01_d_Heart.Disease=="1"|data_tot$X5_A_01_e_Stroke=="1"|data_tot$X5_A_01_f_Chronic.Kidney.Disease==1, 1,0)
but that only gives me whether people have any risk factors, not how many risk factors they have. Is there any way to figure out how many risk factors someone would have?
Edit: The variables are not simply binary, but are either 1s or 2s or ranges of numbers.
If the variables contain only 0 or 1, then the following could be used:
with(data_tot,
rowSums(cbind(X5_A_01_d_Heart.Disease,
X5_A_01_e_Stroke,
X5_A_01_f_Chronic.Kidney.Disease))
)
Edit:
And if they are coded as 1 (yes) and 2 (no), plus if other risk factors such as blood pressure and cholesterol level are to be included, AND there are no missing values in these risk factor variables, then you'll can use something similar to the following:
data_tot %>%
mutate(CVD_Risk.Factors=
(Heart == 1) +
(Stroke == 1) +
(CKD == 1) +
(Systolic_BP >= 130) + (Diastolic_BP >= 80) +
(Cholesterol > 150))
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA NA
6 2 2 2 130 105 140 2
You can see that if there are any missing values, then this would not work. One solution is to use rowwise and then sum.
data_tot %>%
rowwise() %>% # This tells R to apply a function by the rows of the selected inputs
mutate(CVD_Risk.Factors=sum( # This function has an "na.rm" argument
(Heart == 1),
(Stroke == 1),
(CKD == 1),
(Systolic_BP >= 130), (Diastolic_BP >= 80),
(Cholesterol > 150), na.rm=TRUE)) # Omit NA in the summations
# A tibble: 6 x 7
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA 2 # not NA
6 2 2 2 130 105 140 2
Data:
data_tot <- data.frame(Heart=c(1,2,2,1,2,2),
Stroke=c(1,1,1,1,2,2),
CKD=c(2,2,1,2,2,2),
Systolic_BP=c(118,125,133,120,155,130),
Diastolic_BP=c(90,65,95,87,110,105),
Cholesterol=c(200,150,190,250,NA,140))

Subsetting data.frame to return first 200 rows for specific condition in r

I have a data.frame with 3.3 million rows and 9 columns. Below is an example with the 3 relevant columns.
StimulusName Subject Pupil Means
1 1 101 3.270000
2 1 101 3.145000
3 1 101 3.265000
4 2 101 3.015000
5 2 101 3.100000
6 2 101 3.051250
7 1 102 3.035000
8 1 102 3.075000
9 1 102 3.050000
10 2 102 3.056667
11 2 102 3.059167
12 2 102 3.060000
13 1 103 3.085000
14 1 103 3.125000
15 1 103 3.115000
I want to subset data based on stimulus name and subject and then take either the first few or the last few rows for that subset. So, for example returning row 10 and 11 by getting the first 2 rows where df$StimulusName == 2 & df$Subject == 102.
The actual data frame contains thousands of observations per Stimulus and Subject. I want to use it to plot the first and last 200 observations of the stimulus separately.
Have not tested this out, but should work.
First 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[1:200,]
Then plot df_filtered.
Last 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[(nrow(df_filtered)-199):nrow(df_filtered),]
Then plot df_filtered.
Perhaps you want something like this:
subCond <- function(x, r, c) {
m <- x[x[, 1] == r & x[, 2] == c,]
return(m)
}
Yields e.g.:
> subCond(df, 1, 102)
StimulusName Subject PupilMeans
7 1 102 3.035
8 1 102 3.075
9 1 102 3.050
or
> subCond(df, 2, 101)
StimulusName Subject PupilMeans
4 2 101 3.01500
5 2 101 3.10000
6 2 101 3.05125

Replace value in data frame with value from other data frame based on set of conditions

In df1 I need to replace values for msec with corresponding values in df2.
df1 <- data.frame(ID=c('rs', 'rs', 'rs', 'tr','tr','tr'), cond=c(1,1,2,1,1,2),
block=c(2,2,4,2,2,4), correct=c(1,0,1,1,1,0), msec=c(456,678,756,654,625,645))
df2 <- data.frame(ID=c('rs', 'rs', 'tr','tr'), cond=c(1,2,1,2),
block=c(2,4,2,4), mean=c(545,664,703,765))
In df1, if correct==0, then reference df2 with the matching values of ID, cond, and block. Replace the value for msec in df1 with the corresponding value for mean in df2.
For example, the second row in df1 has correct==0. So, in df2 find the corresponding row where ID=='rs', cond==1, block==2 and use the value for mean (mean=545) to replace the value for msec (msec=678). Note that in df1 combinations of ID, block, and cond can repeat, but each combination occurs only once in df2.
Using the data.table package:
# load the 'data.table' package
library(data.table)
# convert the data.frame's to data.table's
setDT(df1)
setDT(df2)
# update df1 by reference with a join with df2
df1[df2[, correct := 0], on = .(ID, cond, block, correct), msec := i.mean]
which gives:
> df1
ID cond block correct msec
1: rs 1 2 1 456
2: rs 1 2 0 545
3: rs 2 4 1 756
4: tr 1 2 1 654
5: tr 1 2 1 625
6: tr 2 4 0 765
Note: The above code will update df1 instead of creating a new dataframe, which is more memory-efficient.
One option would be to use base R with an interaction() and a match(). How about:
df1[which(df1$correct==0),"msec"] <- df2[match(interaction(df1[which(df1$correct==0),c("ID","cond","block")]),
interaction(df2[,c("ID","cond", "block")])),
"mean"]
df1
# ID cond block correct msec
#1 rs 1 2 1 456
#2 rs 1 2 0 545
#3 rs 2 4 1 756
#4 tr 1 2 1 654
#5 tr 1 2 1 625
#6 tr 2 4 0 765
We overwrite the correct == 0 columns with their matched rows in df2$mean
Edit: Another option would be a sql merge this could look like:
library(sqldf)
merged <- sqldf('SELECT l.ID, l.cond, l.block, l.correct,
case when l.correct == 0 then r.mean else l.msec end as msec
FROM df1 as l
LEFT JOIN df2 as r
ON l.ID = r.ID AND l.cond = r.cond AND l.block = r.block')
merged
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765
With dplyr. This solution left_join all columns and mutate when correct is 0.
library(dplyr)
left_join(df1,df2)%>%
mutate(msec=ifelse(correct==0,mean,msec))%>%
select(-mean)
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765

Counting observations for a given ID according to date in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
First of all:
Sorry if this question might sound stupid...or I have formatted it incorrectly, I'm new to this.
I have the following data:
Value Date IDnum
1 230 2010-02-01 1
2 254 2011-07-07 2
3 300 2011-12-14 1
4 700 2011-01-23 3
5 150 2010-08-31 3
6 100 2010-05-06 1
Created using the following code:
Value <- c(230, 254, 300, 700, 150, 100)
Date <- as.Date(c("01/02/2010", "07/07/2011", "14/12/2011", "23/01/2011", "31/08/2010", "06/05/2010")
, "%d/%m/%Y")
IDnum <- c(001, 002, 001, 003, 003, 001)
MyData <- data.frame(Value, Date, IDnum)
I need R to create a column which counts and enumerates every row according to whether its the first, second etc observation for the given IDnum, by date. Thus giving me something similar:
Value Date IDnum Obs
1 230 2010-02-01 1 1
2 254 2011-07-07 2 1
3 300 2011-12-14 1 3
4 700 2011-01-23 3 2
5 150 2010-08-31 3 1
6 100 2010-05-06 1 2
Thanks
library(data.table)
df$Date <- as.Date(df$Date, format = "%Y-%m-%d")
setDT(df)[, Obs := order(Date),by = .(IDnum)]
library(dplyr)
df %>% group_by(IDnum) %>% mutate(Obs = order(Date))
# Value Date IDnum Obs
#1: 230 2010-02-01 1 1
#2: 254 2011-07-07 2 1
#3: 300 2011-12-14 1 3
#4: 700 2011-01-23 3 2
#5: 150 2010-08-31 3 1
#6: 100 2010-05-06 1 2

How to perform countif in R and export the data?

I have a set of data:
ID<-c(111,111,222,222,222,222,222,222)
TreatmentDate<-as.Date(c("2010-12-12","2011-12-01","2009-8-7","2010-5-7","2011-3-7","2011-8-5","2013-8-27","2016-9-3"))
Treatment<-c("AA","BB","CC","DD","AA","BB","BB","CC")
df<-data.frame(ID,TreatmentDate,Treatment)
df
ID TreatmentDate Treatment
111 12/12/2010 AA
111 01/12/2011 BB
222 07/08/2009 CC
222 07/05/2010 DD
222 07/03/2011 AA
222 05/08/2011 BB
222 27/08/2013 BB
222 03/09/2016 CC
I also have another dataframe showing the test date for each subject:
UID<-c(111,222)
Testdate<-as.Date(c("2012-12-31","2014-12-31"))
SubjectTestDate<-data.frame(UID,Testdate)
I am trying to summarise the data such that, say if I want to see how many treatment a subject has prior to their test date, I would get something like this and I would like to export this to a spreasheet.
ID Prior_to_date TreatmentAA TreatmentBB TreatmentCC TreatmentDD
111 31/12/2012 1 1 0 0
222 31/12/2014 1 2 1 1
Any help would be much appreciated!!
We could join the two dataset with 'ID', create a column that checks the condition ('indx'), and use dcast to convert from 'long' to 'wide' format
library(data.table)#v1.9.5+
dcast(setkey(setDT(df), ID)[SubjectTestDate][,
indx:=sum(TreatmentDate <=Testdate) , list(ID, Treatment)],
ID+Testdate~ paste0('Treatment', Treatment), value.var='indx', length)
# ID Testdate TreatmentAA TreatmentBB TreatmentCC TreatmentDD
#1: 111 2012-12-31 1 1 0 0
#2: 222 2014-12-31 1 2 2 1
Update
Based on the modified 'df', we join the 'df' with 'SubjectTestDate', create the 'indx' column as before, and also a sequence column 'Seq', grouped by 'ID' and 'Treatment', use dcast and then remove the duplicated 'ID' rows with unique
unique(dcast(setkey(setDT(df), ID)[SubjectTestDate][,
c('indx', 'Seq') := list(sum(TreatmentDate <= Testdate), 1:.N) ,
.(ID, Treatment)], ID+ Seq+ Testdate ~ paste0('Treatment',
Treatment), value.var='indx', fill=0), by='ID')
# ID Seq Testdate TreatmentAA TreatmentBB TreatmentCC TreatmentDD
#1: 111 1 2012-12-31 1 1 0 0
#2: 222 1 2014-12-31 1 2 1 1

Resources