Replace value in data frame with value from other data frame based on set of conditions - r

In df1 I need to replace values for msec with corresponding values in df2.
df1 <- data.frame(ID=c('rs', 'rs', 'rs', 'tr','tr','tr'), cond=c(1,1,2,1,1,2),
block=c(2,2,4,2,2,4), correct=c(1,0,1,1,1,0), msec=c(456,678,756,654,625,645))
df2 <- data.frame(ID=c('rs', 'rs', 'tr','tr'), cond=c(1,2,1,2),
block=c(2,4,2,4), mean=c(545,664,703,765))
In df1, if correct==0, then reference df2 with the matching values of ID, cond, and block. Replace the value for msec in df1 with the corresponding value for mean in df2.
For example, the second row in df1 has correct==0. So, in df2 find the corresponding row where ID=='rs', cond==1, block==2 and use the value for mean (mean=545) to replace the value for msec (msec=678). Note that in df1 combinations of ID, block, and cond can repeat, but each combination occurs only once in df2.

Using the data.table package:
# load the 'data.table' package
library(data.table)
# convert the data.frame's to data.table's
setDT(df1)
setDT(df2)
# update df1 by reference with a join with df2
df1[df2[, correct := 0], on = .(ID, cond, block, correct), msec := i.mean]
which gives:
> df1
ID cond block correct msec
1: rs 1 2 1 456
2: rs 1 2 0 545
3: rs 2 4 1 756
4: tr 1 2 1 654
5: tr 1 2 1 625
6: tr 2 4 0 765
Note: The above code will update df1 instead of creating a new dataframe, which is more memory-efficient.

One option would be to use base R with an interaction() and a match(). How about:
df1[which(df1$correct==0),"msec"] <- df2[match(interaction(df1[which(df1$correct==0),c("ID","cond","block")]),
interaction(df2[,c("ID","cond", "block")])),
"mean"]
df1
# ID cond block correct msec
#1 rs 1 2 1 456
#2 rs 1 2 0 545
#3 rs 2 4 1 756
#4 tr 1 2 1 654
#5 tr 1 2 1 625
#6 tr 2 4 0 765
We overwrite the correct == 0 columns with their matched rows in df2$mean
Edit: Another option would be a sql merge this could look like:
library(sqldf)
merged <- sqldf('SELECT l.ID, l.cond, l.block, l.correct,
case when l.correct == 0 then r.mean else l.msec end as msec
FROM df1 as l
LEFT JOIN df2 as r
ON l.ID = r.ID AND l.cond = r.cond AND l.block = r.block')
merged
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765

With dplyr. This solution left_join all columns and mutate when correct is 0.
library(dplyr)
left_join(df1,df2)%>%
mutate(msec=ifelse(correct==0,mean,msec))%>%
select(-mean)
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765

Related

Retrieve all rows with same minimum value for a column with sqldf

I have to retrieve IDs for employees who have completed the minimum number of jobs. There are multiple employees who have completed 1 job. My current sqldf query retrieves only 1 row of data, while there are multiple employee IDs who have completed just 1 job. Why does it stop at the first minimum value? And how do I fetch all rows with the minimum value in a column? Here is a data sample:
ID TaskCOunt
1 74
2 53
3 10
4 5
5 1
6 1
7 1
The code I have used:
sqldf("select id, min(taskcount) as Jobscompleted
from (select id,count(id) as taskcount
from MyData
where id is not null
group by id order by id)")
Output is
ID leastcount
5 1
While what I want is all the rows with minimum jobs completed.
ID Jobscompleted
5 1
6 1
7 1
min(...) always returns one row in SQL as do all SQL aggregate functions. Try this instead:
sqldf("select ID, TaskCount TasksCompleted from MyData
where TaskCount = (select min(TaskCount) from MyData)")
giving:
ID TasksCompleted
1 5 1
2 6 1
3 7 1
Note: The input in reproducible form is:
Lines <- "
ID TaskCount
1 74
2 53
3 10
4 5
5 1
6 1
7 1"
MyData <- read.table(text = Lines, header = TRUE)
As an alternative to sqldf, you could use data.table:
library(data.table)
dt <- data.table(ID=1:7, TaskCount=c(74, 53, 10, 5, 1, 1, 1))
dt[TaskCount==min(TaskCount)]
## ID TaskCount
## 1: 5 1
## 2: 6 1
## 3: 7 1

Count number of observations in one data frame based on values from another data frame

I have two very large data frames(50 million and 1.5 million) where some of the variables in both are same. I need to compare both and add another column in one data frame which gives count of matching observations in the other data frame.
For example: DF1 and DF2 both contain id, date, age_grp and gender variables. I want to add another column (match_count) in DF1 which show the count where DF1.id = DF2.id and DF1.date = DF2.date and DF1.age_grp = DF2.age_grp and DF1.gender = DF2.gender
Note
DF1
id date age_grp gender val
101 20140110 1 1 666
102 20150310 2 2 777
103 20160901 3 1 444
104 20160903 4 1 555
105 20010910 5 1 888
DF2
id date age_grp gender state
101 20140110 1 1 10
101 20140110 1 1 12
101 20140110 1 2 22
102 20150310 2 2 33
In the above example the combination "id = 101, date = 20140110, age_grp = 1, gender = 1" appears twice in DF2, hence the count 2 and the combination "id = 102, date = 20150010, age_grp = 2, gender = 2" appears once,hence the count 1.
Below is the resultant data frame I am looking for
Result
id date age_grp gender val match_count
101 20140110 1 1 666 2
102 20150310 2 2 777 1
103 20160901 3 1 444 0
104 20160903 4 1 555 0
105 20010910 5 1 888 0
Here is what I am doing at the moment and it works perfectly well for small data but does not scale well for large data. For this instance it did not return any results even after several hours.
Note: I have gone through this thread and it does not address the scale issue
with(DF1
, mapply(
function(arg_id,arg_agegrp, arg_gender, arg_date){
sum(arg_id == DF2$id
& agegrp == DF2$agegrp
& gender_bool == DF2$gender
& arg_date == DF2$date)
},
id, agegrp, gender, date)
)
UPDATE
The Id column is not unique, hence there could be two observations where id, date, agegrp and sex could be same and only val column could be different.
Here is what I will solve this problem by using dplyr
df2$state=NULL#noted you do not need column state
Name=names(df2)
df2=df2%>%group_by_(.dots=names(df2))%>%dplyr::summarise(match_count=n())
Target=merge(df1,df2,by.x=Name,by.y=Name,all.x=T)
Target[is.na(Target)]=0
Target
id date age_grp gender val match_count
1 101 20140110 1 1 666 2
2 102 20150310 2 2 777 1
3 103 20160901 3 1 444 0
4 104 20160903 4 1 555 0
5 105 20010910 5 1 888 0
data.table might be helpful here too. Aggregate DF2 by the variables specified, then join this back to DF1.
library(data.table)
setDT(DF1)
setDT(DF2)
vars <- c("id","date","age_grp","gender")
DF1[DF2[, .N, by=vars], count := N, on=vars]
DF1
# id date age_grp gender val count
#1: 101 20140110 1 1 666 2
#2: 102 20150310 2 2 777 1
#3: 103 20160901 3 1 444 NA
#4: 104 20160903 4 1 555 NA
#5: 105 20010910 5 1 888 NA

Delete following observations when goal has been reached

Given the dataframe:
df = data.frame(
ID = c(1,1,1,1,2,3,3),
Start = c(0,8,150,200,6,7,60),
Stop = c(5,60,170,210,NA,45,80))
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
3 1 150 170 1
4 1 200 210 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
For each ID, I would like to keep all rows until Start[i+1] - Stop[i] >= 28, and then delete the following observations of that ID
In this example, the output should be
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
I ended up having to set NA's to a value easy to identify later and the following code
df$Stop[is.na(df$Stop)] = 10000
df$diff <- df$Start-c(0,df$Stop[1:length(df$Stop)-1])
space <- with(df, unique(ID[diff<28]))
df2 <- subset(df, (ID %in% space & diff < 28) | !ID %in% space)
Using data.table...
library(data.table)
setDT(df)
df[,{
w = which( shift(Start,type="lead") - Stop >= 28 )
if (length(w)) .SD[seq(w[1])] else .SD
}, by=ID]
# ID Start Stop
# 1: 1 0 5
# 2: 1 8 60
# 3: 2 6 NA
# 4: 3 7 45
# 5: 3 60 80
.SD is the Subset of Data associated with each by=ID group.
Create a diff column.
df$diff<-df$Start-c(0,df$Stop[1:length(df$Stop)-1])
Subset on the basis of this column
df[df$diff<28,]
PS: I have converted 'NA' to 0. You would have to handle that anyway.
p <- which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28)
df <- df[p,]
Assuming you want to keep entries where next entry start if higher than giben entry stop by 28 or more
The result is:
>p 2 3
> df[p,]
ID Start Stop
2 1 8 60
3 1 150 170
start in row 2 ( i + 1 = 2) is higher than stop in row 1 (i=1) by 90.
Or, if by until you mean the reverse condition, then
df <- df[which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] < 28),]
Inclusion of NA in your data frame got me thinking. You have to be very careful how you word your condition. If you want to keep all the cases where difference between next start and stop is less than 28, then the above statement will do.
However, if you want to keep all cases EXCEPT when difference is 28 or more, then you should
p <- which((df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28))
rp <- which((!is.element(1:nrow(df),p)))
df <- df[rp,]
As it will include the unknown difference.

Subset of a table that contains at least one element of another table

I have two tables that are made by intervals of bp, the Table1 has large intervals and the second has short intervals (just 2bp). I want to make a new table that contains only the Table 1 ranges that have at least one element of table 2 contained in their "large" ranges. If doesn´t have an element in the table 2 that corresponds to the table 1 range, this range of Table 1 should be not included.
In this example row 2 (1, 600, 1500) of Table1 (df) should be not included:
df <- "Chromosome start end
1 1 450
1 600 1500
2 3500 3585
2 7850 10000"
df <- read.table(text=df, header=T)
Table2 (df2)
df2 <- "Chromosome start end
1 5 6
1 598 599
2 3580 3581
2 7851 7852
2 7859 7860"
df2 <- read.table(text=df2, header=T)
NewTable (dfout):
dfout <- "Chromosome start end
1 1 450
2 3500 3585
2 7850 10000"
dfout <- read.table(text=df2, header=T)
Try foverlaps from data.table
library(data.table)
setkey(setDT(df1), Chromosome, start, end)
setkey(setDT(df2), Chromosome, start, end)
setnames(unique(foverlaps(df1, df2, nomatch=0)[, c(1,4:5),
with=FALSE]), names(df1))[]
# Chromosome start end
#1: 1 1 450
#2: 2 3500 3585
#3: 2 7850 10000
Or as #Arun commented, we can use which=TRUE (to extract the indices) and subset 'df1' using yid column.
df1[unique(foverlaps(df2, df1, nomatch=0L, which=TRUE)$yid)]
# Chromosome start end
#1: 1 1 450
#2: 2 3500 3585
#3: 2 7850 10000
It seems to solve your problem:
ranges <- merge(df,df2,by="Chromosome",suffixes=c("A","B"))
ranges <- ranges[with(ranges, startA <= startB & endA >= endB),]
ranges <- ranges[,1:3]
dfout <- unique(ranges)
dfout
# Chromosome startA endA
# 1 1 450
# 2 3500 3585
# 2 7850 10000

How to perform countif in R and export the data?

I have a set of data:
ID<-c(111,111,222,222,222,222,222,222)
TreatmentDate<-as.Date(c("2010-12-12","2011-12-01","2009-8-7","2010-5-7","2011-3-7","2011-8-5","2013-8-27","2016-9-3"))
Treatment<-c("AA","BB","CC","DD","AA","BB","BB","CC")
df<-data.frame(ID,TreatmentDate,Treatment)
df
ID TreatmentDate Treatment
111 12/12/2010 AA
111 01/12/2011 BB
222 07/08/2009 CC
222 07/05/2010 DD
222 07/03/2011 AA
222 05/08/2011 BB
222 27/08/2013 BB
222 03/09/2016 CC
I also have another dataframe showing the test date for each subject:
UID<-c(111,222)
Testdate<-as.Date(c("2012-12-31","2014-12-31"))
SubjectTestDate<-data.frame(UID,Testdate)
I am trying to summarise the data such that, say if I want to see how many treatment a subject has prior to their test date, I would get something like this and I would like to export this to a spreasheet.
ID Prior_to_date TreatmentAA TreatmentBB TreatmentCC TreatmentDD
111 31/12/2012 1 1 0 0
222 31/12/2014 1 2 1 1
Any help would be much appreciated!!
We could join the two dataset with 'ID', create a column that checks the condition ('indx'), and use dcast to convert from 'long' to 'wide' format
library(data.table)#v1.9.5+
dcast(setkey(setDT(df), ID)[SubjectTestDate][,
indx:=sum(TreatmentDate <=Testdate) , list(ID, Treatment)],
ID+Testdate~ paste0('Treatment', Treatment), value.var='indx', length)
# ID Testdate TreatmentAA TreatmentBB TreatmentCC TreatmentDD
#1: 111 2012-12-31 1 1 0 0
#2: 222 2014-12-31 1 2 2 1
Update
Based on the modified 'df', we join the 'df' with 'SubjectTestDate', create the 'indx' column as before, and also a sequence column 'Seq', grouped by 'ID' and 'Treatment', use dcast and then remove the duplicated 'ID' rows with unique
unique(dcast(setkey(setDT(df), ID)[SubjectTestDate][,
c('indx', 'Seq') := list(sum(TreatmentDate <= Testdate), 1:.N) ,
.(ID, Treatment)], ID+ Seq+ Testdate ~ paste0('Treatment',
Treatment), value.var='indx', fill=0), by='ID')
# ID Seq Testdate TreatmentAA TreatmentBB TreatmentCC TreatmentDD
#1: 111 1 2012-12-31 1 1 0 0
#2: 222 1 2014-12-31 1 2 1 1

Resources