Remove groups based on multiple conditions in dplyr R

Remove groups based on multiple conditions in dplyr R - r

I have a data that looks like this
gene=c("A","A","A","A","B","B","B","B")
frequency=c(1,1,0.8,0.6,0.3,0.2,1,1)
time=c(1,2,3,4,1,2,3,4)
df <- data.frame(gene,frequency,time)
gene frequency time
1 A 1.0 1
2 A 1.0 2
3 A 0.8 3
4 A 0.6 4
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I want to remove each a gene group, in this case A or B when they have
frequency > 0.9 at time==1
In this case I want to remove A and my data to look like this
gene frequency time
1 B 0.3 1
2 B 0.2 2
3 B 1.0 3
4 B 1.0 4
Any hint or help are appreciated

We may use subset from base R i.e. create a logical vector with multiple expressions extract the 'gene' correspond to that, use %in% to create a logical vector, negate (!) to return the genes that are not. Or may also change the > to <= and remove the !
subset(df, !gene %in% gene[frequency > 0.9 & time == 1])
-ouptut
gene frequency time
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4

Related

R Data.Table: Dynamically Update a Different Column for each Row

I'm working on some code where I need to find the maximum value over a set of columns and then update that maximum value. Consider this toy example:
test <- data.table(thing1=c('AAA','BBB','CCC','DDD','EEE'),
A=c(9,5,4,2,5),
B=c(2,7,2,6,3),
C=c(6,2,5,4,1),
ttl=c(1,1,3,2,1))
where the resulting data.table looks like this:
thing1
A
B
C
ttl
AAA
9
2
6
1
BBB
5
7
2
1
CCC
4
2
5
3
DDD
2
6
4
2
EEE
5
3
1
1
The goal is to find the column (A, B, or C) with the maximum value and replace that value by the current value minus 0.1 times the value in the ttl column (i.e. new_value=old_value - 0.1*ttl). The other columns (not containing the maximum value) should remain the same. The resulting DT should look like this:
thing1
A
B
C
ttl
AAA
8.9
2
6
1
BBB
5
6.9
2
1
CCC
4
2
4.7
3
DDD
2
5.8
4
2
EEE
4.9
3
1
1
The "obvious" way of doing this is to write a for loop and loop through each row of the DT. That's easy enough to do and is what the code I'm adapting this from did. However, the real DT is much larger than my toy example and the for loop takes some time to run, which is why I'm trying to adapt the code to take advantage of vectorization and get rid of the loop.
Here's what I have so far:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
which produces this DT:
thing1
A
B
C
ttl
max_position
newmax
AAA
9
2
6
1
A
8.9
BBB
5
7
2
1
B
6.9
CCC
4
2
5
3
C
4.7
DDD
2
6
4
2
B
5.8
EEE
5
3
1
1
A
4.9
The problem comes in assigning the value of the newmax column back to where it needs to go. I naively tried this, along with some other things, which tells me that "'max_position' not found":
test[,(max_position):=newmax,by=1:nrow(test)]
It's straightforward to solve the problem by reshaping the DT, which is the solution I have in place for now (see below), but I worry that with my full DT two reshapes will be slow as well (though presumably better than the for loop). Any suggestions on how to make this work as intended?
Reshaping solution, for reference:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
test <- setDT(gather(test,idgroup,val,c(A,B,C)))
test[,maxval:=max(val),by='thing1']
test[val==maxval,val:=newmax][,maxval:=NULL]
test <- setDT(spread(test,idgroup,val))

With the OP's code, replace can work
test[, (2:4) := replace(.SD, which.max(.SD), max(.SD, na.rm = TRUE) - 0.1 * ttl),
by = 1:nrow(test),.SDcols = 2:4]
-output
> test
thing1 A B C ttl
1: AAA 8.9 2.0 6.0 1
2: BBB 5.0 6.9 2.0 1
3: CCC 4.0 2.0 4.7 3
4: DDD 2.0 5.8 4.0 2
5: EEE 4.9 3.0 1.0 1
In base R, this may be faster with row/column indexing
test1 <- as.data.frame(test)
m1 <- cbind(seq_len(nrow(test1)), max.col(test1[2:4], "first"))
test1[2:4][m1] <- test1[2:4][m1] - 0.1 * test1$ttl

Filter out a group of a data.frame based on multiple conditions that apply at a specific time point

My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to emphasize my search at the time point 1 and based on the values on that
time point to filter out the groups that do not fulfil a condition from the later time points.
I would like to delete the values of the groups that on the time point 1 are bigger than 0.5
and smaller than 0.1.
I want my data.frame to look like this.
group time value
1 A 1 0.2
2 A 2 0.1
3 A 3 10.0
Any help is highly appreciated.

You can select groups where value at time = 1 is between 0.1 and 0.5.
library(dplyr)
data %>%
group_by(group) %>%
filter(between(value[time == 1], 0.1, 0.5))
# group time value
# <chr> <dbl> <dbl>
#1 A 1 0.2
#2 A 2 0.1
#3 A 3 10

apply distribution to new sample set

I have a data-frame dfu that holds for each id (id belongs to one team, team has many ids) the percentage samples where a bunch of properties prop1, prop2 and so on are observed based on some past studies - this is used as sort of reference table for future studies. Now there is data from new experiment which gives a new set of ids. I need to find the percentage samples where prop1, prop2 and so on are observed on per team basis by using the reference data in dfu. This could be done by counting the number of occurrences per id in dfi and then take a weighted average grouped by team.- not all ids in dfu may be present and one or more ids not present in dfu may be present in dfi. The ids not present in dfu may be excluded from the weighted average as no presence per property values are available for them.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
> dfu
id team prop1 prop2
1 A 0.8 0.2
2 B 0.9 0.3
3 C 0.6 0.3
4 A 0.5 0.2
5 A 0.8 0.2
6 C 0.9 0.3
>
> dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
> dfi
id
2
3
2
1
4
3
7
The output format would be like below. For example the value for prop1 for group A would be (0.8*1 + 0.5*1)/2 = 0.65.
team prop1 prop2
A
B
C
prefer base R approach, other approaches welcome. The number of columns could be many.

I don't know exactly how to do it with base R.
With data.table it's should be pretty easy.
Let convert your data.frames into data.table.
library(data.table)
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
dfi <- data.table(dfi)
dfu <- data.table(dfu)
Then merge them like
dfu[dfi,on="id"]
## > dfu[dfi,on="id"]
## id team prop1 prop2
## 1: 2 B 0.9 0.3
## 2: 3 C 0.6 0.3
## 3: 2 B 0.9 0.3
## 4: 1 A 0.8 0.2
## 5: 4 A 0.5 0.2
## 6: 3 C 0.6 0.3
## 7: 7 NA NA NA
Then we just have to perform the mean by group. In fact we can to it one liner like
dfu[dfi,on="id"][,mean(prop1),team]
## > dfu[dfi,on="id"][,mean(prop1),team]
## team V1
## 1: B 0.90
## 2: C 0.60
## 3: A 0.65
## 4: NA NA
You can achieve the same thing in base R by merging the data.frame and using the function aggregate I guess.

taking cue from #DJJ's answer.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"),
prop1=c(0.8,0.9,0.6,0.5,0.8,0.9),
prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
Merge by id
> dfx <- merge(dfi, dfu, by="id")
> dfx
id team prop1 prop2
1 1 A 0.8 0.2
2 2 B 0.9 0.3
3 2 B 0.9 0.3
4 3 C 0.6 0.3
5 3 C 0.6 0.3
6 4 A 0.5 0.2
Aggregate prop1 and prop2 by team with mean
> aggregate(cbind(prop1, prop2) ~ team, dfx, mean)
team prop1 prop2
1 A 0.65 0.2
2 B 0.90 0.3
3 C 0.60 0.3

Multiplying column value by another value depending on value in certain column R

in the following dataset, I would like to multiply value in column Size by value in column Month1, Month2 or Month3 depending on what number we have in column Month. So if in certain row the Month value is 2, I would like to multiply the value in column Size by the value in column Month2 and save the result in new column NewSize. Many thanks for your help in advance!
Orig = c("A","B","A","A","B","A","A","B","A")
Dest = c("B","A","C","B","A","C","B","A","C")
Month = c(1,1,1,2,2,2,3,3,3)
Size = c(30,20,10,10,20,20,30,50,20)
Month1 = c(1,0.2,0,1,0.2,0,1,0.2,0)
Month2 = c(0.6,1,0,0.6,1,0,0.6,1,0)
Month3 = c(0,1,0.6,0,1,0.6,0,1,0.6)
df <- data.frame(Orig,Dest,Month,Size,Month1,Month2,Month3)
df
Orig Dest Month Size Month1 Month2 Month3
1 A B 1 30 1.0 0.6 0.0
2 B A 1 20 0.2 1.0 1.0
3 A C 1 10 0.0 0.0 0.6
4 A B 2 10 1.0 0.6 0.0
5 B A 2 20 0.2 1.0 1.0
6 A C 2 20 0.0 0.0 0.6
7 A B 3 30 1.0 0.6 0.0
8 B A 3 50 0.2 1.0 1.0
9 A C 3 20 0.0 0.0 0.6

Here's one alternative using ifelse
> transform(df, NewSize=ifelse(Month==1, Size*Month1,
ifelse(Month==2, Size*Month2, Size*Month3)))
Orig Dest Month Size Month1 Month2 Month3 NewSize
1 A B 1 30 1.0 0.6 0.0 30
2 B A 1 20 0.2 1.0 1.0 4
3 A C 1 10 0.0 0.0 0.6 0
4 A B 2 10 1.0 0.6 0.0 6
5 B A 2 20 0.2 1.0 1.0 20
6 A C 2 20 0.0 0.0 0.6 0
7 A B 3 30 1.0 0.6 0.0 0
8 B A 3 50 0.2 1.0 1.0 50
9 A C 3 20 0.0 0.0 0.6 12

In base R, fully vectorized:
df$Size*df[,5:7][cbind(1:nrow(df),df$Month)]

Here's how I'd handle this using data.table.
require(data.table)
setkey(setDT(df),
Month)[.(mon = 1:3), ## i
NewSize := Size * get(paste0("Month", mon)), ## j
by=.EACHI] ## by
setDT converts df from data.frame to data.table by reference.
setkey reorders that data.table by the column specified, Month, in increasing order, and marks that column as key column, on which we'll perform a join.
We perform a join on the key column set in the previous set with the values 1:3. This can also be interpreted as a subset operation that extracts all rows matching 1,2 and 3 from the key column Month.
So, for each value of 1:3, we calculate the matching rows in i. And on those matching rows, we compute NewSize by extracting Size and MonthX for those matching rows, and multiplying them. We use get() to achieve extracting the right MonthX column.
by=.EACHI as the name implies, executes the expression in j for each i. As an example, i=1 matches (or joins) to rows 1:3 of df. For those rows, the j-expression extracts Size = 30,20,10 and Month1 = 1.0, 0.2, 0.0, and it gets evaluated to return 30, 4, 0. And then for i=2 and so on..
Hope this helps a bit even if you're looking for a dplyr only answer.

You can use apply:
apply(df, 1, function(u) as.numeric(u[paste0('Month', u['Month'])])*as.numeric(u['Size']))
#[1] 30 4 0 6 20 0 0 50 12
Or a vectorized solution:
bool = matrix(rep(df$Month, each=3)==rep(1:3, nrow(df)), byrow=T, ncol=3)
df[c('Month1', 'Month2', 'Month3')][bool] * df$Size
#[1] 30 4 0 6 20 0 0 50 12

Thresholding a data frame without removing values

I have a data frame that consisting of a non-unique identifier (ID) and measures of some property of the objects within that ID, something like this:
ID Sph
A 1.0
A 1.2
A 1.1
B 0.5
B 1.8
C 2.2
C 1.1
D 2.1
D 3.0
First, I get the number of instances of each ID as X using table(df$ID), i.e. A=3, B=2 ,C=2 and D=2. Next, I would like to apply a threshold in the "Sph" category after getting the number of instances, limiting to rows where the Sph value exceeds the threshold. With threshold 2.0, for instance, I would use thold=df[df$Sph>2.0,]. Finally I would like to replace the ID column with the X value that I computed using table above. For instance, with a threshold of 1.1 in the "Sph" columns I would like the following output:
ID Sph
3 1.0
2 1.8
2 2.2
2 2.1
2 3.0
In other words, after using table() to get an x value corresponding to the number of times an ID has occurred, say 3, I would like to then assign that number to every value in that ID, Y, that is over some threshold.

There are some inconsistencies in your question and you didn't give a reproducible example, however here's my attempt.
I like to use the dplyr library, in this case I had to break out an sapply, maybe someone can improve on my answer.
Here's the short version:
library(dplyr)
#your data
x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
#lookup table
y <- summarise(group_by(x,ID), IDn=n())
#fill in original table
x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
#filter for rows where Sph greater or equal to 1.1
x <- x %>% filter(Sph>=1.1)
#done
x
And here's the longer version with explanatory output:
> library(dplyr)
> #your data
> x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
> x
ID Sph
1 A 1.0
2 A 1.2
3 A 1.1
4 B 0.5
5 B 1.8
6 C 2.2
7 C 1.1
8 D 2.1
9 D 3.0
>
> #lookup table
> y <- summarise(group_by(x,ID), IDn=n())
> y
Source: local data frame [4 x 2]
ID IDn
1 A 3
2 B 2
3 C 2
4 D 2
>
> #fill in original table
> x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
> x
ID Sph IDn
1 A 1.0 3
2 A 1.2 3
3 A 1.1 3
4 B 0.5 2
5 B 1.8 2
6 C 2.2 2
7 C 1.1 2
8 D 2.1 2
9 D 3.0 2
>
> #filter for rows where Sph greater or equal to 1.1
> x <- x %>% filter(Sph>=1.1)
>
> #done
> x
ID Sph IDn
1 A 1.2 3
2 A 1.1 3
3 B 1.8 2
4 C 2.2 2
5 C 1.1 2
6 D 2.1 2
7 D 3.0 2

You can actually do this in one step after computing X and thold as you did in your question:
X <- table(df$ID)
thold <- df[df$Sph > 1.1,]
thold$ID <- X[as.character(thold$ID)]
thold
# ID Sph
# 2 3 1.2
# 5 2 1.8
# 6 2 2.2
# 8 2 2.1
# 9 2 3.0
Basically you look up the frequency of each ID value in the table X that you built.