Related
I have a data that looks like this
gene=c("A","A","A","A","B","B","B","B")
frequency=c(1,1,0.8,0.6,0.3,0.2,1,1)
time=c(1,2,3,4,1,2,3,4)
df <- data.frame(gene,frequency,time)
gene frequency time
1 A 1.0 1
2 A 1.0 2
3 A 0.8 3
4 A 0.6 4
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I want to remove each a gene group, in this case A or B when they have
frequency > 0.9 at time==1
In this case I want to remove A and my data to look like this
gene frequency time
1 B 0.3 1
2 B 0.2 2
3 B 1.0 3
4 B 1.0 4
Any hint or help are appreciated
We may use subset from base R i.e. create a logical vector with multiple expressions extract the 'gene' correspond to that, use %in% to create a logical vector, negate (!) to return the genes that are not. Or may also change the > to <= and remove the !
subset(df, !gene %in% gene[frequency > 0.9 & time == 1])
-ouptut
gene frequency time
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I have some data, example here:
dat1 <- data.frame(a = c("5","10","15","20"), b = c("0.1","0.2","0.3","0.4"))
dat2 <- data.frame(a = c("15","20","25","30"), b = c("0.5","0.6","0.7","0.8"))
datalist <-list (dat1,dat2)
Giving me a format like this
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4
[[2]]
a b
1 15 0.5
2 20 0.6
3 25 0.7
4 30 0.8
I want to be able to filter the list of data frames with the condition that the first value of column a should be <= 10. So in this scenario the output would be just the first data frame [[1]], and the second data frame would be ignored entirely.
Desired output
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4
Any advice would be greatly appreciated!
Thanks
You can use sapply to get a vector of logicals indicating whether each element of a list meets a certain condition. This can then be applied to subset the list in the usual way with [. E.g:
datalist[sapply(datalist, function(x){as.numeric(x[[1,"a"]]) <= 10})]
will return only the first element in your example.
(Note the as.numeric is necessary because your numbers are stored as character strings here)
We can also use keep function from purrr. It takes a predicate function .p and apply it on every element of a list and return those elements that predicate function equals to a single TRUE.
lirbary(purrr)
datalist %>%
keep(~ .x[["a"]][1] %>% as.numeric() <= 10)
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4
We may use Filter from base R
Filter(\(x) as.numeric(x$a[1]) <= 10, datalist)
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4
I have a dataset that looks like this:
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
head(sample.data)
groups A B position
1 1 1 3 2
2 2 3 2 1
3 3 2 4 2
4 4 4 1 1
5 5 2 5 2
6 6 5 2 1
The "position" column always alternates between 2 and 1. I want to do this calculation in R: starting from the first row, if it's in position 1, ignore it. If it starts at 2 (as in this example), then calculate as follows:
Take the first 2 values of column A that are at position 2, average them, then subtract the first value that is at position 1 (in this example: (1+2)/2 - 3 = -1.5). Then repeat the calculation for the next set of values, using the last position 2 value as the starting point, i.e. the next calculation would be (2+2)/2 - 4 = -2.
So basically, in this example, the calculations are done for the values of these sets of groups: 1-2-3, 3-4-5, 5-6-7, etc. (the last value of the previous is the first value of the next set of calculation)
Repeat the calculation until the end. Also do the same for column B.
Since I need the original data frame intact, put the newly calculated values in a new data frame(s), with columns dA and dB corresponding to the calculated values of column A and B, respectively (if not possible then they can be created as separated data frames, and I will extract them into one afterwards).
Desired output (from the example):
dA dB
1 -1.5 1.5
2 -2 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
start <- match(2, sample.data$position)
twos <- seq(from = start, to = nrow(sample.data), by = 2)
df <-
sapply(c("A", "B"), function(l) {
sapply(twos, function(i) {
mean(sample.data[c(i, i+2), l]) - sample.data[i+1, l]
})
})
df <- setNames(as.data.frame(df), c('dA', 'dB'))
As your values in position always alternate between 1 and 2, you can define an index of odd rows i1 and an index of even rows i2, and do your calculations:
## In case first row has position==1, we add an increment of 1 to the indexes
inc=0
if(sample.data$position[1]==1)
{inc=1}
i1=seq(1+inc,nrow(sample.data),by=2)
i2=seq(2+inc,nrow(sample.data),by=2)
res=data.frame(dA=(lead(sample.data$A[i1])+sample.data$A[i1])/2-sample.data$A[i2],
dB=(lead(sample.data$B[i1])+sample.data$B[i1])/2-sample.data$B[i2]);
This returns:
dA dB
1 -1.5 1.5
2 -2.0 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4.0
7 -3.5 2.5
8 -3.0 3.0
9 -3.0 4.5
10 NA NA
The last row returns NA, you can remove it if you need.
res=na.omit(res)
I have a data-frame dfu that holds for each id (id belongs to one team, team has many ids) the percentage samples where a bunch of properties prop1, prop2 and so on are observed based on some past studies - this is used as sort of reference table for future studies. Now there is data from new experiment which gives a new set of ids. I need to find the percentage samples where prop1, prop2 and so on are observed on per team basis by using the reference data in dfu. This could be done by counting the number of occurrences per id in dfi and then take a weighted average grouped by team.- not all ids in dfu may be present and one or more ids not present in dfu may be present in dfi. The ids not present in dfu may be excluded from the weighted average as no presence per property values are available for them.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
> dfu
id team prop1 prop2
1 A 0.8 0.2
2 B 0.9 0.3
3 C 0.6 0.3
4 A 0.5 0.2
5 A 0.8 0.2
6 C 0.9 0.3
>
> dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
> dfi
id
2
3
2
1
4
3
7
The output format would be like below. For example the value for prop1 for group A would be (0.8*1 + 0.5*1)/2 = 0.65.
team prop1 prop2
A
B
C
prefer base R approach, other approaches welcome. The number of columns could be many.
I don't know exactly how to do it with base R.
With data.table it's should be pretty easy.
Let convert your data.frames into data.table.
library(data.table)
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
dfi <- data.table(dfi)
dfu <- data.table(dfu)
Then merge them like
dfu[dfi,on="id"]
## > dfu[dfi,on="id"]
## id team prop1 prop2
## 1: 2 B 0.9 0.3
## 2: 3 C 0.6 0.3
## 3: 2 B 0.9 0.3
## 4: 1 A 0.8 0.2
## 5: 4 A 0.5 0.2
## 6: 3 C 0.6 0.3
## 7: 7 NA NA NA
Then we just have to perform the mean by group. In fact we can to it one liner like
dfu[dfi,on="id"][,mean(prop1),team]
## > dfu[dfi,on="id"][,mean(prop1),team]
## team V1
## 1: B 0.90
## 2: C 0.60
## 3: A 0.65
## 4: NA NA
You can achieve the same thing in base R by merging the data.frame and using the function aggregate I guess.
taking cue from #DJJ's answer.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"),
prop1=c(0.8,0.9,0.6,0.5,0.8,0.9),
prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
Merge by id
> dfx <- merge(dfi, dfu, by="id")
> dfx
id team prop1 prop2
1 1 A 0.8 0.2
2 2 B 0.9 0.3
3 2 B 0.9 0.3
4 3 C 0.6 0.3
5 3 C 0.6 0.3
6 4 A 0.5 0.2
Aggregate prop1 and prop2 by team with mean
> aggregate(cbind(prop1, prop2) ~ team, dfx, mean)
team prop1 prop2
1 A 0.65 0.2
2 B 0.90 0.3
3 C 0.60 0.3
in the following dataset, I would like to multiply value in column Size by value in column Month1, Month2 or Month3 depending on what number we have in column Month. So if in certain row the Month value is 2, I would like to multiply the value in column Size by the value in column Month2 and save the result in new column NewSize. Many thanks for your help in advance!
Orig = c("A","B","A","A","B","A","A","B","A")
Dest = c("B","A","C","B","A","C","B","A","C")
Month = c(1,1,1,2,2,2,3,3,3)
Size = c(30,20,10,10,20,20,30,50,20)
Month1 = c(1,0.2,0,1,0.2,0,1,0.2,0)
Month2 = c(0.6,1,0,0.6,1,0,0.6,1,0)
Month3 = c(0,1,0.6,0,1,0.6,0,1,0.6)
df <- data.frame(Orig,Dest,Month,Size,Month1,Month2,Month3)
df
Orig Dest Month Size Month1 Month2 Month3
1 A B 1 30 1.0 0.6 0.0
2 B A 1 20 0.2 1.0 1.0
3 A C 1 10 0.0 0.0 0.6
4 A B 2 10 1.0 0.6 0.0
5 B A 2 20 0.2 1.0 1.0
6 A C 2 20 0.0 0.0 0.6
7 A B 3 30 1.0 0.6 0.0
8 B A 3 50 0.2 1.0 1.0
9 A C 3 20 0.0 0.0 0.6
Here's one alternative using ifelse
> transform(df, NewSize=ifelse(Month==1, Size*Month1,
ifelse(Month==2, Size*Month2, Size*Month3)))
Orig Dest Month Size Month1 Month2 Month3 NewSize
1 A B 1 30 1.0 0.6 0.0 30
2 B A 1 20 0.2 1.0 1.0 4
3 A C 1 10 0.0 0.0 0.6 0
4 A B 2 10 1.0 0.6 0.0 6
5 B A 2 20 0.2 1.0 1.0 20
6 A C 2 20 0.0 0.0 0.6 0
7 A B 3 30 1.0 0.6 0.0 0
8 B A 3 50 0.2 1.0 1.0 50
9 A C 3 20 0.0 0.0 0.6 12
In base R, fully vectorized:
df$Size*df[,5:7][cbind(1:nrow(df),df$Month)]
Here's how I'd handle this using data.table.
require(data.table)
setkey(setDT(df),
Month)[.(mon = 1:3), ## i
NewSize := Size * get(paste0("Month", mon)), ## j
by=.EACHI] ## by
setDT converts df from data.frame to data.table by reference.
setkey reorders that data.table by the column specified, Month, in increasing order, and marks that column as key column, on which we'll perform a join.
We perform a join on the key column set in the previous set with the values 1:3. This can also be interpreted as a subset operation that extracts all rows matching 1,2 and 3 from the key column Month.
So, for each value of 1:3, we calculate the matching rows in i. And on those matching rows, we compute NewSize by extracting Size and MonthX for those matching rows, and multiplying them. We use get() to achieve extracting the right MonthX column.
by=.EACHI as the name implies, executes the expression in j for each i. As an example, i=1 matches (or joins) to rows 1:3 of df. For those rows, the j-expression extracts Size = 30,20,10 and Month1 = 1.0, 0.2, 0.0, and it gets evaluated to return 30, 4, 0. And then for i=2 and so on..
Hope this helps a bit even if you're looking for a dplyr only answer.
You can use apply:
apply(df, 1, function(u) as.numeric(u[paste0('Month', u['Month'])])*as.numeric(u['Size']))
#[1] 30 4 0 6 20 0 0 50 12
Or a vectorized solution:
bool = matrix(rep(df$Month, each=3)==rep(1:3, nrow(df)), byrow=T, ncol=3)
df[c('Month1', 'Month2', 'Month3')][bool] * df$Size
#[1] 30 4 0 6 20 0 0 50 12