apply distribution to new sample set - r

I have a data-frame dfu that holds for each id (id belongs to one team, team has many ids) the percentage samples where a bunch of properties prop1, prop2 and so on are observed based on some past studies - this is used as sort of reference table for future studies. Now there is data from new experiment which gives a new set of ids. I need to find the percentage samples where prop1, prop2 and so on are observed on per team basis by using the reference data in dfu. This could be done by counting the number of occurrences per id in dfi and then take a weighted average grouped by team.- not all ids in dfu may be present and one or more ids not present in dfu may be present in dfi. The ids not present in dfu may be excluded from the weighted average as no presence per property values are available for them.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
> dfu
id team prop1 prop2
1 A 0.8 0.2
2 B 0.9 0.3
3 C 0.6 0.3
4 A 0.5 0.2
5 A 0.8 0.2
6 C 0.9 0.3
>
> dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
> dfi
id
2
3
2
1
4
3
7
The output format would be like below. For example the value for prop1 for group A would be (0.8*1 + 0.5*1)/2 = 0.65.
team prop1 prop2
A
B
C
prefer base R approach, other approaches welcome. The number of columns could be many.

I don't know exactly how to do it with base R.
With data.table it's should be pretty easy.
Let convert your data.frames into data.table.
library(data.table)
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
dfi <- data.table(dfi)
dfu <- data.table(dfu)
Then merge them like
dfu[dfi,on="id"]
## > dfu[dfi,on="id"]
## id team prop1 prop2
## 1: 2 B 0.9 0.3
## 2: 3 C 0.6 0.3
## 3: 2 B 0.9 0.3
## 4: 1 A 0.8 0.2
## 5: 4 A 0.5 0.2
## 6: 3 C 0.6 0.3
## 7: 7 NA NA NA
Then we just have to perform the mean by group. In fact we can to it one liner like
dfu[dfi,on="id"][,mean(prop1),team]
## > dfu[dfi,on="id"][,mean(prop1),team]
## team V1
## 1: B 0.90
## 2: C 0.60
## 3: A 0.65
## 4: NA NA
You can achieve the same thing in base R by merging the data.frame and using the function aggregate I guess.

taking cue from #DJJ's answer.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"),
prop1=c(0.8,0.9,0.6,0.5,0.8,0.9),
prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
Merge by id
> dfx <- merge(dfi, dfu, by="id")
> dfx
id team prop1 prop2
1 1 A 0.8 0.2
2 2 B 0.9 0.3
3 2 B 0.9 0.3
4 3 C 0.6 0.3
5 3 C 0.6 0.3
6 4 A 0.5 0.2
Aggregate prop1 and prop2 by team with mean
> aggregate(cbind(prop1, prop2) ~ team, dfx, mean)
team prop1 prop2
1 A 0.65 0.2
2 B 0.90 0.3
3 C 0.60 0.3

Related

Remove groups based on multiple conditions in dplyr R

I have a data that looks like this
gene=c("A","A","A","A","B","B","B","B")
frequency=c(1,1,0.8,0.6,0.3,0.2,1,1)
time=c(1,2,3,4,1,2,3,4)
df <- data.frame(gene,frequency,time)
gene frequency time
1 A 1.0 1
2 A 1.0 2
3 A 0.8 3
4 A 0.6 4
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I want to remove each a gene group, in this case A or B when they have
frequency > 0.9 at time==1
In this case I want to remove A and my data to look like this
gene frequency time
1 B 0.3 1
2 B 0.2 2
3 B 1.0 3
4 B 1.0 4
Any hint or help are appreciated
We may use subset from base R i.e. create a logical vector with multiple expressions extract the 'gene' correspond to that, use %in% to create a logical vector, negate (!) to return the genes that are not. Or may also change the > to <= and remove the !
subset(df, !gene %in% gene[frequency > 0.9 & time == 1])
-ouptut
gene frequency time
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4

Filter out a group of a data.frame based on multiple conditions that apply at a specific time point

My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to emphasize my search at the time point 1 and based on the values on that
time point to filter out the groups that do not fulfil a condition from the later time points.
I would like to delete the values of the groups that on the time point 1 are bigger than 0.5
and smaller than 0.1.
I want my data.frame to look like this.
group time value
1 A 1 0.2
2 A 2 0.1
3 A 3 10.0
Any help is highly appreciated.
You can select groups where value at time = 1 is between 0.1 and 0.5.
library(dplyr)
data %>%
group_by(group) %>%
filter(between(value[time == 1], 0.1, 0.5))
# group time value
# <chr> <dbl> <dbl>
#1 A 1 0.2
#2 A 2 0.1
#3 A 3 10

Thresholding a data frame without removing values

I have a data frame that consisting of a non-unique identifier (ID) and measures of some property of the objects within that ID, something like this:
ID Sph
A 1.0
A 1.2
A 1.1
B 0.5
B 1.8
C 2.2
C 1.1
D 2.1
D 3.0
First, I get the number of instances of each ID as X using table(df$ID), i.e. A=3, B=2 ,C=2 and D=2. Next, I would like to apply a threshold in the "Sph" category after getting the number of instances, limiting to rows where the Sph value exceeds the threshold. With threshold 2.0, for instance, I would use thold=df[df$Sph>2.0,]. Finally I would like to replace the ID column with the X value that I computed using table above. For instance, with a threshold of 1.1 in the "Sph" columns I would like the following output:
ID Sph
3 1.0
2 1.8
2 2.2
2 2.1
2 3.0
In other words, after using table() to get an x value corresponding to the number of times an ID has occurred, say 3, I would like to then assign that number to every value in that ID, Y, that is over some threshold.
There are some inconsistencies in your question and you didn't give a reproducible example, however here's my attempt.
I like to use the dplyr library, in this case I had to break out an sapply, maybe someone can improve on my answer.
Here's the short version:
library(dplyr)
#your data
x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
#lookup table
y <- summarise(group_by(x,ID), IDn=n())
#fill in original table
x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
#filter for rows where Sph greater or equal to 1.1
x <- x %>% filter(Sph>=1.1)
#done
x
And here's the longer version with explanatory output:
> library(dplyr)
> #your data
> x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
> x
ID Sph
1 A 1.0
2 A 1.2
3 A 1.1
4 B 0.5
5 B 1.8
6 C 2.2
7 C 1.1
8 D 2.1
9 D 3.0
>
> #lookup table
> y <- summarise(group_by(x,ID), IDn=n())
> y
Source: local data frame [4 x 2]
ID IDn
1 A 3
2 B 2
3 C 2
4 D 2
>
> #fill in original table
> x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
> x
ID Sph IDn
1 A 1.0 3
2 A 1.2 3
3 A 1.1 3
4 B 0.5 2
5 B 1.8 2
6 C 2.2 2
7 C 1.1 2
8 D 2.1 2
9 D 3.0 2
>
> #filter for rows where Sph greater or equal to 1.1
> x <- x %>% filter(Sph>=1.1)
>
> #done
> x
ID Sph IDn
1 A 1.2 3
2 A 1.1 3
3 B 1.8 2
4 C 2.2 2
5 C 1.1 2
6 D 2.1 2
7 D 3.0 2
You can actually do this in one step after computing X and thold as you did in your question:
X <- table(df$ID)
thold <- df[df$Sph > 1.1,]
thold$ID <- X[as.character(thold$ID)]
thold
# ID Sph
# 2 3 1.2
# 5 2 1.8
# 6 2 2.2
# 8 2 2.1
# 9 2 3.0
Basically you look up the frequency of each ID value in the table X that you built.

Multiplying values in same position in R

I am working in R and I have two datasets. One dataset contains a contribution amount, and the other includes an include/exclude flag. Below are the data:
> contr_df
asof_dt X Y
1 2014-11-03 0.3 1.2
2 2014-11-04 -0.5 2.3
3 2014-11-05 1.2 0.4
> inex_flag
asof_dt X Y
1 2014-11-03 1 0
2 2014-11-04 1 1
3 2014-11-05 0 0
I would like to create a 3rd dataset that show one multiplied by the other. For example, I want to see the following
2014-11-03 0.3 * 1 1.2*0
2014-11-04 -0.5*1 2.3*1
2014-11-05 1.2*0 0.4*0
So far the only way that I've been able accomplish this is through using a for loop that loops through the total number of columns. However, this is complicated and inefficient. I was wondering if there was an easier way to make this happen. Does anyone know of a better solution?
This does the multiplication, but doesn't make sense for factors:
df1 * df2
# asof_dt X Y
#1 NA 0.3 0.0
#2 NA -0.5 2.3
#3 NA 0.0 0.0
#Warning message:
#In Ops.factor(left, right) : * nicht sinnvoll für Faktoren
One Option: You can cbind the first column and the multiplied values like this:
cbind(df1[1], df1[-1] * df2[-1])
# asof_dt X Y
#1 2014-11-03 0.3 0.0
#2 2014-11-04 -0.5 2.3
#3 2014-11-05 0.0 0.0
This means, you multiply the df1 and df2 without their first column of each data frame and add to it the first column of df1 with the dates.
The one-line answer is:
mapply(`*`, contr_df, inex_flag)
This will pair-wise apply the scalar multiplication function across the data.frame columns.
d = data.frame(a=c(1,2,3), b=c(0,2,-1))
e = data.frame(a=c(.2, 2, -1), b=c(0, 2, -2))
mapply(`*`, d, e)
a b
[1,] 0.2 0
[2,] 4.0 4
[3,] -3.0 2

Calculate the difference betwen pairs of consecutive rows in a data frame - R

I have a data.frame in which each gene name is repeated and contains values for 2 conditions:
df <- data.frame(gene=c("A","A","B","B","C","C"),
condition=c("control","treatment","control","treatment","control","treatment"),
count=c(10, 2, 5, 8, 5, 1),
sd=c(1, 0.2, 0.1, 2, 0.8, 0.1))
gene condition count sd
1 A control 10 1.0
2 A treatment 2 0.2
3 B control 5 0.1
4 B treatment 8 2.0
5 C control 5 0.8
6 C treatment 1 0.1
I want to calculate if there is an increase or decrease in "count" after treatment and mark them as such and/or subset them. That is (pseudo code):
for each unique(gene) do
if df[geneRow1,3]-df[geneRow2,3] > 0 then gene is "up"
else gene is "down"
This what it should look like in the end (the last columns is optional):
up-regulated
gene condition count sd regulation
B control 5 0.1 up
B treatment 8 2.0 up
down-regulated
gene condition count sd regulation
A control 10 1.0 down
A treatment 2 0.2 down
C control 5 0.8 down
C treatment 1 0.1 down
I have been raking my brain with this, including playing with ddply, and I've failed to find a solution - please a hapless biologist.
Cheers.
The plyr solution would look something like:
library(plyr)
reg.fun <- function(x) {
reg.diff <- x$count[x$condition=='control'] - x$count[x$condition=='treatment']
x$regulation <- ifelse(reg.diff > 0, 'up', 'down')
x
}
ddply(df, .(gene), reg.fun)
gene condition count sd regulation
1 A control 10 1.0 up
2 A treatment 2 0.2 up
3 B control 5 0.1 down
4 B treatment 8 2.0 down
5 C control 5 0.8 up
6 C treatment 1 0.1 up
>
You could also think about doing this with a different package and/or with data in a different shape:
df.w <- reshape(df, direction='wide', idvar='gene', timevar='condition')
library(data.table)
DT <- data.table(df.w, key='gene')
DT[, regulation:=ifelse(count.control-count.treatment > 0, 'up', 'down'), by=gene]
gene count.control sd.control count.treatment sd.treatment regulation
1: A 10 1.0 2 0.2 up
2: B 5 0.1 8 2.0 down
3: C 5 0.8 1 0.1 up
>
Something like this:
df$up.down <- with( df, ave(count, gene,
FUN=function(diffs) c("up", "down")[1+(diff(diffs) < 0) ]) )
spltdf <- split(df, df$up.down)
> df
gene condition count sd up.down
1 A control 10 1.0 down
2 A treatment 2 0.2 down
3 B control 5 0.1 up
4 B treatment 8 2.0 up
5 C control 5 0.8 down
6 C treatment 1 0.1 down
> spltdf
$down
gene condition count sd up.down
1 A control 10 1.0 down
2 A treatment 2 0.2 down
5 C control 5 0.8 down
6 C treatment 1 0.1 down
$up
gene condition count sd up.down
3 B control 5 0.1 up
4 B treatment 8 2.0 up

Resources