Column-specific arguments to lapply in data.table .SD when applying rbinom - r

I have a data.table for which I want to add columns of random binomial numbers based on one column as number of trials and multiple probabilities based on other columns:
require(data.table)
DT = data.table(
ID = letters[sample.int(26,10, replace = T)],
Quantity=as.integer(100*runif(10))
)
prob.vecs <- LETTERS[1:5]
DT[,(prob.vecs):=0]
set.seed(123)
DT[,(prob.vecs):=lapply(.SD, function(x){runif(.N,0,0.2)}), .SDcols=prob.vecs]
DT
ID Quantity A B C D E
1: b 66 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000
2: l 9 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927
3: u 38 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487
4: d 27 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909
5: o 81 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895
6: f 44 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121
7: d 81 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682
8: t 81 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249
9: x 79 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453
10: j 43 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554
Now I want to add five columns Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
which apply the rbinom with the correspoding probability and quantity from the second column.
So for example the first entry for Quantity_A would be:
set.seed(741)
sum(rbinom(66,1,0.05751550))
> 2
This problem seems very similar to this post: How do I pass column-specific arguments to lapply in data.table .SD? but I cannot seem to make it work. My try:
DT[,(paste0("Quantity_", prob.vecs)):= mapply(function(x, Quantity){sum(rbinom(Quantity, 1 , x))}, .SD), .SDcols = prob.vecs]
Error in rbinom(Quantity, 1, x) :
argument "Quantity" is missing, with no default
Any ideas?

I seemed to have found a work-around, though I am not quite sure why this works (probably has something to do with the function rbinom not beeing vectorized in both arguments):
first define an index:
DT[,Index:=.I]
and then do it by index:
DT[,(paste0("Quantity_", prob.vecs)):= lapply(.SD,function(x){sum(rbinom(Quantity, 1 , x))}), .SDcols = prob.vecs, by=Index]
set.seed(789)
ID Quantity A B C D E Index Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
1: c 37 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000 1 0 4 7 8 0
2: c 51 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927 2 3 5 9 19 3
3: r 7 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487 3 0 0 2 2 0
4: v 53 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909 4 8 4 16 12 3
5: d 96 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895 5 17 3 12 0 4
6: u 52 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121 6 1 3 8 6 0
7: m 43 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682 7 6 1 7 6 2
8: z 3 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249 8 1 0 2 1 1
9: m 3 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453 9 1 0 0 0 0
10: o 4 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554 10 0 0 0 0 0
numbers look about right to me
If someone finds a solution without the index would still be appreciated.

Related

Comparing each element in two columns and set another column

I have a data frame (after fread from a file) with two columns (dep and label). I want to set another column (mark) with id value depending on the match. If the 'dep' entry matches 'lablel' entry, mark get the 'id' of the matched 'label'. For no match, mark get the value of its own 'id'. Currently, I have work around solution with loops but I know there should be a neat way to do it in R specifics.
trace <- data.table(id=seq(1:7),dep=c(-1,45,40,47,0,45,43),
label=c(99,40,43,45,47,42,48), mark=rep("",7))
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
I know loops are slow in r and just to give example the following naive for/while works for small sizes but my data set is huge.
trace$mark <- trace$id
for (i in 1:length(trace$id)){
val <- trace$dep[i]
j <- 1
while(j<=i && val !=-1 && val!=0){ // don't compare if val is -1/0
if(val==trace$label[j]){
trace$mark[i] <- trace$id[j]
}
j <-j +1
}
}
I have also tried using the following approach but it works only if there is a single match.
match <- which(trace$dep %in% trace$label)
match_to <- which(trace$label %in% trace$dep)
trace$mark[match] <- trace$mark[match_to]
This solution might help:
trace[trace[,.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
# id dep label mark
# 1: 1 -1 99 1
# 2: 2 45 40 4
# 3: 3 -1 43 3
# 4: 4 47 45 5
# 5: 5 -1 47 5
# 6: 6 45 42 4
# 7: 7 43 48 3
Update:
To make sure you are not matching dep with 0 or -1 values you can just add another line.
trace[dep %in% c(0,-1), mark:= as.character(id)]
OR
Try this:
trace[trace[!dep %in% c(0,-1),.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
The solution that worked
trace[trace[,.(id,dep=label)],on=.(id<=id,dep),mark:=as.char‌​acter(i.id),allow.ca‌​rtesian=TRUE]

Adding multiple columns to data.table in R

I have a data.table with sequences and number of reads, like so:
sequence num_reads
1: AACCTGCCG 1
2: CGCGCTCAA 12
3: AGTGTGAGC 3
4: TGGGTACAC 11
5: GGCCGCGTG 15
6: CCTTAAGAG 2
7: GCGGAACTG 9
8: GCGTTGTAG 17
9: GTTGTAGCG 20
10: ACACGTGAC 16
I'd like to use data.table to add two new columns to this table, based on the results of applying dpois() with two weights and two lambdas. The correct output should be this (based on using data.frame):
sequence num_reads clus1 clus2
1 AACCTGCCG 1 2.553269503552647000377e-03 1.610220613932057849571e-03
2 CGCGCTCAA 12 1.053993989051599418361e-02 2.887608256917401083896e-02
3 AGTGTGAGC 3 2.085170094567994833468e-02 1.717568654860860896672e-02
4 TGGGTACAC 11 1.806846838374168498498e-02 4.331412385376097462508e-02
5 GGCCGCGTG 15 1.324248858039188620275e-03 5.415587646672919558410e-03
6 CCTTAAGAG 2 8.936443262434262332916e-03 6.440882455728230530922e-03
7 GCGGAACTG 9 4.056186780023639942838e-02 7.444615037365168164207e-02
8 GCGTTGTAG 17 2.385595369261770803265e-04 1.274255916864215588610e-03
9 GTTGTAGCG 20 1.196285397159046524451e-05 9.538289904012846548518e-05
10 ACACGTGAC 16 5.793588753921446012421e-04 2.707793823336458478163e-03
But when I try to use data.table I can't seem to get the right result. Here is what I tried (based on similar questions asked around this topic):
pois = function(n, p, l){return(dpois(as.numeric(as.character(n)), l)*p) }
x = x[, c(paste("clus", seq(1,2), sep = '')) := pois(num_reads, c(0.4,0.6), c(7,8)), by = seq_len(nrow(x))]
And here is the result:
sequence num_reads clus1 clus2
1: AACCTGCCG 1 2.553269503552647000377e-03 2.553269503552647000377e-03
2: CGCGCTCAA 12 1.053993989051599418361e-02 1.053993989051599418361e-02
3: AGTGTGAGC 3 2.085170094567994833468e-02 2.085170094567994833468e-02
4: TGGGTACAC 11 1.806846838374168498498e-02 1.806846838374168498498e-02
5: GGCCGCGTG 15 1.324248858039188620275e-03 1.324248858039188620275e-03
6: CCTTAAGAG 2 8.936443262434262332916e-03 8.936443262434262332916e-03
7: GCGGAACTG 9 4.056186780023639942838e-02 4.056186780023639942838e-02
8: GCGTTGTAG 17 2.385595369261770803265e-04 2.385595369261770803265e-04
9: GCGTTGTAG 20 1.196285397159046524451e-05 1.196285397159046524451e-05
10: ACACGTGAC 16 5.793588753921446012421e-04 5.793588753921446012421e-04
The reason I'm using data.table and not data.frame is that my real data has 100,000s of rows. I studied the answers to this and this but I haven't been able to come up with a solution.
Any tips you have would be much appreciated. Thanks!
We can try
x[, paste("clus", seq(1,2), sep = ''):=
as.list(pois(num_reads, c(0.4,0.6), c(7,8))), by = seq_len(nrow(x))]
x
# sequence num_reads clus1 clus2
#1: AACCTGCCG 1 0.00255326950 0.0016102206
#2: CGCGCTCAA 12 0.01053993989 0.0288760826
#3: AGTGTGAGC 3 0.02085170095 0.0171756865
#4: TGGGTACAC 11 0.01806846838 0.0433141239
#5: GGCCGCGTG 15 0.00132424886 0.0054155876
#6: CCTTAAGAG 2 0.00893644326 0.0064408825
#7: GCGGAACTG 9 0.04056186780 0.0744461504
#8: GCGTTGTAG 17 0.00023855954 0.0012742559
#9: GTTGTAGCG 20 0.00001196285 0.0000953829
#10:ACACGTGAC 16 0.00057935888 0.0027077938

Calculate mean of a proportion of the data.frame

I'm working with data that looks similar to this:
cat value n
1 100 18
2 0 19
3 -100 15
4 100 13
5 0 17
6 -100 18
In the real data, there are many cats and value can be any number between -100 and 100 (no NA).
What I want to do is to calculate the mean of value based on terciles defined by n
So, for example, since sum(n)=100 what I want to do is to get n's as close as possible to 33 and calculate the mean of value. So for the first tercile, 18 isn't quite 33, so I need to take 15 values from cat=2. So the mean for the first tercile should be (100*18+0*15)/(18+15). The second tercile would be the remaining ns from cat=2, then as many as are needed to get to 33: (0*4+-100*15+100*13+0*1)/(4+15+13+1). Similar for the last tercile.
I got started writing this, but ended up with lots of nasty for loops and if statements. I'm hoping that you see an easier way to deal with this than I do. Thanks in advance!
A solution with data.table:
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))
][,.(mean_val_indx=mean(value)),by=indx]
this gives:
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
Which are the means of value for the three parts of the data.
Broken down in the intermediate steps:
1: replice the rows according n
setDT(df)[rep(1:.N,n)]
this gives (shortened):
cat value n
1: 1 100 18
2: 1 100 18
....
17: 1 100 18
18: 1 100 18
19: 2 0 19
20: 2 0 19
....
36: 2 0 19
37: 2 0 19
38: 3 -100 15
....
99: 6 -100 18
100: 6 -100 18
2: create an index with [,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
this gives:
> dt
cat value n indx
1: 1 100 18 a
2: 1 100 18 a
....
17: 1 100 18 a
18: 1 100 18 a
19: 2 0 19 a
20: 2 0 19 a
....
32: 2 0 19 a
33: 2 0 19 a
34: 2 0 19 b
35: 2 0 19 b
....
99: 6 -100 18 c
100: 6 -100 18 c
3: summarise value by indx with [,.(mean_val_indx=mean(value)),by=indx]
You could try something like this, data being your example dataframe:
longData<-unlist(apply(data[,c("value","n")],1,function(x){
rep(x["value"],x["n"])
}))
aggregate(longData,list(cut(seq_along(longData),breaks=3,right=FALSE)),mean)
longData will be a vector of length 100 with, using your example, 18 repetitions of -100, 19 repetitions of 0 etc.
The cut in the aggregate will divide longData into three groups, and the mean of each group will be calculated.
If already the data is very long repetition by "n" is perhaps unwanted.
The following solution doesn't do this. Moreover, 1/3 of the sum of the
"n"-values is not rounded to the nearest integer.
"i" is the vector of row numbers where terciles end. Since it is possible
that several terciles end at the same row, those row numbers are replicated.
The result is the vector "k".
For each index "j" the cumulative sum of "data$value"*"data$n" up to "k[j]"
covers "ms[k[j]]" terciles, so "ms[j]-j" terciles have to be subtracted
to get the cumulative sum up to the "j"th tercile.
m <- 3
sn <- sum(data$n)
ms <- m * cumsum(data$n) / sn
d <- diff(c(0,floor(ms)))
i <- which(d>0)
k <- rep(i,d[i])
vn <- data$value * data$n
sums <- cumsum(vn)[k] - (ms[k]-(1:m))*data$value[k]*sn/m
means <- m*diff(c(0,sums))/sn
The means of the terciles are:
> means
[1] 54 -6 -54
In this example "i" is equal to "k". But if terciles are replaced by deciles,
i.e. "m" is not 3 but 10, they are distinct:
> m
[1] 10
> i
[1] 1 2 3 4 5 6
> k
[1] 1 2 2 3 3 4 5 5 6 6
> means
[1] 100 80 0 -30 -100 60 50 0 -80 -100
I compared the speed of the 4 answers, using out small example with 8 rows:
> ##### "longData"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A1 <- f1(data) } )
User System verstrichen
3.48 0.00 3.49
> ##### "sapply"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A2 <- f2(data) } )
User System verstrichen
1.00 0.00 0.99
> ##### "data.table"Answer #####
>
> system.time( for ( i in 1:1000 ) { A3 <- f3(data) } )
User System verstrichen
4.73 0.00 4.79
> ##### this Answer #####
>
> system.time( for ( i in 1:1000 ) { A4 <- f4(data) } )
User System verstrichen
0.43 0.00 0.44
The "sapply"-Answer is even false:
> A1
Group.1 x
1 [0.901,34) 54.545455
2 [34,67) -6.060606
3 [67,100) -52.941176
> A2
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939
> A3
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
> A4
[1] 54 -6 -54
>
This is basically the same as NicE although perhaps useful as a different way fo assembling the rep and cutting operations:
sapply(split( sort(unlist( mapply(rep, res$value, res$n) )),
cut(seq(sum(res$n)), breaks=c(0,33,67,100) )),
mean)
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939

Insert NA's in case there are no observations when using subset() and then dcast or tapply

I have the following data frame (this is only the head of the data frame). The ID column is subject (I have more subjects in the data frame, not only subject #99). I want to calculate the mean "rt" by "subject" and "condition" only for observations that have z.score (in absolute values) smaller than 1.
> b
subject rt ac condition z.score
1 99 1253 1 200_9 1.20862682
2 99 1895 1 102_2 2.95813507
3 99 1049 1 68_1 1.16862102
4 99 1732 1 68_9 2.94415384
5 99 765 1 34_9 -0.63991180
7 99 1016 1 68_2 -0.03191493
I know I can to do it using tapply or dcast (from reshape2) after subsetting the data:
b1 <- subset(b, abs(z.score) < 1)
b2 <- dcast(b1, subject~condition, mean, value.var = "rt")
subject 34_1 34_2 34_9 68_1 68_2 68_9 102_1 102_2 102_9 200_1 200_2 200_9
1 99 1028.5714 957.5385 861.6818 837.0000 969.7222 856.4000 912.5556 977.7273 858.7800 1006.0000 1015.3684 913.2449
2 5203 957.8889 815.2500 845.7750 933.0000 893.0000 883.0435 926.0000 879.2778 813.7308 804.2857 803.8125 843.7200
3 5205 1456.3333 1008.4286 850.7170 1142.4444 910.4706 998.4667 935.2500 980.9167 897.4681 1040.8000 838.7917 819.9710
4 5306 1022.2000 940.5882 904.6562 1525.0000 1216.0000 929.5167 955.8571 981.7500 902.8913 997.6000 924.6818 883.4583
5 5307 1396.1250 1217.1111 1044.4038 1055.5000 1115.6000 980.5833 1003.5714 1482.8571 941.4490 1091.5556 1125.2143 989.4918
6 5308 659.8571 904.2857 966.7755 960.9091 1048.6000 904.5082 836.2000 1753.6667 926.0400 870.2222 1066.6667 930.7500
In the example above for b1 each of the subjects had observations that met the subset demands.
However, it can be that for a certain subject I won't have observations after I subset. In this case I want to get NA in b2 for that subject in the specific condition in which he doesn't have observations that meet the subset demands. Does anyone have an idea for a way to do that?
Any help will be greatly appreciated.
Best,
Ayala
There is a drop argument in dcast that you can use in this situation, but you'll need to convert subject to a factor.
Here is a dataset with a second subject ID that has no values that meet your condition that the absolute value of z.score is less than one.
library(reshape2)
bb = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2),
ac=rep(1,8), condition=c("A","A","B","D","C","C","D","D"),
z.score=c(0.2,0.3,0.2,0.3,.2,2,2,2))
If you reshape this to a wide format with dcast, you lose subject number 11 even with the drop argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 99 125 2 10 4
Make subject a factor.
bb$subject = factor(bb$subject)
Now you can dcast with drop = FALSE to keep all subjects in the wide dataset.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 11 NaN NaN NaN NaN
2 99 125 2 10 4
To get NA instead of NaN you can use the fill argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE, fill = as.numeric(NA))
subject A B C D
1 11 NA NA NA NA
2 99 125 2 10 4
Is it the following you are after? I created a similar dataset "bb"
library("plyr") ###needed for . function below
bb<- data.frame(subject=c(99,99,99,99,99,11,11,11),rt=c(100,150,2,4,10,15,1,2), ac=rep(1,8) ,condition=c("A","A","B","D","C","C","D","D"), z.score=c(0.2,0.3,0.2,0.3,1.5,-0.3,0.8,0.7))
bb
subject rt ac condition z.score
#1 99 100 1 A 0.2
#2 99 150 1 A 0.3
#3 99 2 1 B 0.2
#4 99 4 1 D 0.3
#5 99 10 1 C 1.5
#6 11 15 1 C -0.3
#7 11 1 1 D 0.8
#8 11 2 1 D 0.7
Then you call dcast with subset included:
cc<-dcast(bb,subject~condition, mean, value.var = "rt",subset = .(abs(z.score)<1))
cc
subject A B C D
#1 11 NaN NaN 15 1.5
#2 99 125 2 NaN 4.0

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources