Conditional data.table merge with .EACHI - r

I have been playing around with the newer data.table conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig and dtSmall, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max or min for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.
Set up environment
## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")
Create two fake datasets
A create a "big" table with 50 rows (10 values for each ID).
library(data.table)
set.seed(1L)
# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]
ID ValueBig Rank
1: A 266 3
2: A 373 4
3: A 573 5
4: A 909 9
5: A 202 2
---
46: E 790 9
47: E 24 1
48: E 478 2
49: E 733 7
50: E 693 6
Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)
dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))
ID ValueSmall
1: A 478
2: A 862
3: B 439
4: B 245
5: C 71
6: C 100
7: D 317
8: D 519
9: E 663
10: E 407
Merge
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.
## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
Results
ID ValueSmall RankSmall RankSmall2 DesiredRank
1: A 478 1 4 4
2: A 862 1 7 7
3: B 439 3 4 4
4: B 245 1 2 2
5: C 71 1 1 1
6: C 100 1 1 1
7: D 317 1 2 2
8: D 519 3 5 5
9: E 663 2 5 5
10: E 407 1 1 1
Is there a better data.table way of grabbing the max value in another data.table with multiple matches?

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

Related

Column-specific arguments to lapply in data.table .SD when applying rbinom

I have a data.table for which I want to add columns of random binomial numbers based on one column as number of trials and multiple probabilities based on other columns:
require(data.table)
DT = data.table(
ID = letters[sample.int(26,10, replace = T)],
Quantity=as.integer(100*runif(10))
)
prob.vecs <- LETTERS[1:5]
DT[,(prob.vecs):=0]
set.seed(123)
DT[,(prob.vecs):=lapply(.SD, function(x){runif(.N,0,0.2)}), .SDcols=prob.vecs]
DT
ID Quantity A B C D E
1: b 66 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000
2: l 9 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927
3: u 38 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487
4: d 27 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909
5: o 81 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895
6: f 44 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121
7: d 81 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682
8: t 81 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249
9: x 79 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453
10: j 43 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554
Now I want to add five columns Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
which apply the rbinom with the correspoding probability and quantity from the second column.
So for example the first entry for Quantity_A would be:
set.seed(741)
sum(rbinom(66,1,0.05751550))
> 2
This problem seems very similar to this post: How do I pass column-specific arguments to lapply in data.table .SD? but I cannot seem to make it work. My try:
DT[,(paste0("Quantity_", prob.vecs)):= mapply(function(x, Quantity){sum(rbinom(Quantity, 1 , x))}, .SD), .SDcols = prob.vecs]
Error in rbinom(Quantity, 1, x) :
argument "Quantity" is missing, with no default
Any ideas?
I seemed to have found a work-around, though I am not quite sure why this works (probably has something to do with the function rbinom not beeing vectorized in both arguments):
first define an index:
DT[,Index:=.I]
and then do it by index:
DT[,(paste0("Quantity_", prob.vecs)):= lapply(.SD,function(x){sum(rbinom(Quantity, 1 , x))}), .SDcols = prob.vecs, by=Index]
set.seed(789)
ID Quantity A B C D E Index Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
1: c 37 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000 1 0 4 7 8 0
2: c 51 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927 2 3 5 9 19 3
3: r 7 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487 3 0 0 2 2 0
4: v 53 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909 4 8 4 16 12 3
5: d 96 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895 5 17 3 12 0 4
6: u 52 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121 6 1 3 8 6 0
7: m 43 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682 7 6 1 7 6 2
8: z 3 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249 8 1 0 2 1 1
9: m 3 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453 9 1 0 0 0 0
10: o 4 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554 10 0 0 0 0 0
numbers look about right to me
If someone finds a solution without the index would still be appreciated.

Add a row of zeros in a data frame created with ddply if there are no observations

I used the function ddply (package plyr) to calculate the mean of a response variable for each group "Trial" and "Treatment". I get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
This data frame suggests that in the trial 4 and treatment B, there are no observations for the response variable (as no row is specified in the data frame). So, is it possible to automatically add a row of zeros in the data frame (built with the function “ddply”) when there are no observations for a given response variable?
I would like to get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
4 B 0 0
We can merge the original dataset with another data.frame created with the full combination of unique values in 'Trial', and 'Treatment'. It will give an output with the missing combinations filled with NA. If needed, this can be changed to 0 (but it is better to have the missing combination as NA).
res <- merge(expand.grid(lapply(df1[1:2], unique)), df1, all.x=TRUE)
is.na(res) <- res==0
Or with dplyr/tidyr, we can use complete (from tidyr)
library(dplyr)
library(tidyr)
df1 %>%
complete(Trial, Treatment, fill= list(N=0, Mean=0))
# Trial Treatment N Mean
# (int) (chr) (dbl) (dbl)
#1 1 A 458 125.258
#2 1 B 459 168.748
#3 2 A 742 214.266
#4 2 B 142 475.786
#5 3 A 247 145.689
#6 3 B 968 234.129
#7 4 A 436 456.287
#8 4 B 0 0.000

Use previous calculated row value in r Continued 2

I have a data.table that looks like this:
library(data.table)
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "R" that has the first value as
DT$R[1]<-tanh(DT$B[1]/400000)
, and then I want to use the first row of column R to help calculate the next row value of G.
DT$R[2] <- 0.5*tanh(DT$B[2]/400000) + DT$R[1]*0.6
DT$R[3] <- 0.5*tanh(DT$B[3]/400000) + DT$R[2]*0.6
DT$R[4] <- 0.5*tanh(DT$B[4]/400000) + DT$R[3]*0.6
This will then look a bit like this
A B C R
1: 1 10 100 2.5e-05
2: 2 20 200 4e-05
3: 3 30 300 6.15e-05
4: 4 40 400 8.69e-05
5: 5 50 500 0.00011464
...
20: 20 200 2000 0.0005781274
Any ideas on this would be made?
Is this what you are looking for ?
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT$R = 0
DT$R[1]<-tanh(DT$B[1]/400000)
for(i in 2:nrow(DT)) {
DT$R[i] <- 0.5*tanh(DT$B[i]/400000) + DT$R[i-1]*0.6
}

nomatch in data.table ':' not updating properly

I am using data,table 1.9.4 with R 3.1.2 Platform: x86_64-pc-linux-gnu (64-bit). nomatch argument in ':=' functionality doesn't seem to work. Following is the sample code
library(data.table)
options(datatable.nomatch=0)
dt1 = data.table(
a=c(rep(1, 2), rep(2, 2), 3),
b=c(1:2, 1:2,1),
c=101:105)
setkey(dt1, a, b)
dt1
dtw = data.table(a=c(1,3), w1=c(201, 203), w2=c(301,303))
setkey(dtw, a)
dtw
dt1[dtw, ':='(w1=i.w1, w2=i.w2)]
dt1
it returns NA in w1 and w2 columns in stead of 0.
a b c w1 w2
1: 1 1 101 201 301
2: 1 2 102 201 301
3: 2 1 103 NA NA
4: 2 2 104 NA NA
5: 3 1 105 203 303
The correct output should be
a b c w1 w2
1: 1 1 101 201 301
2: 1 2 102 201 301
3: 2 1 103 0 0
4: 2 2 104 0 0
5: 3 1 105 203 303
What am I doing wrong and how do I get 0 in stead of NA?
As mentioned in the comments, your misunderstood current nomatch behavior, nomatch=0 doesn't fill with 0.
Not sure if nomatch affects := at all. It is used on join without := to indicate if it should be inner or outer join.
Be aware the nomatch behavior is likely to change in 2.0.0 - quite a distant future as it is 4 stable releases from now.
For reference leaving the current discussion on that change #857.

Is it possible to compute any window join in data.table

I came across a lot of posts asking about window joins
rolling median
rolling regression
Since data.table 1.8.8 and the roll parameter my understanding is that we can do those things. Say we have X and Y with same keys say x,y,t, we want to be able to get for each line of X
all the rows of Y where x,y of Y are matching those of X AND where X$t in [Y$t-w1,Y$t+w2]
Here is an example with (w1,w2)=(1,5)
library(data.table)
A <- data.table(x=c(1,1,1,2,2),y=c(F,F,T,T,T),t=c(407,286,788,882,942),key='x,y,t')
X <- copy(A)
Y <- data.table(x=c(1,1,1,2,2,2,2),y=c(F,F,T,T,T,T,T),u=c(417,285,788,882,941,942,945),IDX=1:7,key='x,y,u')
R) X
x y t
1: 1 FALSE 286
2: 1 FALSE 407
3: 1 TRUE 788
4: 2 TRUE 882
5: 2 TRUE 942
R) Y
x y u IDX
1: 1 FALSE 285 2 # match line 1 as (x,y) ok and 285 in [286-1,286+5]
2: 1 FALSE 417 1 # match no line as (x,y) ok against X[c(1,2),] but 417 is too big
3: 1 TRUE 788 3 # match row 3
4: 2 TRUE 882 4 # match row 4
5: 2 TRUE 941 5 # match row 5
6: 2 TRUE 942 6 # match row 5
7: 2 TRUE 945 7 # match row 5
We cannot do Y[setkey(X[,list(x,y,t)],x,y,t),roll=1] because if we have a perfect match on (x,y,t) data.table will discard potential partial matches with X$t in [Y$t-w1,X$t[.
#get the lower bounds and upper bounds for t
X[,`:=`(lowT=t-1,upT=t+5)]
#we get the first line where Y$u >= X$t-1 but Y$u <= X$t+5
X <- setnames(copy(Y),c('u','IDX'),c('lowT','lowIDX'))[setkey(X,x,y,lowT),roll=-6,rollends=T]
#we get the last line where Y$u <= X$t+5 ...
X <- setnames(copy(Y),c('u','IDX'),c('upT','upIDX'))[setkey(X,x,y,upT),roll=6]
#we get the matching IDX
X[!is.na(lowIDX) & !is.na(upIDX), allIDX:=mapply(`seq`,from=lowIDX,to=upIDX)]
R) X
x y upT upIDX lowT lowIDX t allIDX
1: 1 FALSE 291 2 285 2 286 2
2: 1 FALSE 412 NA 406 NA 407
3: 1 TRUE 793 3 787 3 788 3
4: 2 TRUE 887 4 881 4 882 4
5: 2 TRUE 947 7 941 5 942 5,6,7
My questions are:
Am I correct to think that window joins could not be achieve easily before roll?
Can we solve the pb if we want X$t in ]Y$t-w1,Y$t+w2[ (not compact set anymore) ?

Resources