nomatch in data.table ':' not updating properly - r

I am using data,table 1.9.4 with R 3.1.2 Platform: x86_64-pc-linux-gnu (64-bit). nomatch argument in ':=' functionality doesn't seem to work. Following is the sample code
library(data.table)
options(datatable.nomatch=0)
dt1 = data.table(
a=c(rep(1, 2), rep(2, 2), 3),
b=c(1:2, 1:2,1),
c=101:105)
setkey(dt1, a, b)
dt1
dtw = data.table(a=c(1,3), w1=c(201, 203), w2=c(301,303))
setkey(dtw, a)
dtw
dt1[dtw, ':='(w1=i.w1, w2=i.w2)]
dt1
it returns NA in w1 and w2 columns in stead of 0.
a b c w1 w2
1: 1 1 101 201 301
2: 1 2 102 201 301
3: 2 1 103 NA NA
4: 2 2 104 NA NA
5: 3 1 105 203 303
The correct output should be
a b c w1 w2
1: 1 1 101 201 301
2: 1 2 102 201 301
3: 2 1 103 0 0
4: 2 2 104 0 0
5: 3 1 105 203 303
What am I doing wrong and how do I get 0 in stead of NA?

As mentioned in the comments, your misunderstood current nomatch behavior, nomatch=0 doesn't fill with 0.
Not sure if nomatch affects := at all. It is used on join without := to indicate if it should be inner or outer join.
Be aware the nomatch behavior is likely to change in 2.0.0 - quite a distant future as it is 4 stable releases from now.
For reference leaving the current discussion on that change #857.

Related

Column-specific arguments to lapply in data.table .SD when applying rbinom

I have a data.table for which I want to add columns of random binomial numbers based on one column as number of trials and multiple probabilities based on other columns:
require(data.table)
DT = data.table(
ID = letters[sample.int(26,10, replace = T)],
Quantity=as.integer(100*runif(10))
)
prob.vecs <- LETTERS[1:5]
DT[,(prob.vecs):=0]
set.seed(123)
DT[,(prob.vecs):=lapply(.SD, function(x){runif(.N,0,0.2)}), .SDcols=prob.vecs]
DT
ID Quantity A B C D E
1: b 66 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000
2: l 9 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927
3: u 38 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487
4: d 27 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909
5: o 81 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895
6: f 44 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121
7: d 81 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682
8: t 81 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249
9: x 79 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453
10: j 43 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554
Now I want to add five columns Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
which apply the rbinom with the correspoding probability and quantity from the second column.
So for example the first entry for Quantity_A would be:
set.seed(741)
sum(rbinom(66,1,0.05751550))
> 2
This problem seems very similar to this post: How do I pass column-specific arguments to lapply in data.table .SD? but I cannot seem to make it work. My try:
DT[,(paste0("Quantity_", prob.vecs)):= mapply(function(x, Quantity){sum(rbinom(Quantity, 1 , x))}, .SD), .SDcols = prob.vecs]
Error in rbinom(Quantity, 1, x) :
argument "Quantity" is missing, with no default
Any ideas?
I seemed to have found a work-around, though I am not quite sure why this works (probably has something to do with the function rbinom not beeing vectorized in both arguments):
first define an index:
DT[,Index:=.I]
and then do it by index:
DT[,(paste0("Quantity_", prob.vecs)):= lapply(.SD,function(x){sum(rbinom(Quantity, 1 , x))}), .SDcols = prob.vecs, by=Index]
set.seed(789)
ID Quantity A B C D E Index Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
1: c 37 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000 1 0 4 7 8 0
2: c 51 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927 2 3 5 9 19 3
3: r 7 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487 3 0 0 2 2 0
4: v 53 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909 4 8 4 16 12 3
5: d 96 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895 5 17 3 12 0 4
6: u 52 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121 6 1 3 8 6 0
7: m 43 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682 7 6 1 7 6 2
8: z 3 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249 8 1 0 2 1 1
9: m 3 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453 9 1 0 0 0 0
10: o 4 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554 10 0 0 0 0 0
numbers look about right to me
If someone finds a solution without the index would still be appreciated.

Conditional data.table merge with .EACHI

I have been playing around with the newer data.table conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig and dtSmall, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max or min for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.
Set up environment
## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")
Create two fake datasets
A create a "big" table with 50 rows (10 values for each ID).
library(data.table)
set.seed(1L)
# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]
ID ValueBig Rank
1: A 266 3
2: A 373 4
3: A 573 5
4: A 909 9
5: A 202 2
---
46: E 790 9
47: E 24 1
48: E 478 2
49: E 733 7
50: E 693 6
Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)
dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))
ID ValueSmall
1: A 478
2: A 862
3: B 439
4: B 245
5: C 71
6: C 100
7: D 317
8: D 519
9: E 663
10: E 407
Merge
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.
## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
Results
ID ValueSmall RankSmall RankSmall2 DesiredRank
1: A 478 1 4 4
2: A 862 1 7 7
3: B 439 3 4 4
4: B 245 1 2 2
5: C 71 1 1 1
6: C 100 1 1 1
7: D 317 1 2 2
8: D 519 3 5 5
9: E 663 2 5 5
10: E 407 1 1 1
Is there a better data.table way of grabbing the max value in another data.table with multiple matches?
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

Add a row of zeros in a data frame created with ddply if there are no observations

I used the function ddply (package plyr) to calculate the mean of a response variable for each group "Trial" and "Treatment". I get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
This data frame suggests that in the trial 4 and treatment B, there are no observations for the response variable (as no row is specified in the data frame). So, is it possible to automatically add a row of zeros in the data frame (built with the function “ddply”) when there are no observations for a given response variable?
I would like to get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
4 B 0 0
We can merge the original dataset with another data.frame created with the full combination of unique values in 'Trial', and 'Treatment'. It will give an output with the missing combinations filled with NA. If needed, this can be changed to 0 (but it is better to have the missing combination as NA).
res <- merge(expand.grid(lapply(df1[1:2], unique)), df1, all.x=TRUE)
is.na(res) <- res==0
Or with dplyr/tidyr, we can use complete (from tidyr)
library(dplyr)
library(tidyr)
df1 %>%
complete(Trial, Treatment, fill= list(N=0, Mean=0))
# Trial Treatment N Mean
# (int) (chr) (dbl) (dbl)
#1 1 A 458 125.258
#2 1 B 459 168.748
#3 2 A 742 214.266
#4 2 B 142 475.786
#5 3 A 247 145.689
#6 3 B 968 234.129
#7 4 A 436 456.287
#8 4 B 0 0.000

Use previous calculated row value in r Continued 2

I have a data.table that looks like this:
library(data.table)
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "R" that has the first value as
DT$R[1]<-tanh(DT$B[1]/400000)
, and then I want to use the first row of column R to help calculate the next row value of G.
DT$R[2] <- 0.5*tanh(DT$B[2]/400000) + DT$R[1]*0.6
DT$R[3] <- 0.5*tanh(DT$B[3]/400000) + DT$R[2]*0.6
DT$R[4] <- 0.5*tanh(DT$B[4]/400000) + DT$R[3]*0.6
This will then look a bit like this
A B C R
1: 1 10 100 2.5e-05
2: 2 20 200 4e-05
3: 3 30 300 6.15e-05
4: 4 40 400 8.69e-05
5: 5 50 500 0.00011464
...
20: 20 200 2000 0.0005781274
Any ideas on this would be made?
Is this what you are looking for ?
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT$R = 0
DT$R[1]<-tanh(DT$B[1]/400000)
for(i in 2:nrow(DT)) {
DT$R[i] <- 0.5*tanh(DT$B[i]/400000) + DT$R[i-1]*0.6
}

Insert NA's in case there are no observations when using subset() and then dcast or tapply

I have the following data frame (this is only the head of the data frame). The ID column is subject (I have more subjects in the data frame, not only subject #99). I want to calculate the mean "rt" by "subject" and "condition" only for observations that have z.score (in absolute values) smaller than 1.
> b
subject rt ac condition z.score
1 99 1253 1 200_9 1.20862682
2 99 1895 1 102_2 2.95813507
3 99 1049 1 68_1 1.16862102
4 99 1732 1 68_9 2.94415384
5 99 765 1 34_9 -0.63991180
7 99 1016 1 68_2 -0.03191493
I know I can to do it using tapply or dcast (from reshape2) after subsetting the data:
b1 <- subset(b, abs(z.score) < 1)
b2 <- dcast(b1, subject~condition, mean, value.var = "rt")
subject 34_1 34_2 34_9 68_1 68_2 68_9 102_1 102_2 102_9 200_1 200_2 200_9
1 99 1028.5714 957.5385 861.6818 837.0000 969.7222 856.4000 912.5556 977.7273 858.7800 1006.0000 1015.3684 913.2449
2 5203 957.8889 815.2500 845.7750 933.0000 893.0000 883.0435 926.0000 879.2778 813.7308 804.2857 803.8125 843.7200
3 5205 1456.3333 1008.4286 850.7170 1142.4444 910.4706 998.4667 935.2500 980.9167 897.4681 1040.8000 838.7917 819.9710
4 5306 1022.2000 940.5882 904.6562 1525.0000 1216.0000 929.5167 955.8571 981.7500 902.8913 997.6000 924.6818 883.4583
5 5307 1396.1250 1217.1111 1044.4038 1055.5000 1115.6000 980.5833 1003.5714 1482.8571 941.4490 1091.5556 1125.2143 989.4918
6 5308 659.8571 904.2857 966.7755 960.9091 1048.6000 904.5082 836.2000 1753.6667 926.0400 870.2222 1066.6667 930.7500
In the example above for b1 each of the subjects had observations that met the subset demands.
However, it can be that for a certain subject I won't have observations after I subset. In this case I want to get NA in b2 for that subject in the specific condition in which he doesn't have observations that meet the subset demands. Does anyone have an idea for a way to do that?
Any help will be greatly appreciated.
Best,
Ayala
There is a drop argument in dcast that you can use in this situation, but you'll need to convert subject to a factor.
Here is a dataset with a second subject ID that has no values that meet your condition that the absolute value of z.score is less than one.
library(reshape2)
bb = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2),
ac=rep(1,8), condition=c("A","A","B","D","C","C","D","D"),
z.score=c(0.2,0.3,0.2,0.3,.2,2,2,2))
If you reshape this to a wide format with dcast, you lose subject number 11 even with the drop argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 99 125 2 10 4
Make subject a factor.
bb$subject = factor(bb$subject)
Now you can dcast with drop = FALSE to keep all subjects in the wide dataset.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 11 NaN NaN NaN NaN
2 99 125 2 10 4
To get NA instead of NaN you can use the fill argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE, fill = as.numeric(NA))
subject A B C D
1 11 NA NA NA NA
2 99 125 2 10 4
Is it the following you are after? I created a similar dataset "bb"
library("plyr") ###needed for . function below
bb<- data.frame(subject=c(99,99,99,99,99,11,11,11),rt=c(100,150,2,4,10,15,1,2), ac=rep(1,8) ,condition=c("A","A","B","D","C","C","D","D"), z.score=c(0.2,0.3,0.2,0.3,1.5,-0.3,0.8,0.7))
bb
subject rt ac condition z.score
#1 99 100 1 A 0.2
#2 99 150 1 A 0.3
#3 99 2 1 B 0.2
#4 99 4 1 D 0.3
#5 99 10 1 C 1.5
#6 11 15 1 C -0.3
#7 11 1 1 D 0.8
#8 11 2 1 D 0.7
Then you call dcast with subset included:
cc<-dcast(bb,subject~condition, mean, value.var = "rt",subset = .(abs(z.score)<1))
cc
subject A B C D
#1 11 NaN NaN 15 1.5
#2 99 125 2 NaN 4.0

Resources