I am trying to do a rolling join in data.table that brings in multiple columns, but rolls over both entire missing rows, and individual NAs in particular columns, even when the row is present. By way of example, I have two tables, A, and B:
library(data.table)
A <- data.table(v1 = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(6,6,6,4,4,6,4,4,4,6,4,4,4),
t = c(10,20,30,60,60,10,40,50,60,20,40,50,60),
key = c("v1", "v2", "t"))
B <- data.table(v1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(4,4,6,6,4,4,6,6,4,4,6,6),
t = c(10,70,20,70,10,70,20,70,10,70,20,70),
valA = c('a','a',NA,'a',NA,'a','b','a', 'b','b',NA,'b'),
valB = c(NA,'q','q','q','p','p',NA,'p',NA,'q',NA,'q'),
key = c("v1", "v2", "t"))
B
## v1 v2 t valA valB
## 1: 1 4 10 a NA
## 2: 1 4 70 a q
## 3: 1 6 20 NA q
## 4: 1 6 70 a q
## 5: 2 4 10 NA p
## 6: 2 4 70 a p
## 7: 2 6 20 b NA
## 8: 2 6 70 a p
## 9: 3 4 10 b NA
## 10: 3 4 70 b q
## 11: 3 6 20 NA NA
## 12: 3 6 70 b q
If I do a rolling join (in this case a backwards join), it rolls over all the points when a row cannot be found in B, but still includes points when the row exists but the data to be merged are NA:
B[A, , roll=-Inf]
## v1 v2 t valA valB
## 1: 1 4 60 a q
## 2: 1 4 60 a q
## 3: 1 6 10 NA q
## 4: 1 6 20 NA q
## 5: 1 6 30 a q
## 6: 2 4 40 a p
## 7: 2 4 50 a p
## 8: 2 4 60 a p
## 9: 2 6 10 b NA
## 10: 3 4 40 b q
## 11: 3 4 50 b q
## 12: 3 4 60 b q
## 13: 3 6 20 NA NA
I would like to rolling join in such a way that it rolls over these NAs as well. For a single column, I can subset B to remove the NAs, then roll with A:
C <- B[!is.na(valA), .(v1, v2, t, valA)][A, roll=-Inf]
C
## v1 v2 t valA
## 1: 1 4 60 a
## 2: 1 4 60 a
## 3: 1 6 10 a
## 4: 1 6 20 a
## 5: 1 6 30 a
## 6: 2 4 40 a
## 7: 2 4 50 a
## 8: 2 4 60 a
## 9: 2 6 10 b
## 10: 3 4 40 b
## 11: 3 4 50 b
## 12: 3 4 60 b
## 13: 3 6 20 b
But for multiple columns, I have to do this sequentially, storing the value for each added column and then repeat.
B[!is.na(valB), .(v1, v2, t, valB)][C, roll=-Inf]
## v1 v2 t valB valA
## 1: 1 4 60 q a
## 2: 1 4 60 q a
## 3: 1 6 10 q a
## 4: 1 6 20 q a
## 5: 1 6 30 q a
## 6: 2 4 40 p a
## 7: 2 4 50 p a
## 8: 2 4 60 p a
## 9: 2 6 10 p b
## 10: 3 4 40 q b
## 11: 3 4 50 q b
## 12: 3 4 60 q b
## 13: 3 6 20 q b
The end result above is the desired output, but for multiple columns it quickly becomes unwieldy. Is there a better solution?
Joins are about matching up rows. If you want to match rows multiple ways, you'll need multiple joins.
I'd use a loop, but add columns to A (rather than creating new tables C, D, ... following each join):
k = key(A)
bcols = setdiff(names(B), k)
for (col in bcols) A[, (col) :=
B[!.(as(NA, typeof(B[[col]]))), on=col][.SD, roll=-Inf, ..col]
][]
A
v1 v2 t valA valB
1: 1 4 60 a q
2: 1 4 60 a q
3: 1 6 10 a q
4: 1 6 20 a q
5: 1 6 30 a q
6: 2 4 40 a p
7: 2 4 50 a p
8: 2 4 60 a p
9: 2 6 10 b p
10: 3 4 40 b q
11: 3 4 50 b q
12: 3 4 60 b q
13: 3 6 20 b q
B[!.(NA_character_), on="valA"] is an anti-join that drops rows with NAs in valA. The code above attempts to generalize this (since the NA needs to match the type of the column).
I want to combine the result of lapply using .SD in j with further output columns in j. How can I do that in the same data table?
So far Im creating two data tables (example_summary1, example_summary2) and merge them but there should be a better way?
Maybe I don't fully understand the concept of .SD/.SDcols.
example <-data.table(id=rep(1:5,3),numbers=rep(1:5,3),sample1=sample(20,15,repla ce=TRUE),sample2=sample(20,15,replace=100))
id numbers sample1 sample2
1: 1 1 17 18
2: 2 2 8 1
3: 3 3 17 12
4: 4 4 15 2
5: 5 5 14 18
6: 1 1 11 14
7: 2 2 12 12
8: 3 3 11 7
9: 4 4 16 13
10: 5 5 17 1
11: 1 1 10 3
12: 2 2 14 15
13: 3 3 13 3
14: 4 4 17 6
15: 5 5 1 5
example_summary1<-example[,lapply(.SD,mean),by=id,.SDcols=c("sample1","sample2")]
> example_summary1
id sample1 sample2
1: 1 12.66667 11.666667
2: 2 11.33333 9.333333
3: 3 13.66667 7.333333
4: 4 16.00000 7.000000
5: 5 10.66667 8.000000
example_summary2<-example[,.(example.sum=sum(numbers)),id]
> example_summary2
id example.sum
1: 1 3
2: 2 6
3: 3 9
4: 4 12
5: 5 15
This is the best you can do if you are using .SDcols:
example_summary1 <- example[, c(lapply(.SD, mean), .(example.sum = sum(numbers))),
by = id, .SDcols = c("sample1", "sample2", "numbers")][, numbers := NULL][]
If you don't include numbers in .SDcols it's not available in j.
Without .SDcols you can do this:
example_summary1 <- example[, c(lapply(.(sample1 = sample1, sample2 = sample2), mean),
.(example.sum = sum(numbers))),
by=id]
Or if you have a vector of column names:
cols <- c("sample1","sample2")
example_summary1 <- example[, c(lapply(mget(cols), mean),
.(example.sum = sum(numbers))),
by=id]
But I suspect that you don't get the same data.table optimizations then.
Finally, a data.table join is so fast that I would use your approach.
Let's say I have a table like the following:
DT <- data.table(ID1= rep(c("a","b","c"),3),ID2=rnorm(9,4),var = 1:9)
## > DT
## ID1 ID2 var
## 1: a 2.630392 1
## 2: b 3.966620 2
## 3: c 4.002776 3
## 4: a 3.188372 4
## 5: b 4.735084 5
## 6: c 4.307198 6
## 7: a 2.830868 7
## 8: b 4.892684 8
## 9: c 3.429826 9
and I would like to perform a dcast with the by taking into consideration only the number of times that each ID1 apprear.
undesired output:
dcast(DT,ID1~ID2)
desired output:
## ID1 1 2 3
## 1: a 1 4 7
## 2: b 2 5 8
## 3: c 3 6 9
Try
dcast.data.table(DT[,N:=1:.N ,ID1], ID1~N, value.var='var')
# ID1 1 2 3
#1: a 1 4 7
#2: b 2 5 8
#3: c 3 6 9
Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)
I've noticed some inconsistent (inconsistent to me) behaviour in data.table when using different assignment operators. I have to admit I never quite got the difference between "=" and copy(), so maybe we can shed some light here. If you use "=" or "<-" instead of copy() below, upon changing the copied data.table, the original data.table will change as well.
Please execute the following commands and you will see what I mean
library(data.table)
example(data.table)
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
DT2 = DT
now i'll change the v column of DT2:
DT2[ ,v:=3L]
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
but look what happened to DT:
DT
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
it changed as well.
so: changing DT2 changed the original DT. not so if I use copy():
example(data.table) # reset DT
DT3 <- copy(DT)
DT3[, v:= 3L]
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
is this behaviour expected?
Yes. This is expected behaviour, and well documented.
Since data.table uses references to the original object to achieve modify-in-place, it is very fast.
For this reason, if you really want to copy the data, you need to use copy(DT)
From the documentation for ?copy:
The data.table is modified by reference, and returned (invisibly) so
it can be used in compound statements; e.g., setkey(DT,a)[J("foo")].
If you require a copy, take a copy first (using DT2=copy(DT)). copy()
may also sometimes be useful before := is used to subassign to a
column by reference. See ?copy.
See also this question :
Understanding exactly when a data.table is a reference to vs a copy of another