Persistent assignment in data.table with .SD - r

I'm struggling with .SD calls in data.table.
In particular, I'm trying to identify some logical characteristic within a grouping of data, and draw some identifying mark in another variable. Canonical application of .SD, right?
From FAQ 4.5, http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf, imagine the following table:
library(data.table) # 1.9.5
DT = data.table(a=rep(1:3,1:3),b=1:6,c=7:12)
DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a]
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
I've assigned these values to b (using the ':=' operator) and so when I re-call DT, I expect the same output. But, unexpectedly, I'm met with the original table:
DT
## a b c
## 1: 1 1 7
## 2: 2 2 8
## 3: 2 3 9
## 4: 3 4 10
## 5: 3 5 11
## 6: 3 6 12
Expected output was the original frame, with persistent modifications in 'b':
DT
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
Sure, I can copy this table into another one, but that doesn't seem consistent with the ethos.
DT2 <- copy(DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a])
DT2
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
It feels like I'm missing something fundamental here.

The mentioned FAQ is just showing a workaround on how to modify (a temprory copy of) .SD but it won't update your original data in place. A possible solution for you problem would be something like
DT[DT[, .I[1L], by = a]$V1, b := 99L]
DT
# a b c
# 1: 1 99 7
# 2: 2 99 8
# 3: 2 3 9
# 4: 3 99 10
# 5: 3 5 11
# 6: 3 6 12

Related

Rolling Join multiple columns independently to eliminate NAs

I am trying to do a rolling join in data.table that brings in multiple columns, but rolls over both entire missing rows, and individual NAs in particular columns, even when the row is present. By way of example, I have two tables, A, and B:
library(data.table)
A <- data.table(v1 = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(6,6,6,4,4,6,4,4,4,6,4,4,4),
t = c(10,20,30,60,60,10,40,50,60,20,40,50,60),
key = c("v1", "v2", "t"))
B <- data.table(v1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(4,4,6,6,4,4,6,6,4,4,6,6),
t = c(10,70,20,70,10,70,20,70,10,70,20,70),
valA = c('a','a',NA,'a',NA,'a','b','a', 'b','b',NA,'b'),
valB = c(NA,'q','q','q','p','p',NA,'p',NA,'q',NA,'q'),
key = c("v1", "v2", "t"))
B
## v1 v2 t valA valB
## 1: 1 4 10 a NA
## 2: 1 4 70 a q
## 3: 1 6 20 NA q
## 4: 1 6 70 a q
## 5: 2 4 10 NA p
## 6: 2 4 70 a p
## 7: 2 6 20 b NA
## 8: 2 6 70 a p
## 9: 3 4 10 b NA
## 10: 3 4 70 b q
## 11: 3 6 20 NA NA
## 12: 3 6 70 b q
If I do a rolling join (in this case a backwards join), it rolls over all the points when a row cannot be found in B, but still includes points when the row exists but the data to be merged are NA:
B[A, , roll=-Inf]
## v1 v2 t valA valB
## 1: 1 4 60 a q
## 2: 1 4 60 a q
## 3: 1 6 10 NA q
## 4: 1 6 20 NA q
## 5: 1 6 30 a q
## 6: 2 4 40 a p
## 7: 2 4 50 a p
## 8: 2 4 60 a p
## 9: 2 6 10 b NA
## 10: 3 4 40 b q
## 11: 3 4 50 b q
## 12: 3 4 60 b q
## 13: 3 6 20 NA NA
I would like to rolling join in such a way that it rolls over these NAs as well. For a single column, I can subset B to remove the NAs, then roll with A:
C <- B[!is.na(valA), .(v1, v2, t, valA)][A, roll=-Inf]
C
## v1 v2 t valA
## 1: 1 4 60 a
## 2: 1 4 60 a
## 3: 1 6 10 a
## 4: 1 6 20 a
## 5: 1 6 30 a
## 6: 2 4 40 a
## 7: 2 4 50 a
## 8: 2 4 60 a
## 9: 2 6 10 b
## 10: 3 4 40 b
## 11: 3 4 50 b
## 12: 3 4 60 b
## 13: 3 6 20 b
But for multiple columns, I have to do this sequentially, storing the value for each added column and then repeat.
B[!is.na(valB), .(v1, v2, t, valB)][C, roll=-Inf]
## v1 v2 t valB valA
## 1: 1 4 60 q a
## 2: 1 4 60 q a
## 3: 1 6 10 q a
## 4: 1 6 20 q a
## 5: 1 6 30 q a
## 6: 2 4 40 p a
## 7: 2 4 50 p a
## 8: 2 4 60 p a
## 9: 2 6 10 p b
## 10: 3 4 40 q b
## 11: 3 4 50 q b
## 12: 3 4 60 q b
## 13: 3 6 20 q b
The end result above is the desired output, but for multiple columns it quickly becomes unwieldy. Is there a better solution?
Joins are about matching up rows. If you want to match rows multiple ways, you'll need multiple joins.
I'd use a loop, but add columns to A (rather than creating new tables C, D, ... following each join):
k = key(A)
bcols = setdiff(names(B), k)
for (col in bcols) A[, (col) :=
B[!.(as(NA, typeof(B[[col]]))), on=col][.SD, roll=-Inf, ..col]
][]
A
v1 v2 t valA valB
1: 1 4 60 a q
2: 1 4 60 a q
3: 1 6 10 a q
4: 1 6 20 a q
5: 1 6 30 a q
6: 2 4 40 a p
7: 2 4 50 a p
8: 2 4 60 a p
9: 2 6 10 b p
10: 3 4 40 b q
11: 3 4 50 b q
12: 3 4 60 b q
13: 3 6 20 b q
B[!.(NA_character_), on="valA"] is an anti-join that drops rows with NAs in valA. The code above attempts to generalize this (since the NA needs to match the type of the column).

R data table combining lapply with other j arguments

I want to combine the result of lapply using .SD in j with further output columns in j. How can I do that in the same data table?
So far Im creating two data tables (example_summary1, example_summary2) and merge them but there should be a better way?
Maybe I don't fully understand the concept of .SD/.SDcols.
example <-data.table(id=rep(1:5,3),numbers=rep(1:5,3),sample1=sample(20,15,repla ce=TRUE),sample2=sample(20,15,replace=100))
id numbers sample1 sample2
1: 1 1 17 18
2: 2 2 8 1
3: 3 3 17 12
4: 4 4 15 2
5: 5 5 14 18
6: 1 1 11 14
7: 2 2 12 12
8: 3 3 11 7
9: 4 4 16 13
10: 5 5 17 1
11: 1 1 10 3
12: 2 2 14 15
13: 3 3 13 3
14: 4 4 17 6
15: 5 5 1 5
example_summary1<-example[,lapply(.SD,mean),by=id,.SDcols=c("sample1","sample2")]
> example_summary1
id sample1 sample2
1: 1 12.66667 11.666667
2: 2 11.33333 9.333333
3: 3 13.66667 7.333333
4: 4 16.00000 7.000000
5: 5 10.66667 8.000000
example_summary2<-example[,.(example.sum=sum(numbers)),id]
> example_summary2
id example.sum
1: 1 3
2: 2 6
3: 3 9
4: 4 12
5: 5 15
This is the best you can do if you are using .SDcols:
example_summary1 <- example[, c(lapply(.SD, mean), .(example.sum = sum(numbers))),
by = id, .SDcols = c("sample1", "sample2", "numbers")][, numbers := NULL][]
If you don't include numbers in .SDcols it's not available in j.
Without .SDcols you can do this:
example_summary1 <- example[, c(lapply(.(sample1 = sample1, sample2 = sample2), mean),
.(example.sum = sum(numbers))),
by=id]
Or if you have a vector of column names:
cols <- c("sample1","sample2")
example_summary1 <- example[, c(lapply(mget(cols), mean),
.(example.sum = sum(numbers))),
by=id]
But I suspect that you don't get the same data.table optimizations then.
Finally, a data.table join is so fast that I would use your approach.

dcast long to wide by transforming unique RHS

Let's say I have a table like the following:
DT <- data.table(ID1= rep(c("a","b","c"),3),ID2=rnorm(9,4),var = 1:9)
## > DT
## ID1 ID2 var
## 1: a 2.630392 1
## 2: b 3.966620 2
## 3: c 4.002776 3
## 4: a 3.188372 4
## 5: b 4.735084 5
## 6: c 4.307198 6
## 7: a 2.830868 7
## 8: b 4.892684 8
## 9: c 3.429826 9
and I would like to perform a dcast with the by taking into consideration only the number of times that each ID1 apprear.
undesired output:
dcast(DT,ID1~ID2)
desired output:
## ID1 1 2 3
## 1: a 1 4 7
## 2: b 2 5 8
## 3: c 3 6 9
Try
dcast.data.table(DT[,N:=1:.N ,ID1], ID1~N, value.var='var')
# ID1 1 2 3
#1: a 1 4 7
#2: b 2 5 8
#3: c 3 6 9

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

data.table 1.8.1.: "DT1 = DT2" is not the same as DT1 = copy(DT2)?

I've noticed some inconsistent (inconsistent to me) behaviour in data.table when using different assignment operators. I have to admit I never quite got the difference between "=" and copy(), so maybe we can shed some light here. If you use "=" or "<-" instead of copy() below, upon changing the copied data.table, the original data.table will change as well.
Please execute the following commands and you will see what I mean
library(data.table)
example(data.table)
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
DT2 = DT
now i'll change the v column of DT2:
DT2[ ,v:=3L]
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
but look what happened to DT:
DT
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
it changed as well.
so: changing DT2 changed the original DT. not so if I use copy():
example(data.table) # reset DT
DT3 <- copy(DT)
DT3[, v:= 3L]
x y v
1: a 1 3
2: a 3 3
3: a 6 3
4: b 1 3
5: b 3 3
6: b 6 3
7: c 1 3
8: c 3 3
9: c 6 3
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
is this behaviour expected?
Yes. This is expected behaviour, and well documented.
Since data.table uses references to the original object to achieve modify-in-place, it is very fast.
For this reason, if you really want to copy the data, you need to use copy(DT)
From the documentation for ?copy:
The data.table is modified by reference, and returned (invisibly) so
it can be used in compound statements; e.g., setkey(DT,a)[J("foo")].
If you require a copy, take a copy first (using DT2=copy(DT)). copy()
may also sometimes be useful before := is used to subassign to a
column by reference. See ?copy.
See also this question :
Understanding exactly when a data.table is a reference to vs a copy of another

Resources