Use of lapply .SD in data.table R - r

I am not very clear about use of .SD and by.
For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?
DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]
However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)
DT[,lapply(.SD,sum),by=.(A,B)]
Then how do I do a simple colsum over all the columns except A and B?

Just to illustrate the comments above with an example, let's take
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:
You said "does the below snippet... change all the columns in DT..."
The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.
You mentioned wanting to change the column types
Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].
This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.
You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:
Results:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100

Related

Last observation of the previous group

I would like to know, if I have data that I can group by a variable, how can I get the last observation of the previous group?
I have the following data:
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))
I would like to create a new data.table that has the group ID and the difference between the last observation of the group from the last observation of the previous group. So that from the above I'd get:
a c
1: 1 NA
2: 2 9
3: 3 1
4: 4 -8
5: 5 5
Thanks!
We group by 'a', get the last element of 'b', then take the lag of 'c' by shifting
dt[, .(c = last(b)), a][, c:= shift(c)][]
Here is a way:
dt[, c := b * (1:.N == .N), by = a] ## get last row within the group
dt <- dt[b == c] ## filter data.table to get rows of interest
dt[, c := shift(c, type = "lag") - c][] ## getting difference using shift with lag argument
# a b c
#1: 1 11 NA
#2: 2 10 NA
#3: 3 18 9
#4: 4 19 -7
#5: 5 12 -8
data
set.seed(1)
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))

Row operations on selected columns based on substring in data.table

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.
Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.
update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity
library(data.table)
# generate data
dt <- data.table(id= letters[1:5],
v1= 1:5,
v2= 1:5,
f1= 11:15,
f2= 11:15)
dt
#> id v1 v2 f1 f2
#> 1: a 1 1 11 11
#> 2: b 2 2 12 12
#> 3: c 3 3 13 13
#> 4: d 4 4 14 14
#> 5: e 5 5 15 15
# what I've tried
dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]
We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]
rowSums and rowMeans combined with grep can accomplish this.
dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

What does ".N" mean in data.table?

I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols

R data.table: calculating grouped frequencies

I'm trying to add columns to my data.table that essentially append a cumulative frequency table for each group that is aggregated. Unfortunately, my current solution is about ten times slower than I had hoped.
Here is what I'm using (apologies for the ugly one-liner):
DT[, c("bin1","bin2","bin3","bin4") := as.list(cumsum(hist(colx,c(lbound,bound1,bound2, bound3,ubound),plot=FALSE)$counts)), by=category]
If the bin boundaries are set at 0,25,50,75,100, I would like my table to look like:
id category colx bin1 bin2 bin3 bin4
1 a 5 1 2 2 3
2 a 30 1 2 2 3
3 b 21 1 2 3 4
4 c 62 0 1 3 3
5 b 36 1 2 3 4
6 a 92 1 2 2 3
7 c 60 0 1 3 3
8 b 79 1 2 3 4
9 b 54 1 2 3 4
10 c 27 0 1 3 3
In the actual dataset I'm grouping using 4 different columns and there are millions of rows and unique groups. When I try a simpler function, such as sum, it takes an acceptable amount of time to do the calculation. Is there any way to significantly speed up the counting process?
Okay, here's one way (here I use data.table v1.9.3). Remove the by=.EACHI if you're using versions <= 1.9.2.
dt[, ival := findInterval(colx, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, category, ival)
ans <- dt[CJ(unique(category), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="category"][, bin := "bin"]
ans <- dcast.data.table(ans, category ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
id category colx bin_1 bin_2 bin_3 bin_4
1: 1 a 5 1 2 2 3
2: 2 a 30 1 2 2 3
3: 6 a 92 1 2 2 3
4: 3 b 21 1 2 3 4
5: 5 b 36 1 2 3 4
6: 9 b 54 1 2 3 4
7: 8 b 79 1 2 3 4
8: 10 c 27 0 1 3 3
9: 4 c 62 0 1 3 3
10: 7 c 60 0 1 3 3
Benchmark on simulated large data:
I generate here a data.table with 20 million rows and a total of 1-million groups with 2 grouping columns (instead of 4 as you state in your question).
K = 1e3L
N = 20e6L
sim_data <- function(K, N) {
set.seed(1L)
ff <- function(K, N) sample(paste0("V", 1:K), N, TRUE)
data.table(x=ff(K,N), y=ff(K,N), val=sample(1:100, N, TRUE))
}
dt <- sim_data(K, N)
method1 <- function(x) {
dt[, ival := findInterval(val, seq(0, 100, by=25), rightmost.closed=TRUE)]
setkey(dt, x, y, ival)
ans <- dt[CJ(unique(x), unique(y), unique(ival)), .N, allow.cartesian=TRUE, by=.EACHI]
ans[, N := cumsum(N), by="x,y"][, bin := "bin"]
ans <- dcast.data.table(ans, x+y ~ bin+ival, value.var="N")
ans <- dt[ans][, ival := NULL]
}
system.time(ans1 <- method1(dt))
# user system elapsed
# 13.148 2.778 16.209
I hope this is faster than your original solution and scales well for your real data dimensions.
Update: Here's another version using data.table's rolling joins instead of findInterval from base. We've to modify the intervals slightly so that the rolling join finds the right match.
dt <- sim_data(K, N)
method2 <- function(x) {
ivals = seq(24L, 100L, by=25L)
ivals[length(ivals)] = 100L
setkey(dt, x,y,val)
dt[, ival := seq_len(.N), by="x,y"]
ans <- dt[CJ(unique(x), unique(y), ivals), roll=TRUE, mult="last"][is.na(ival), ival := 0L][, bin := "bin"]
ans <- dcast.data.table(ans, x+y~bin+val, value.var="ival")
dt[, ival := NULL]
ans2 <- dt[ans]
}
system.time(ans2 <- method2(dt))
# user system elapsed
# 12.538 2.649 16.079
## check if both methods give identical results:
setkey(ans1, x,y,val)
setnames(ans2, copy(names(ans1)))
setkey(ans2, x,y,val)
identical(ans1, ans2) # [1] TRUE
Edit: Some explanation on why OP's is very time consuming:
A huge reason, I suspect, for the difference in runtime between these solutions and hist is that both the answers here are vectorised (written entirely in C and will work on the whole data set directly), where as hist is a S3 method (which'll take time for dispatch to the .default method and added to that, it's written in R. So, basically you're executing about a million times hist, a function in R, where as the other two vectorised solutions are calling it once in C (no need to call for every group here).
And since that's the most complex part of your question, it obviously slows things down.

How to change the last value in each group by reference, in data.table

For a data.table DT grouped by site, sorted by time t, I need to change the last value of a variable in each group. I assume it should be possible to do this by reference using :=, but I haven't found a way that works yet.
Sample data:
require(data.table) # using 1.8.11
DT <- data.table(site=c(rep("A",5), rep("B",4)),t=c(1:5,1:4),a=as.double(c(11:15,21:24)))
setkey(DT, site, t)
DT
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 15
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 24
The desired result is to change the last value of a in each group, for example to 999, so the result looks like:
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 999
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 999
It seems like .I and/or .N should be used, but I haven't found a form that works. The use of := in the same statement as .I[.N] gives an error. The following gives me the row numbers where the assignment is to be made:
DT[, .I[.N], by=site]
# site V1
# 1: A 5
# 2: B 9
but I don't seem to be able to use this with a := assignment. The following give errors:
DT[.N, a:=999, by=site]
# Null data.table (0 rows and 0 cols)
DT[, .I[.N, a:=999], by=site]
# Error in `:=`(a, 999) :
# := and `:=`(...) are defined for use in j, once only and in particular ways.
# See help(":="). Check is.data.table(DT) is TRUE.
DT[.I[.N], a:=999, by=site]
# Null data.table (0 rows and 0 cols)
Is there a way to do this by reference in data.table? Or is this better done another way in R?
Currently you can use:
DT[DT[, .I[.N], by = site][['V1']], a := 999]
# or, avoiding the overhead of a second call to `[.data.table`
set(DT, i = DT[,.I[.N],by='site'][['V1']], j = 'a', value = 999L)
alternative approaches:
use replace...
DT[, a := replace(a, .N, 999), by = site]
or shift the replacement to the RHS, wrapped by {} and return the full vector
DT[, a := {a[.N] <- 999L; a}, by = site]
or use mult='last' and take advantage of by-without-by. This requires the data.table to be keyed by the groups of interest.
DT[unique(site), a := 999, mult = 'last']
There is a feature request #2793 that would allow
DT[, a[.N] := 999]
but this is yet to be implemented

Resources