Select last row by group for all columns data.table - r

I was surprised doing the following:
R) system.time(lastOrder <- order[,lapply(.SD,tail,1),by="TRADER_ID,EXEC_IDATE"]);
utilisateur système écoulé
1.45 0.00 1.53
R) nrow(order)
[1] 75301
R) ncol(order)
[1] 23
Thought it was very long, then I did
R) system.time(lastOrder <- order[,list(test=tail(EXEC_IDATE,1)),by="TRADER_ID,EXEC_IDATE"]);
utilisateur système écoulé
0.14 0.00 0.14
as far as I understand, if you know all the rows to select and work on most of the work is done, then I don't see why apply this to all columns should be 10x longer. Am I doing something wrong on the first bit of code, this is the only way I know to select last rows by group

Last row by group :
DT[, .SD[.N], by="TRADER_ID,EXEC_IDATE"] # (1)
or, faster (avoid use of .SD where possible, for speed) :
w = DT[, .I[.N], by="TRADER_ID,EXEC_IDATE"][[3]] # (2)
DT[w]
Note that the following feature request will make approach (1) as fast as approach (2) :
FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged.

How about something like this? (Synthetic data meant to mimic what I can infer about yours from the question)
tmp <- data.table(id = sample(1:20, 1e6, replace=TRUE),
date = as.Date(as.integer(runif(n=1e6, min = 1e4, max = 1.1e4)),
origin = as.Date("1970-01-01")),
data1 = rnorm(1e6),
data2 = rnorm(1e6),
data3 = rnorm(1e6))
> system.time(X <- tmp[, lapply(.SD, tail, 1), by = list(id, date)])
user system elapsed
1.95 0.00 1.95
> system.time(Y <- tmp[, list(tail(data1, 1)), by = list(id, date)])
user system elapsed
1.24 0.01 1.26
> system.time({
setkey(tmp, id, date)
Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"]
})
user system elapsed
0.90 0.02 0.92
X and Z are the same after same order is ensured:
> identical(setkey(X, id, date), setkey(Z, id, date))
[1] TRUE
The difference between my lapply tail and 1-column tail isn't as drastic as yours, but without the structure of your data, it's hard to say more.
Also, note that most of the time in this method is setting the key. If the table is already sorted by the grouping columns, it goes really fast:
> system.time(Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"])
user system elapsed
0.03 0.00 0.03
Alternatively, you could translate the many column problem to the 1-column problem with a temporary column:
> system.time({
tmp[, row.num := seq_len(nrow(tmp))]
W <- tmp[tmp[, max(row.num), by = list(id, date)]$V1][, row.num := NULL]
tmp[, row.num := NULL]
})
user system elapsed
0.92 0.00 1.09
> identical(setkey(X, id, date), setkey(W, id, date))
[1] TRUE

Related

Replacing impossible values with NA using R's data.table [duplicate]

This question already has an answer here:
Fastest method to replace data values conditionally in data.table (speed comparison)
(1 answer)
Closed 6 years ago.
I have a code that replaces impossible values in a dataset with NA.
I'm trying to convert the code to being based on data.table, as an example, I replace height of 0 with height NA
(Dummy) data
DT <- data.table(id = 1:5e6,
height = sample(c(0, 100:240), 5e6, replace = TRUE))
My current solution is slower and at least as verbose as my data.frame version. I assume I am doing something wrong...
DT[height == 0, height := NA]
While researching this question I found another solution which is much faster (but uglier).
set(DT, which("height"==0), "height", value = NA)
All suggestions appreciated.
Since v1.9.4, data.table by default automatically creates an index on columns during subsets of the form x == val and x %in% val used within [.data.table call. This makes subsequent subsetting very fast with only a slightly higher price to pay on the first subset (since data.table's radix ordering is quite fast). The first subset could be slower because it is the time to:
create the index
and then subset.
To illustrate this (using #akrun's data):
require(data.table)
getOption("datatable.auto.index") # [1] TRUE ===> enabled
set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))
system.time(DT[height == 0L])
# 0.396 0.059 0.452 ## first run
# 0.003 0.000 0.004 ## second run is very fast
Now if we disable auto indexing:
require(data.table)
options(datatable.auto.index = FALSE)
getOption("datatable.auto.index") # [1] FALSE
set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))
system.time(DT[height == 0L])
# 0.037 0.007 0.042 ## first run
# 0.039 0.010 0.045 ## second run (~ 10x slower than 2nd run above)
options(datatable.auto.index = TRUE) # restore auto indexing if necessary
But your case is special because, you update the same column you subset. In essence, this is what is happening:
The i expression is seen to be an expression that can be optimised for auto indexing. An index is created and saved for blazing fast subsets later on.
The j expression is seen and the column is updated.
The column on which the index has been set has been updated. So index is removed.
Auto indexing logic should detect this and skip creating the index altogether if any of the rows evaluate to TRUE, since the created index is essentially useless.
Could you please file an issue on the project issues page? Just linking to this SO Q should be sufficient.
To answer your Q, disable auto indexing and run the subset, and it should be more or less equal to the time you get with set().
Base R solution just can not be faster here since it copies to entire column just to update those entries. But it is because base R chose to do that.
A speed test with one evaluation on 100 million rows:
library(data.table)
DT <- data.table(id = 1:1e8,
height = sample(c(0, 100:240), 1e8, replace = TRUE))
DT2 <- copy(DT);DT3 <- copy(DT); DT4 <- copy(DT); DT5 <- copy(DT); DT6 <- copy(DT);DT7 <- copy(DT)
library(microbenchmark)
microbenchmark(
David = set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA),
OP = DT2[height == 0, height := NA],
akrun = setkey(DT3, "height")[.(0), height := NA],
isna = {is.na(DT4$height) <- DT4$height == 0},
assignNA = {DT5$height[DT5$height == 0] <- NA},
indexset = {setindex(DT6, height); DT6[height==0, height := NA_real_]},
exponent = DT7[, height:= NA^(!height)*height],
times=1L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# David 585.9044 585.9044 585.9044 585.9044 585.9044 585.9044 1
# OP 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 1
# akrun 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 1
# isna 4843.3623 4843.3623 4843.3623 4843.3623 4843.3623 4843.3623 1
# assignNA 4797.0191 4797.0191 4797.0191 4797.0191 4797.0191 4797.0191 1
# indexset 6307.4564 6307.4564 6307.4564 6307.4564 6307.4564 6307.4564 1
# exponent 1054.6013 1054.6013 1054.6013 1054.6013 1054.6013 1054.6013 1
We can try
system.time(DT[, height:= NA^(!height)*height])
# user system elapsed
# 0.03 0.05 0.08
OP's code
system.time(DT[height == 0, height := NA])
# user system elapsed
# 0.42 0.04 0.49
base R option that should be faster.
system.time(DT$height[DT$height == 0] <- NA)
# user system elapsed
# 0.19 0.05 0.23
and the is.na route
system.time(is.na(DT$height) <- DT$height == 0)
# user system elapsed
# 0.22 0.06 0.28
#DavidArenburg's suggestion
system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
# user system elapsed
# 0.06 0.00 0.06
NOTE: All these benchmarks are done by freshly creating the dataset before each run so as to provide some unbiased benchmarks. I could use microbenchmark, but there could be some biasedness in each run as the assignment happens in the 1st run.
Using a bigger dataset
set.seed(24)
DT <- data.table(id = 1:1e8,
height = sample(c(0, 100:240), 1e8, replace = TRUE))
system.time(DT[, height:= NA^(!height)*height])
# user system elapsed
# 0.58 0.24 0.81
system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
# user system elapsed
# 0.49 0.12 0.61
data
set.seed(24)
DT <- data.table(id = 1:1e7,
height = sample(c(0, 100:240), 1e7, replace = TRUE))

R, data.table: find all combinations of a list excluding each element paired with itself

I would like to efficiently find all combinations of a list excluding the combination of each element to itself. For example, with a list of A,B,C,D find all combinations excluding A-A, B-B, C-C, D-D.
I can do this in what seems to be an inefficient way using this code:
x <- c("A","B","C","D")
dt <- CJ(x,x)
dt <- dt[!V1==V2]
The problem is that the third line takes as about 4 times as long to run as the second line. So for a large list like my real data, line 2 and line 3 together can take a very long time.
I am using data.table 1.9.6, R 3.2.2, and R Studio on Windows 7.
Thanks so much.
Well, this is something of an improvement:
n = 1e4; x = seq(n)
# combn (variant of #Psidom's answer)
system.time({
cn = transpose(combn(x, 2, simplify=FALSE))
r = rbind( setDT(cn), rev(cn) )
})
# takes forever, so i cut it off
# op's code
system.time({
r0 = CJ(x,x)[V1 != V2]
})
# user system elapsed
# 1.69 0.63 1.50
# use indices in the final step
system.time({
r1 = CJ(x,x)[-seq(1L, .N, by=length(x)+1L)]
})
# user system elapsed
# 1.17 0.42 0.96
And some more:
# build it manually
system.time({
xlen = length(x)
r2 = data.table(rep(x, each = xlen), V2 = x)[-seq(1L, .N, by=xlen+1L)]
})
# user system elapsed
# 3.03 0.60 2.79
# ... or ...
system.time({
xlen = length(x)
r2 = data.table(rep(x, each = xlen-1L), rep.int(x, xlen)[-seq(1L, xlen^2, by=xlen+1L)])
})
# user system elapsed
# 2.79 0.25 3.07
# build it manually special for the case of two cols
system.time({
r3 = setDT(list(x))[, .(V2 = x), by=V1][ -seq(1L, .N, by=length(x)+1L) ]
})
# user system elapsed
# 0.92 0.25 0.86
# ... or ...
system.time({
r4 = setDT(list(x))[, .(V2 = x[-.GRP]), by=V1]
})
# user system elapsed
# 0.85 0.32 1.19
# verify
identical(r0, r1) # TRUE
identical(setkey(r0, NULL), r2) # TRUE
identical(setkey(r0, NULL), r3) # TRUE
identical(setkey(r0, NULL), r4) # TRUE
Maybe you can do a little better by writing your own CJ with Rcpp. It might also be worth noting that everything is faster with integers (instead of characters):
x = rep(LETTERS, 5e2)
system.time(CJ(x,x))
# user system elapsed
# 7.06 1.81 6.61
x = rep(1:26, 5e2)
system.time(CJ(x,x))
# user system elapsed
# 3.39 0.88 2.95
So if x is a character vector, it might be best to use seq_along(x) for the combinatorial tasks and then map back to the character values like x[V1] afterwards.

rbind + setkey in data.table slower than xts::rbind which automatically indexes?

What is the reason behind data.table being almost 6x slower than xts when updating(=rbind) new rows?
library(quantmod); library(xts); library(data.table)
XTS = getSymbols("AAPL", from="2000-01-01", env = NULL)
# make corresponding `data.table`:
DT <- as.data.table(as.data.frame(XTS))
DT[, Date:=index(XTS)]
setkey(DT,Date)
setcolorder(DT,c("Date",names(XTS)))
# Note: rerun the above before running each test.
system.time(for(i in 1:10) XTS = rbind(XTS, XTS)) # reindexing is automatic
# user system elapsed
# 0.15 0.03 0.47
system.time(for(i in 1:10) DT = setkey(rbind(DT, DT), Date)) # need to manually reset key
# user system elapsed
# 0.64 0.02 2.30
system.time(for(i in 1:10) DT = setkey(rbindlist(list(DT, DT)), Date)) # ditto
# user system elapsed
# 0.60 0.02 2.20
The data.table (unlike xts) will even exhaust memory allocation for i>15 on my computer.
The common programming use case is when you are running a temporal simulation and want to collect intermediate measurements into a result table, which you later want to summarise.
Try
rbindlist( rep( list(DT), 10 ))
rbindlist should boost your runtime significantly.

Why is "by" on a vector not from a data.table column very slow?

test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x
test[,.N, by=x] # fast
test[,.N, by=y] # extremely slow
Why it is slow on the second case?
It is even faster to do this:
test[,y:=y]
test[,.N, by=y]
test[,y:=NULL]
It looks as if it is poorly optimized?
Seems like I forgot to update this post.
This was fixed long back in commit #1039 of v1.8.11. From NEWS:
Fixed #5106 where DT[, .N, by=y] where y is a vector with length(y) = nrow(DT), but y is not a column in DT. Thanks to colinfang for reporting.
Testing on v1.8.11 commit 1187:
require(data.table)
test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x
system.time(ans1 <- test[,.N, by=x])
# user system elapsed
# 0.015 0.000 0.016
system.time(ans2 <- test[,.N, by=y])
# user system elapsed
# 0.015 0.000 0.015
setnames(ans2, "y", "x")
identical(ans1, ans2) # [1] TRUE

return type for j parameter in data.table

I have been using data.table for some computation and am wondering what are the possible return types for the j parameter so that it stacks up my output correctly? I know data.frame is acceptable so list must be as well? My function returns multiple rows and multiple columns for each id. So imagine:
dtb <- data.table(id=rep(1:5,20), a=1:100, b=sample(1:100, 100), c=sample(1:100, 100))
f <- function(dt) { return(c(dt$a+1, dt$b+1, dt$c+1))}
dtb[,f(.SD), by=id]
This clearly does not work properly. This does:
dtb <- data.table(id=rep(1:5,20), a=1:100, b=sample(1:100, 100), c=sample(1:100, 100))
f <- function(dt) { return(data.frame(a=dt$a+1, b=dt$b+1, c=dt$c+1))}
dtb[,f(.SD), by=id]
Constructing these data.frames seems like a really inefficient way to do things. What are some suggestions? The by must be used.
Your approach to th e j component is not native data.table-speak
It is worth reading the data.table wiki on do's and don't regarding data.table syntax (using data.frame is terrible!, in terms of performance).
You may also refer to this question, and perhaps you will start to understand how the using j and list works.
You are passing a list of expression that will be evaluated within the data.table (or grouped subset thereof)
these are unevaulated expressions, and (currentl) the function [ relies on observing list to properly evaulated these within the correct environment (the data.table or .SD, the group subset)
This call will work
dtb[,list(a = a+1, b = b + 1, c = c+1), by = id]
As will this (passing an unevaulated expression which happens to be a call to list(...)
library(plyr) # for as.quoted
my_list <- as.quoted(paste('list(',paste(letters[1:3], '=', letters[1:3], '+1',collapse= ','),')'))[[1]]
my_list
## list(a = a + 1, b = b + 1, c = c + 1)
dtb[,eval(my_list), by = id]
There is also the possibility of combining a call of lapply(.SD, a_function) in conjunction with .SDcols. The .SDcols argument lets you pass a string of column names on which want the function to be evaluated, so this will work
dtb[, lapply(.SD,base::'+',1),by= id, .SDcols = c('a','b','c')]
or
dtb[,lapply(.SD, .Primitive('+'),1), by= id, .SDcols = c('a','b','c')]
note that I called base::'+' or .Primitive('+') instead of '+', as data.table cannot cannot find '+' as a function
Benchmarking
Benchmarking these solutions
benchmark(
lstdt=dtb[ , flst(.SD), by=id],
dfdt=dtb[ , fdf(.SD), by=id],
lapplySD = dtb[, lapply(.SD,base::'+',1),by= id, .SDcols = c('a','b','c')],
lapplySD2 = dtb[, lapply(.SD,.Primitive('+'),1),by= id, .SDcols = c('a','b','c')]
just_list = dtb[,list(a = a+1,b=b+1,c=c+1),b=id],
eval_mylist = dtb[,eval(my_list),b=id],
replications=10^2
## test replications elapsed relative user.self
## 2 dfdt 100 0.36 4.000000 0.34
## 6 eval_mylist 100 0.09 1.000000 0.10
## 5 just_list 100 0.11 1.222222 0.10
## 3 lapplySD 100 0.14 1.555556 0.14
## 4 lapplySD2 100 0.11 1.1 0.11
## 1 lstdt 100 0.18 2.000000 0.17
the unevaluated expression (passing the list of expressions) is the fasted, which is consistent with Matthew Dowle's points in this previous question
When you wrote this c(dt$a+1, dt$b+1, dt$c+1) you should have expected a single vector (plus the group id column. Try this instead:
dtb <- data.table(id=rep(1:5,20), a=1:100, b=sample(1:100, 100), c=sample(1:100, 100))
f <- function(dt) { return(list(dt$a+1, dt$b+1, dt$c+1))}
dtb[,f(.SD), by=id]
EDIT2 (there was an error in my earlier edit that I only noticed when posting the full code). To the question about "cheaper": Here's a benchmark run that shows list construction to be 'cheaper':
flst <- function(dt) { return(list(dt$a+1, dt$b+1, dt$c+1))}
fdf <- function(dt) { return(data.frame(dt$a+1, dt$b+1, dt$c+1))}
require(rbenchmark)
benchmark(
lstdt=dtb[ , flst(.SD), by=id],
dfdt=dtb[ , fdf(.SD), by=id],
replications=10^2
)
test replications elapsed relative user.self sys.self user.child sys.child
2 dfdt 100 0.466 2.89441 0.457 0.010 0 0
1 lstdt 100 0.161 1.00000 0.159 0.003 0 0

Resources