rbind + setkey in data.table slower than xts::rbind which automatically indexes? - r

What is the reason behind data.table being almost 6x slower than xts when updating(=rbind) new rows?
library(quantmod); library(xts); library(data.table)
XTS = getSymbols("AAPL", from="2000-01-01", env = NULL)
# make corresponding `data.table`:
DT <- as.data.table(as.data.frame(XTS))
DT[, Date:=index(XTS)]
setkey(DT,Date)
setcolorder(DT,c("Date",names(XTS)))
# Note: rerun the above before running each test.
system.time(for(i in 1:10) XTS = rbind(XTS, XTS)) # reindexing is automatic
# user system elapsed
# 0.15 0.03 0.47
system.time(for(i in 1:10) DT = setkey(rbind(DT, DT), Date)) # need to manually reset key
# user system elapsed
# 0.64 0.02 2.30
system.time(for(i in 1:10) DT = setkey(rbindlist(list(DT, DT)), Date)) # ditto
# user system elapsed
# 0.60 0.02 2.20
The data.table (unlike xts) will even exhaust memory allocation for i>15 on my computer.
The common programming use case is when you are running a temporal simulation and want to collect intermediate measurements into a result table, which you later want to summarise.

Try
rbindlist( rep( list(DT), 10 ))
rbindlist should boost your runtime significantly.

Related

set missing values to constant in R, computational speed

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude
You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE
I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.

R, data.table: find all combinations of a list excluding each element paired with itself

I would like to efficiently find all combinations of a list excluding the combination of each element to itself. For example, with a list of A,B,C,D find all combinations excluding A-A, B-B, C-C, D-D.
I can do this in what seems to be an inefficient way using this code:
x <- c("A","B","C","D")
dt <- CJ(x,x)
dt <- dt[!V1==V2]
The problem is that the third line takes as about 4 times as long to run as the second line. So for a large list like my real data, line 2 and line 3 together can take a very long time.
I am using data.table 1.9.6, R 3.2.2, and R Studio on Windows 7.
Thanks so much.
Well, this is something of an improvement:
n = 1e4; x = seq(n)
# combn (variant of #Psidom's answer)
system.time({
cn = transpose(combn(x, 2, simplify=FALSE))
r = rbind( setDT(cn), rev(cn) )
})
# takes forever, so i cut it off
# op's code
system.time({
r0 = CJ(x,x)[V1 != V2]
})
# user system elapsed
# 1.69 0.63 1.50
# use indices in the final step
system.time({
r1 = CJ(x,x)[-seq(1L, .N, by=length(x)+1L)]
})
# user system elapsed
# 1.17 0.42 0.96
And some more:
# build it manually
system.time({
xlen = length(x)
r2 = data.table(rep(x, each = xlen), V2 = x)[-seq(1L, .N, by=xlen+1L)]
})
# user system elapsed
# 3.03 0.60 2.79
# ... or ...
system.time({
xlen = length(x)
r2 = data.table(rep(x, each = xlen-1L), rep.int(x, xlen)[-seq(1L, xlen^2, by=xlen+1L)])
})
# user system elapsed
# 2.79 0.25 3.07
# build it manually special for the case of two cols
system.time({
r3 = setDT(list(x))[, .(V2 = x), by=V1][ -seq(1L, .N, by=length(x)+1L) ]
})
# user system elapsed
# 0.92 0.25 0.86
# ... or ...
system.time({
r4 = setDT(list(x))[, .(V2 = x[-.GRP]), by=V1]
})
# user system elapsed
# 0.85 0.32 1.19
# verify
identical(r0, r1) # TRUE
identical(setkey(r0, NULL), r2) # TRUE
identical(setkey(r0, NULL), r3) # TRUE
identical(setkey(r0, NULL), r4) # TRUE
Maybe you can do a little better by writing your own CJ with Rcpp. It might also be worth noting that everything is faster with integers (instead of characters):
x = rep(LETTERS, 5e2)
system.time(CJ(x,x))
# user system elapsed
# 7.06 1.81 6.61
x = rep(1:26, 5e2)
system.time(CJ(x,x))
# user system elapsed
# 3.39 0.88 2.95
So if x is a character vector, it might be best to use seq_along(x) for the combinatorial tasks and then map back to the character values like x[V1] afterwards.

replacement for na.locf.xts (extremely slow when used with a multicolumn xts)

The R function
xts:::na.locf.xts
is extremely slow when used with a multicolumn xts of more than a few columns.
There is indeed a loop over the columns in the code of na.locf.xts
I am trying to find a way to avoid this loop.
Any idea?
The loop in na.locf.xts is slow because it creates a copy of the entire object for each column in the object. The loop itself isn't slow; the copies created by [.xts are slow.
There's an experimental (and therefore unexported) version of na.locf.xts on R-Forge that moves the loop over columns to C, which avoids copying the object. It's quite a bit faster for very large objects.
set.seed(21)
m <- replicate(20, rnorm(1e6))
is.na(m) <- sample(length(x), 1e5)
x <- xts(m, Sys.time()-1e6:1)
y <- x[1:1e5,1:3]
> # smaller objects
> system.time(a <- na.locf(y))
user system elapsed
0.008 0.000 0.008
> system.time(b <- xts:::.na.locf.xts(y))
user system elapsed
0.000 0.000 0.003
> identical(a,b)
[1] TRUE
> # larger objects
> system.time(a <- na.locf(x))
user system elapsed
1.620 1.420 3.064
> system.time(b <- xts:::.na.locf.xts(x))
user system elapsed
0.124 0.092 0.220
> identical(a,b)
[1] TRUE
timeIndex <- index(x)
x <- apply(x, 2, na.locf)
x <- as.xts(x, order.by = timeIndex)
This avoids the column-by-column data copying. Without this, when filling the nth column, you make a copy of 1 : (n - 1) columns and append the nth column to it, which becomes prohibitively slow when n is large.

Transposing a data.table

What would be a good way to efficiently transform a data.table after the data computation is over
nrow=500e3
ncol=2000
m <- matrix(rnorm(nrow*ncol),nrow=nrow)
colnames(m) <- c('foo',seq(ncol-1))
dt <- data.table(m)
df <- as.data.frame(m)
dt <- t(dt) #take a long time and converts the data table to a matrix
compute time
1. to transpose the matrix
system.time(mt <- t(m))
user system elapsed
20.005 0.016 20.024
2. to transpose the dt
system.time(dt <- t(dt))
user system elapsed
32.722 15.129 47.855
3. to transpose a df
system.time(df <- t(df))
user system elapsed
32.414 15.357 47.775
This is quite an old question, and since then data.table has added/exported transpose for transposing lists. Performance-wise, it out-performs t except on matrices (I think this is to be expected)
system.time(t(m))
# user system elapsed
# 23.990 23.416 85.722
system.time(t(dt))
# user system elapsed
# 31.223 53.197 195.221
system.time(t(df))
# user system elapsed
# 30.609 45.404 148.323
system.time(setDT(transpose(dt)))
# user system elapsed
# 42.135 38.478 116.599

Select last row by group for all columns data.table

I was surprised doing the following:
R) system.time(lastOrder <- order[,lapply(.SD,tail,1),by="TRADER_ID,EXEC_IDATE"]);
utilisateur système écoulé
1.45 0.00 1.53
R) nrow(order)
[1] 75301
R) ncol(order)
[1] 23
Thought it was very long, then I did
R) system.time(lastOrder <- order[,list(test=tail(EXEC_IDATE,1)),by="TRADER_ID,EXEC_IDATE"]);
utilisateur système écoulé
0.14 0.00 0.14
as far as I understand, if you know all the rows to select and work on most of the work is done, then I don't see why apply this to all columns should be 10x longer. Am I doing something wrong on the first bit of code, this is the only way I know to select last rows by group
Last row by group :
DT[, .SD[.N], by="TRADER_ID,EXEC_IDATE"] # (1)
or, faster (avoid use of .SD where possible, for speed) :
w = DT[, .I[.N], by="TRADER_ID,EXEC_IDATE"][[3]] # (2)
DT[w]
Note that the following feature request will make approach (1) as fast as approach (2) :
FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged.
How about something like this? (Synthetic data meant to mimic what I can infer about yours from the question)
tmp <- data.table(id = sample(1:20, 1e6, replace=TRUE),
date = as.Date(as.integer(runif(n=1e6, min = 1e4, max = 1.1e4)),
origin = as.Date("1970-01-01")),
data1 = rnorm(1e6),
data2 = rnorm(1e6),
data3 = rnorm(1e6))
> system.time(X <- tmp[, lapply(.SD, tail, 1), by = list(id, date)])
user system elapsed
1.95 0.00 1.95
> system.time(Y <- tmp[, list(tail(data1, 1)), by = list(id, date)])
user system elapsed
1.24 0.01 1.26
> system.time({
setkey(tmp, id, date)
Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"]
})
user system elapsed
0.90 0.02 0.92
X and Z are the same after same order is ensured:
> identical(setkey(X, id, date), setkey(Z, id, date))
[1] TRUE
The difference between my lapply tail and 1-column tail isn't as drastic as yours, but without the structure of your data, it's hard to say more.
Also, note that most of the time in this method is setting the key. If the table is already sorted by the grouping columns, it goes really fast:
> system.time(Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"])
user system elapsed
0.03 0.00 0.03
Alternatively, you could translate the many column problem to the 1-column problem with a temporary column:
> system.time({
tmp[, row.num := seq_len(nrow(tmp))]
W <- tmp[tmp[, max(row.num), by = list(id, date)]$V1][, row.num := NULL]
tmp[, row.num := NULL]
})
user system elapsed
0.92 0.00 1.09
> identical(setkey(X, id, date), setkey(W, id, date))
[1] TRUE

Resources