Transposing a data.table - r

What would be a good way to efficiently transform a data.table after the data computation is over
nrow=500e3
ncol=2000
m <- matrix(rnorm(nrow*ncol),nrow=nrow)
colnames(m) <- c('foo',seq(ncol-1))
dt <- data.table(m)
df <- as.data.frame(m)
dt <- t(dt) #take a long time and converts the data table to a matrix
compute time
1. to transpose the matrix
system.time(mt <- t(m))
user system elapsed
20.005 0.016 20.024
2. to transpose the dt
system.time(dt <- t(dt))
user system elapsed
32.722 15.129 47.855
3. to transpose a df
system.time(df <- t(df))
user system elapsed
32.414 15.357 47.775

This is quite an old question, and since then data.table has added/exported transpose for transposing lists. Performance-wise, it out-performs t except on matrices (I think this is to be expected)
system.time(t(m))
# user system elapsed
# 23.990 23.416 85.722
system.time(t(dt))
# user system elapsed
# 31.223 53.197 195.221
system.time(t(df))
# user system elapsed
# 30.609 45.404 148.323
system.time(setDT(transpose(dt)))
# user system elapsed
# 42.135 38.478 116.599

Related

Is there a fast way to extract elements in a list of data frames?

I'm trying to find a fast way to extract elements in a list of data frames.
To do this, I've tested the function lapply. Here is a reproducible example:
i <- 2
dat <- replicate(100000, data.frame(x=1:5000, y = 1:5000, z = 1:5000), simplify=FALSE)
system.time(test <- lapply(dat, function(y) y[i, c("x", "y")]))
user system elapsed
7.69 0.00 7.73
Ideally, the elapsed time should be <= 1 second.

Fast way to get topN elements for each matrix column

This is the code snippet from recommenderlab package, that takes matrix with ratings and returns top 5 elements for each user -
reclist <- apply(ratings, MARGIN=2, FUN=function(x)
head(order(x, decreasing=TRUE, na.last=NA), 5))
For large matrix (>10K columns) it takes too long to run, is there any way to re-write it to make more efficient? Maybe by using dpyr, or data.table package)? Writing C++ code is not an option for me
An answer with data.table and base R
# 10000 column dummy matrix
cols <- 10000
mat <- matrix(rnorm(100*cols), ncol=cols)
With data.table:
library(data.table)
dt1 <- data.table(mat)
# sort every column, return first 5 rows
dt1[, lapply(.SD, sort, decreasing=T)][1:5]
system.time(dt1[, lapply(.SD, sort, decreasing=T)][1:5])
result:
user system elapsed
2.904 0.013 2.916
In plain old base, it's actually faster! (thanks for the comment Arun)
system.time(head(apply(mat, 2, sort, decreasing=T), 5))
user system elapsed
0.473 0.002 0.475
However, both are faster than the code sample above, according to system.time()
system.time(
apply(mat, MARGIN=2, FUN=function(x) {
head(order(x, decreasing=TRUE, na.last=NA), 5)
}))
user system elapsed
3.063 0.031 3.094

rbind + setkey in data.table slower than xts::rbind which automatically indexes?

What is the reason behind data.table being almost 6x slower than xts when updating(=rbind) new rows?
library(quantmod); library(xts); library(data.table)
XTS = getSymbols("AAPL", from="2000-01-01", env = NULL)
# make corresponding `data.table`:
DT <- as.data.table(as.data.frame(XTS))
DT[, Date:=index(XTS)]
setkey(DT,Date)
setcolorder(DT,c("Date",names(XTS)))
# Note: rerun the above before running each test.
system.time(for(i in 1:10) XTS = rbind(XTS, XTS)) # reindexing is automatic
# user system elapsed
# 0.15 0.03 0.47
system.time(for(i in 1:10) DT = setkey(rbind(DT, DT), Date)) # need to manually reset key
# user system elapsed
# 0.64 0.02 2.30
system.time(for(i in 1:10) DT = setkey(rbindlist(list(DT, DT)), Date)) # ditto
# user system elapsed
# 0.60 0.02 2.20
The data.table (unlike xts) will even exhaust memory allocation for i>15 on my computer.
The common programming use case is when you are running a temporal simulation and want to collect intermediate measurements into a result table, which you later want to summarise.
Try
rbindlist( rep( list(DT), 10 ))
rbindlist should boost your runtime significantly.

replacement for na.locf.xts (extremely slow when used with a multicolumn xts)

The R function
xts:::na.locf.xts
is extremely slow when used with a multicolumn xts of more than a few columns.
There is indeed a loop over the columns in the code of na.locf.xts
I am trying to find a way to avoid this loop.
Any idea?
The loop in na.locf.xts is slow because it creates a copy of the entire object for each column in the object. The loop itself isn't slow; the copies created by [.xts are slow.
There's an experimental (and therefore unexported) version of na.locf.xts on R-Forge that moves the loop over columns to C, which avoids copying the object. It's quite a bit faster for very large objects.
set.seed(21)
m <- replicate(20, rnorm(1e6))
is.na(m) <- sample(length(x), 1e5)
x <- xts(m, Sys.time()-1e6:1)
y <- x[1:1e5,1:3]
> # smaller objects
> system.time(a <- na.locf(y))
user system elapsed
0.008 0.000 0.008
> system.time(b <- xts:::.na.locf.xts(y))
user system elapsed
0.000 0.000 0.003
> identical(a,b)
[1] TRUE
> # larger objects
> system.time(a <- na.locf(x))
user system elapsed
1.620 1.420 3.064
> system.time(b <- xts:::.na.locf.xts(x))
user system elapsed
0.124 0.092 0.220
> identical(a,b)
[1] TRUE
timeIndex <- index(x)
x <- apply(x, 2, na.locf)
x <- as.xts(x, order.by = timeIndex)
This avoids the column-by-column data copying. Without this, when filling the nth column, you make a copy of 1 : (n - 1) columns and append the nth column to it, which becomes prohibitively slow when n is large.

Why is "by" on a vector not from a data.table column very slow?

test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x
test[,.N, by=x] # fast
test[,.N, by=y] # extremely slow
Why it is slow on the second case?
It is even faster to do this:
test[,y:=y]
test[,.N, by=y]
test[,y:=NULL]
It looks as if it is poorly optimized?
Seems like I forgot to update this post.
This was fixed long back in commit #1039 of v1.8.11. From NEWS:
Fixed #5106 where DT[, .N, by=y] where y is a vector with length(y) = nrow(DT), but y is not a column in DT. Thanks to colinfang for reporting.
Testing on v1.8.11 commit 1187:
require(data.table)
test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x
system.time(ans1 <- test[,.N, by=x])
# user system elapsed
# 0.015 0.000 0.016
system.time(ans2 <- test[,.N, by=y])
# user system elapsed
# 0.015 0.000 0.015
setnames(ans2, "y", "x")
identical(ans1, ans2) # [1] TRUE

Resources