data.table underperforming when applying function into nested fields - r

I wrote a function that applies another function into each nested field in a data.table:
n <- 3000
nameslist <- paste0("NAME",1:n)
dt <- data.table(name_var = nameslist
, value_var = rnorm(1e7)
, car_color = c("B","B","B","G","G","G","G"))
dt <- dt[,.(.(.SD)), by = name_var] # nesting dt and finishing toy data creation
transform_value <- function(x, fun, campo, ...) {
x [, match.fun (fun)(get(campo), ...)] }
system.time({
dt[, min_value:=lapply(V1, transform_value, "min", "value_var")
]})
This runs correctly and pretty fast, it takes about .36 seconds in my machine for n <- 3000.
However when I make n <- 500000 (nameslist length 500,000 long), it takes 217 seconds!
Thus it seems that data.table is not effectivelly optimized by the code I wrote in the loop
dt[, min_value:=lapply(V1, transform_value, "min", "value_var"). My take is that there is an overhead each time data.table enters into each instance of a nested loop, but I am not sure.
What can I do to make it run faster?

Since you are interested by performance, you should definitely set a primary key. Once paid the fixed cost of setting the key (which means a reordering of your rows in contiguous memory slots), you will experience significant gains.
Taking again your example
dt <- data.table(name_var = nameslist
, value_var = rnorm(1e7)
, car_color = c("B","B","B","G","G","G","G"))
dt2 <- data.table::copy(dt)
setkeyv(dt2, c("name_var"))
microbenchmark::microbenchmark(
dt[,.(.(.SD)), by = name_var],
dt2[,.(.(.SD)), by = name_var],
times = 20L
)
Unit: milliseconds
expr min lq mean median uq max neval
dt[, .(.(.SD)), by = name_var] 658.76452 676.22905 706.3578 699.46644 727.8368 793.1192 20
dt2[, .(.(.SD)), by = name_var] 91.62276 92.48002 131.2777 99.15238 135.1332 318.3719 20
Just for this step, in my laptop (that is not very fast), you see a division by 7 of computation time (the order of magnitude can vary a lot)

Related

Performance properties of time-series operations in R at scale (mainly xts and data.table)

I am undertaking a new project encompassing large time-series datasets from which dependent calculations are fed into a shiny application. As such, efficiency is of interest to me. The operations are typically restricted to elementary period conversions and subsequent summary statistics for risk metrics.
I am investigating which library/approach to build the calculation scripts with. At present, I am OK with xts and data.table. Although I can resort to libraries as quantmod and TTR, I am hesitant to deploy blackbox functions in production and would prefer to maintain full traceability.
Thus far, I have carried out the following benchmarking exercise in which a data.frame of daily prices is converted into monthly returns. The packages used thus far are xts, data.table and quantmod (as reference). The code is pasted below but can also be found on GitHub.
Benchmark Code
# Simple return exercise: Daily Prices to Monthly Returns
# Input: Nx2 data.frame with columns (N days, price)
# Output: Mx2 object with columns (M months, return)
# Three different functions: 1. xts, 2. data.table, 3. quantmod
rm(list = ls()); gc()
library(data.table)
library(zoo)
library(xts)
library(ggplot2)
library(quantmod)
# Asset params
spot = 100
r = 0.01
sigma = 0.02
N = 1e5
# Input data: Nx2 data.frame (date, price)
pmat = data.frame(
date = seq.Date(as.Date('1970-01-01'), by = 1, length.out = N),
price = spot * exp(cumsum((r - 0.5 * sigma**2) * 1/N + (sigma * (sqrt(1/N)) * rnorm(N, mean = 0, sd = 1))))
)
# Output functions
# 1. xts standalone
xtsfun = function(mat){
xtsdf = as.xts(mat[, 2], order.by = mat[, 1])
eom_prices = to.monthly(xtsdf)[, 4]
mret = eom_prices/lag.xts(eom_prices) - 1; mret[1] = eom_prices[1]/xtsdf[1] - 1
mret
}
# 2. data.table standalone
dtfun = function(mat){
dt = setNames(as.data.table(mat), c('V1', 'V2'))
dt[, .(EOM = last(V2)), .(Month = as.yearmon(V1))][, .(Month, Return = EOM/shift(EOM, fill = first(mat[, 2])) - 1)]
}
# 3. quantmod (black box library)
qmfun = function(mat){
qmdf = as.xts(mat[, 2], order.by = mat[, 1])
monthlyReturn(qmdf)
}
# Check 1 == 2 == 3:
all.equal(
unlist(dtfun(pmat[1:1000,])[, Return]),
as.numeric(xtsfun(pmat[1:1000,])),
as.numeric(qmfun(pmat[1:1000,])),
scale = NULL
)
# Benchmark
library(microbenchmark)
gc()
mbm = microbenchmark(
xts = xtsfun(pmat),
data.table = dtfun(pmat),
quantmod = qmfun(pmat),
times = 50
)
mbm
Results
For N = 1e5, the three approaches perform similarly:
Unit: milliseconds
expr min lq mean median uq max neval
xts 20.62520 22.93372 25.14445 23.84235 27.25468 39.29402 50
data.table 21.23984 22.29121 27.28266 24.05491 26.25416 98.35812 50
quantmod 14.21228 16.71663 19.54709 17.19368 19.38106 102.56189 50
However, for N = 1e6 I observe a substantial performance divergence for data.table:
Unit: milliseconds
expr min lq mean median uq max neval
xts 296.8969 380.7494 408.7696 397.4292 431.1306 759.7227 50
data.table 1562.3613 1637.8787 1669.8513 1651.4729 1688.2312 1969.4942 50
quantmod 144.1901 244.2427 278.7676 268.4302 331.4777 418.7951 50
I am very curious as to what drives this result, particularly since data.table normally excels at large N. Of course, dtfun could just be poorly written (and I very much appreciate any code improvements), but I achieve similar results using other approaches including self-joins on EOM dates and cumprod on daily returns.
Do xts and/or quantmod benefit from any internal rcpp or eqv calls that improves their performance at scale? Lastly, if you are aware of any other competitive standalone solution (base?, dplyr?) for large-scale TS, I am all ears.
The answer resides in data.table's date handling. In essence, it employs the ISOdate format which is comparatively slow. When imposing integer-based date-grouping instead, the results shift in the favor of data.table.
I have updated the TSBenchmark repository using updated solutions for xts and data.table. I much appreciate the improvements provided by Joshua Ulrich and Matt Dowle who deserve full credit.

Is it possible to do your own efficient descriptive statistics function? - R

Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b

roll applying multiple quantiles in data table to multiple columns

Background:
I can get multiple moments from my data using data.table (see appended), but it is taking a very long time. I was thinking that the process of sorting the table to get a particular percentile would be more efficient for finding several.
A once-through statistic like median is taking 1.79 ms while the non-median quantile is taking 68x longer at 122.8 ms. There has to be a way to reduce the compute time.
Questions:
Is there a way to call several quantiles from the same data in a more efficient manner?
Can I pull the "lapply" out of the data.table and compose it like I do the name lists?
My example code with tiny synthetic data:
#libraries
library(data.table) #data.table
library(zoo) #roll apply
#reproducibility
set.seed(45L)
#make data
DT<-data.table(V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(300),4),
V4=round(runif(150),4),
V5=1:1200)
DT
#get names
my_col_list <- names(DT)[c(3,4)]
#make new variable names
my_name_list1 <- paste0(my_col_list, "_", "33rd_pctile")
my_name_list2 <- paste0(my_col_list, "_", "77rd_pctile")
#compute values
for(i in 1:length(my_col_list)){
#first
DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]]
#second
DT[, (my_name_list2[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 7/9)),
recursive = F),
.SDcols = my_col_list[i]]
}
#display it
head(DT,10)
Microbenchmarking a once-through statistic vs. the many-through statistics says that the quantiles are expensive.
res2 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
mean,
fill = NA)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
says it takes about 1.75 millseconds for a mean (median is 1.79 sec)
> res2
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, mean, fill = NA)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
1.465779 1.509114 1.754145 1.618591 1.712103 2.46514 5
but it takes 100x that to compute a quantile
res3 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
res3
and
> res3
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, quantile, fill = NA, probs = 1/3)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
118.5833 119.2896 122.8432 124.0168 124.4183 127.9082 5
UPDATES:
The median from "quantile" takes ~128 seconds while the "median"
takes much less. They are not the same thing.
iterating through the 9 options for "type" of "quantile" gives mean
times between 129ms and 157ms. There is no "easy-win" here.
The package "WGCNA" requires "GO.db" from bioconductor, which is not
installed with the "install.packages" command. Also requires package
"impute" which is not installed with "WGCNA" or "GO.db". Also
"preprocessCore".
Using the (eventually working) WGCNA package reduced mean time for
rolling quantile to 68.1 ms. It is about half the time, but it is
not about 1/70th the time.
Using "RollingMedian" from the "RollingWindow" package I get 169.8
microseconds (aka 0.1698 milliseconds) which is a LOT faster, but is
not an arbitrary quantile.
Using "perccal" seems to drop computing a quantile down to ~145
microseconds. In the rollapply this drops compute time down to 15.3
milliseconds, which is an 8.5x boost. I'm not sure how much more
blood there is to squeeze out of this stone.
Thoughts:
The "perccal" approach seems to only be using a single core. There
may be some "parallel" process that allows me to split the summary
against different variables to different cores. This might give some
speedup.
As we add more terms to the data, the speedup starts reducing.
Increasing to 9600 rows reduces the speedup from ~8.5x to below 1x.
This suggests that the rollapply function may also have some issues.
Data table optimization
You are correct that the median is especially fast in this case, this is because it is running specialized C code unlike the quantile function which is pure R-code.
We can read about this optimization in the Documentation of data.table in
?data.table.optimize
There we have:
When expressions in j which contains only these functions min, max,
mean, median, var, sd, prod, for example, dt[, list(mean(x),
median(x), min(y), max(y)), by=z], they are very effectively optimised
using, what we call, GForce. These functions are replaced with gmean,
gmedian, gmin, gmax instead
And they give an example for the speed improvement for the median case:
# Generate a big data.table with a relatively many columns
set.seed(1L)
dt = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(dt)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(dt), units="Mb") # 400MB, not huge, but will do
# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, median), by=id])
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)
On my system the R internal version is about 44x slower than the data.table version.
Speeding up quantile
We can still try to improve the speed of the quantile function in R, for this my approach is basically "Use the source, Luke" and looking at the quantile function. Looking at the source we get standard generic function:
>> quantile
function (x, ...)
UseMethod("quantile")
<bytecode: 0x0000000009154c78>
<environment: namespace:stats>
We can trace this a bit more:
>> methods(quantile)
[1] quantile.default* quantile.ecdf* quantile.POSIXt* quantile.zoo
see '?methods' for accessing help and source code
and have a look at the default function.
>> stats:::quantile.default
function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
...
}
Now we have the entire source, which is quite long, we can compare it to the R median source in median.default. With the source we can copy it as a user defined function and profile it (with a small inclusion of supplying the namespace for format_perc), from this it is clear that only two lines are relevant, namely the sorting and the output formatting, the sorting is very similar to the median function and likely hard to improve. The formatting however can be skipped entirely by commenting it out.
fast.quant <- function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
if (is.factor(x)) {
...
...
if (names && np > 0L) {
#names(qs) <- stats:::format_perc(probs)
}
...
}
All in all this fix cuts the run time in half, it is still not the optimized median but most likely it is hard to get better performance without leaving R.
It is possible, and maybe even probable, that the optimizations in data.table can be leveraged to help with the quantile calulations as well, since data.table implements sorting in C as well. One would however still like to leverage that only partial sorting is required. Otherwise the Rcpp package can be used as well to perform a similar optimization.

Optimising sapply() or for(), paste(), to efficiently transform sparse triplet matrix to a libsvm format

I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive() that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply call that will make it an order of magnitude or three faster?
library(microbenchmark)
# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000
# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10
stm <- data.frame(
i = sample(1:n, nonzerovalues, replace = TRUE),
j = sample(1:ncols, nonzerovalues, replace = TRUE),
v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)
# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v
expensive <- function(){
sapply(1:n, function(k){
# microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
whichi <- which(i == k)
paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
})
}
microbenchmark(expensive())
The output of expensive is a character vector, of n elements, that looks like this:
[1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"
[2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"
[3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8"
For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam or Matrix, but starting with slam - into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.
Use package data.table. Its by combined with the fast sorting saves you from finding the indices of equal i values.
res1 <- expensive()
library(data.table)
cheaper <- function() {
setDT(stm)
res <- stm[, .(i, jv = paste(j, v, sep = ":"))
][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]
setDF(stm) #clean-up which might not be necessary
res
}
res2 <- cheaper()
all.equal(res1, res2)
#[1] TRUE
microbenchmark(expensive(),
cheaper())
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b
# cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a

plyr outperforms dplyr and data.table - What's wrong?

I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).

Resources