Background:
I can get multiple moments from my data using data.table (see appended), but it is taking a very long time. I was thinking that the process of sorting the table to get a particular percentile would be more efficient for finding several.
A once-through statistic like median is taking 1.79 ms while the non-median quantile is taking 68x longer at 122.8 ms. There has to be a way to reduce the compute time.
Questions:
Is there a way to call several quantiles from the same data in a more efficient manner?
Can I pull the "lapply" out of the data.table and compose it like I do the name lists?
My example code with tiny synthetic data:
#libraries
library(data.table) #data.table
library(zoo) #roll apply
#reproducibility
set.seed(45L)
#make data
DT<-data.table(V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(300),4),
V4=round(runif(150),4),
V5=1:1200)
DT
#get names
my_col_list <- names(DT)[c(3,4)]
#make new variable names
my_name_list1 <- paste0(my_col_list, "_", "33rd_pctile")
my_name_list2 <- paste0(my_col_list, "_", "77rd_pctile")
#compute values
for(i in 1:length(my_col_list)){
#first
DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]]
#second
DT[, (my_name_list2[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 7/9)),
recursive = F),
.SDcols = my_col_list[i]]
}
#display it
head(DT,10)
Microbenchmarking a once-through statistic vs. the many-through statistics says that the quantiles are expensive.
res2 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
mean,
fill = NA)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
says it takes about 1.75 millseconds for a mean (median is 1.79 sec)
> res2
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, mean, fill = NA)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
1.465779 1.509114 1.754145 1.618591 1.712103 2.46514 5
but it takes 100x that to compute a quantile
res3 <- microbenchmark( DT[, (my_name_list1[i]) := unlist(lapply(.SD,
function(x) rollapply(x,
7,
quantile,
fill = NA,
probs = 1/3)),
recursive = F),
.SDcols = my_col_list[i]],
times = 5)
res3
and
> res3
Unit: milliseconds
expr
DT[, `:=`((my_name_list1[i]), unlist(lapply(.SD, function(x) rollapply(x, 7, quantile, fill = NA, probs = 1/3)), recursive = F)), .SDcols = my_col_list[i]]
min lq mean median uq max neval
118.5833 119.2896 122.8432 124.0168 124.4183 127.9082 5
UPDATES:
The median from "quantile" takes ~128 seconds while the "median"
takes much less. They are not the same thing.
iterating through the 9 options for "type" of "quantile" gives mean
times between 129ms and 157ms. There is no "easy-win" here.
The package "WGCNA" requires "GO.db" from bioconductor, which is not
installed with the "install.packages" command. Also requires package
"impute" which is not installed with "WGCNA" or "GO.db". Also
"preprocessCore".
Using the (eventually working) WGCNA package reduced mean time for
rolling quantile to 68.1 ms. It is about half the time, but it is
not about 1/70th the time.
Using "RollingMedian" from the "RollingWindow" package I get 169.8
microseconds (aka 0.1698 milliseconds) which is a LOT faster, but is
not an arbitrary quantile.
Using "perccal" seems to drop computing a quantile down to ~145
microseconds. In the rollapply this drops compute time down to 15.3
milliseconds, which is an 8.5x boost. I'm not sure how much more
blood there is to squeeze out of this stone.
Thoughts:
The "perccal" approach seems to only be using a single core. There
may be some "parallel" process that allows me to split the summary
against different variables to different cores. This might give some
speedup.
As we add more terms to the data, the speedup starts reducing.
Increasing to 9600 rows reduces the speedup from ~8.5x to below 1x.
This suggests that the rollapply function may also have some issues.
Data table optimization
You are correct that the median is especially fast in this case, this is because it is running specialized C code unlike the quantile function which is pure R-code.
We can read about this optimization in the Documentation of data.table in
?data.table.optimize
There we have:
When expressions in j which contains only these functions min, max,
mean, median, var, sd, prod, for example, dt[, list(mean(x),
median(x), min(y), max(y)), by=z], they are very effectively optimised
using, what we call, GForce. These functions are replaced with gmean,
gmedian, gmin, gmax instead
And they give an example for the speed improvement for the median case:
# Generate a big data.table with a relatively many columns
set.seed(1L)
dt = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(dt)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(dt), units="Mb") # 400MB, not huge, but will do
# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- dt[, lapply(.SD, median), by=id])
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)
On my system the R internal version is about 44x slower than the data.table version.
Speeding up quantile
We can still try to improve the speed of the quantile function in R, for this my approach is basically "Use the source, Luke" and looking at the quantile function. Looking at the source we get standard generic function:
>> quantile
function (x, ...)
UseMethod("quantile")
<bytecode: 0x0000000009154c78>
<environment: namespace:stats>
We can trace this a bit more:
>> methods(quantile)
[1] quantile.default* quantile.ecdf* quantile.POSIXt* quantile.zoo
see '?methods' for accessing help and source code
and have a look at the default function.
>> stats:::quantile.default
function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
...
}
Now we have the entire source, which is quite long, we can compare it to the R median source in median.default. With the source we can copy it as a user defined function and profile it (with a small inclusion of supplying the namespace for format_perc), from this it is clear that only two lines are relevant, namely the sorting and the output formatting, the sorting is very similar to the median function and likely hard to improve. The formatting however can be skipped entirely by commenting it out.
fast.quant <- function (x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
type = 7, ...)
{
if (is.factor(x)) {
...
...
if (names && np > 0L) {
#names(qs) <- stats:::format_perc(probs)
}
...
}
All in all this fix cuts the run time in half, it is still not the optimized median but most likely it is hard to get better performance without leaving R.
It is possible, and maybe even probable, that the optimizations in data.table can be leveraged to help with the quantile calulations as well, since data.table implements sorting in C as well. One would however still like to leverage that only partial sorting is required. Otherwise the Rcpp package can be used as well to perform a similar optimization.
Related
I am undertaking a new project encompassing large time-series datasets from which dependent calculations are fed into a shiny application. As such, efficiency is of interest to me. The operations are typically restricted to elementary period conversions and subsequent summary statistics for risk metrics.
I am investigating which library/approach to build the calculation scripts with. At present, I am OK with xts and data.table. Although I can resort to libraries as quantmod and TTR, I am hesitant to deploy blackbox functions in production and would prefer to maintain full traceability.
Thus far, I have carried out the following benchmarking exercise in which a data.frame of daily prices is converted into monthly returns. The packages used thus far are xts, data.table and quantmod (as reference). The code is pasted below but can also be found on GitHub.
Benchmark Code
# Simple return exercise: Daily Prices to Monthly Returns
# Input: Nx2 data.frame with columns (N days, price)
# Output: Mx2 object with columns (M months, return)
# Three different functions: 1. xts, 2. data.table, 3. quantmod
rm(list = ls()); gc()
library(data.table)
library(zoo)
library(xts)
library(ggplot2)
library(quantmod)
# Asset params
spot = 100
r = 0.01
sigma = 0.02
N = 1e5
# Input data: Nx2 data.frame (date, price)
pmat = data.frame(
date = seq.Date(as.Date('1970-01-01'), by = 1, length.out = N),
price = spot * exp(cumsum((r - 0.5 * sigma**2) * 1/N + (sigma * (sqrt(1/N)) * rnorm(N, mean = 0, sd = 1))))
)
# Output functions
# 1. xts standalone
xtsfun = function(mat){
xtsdf = as.xts(mat[, 2], order.by = mat[, 1])
eom_prices = to.monthly(xtsdf)[, 4]
mret = eom_prices/lag.xts(eom_prices) - 1; mret[1] = eom_prices[1]/xtsdf[1] - 1
mret
}
# 2. data.table standalone
dtfun = function(mat){
dt = setNames(as.data.table(mat), c('V1', 'V2'))
dt[, .(EOM = last(V2)), .(Month = as.yearmon(V1))][, .(Month, Return = EOM/shift(EOM, fill = first(mat[, 2])) - 1)]
}
# 3. quantmod (black box library)
qmfun = function(mat){
qmdf = as.xts(mat[, 2], order.by = mat[, 1])
monthlyReturn(qmdf)
}
# Check 1 == 2 == 3:
all.equal(
unlist(dtfun(pmat[1:1000,])[, Return]),
as.numeric(xtsfun(pmat[1:1000,])),
as.numeric(qmfun(pmat[1:1000,])),
scale = NULL
)
# Benchmark
library(microbenchmark)
gc()
mbm = microbenchmark(
xts = xtsfun(pmat),
data.table = dtfun(pmat),
quantmod = qmfun(pmat),
times = 50
)
mbm
Results
For N = 1e5, the three approaches perform similarly:
Unit: milliseconds
expr min lq mean median uq max neval
xts 20.62520 22.93372 25.14445 23.84235 27.25468 39.29402 50
data.table 21.23984 22.29121 27.28266 24.05491 26.25416 98.35812 50
quantmod 14.21228 16.71663 19.54709 17.19368 19.38106 102.56189 50
However, for N = 1e6 I observe a substantial performance divergence for data.table:
Unit: milliseconds
expr min lq mean median uq max neval
xts 296.8969 380.7494 408.7696 397.4292 431.1306 759.7227 50
data.table 1562.3613 1637.8787 1669.8513 1651.4729 1688.2312 1969.4942 50
quantmod 144.1901 244.2427 278.7676 268.4302 331.4777 418.7951 50
I am very curious as to what drives this result, particularly since data.table normally excels at large N. Of course, dtfun could just be poorly written (and I very much appreciate any code improvements), but I achieve similar results using other approaches including self-joins on EOM dates and cumprod on daily returns.
Do xts and/or quantmod benefit from any internal rcpp or eqv calls that improves their performance at scale? Lastly, if you are aware of any other competitive standalone solution (base?, dplyr?) for large-scale TS, I am all ears.
The answer resides in data.table's date handling. In essence, it employs the ISOdate format which is comparatively slow. When imposing integer-based date-grouping instead, the results shift in the favor of data.table.
I have updated the TSBenchmark repository using updated solutions for xts and data.table. I much appreciate the improvements provided by Joshua Ulrich and Matt Dowle who deserve full credit.
I wrote a function that applies another function into each nested field in a data.table:
n <- 3000
nameslist <- paste0("NAME",1:n)
dt <- data.table(name_var = nameslist
, value_var = rnorm(1e7)
, car_color = c("B","B","B","G","G","G","G"))
dt <- dt[,.(.(.SD)), by = name_var] # nesting dt and finishing toy data creation
transform_value <- function(x, fun, campo, ...) {
x [, match.fun (fun)(get(campo), ...)] }
system.time({
dt[, min_value:=lapply(V1, transform_value, "min", "value_var")
]})
This runs correctly and pretty fast, it takes about .36 seconds in my machine for n <- 3000.
However when I make n <- 500000 (nameslist length 500,000 long), it takes 217 seconds!
Thus it seems that data.table is not effectivelly optimized by the code I wrote in the loop
dt[, min_value:=lapply(V1, transform_value, "min", "value_var"). My take is that there is an overhead each time data.table enters into each instance of a nested loop, but I am not sure.
What can I do to make it run faster?
Since you are interested by performance, you should definitely set a primary key. Once paid the fixed cost of setting the key (which means a reordering of your rows in contiguous memory slots), you will experience significant gains.
Taking again your example
dt <- data.table(name_var = nameslist
, value_var = rnorm(1e7)
, car_color = c("B","B","B","G","G","G","G"))
dt2 <- data.table::copy(dt)
setkeyv(dt2, c("name_var"))
microbenchmark::microbenchmark(
dt[,.(.(.SD)), by = name_var],
dt2[,.(.(.SD)), by = name_var],
times = 20L
)
Unit: milliseconds
expr min lq mean median uq max neval
dt[, .(.(.SD)), by = name_var] 658.76452 676.22905 706.3578 699.46644 727.8368 793.1192 20
dt2[, .(.(.SD)), by = name_var] 91.62276 92.48002 131.2777 99.15238 135.1332 318.3719 20
Just for this step, in my laptop (that is not very fast), you see a division by 7 of computation time (the order of magnitude can vary a lot)
Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b
I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive() that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply call that will make it an order of magnitude or three faster?
library(microbenchmark)
# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000
# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10
stm <- data.frame(
i = sample(1:n, nonzerovalues, replace = TRUE),
j = sample(1:ncols, nonzerovalues, replace = TRUE),
v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)
# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v
expensive <- function(){
sapply(1:n, function(k){
# microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
whichi <- which(i == k)
paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
})
}
microbenchmark(expensive())
The output of expensive is a character vector, of n elements, that looks like this:
[1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"
[2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"
[3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8"
For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam or Matrix, but starting with slam - into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.
Use package data.table. Its by combined with the fast sorting saves you from finding the indices of equal i values.
res1 <- expensive()
library(data.table)
cheaper <- function() {
setDT(stm)
res <- stm[, .(i, jv = paste(j, v, sep = ":"))
][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]
setDF(stm) #clean-up which might not be necessary
res
}
res2 <- cheaper()
all.equal(res1, res2)
#[1] TRUE
microbenchmark(expensive(),
cheaper())
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b
# cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a
I have a dataframe that houses cols of numbers - id like to check the range between these cols by row and create a new col that contains this range....
tool1 tool2 tool3 range
1 34 12 33
na 19 23 4
its has to be able to handle NAs too, byt just ignoring them.
How could this be done?
I've decide to expand this, because operating on rows in R is always a pain. So I've decided to compare base R against the two very efficient packages data.table and dplyr
(I'm not a dplyr expert, so if someone wants to modify my answer, please do)
Note:
Your case isn't a classic case of operating on rows because it can be solved using vectorized pmax and pmin, which we won't be always able to use
So creating a bit bigger data than in your example
n <- 1e4
set.seed(123)
df <- data.frame(tool1 = sample(100, n, replace = T),
tool2 = sample(100, n, replace = T),
tool3 = sample(100, n, replace = T))
Loading the necessary packages
library(data.table)
library(dplyr)
library(microbenchmark)
Defining the functions
apply1 <- function(y) apply(y, 1, function(x) max(x, na.rm = T) - min(x, na.rm = T))
apply2 <- function(y) apply(y, 1, function(x) diff(range(x, na.rm = T)))
trans <- function(y) transform(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
DTfunc <- function(y) setDT(y)[, range := pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3)]
DTfunc2 <- function(y) set(y, j = "range", value = with(y, pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))) # Thanks to #Arun for this
dplyrfunc <- function(y) mutate(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
df2 <- as.data.table(df) # This is in order to avoid overriding df by `setDT` during benchmarking
Running some benchmarks
microbenchmark(apply1(df), apply2(df), trans(df), DTfunc(df2), DTfunc2(df2), dplyrfunc(df), times = 100)
Unit: microseconds
expr min lq median uq max neval
apply1(df) 37221.513 40699.3790 44103.3495 46777.305 94845.463 100
apply2(df) 262440.581 278239.6460 287478.4710 297301.116 343962.869 100
trans(df) 1088.799 1178.3355 1234.9940 1287.503 1965.328 100
DTfunc(df2) 2068.750 2221.8075 2317.5680 2400.400 5935.883 100
DTfunc2(df2) 903.981 959.0435 986.3355 1026.395 1235.951 100
dplyrfunc(df) 1040.280 1118.9635 1159.9815 1200.680 1509.189 100
Seems like the second data.table approach is the most efficient. Base R transform and dplyr both pretty much the same, while more efficient than the first data.table approach because of the overhead in calling [.data.table