R data table: strangely poor performance in subsetting - r

I was under the impression that data.table is extremely well optimized, so I was quite surprised to see this:
library(data.table)
SimData <- data.table(ID = sample(1:4e5, 4e6, replace = TRUE),
DATE = sample(seq(as.Date("2000-01-01"), as.Date("2019-12-31"), by = "day"),
4e6, replace = TRUE))
microbenchmark::microbenchmark(SimData[ID==1&DATE>="2005-01-01"])
microbenchmark::microbenchmark(SimData[ID==1][DATE>="2005-01-01"])
The two solutions are quite obviously the same, yet there is more than an order of magnitude difference in runtime. Is it possible that data.table performs so poorly with the first form? (I.e., that it can't automatically optimize this call.) Or I overlook something here...?

The long operation is SimData[DATE>="2005-01-01"] because it returns millions of rows.
microbenchmark::microbenchmark(SimData[DATE>="2005-01-01"],SimData[ID==1])
Unit: microseconds
expr min lq mean median uq max neval
SimData[DATE >= "2005-01-01"] 32542.8 44549.55 51323.53 47529.75 50258.10 117396.3 100
SimData[ID == 1] 820.0 1043.55 1397.79 1435.15 1688.25 2302.5 100
SimData[ID == 1] is much shorter as it returns only a few rows,
When you execute SimData[ID==1&DATE>="2005-01-01"], you force both evaluations on all rows.
With SimData[ID==1][DATE>="2005-01-01"] the quick operation is done first, and the subsequent filter is also quick because applied on only a few row.
As mentioned by #jangorecki, there is a room for improvement in that matter.

data.table optimizes the queries like X==x, and ID==1 is in this form thus the first time you run this query it takes a while but after that calling the same query is very fast. In your case second run of the query SimData[ID==1] is very fast and the returned data table is very small which makes the second query also fast.

Related

Are 'j'-expressions in 'data.table' automatically parallelised?

How should I understand the parallelism built into data.table objects? From the getDTthreads function documentation, it seems that shared memory parallelism is employed using OpenMP. That seems fairly low level, and I imagine that it only works for a certain subset
of overloaded functions and operators.
Or, is data.table somehow smart enough to split work for even more complicated expressions? More specifically, to parallelize a j-expression, what restrictions do I need to take into account?
Not to run too much afoul of Stack Overflow's question policy, here is an example. I often want to apply a function to each object in a huge data.table. For example,
library(data.table)
n <- 100000L
dt <- data.table(a = rnorm(n), b = rnorm(n))
dt[, c := sapply(a, function(x) paste(x, 'silly example')]
Would the sapply call in the j-expression work on chunks of column a in parallel? Or is it a plain old base R sapply, which works sequentially?
If the latter is the case, then is embedding one of R's many parallel computing frameworks inside the j-expression a good approach? For example, can I safely and efficiently call foreach, future, et al. in the j-expression?
From ?setDTthreads:
Internally parallelized code is used in the following places:
between.c - between()
cj.c - CJ()
coalesce.c - fcoalesce()
fifelse.c - fifelse()
fread.c - fread()
forder.c, fsort.c, and reorder.c - forder() and related
froll.c, frolladaptive.c, and frollR.c - froll() and family
fwrite.c - fwrite()
gsumm.c - GForce in various places, see GForce
nafill.c - nafill()
subset.c - Used in [.data.table subsetting
types.c - Internal testing usage
My understanding is that you should not expect data.table to make use of multithreading outside of the above use cases. Note that [.data.table uses multithreading for subsetting only, i.e., in i-expressions but not j-expressions. That is presumably just to speed up relational and logical operations, as in x[!is.na(a) & a > 0].
In a j-expression, sum and sapply are still just base::sum and base::sapply. You can test this with a benchmark:
library("data.table")
setDTthreads(4L)
x <- data.table(a = rnorm(2^25))
microbenchmark::microbenchmark(sum(x$a), x[, sum(a)], times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
sum(x$a) 51.61281 51.68317 51.95975 51.84204 52.09202 56.67213 1000
x[, sum(a)] 51.78759 51.89054 52.18827 52.07291 52.33486 61.11378 1000
x <- data.table(a = seq_len(1e+04L))
microbenchmark::microbenchmark(sapply(x$a, paste, "is a good number"), x[, sapply(a, paste, "is a good number")], times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(x$a, paste, "is a good number") 14.07403 15.7293 16.72879 16.31326 17.49072 45.62300 1000
x[, sapply(a, paste, "is a good number")] 14.56324 15.9375 17.03164 16.48971 17.69045 45.99823 1000
where it is clear that simply putting code into a j-expression does not improve performance.
data.table does recognize and handle certain constructs exceptionally. For instance, data.table uses its own radix-based forder instead of base::order when it sees x[order(...)]. (This feature is somewhat redundant now that users of base::order can request data.table's radix sort by passing method = "radix".) I haven't seen a "master list" of such exceptions.
As for whether using, e.g., parallel::mclapply inside of a j-expression can have performance benefits, I think the answer (as usual) depends on what you are trying to do and the scale of your data. Ultimately, you'll have to do your own benchmarks and profiling to find out. For example:
library("parallel")
cl <- makePSOCKcluster(4L)
microbenchmark::microbenchmark(x[, sapply(a, paste, "is a good number")], x[, parSapply(cl, a, paste, "is a good number")], times = 1000L)
stopCluster(cl)
Unit: milliseconds
expr min lq mean median uq max neval
x[, sapply(a, paste, "is a good number")] 14.553934 15.982681 17.105667 16.585525 17.864623 48.81276 1000
x[, parSapply(cl, a, paste, "is a good number")] 7.675487 8.426607 9.022947 8.802454 9.334532 25.67957 1000
So it is possible to see speed-up, though sometimes you pay the price in memory usage. For small enough problems, the overhead associated with R-level parallelism can definitely outweigh the performance benefits.
You'll find good thread about integrating parallel and data.table (including reasons not to) here.

Code performance: apply family or optimized alternatives

I've always taken it as fact that colMeans() or colSums() are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr or data.table implementations.
While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.
n = 10000
p = 100
test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix)
benchmark <- microbenchmark(
colMeans(test_df),
colMeans(as.matrix(test_df)),
sapply(test_df, mean),
vapply(test_df, mean, 0),
colMeans(test_matrix),
apply(test_matrix, 2, mean)
)
Unit: microseconds
expr min lq mean median uq max neval
colMeans(test_df) 3099.941 3165.8290 3733.024 3241.345 3617.039 11387.090 100
colMeans(as.matrix(test_df)) 3091.634 3158.0880 3553.537 3241.345 3548.507 8531.067 100
sapply(test_df, mean) 2209.227 2267.3750 2723.176 2338.172 2602.289 10384.612 100
vapply(test_df, mean, 0) 2180.153 2228.2945 2611.982 2270.584 2514.123 7421.356 100
colMeans(test_matrix) 904.307 915.0685 1020.085 939.422 1002.667 2985.911 100
apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009 100
For a matrix, colMeans() torches apply() That is expected. But for a data frame, sapply() and vapply() routinely beat colMeans(), even as I increase n and p. Is there a reason why I would want to use colMeans() on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.
Main Question
In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.
colMeans2 <- function(myobject) {
if (typeof(myobject) == "double") {
colMeans(myobject)
} else if (typeof(myobject) == "list") {
vapply(myobject, mean, 0)
} else {
stop("what is this")
}
}
For Reference
Here are two other posts I could find, both somewhat related and mentioning how colMeans() should be faster.
Grouping functions (tapply, by, aggregate) and the *apply family
Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?

R - apply or for loop depending on results of previous iterations

I am currently working on a problem where I perform multiple functions are executed iteratively. Foor each iteration the input is dependent on the results of the previous run. Currently I employed a 'for loop', however to speed up the runs I am interested in replacing this loop by an apply function.
The apply function does typically not make changes in the global environment parameters into account. However the global variables can be changed directly. Hence, the following code is identical.
a <- 1
sapply(seq_len(5), function(x){
a <<- a + 1
})
a <- 1
for(i in seq_len(5)){
a <- a + 1
}
Could a change from for loops to an apply function which makes direct changes to global variables result in a decrease in calculation time?
No, it will not be faster.
We can compare using the microbenchmark package:
n = 1e5
microbenchmark::microbenchmark(sapply = {
a <- 1
sapply(seq_len(n), function(x) {
a <<- a + 1
})
},
forloop = {
a <- 1
for (i in seq_len(n)) {
a <- a + 1
}
})
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# sapply 55.081023 67.740821 86.924793 78.312672 100.079169 424.137078 100 b
# forloop 3.950579 4.267804 4.666161 4.492243 4.764634 8.714735 100 a
On average, the sapply version is almost than 20x slower than the for loop version on input of length 100k. Global assignment is apparently expensive, when I also tried running the for loop with <<-, and then the difference is closer to 3x.
But this difference is basically meaningless. If we look per iteration, the sapply code takes 0.078 seconds / 100k iterations = 780 nanoseconds per iteration. The for loop takes 40 nanoseconds per iteration. Your actual code is hopefully doing something more interesting than a single addition, so it's probably taking microseconds, or more probably milliseconds, maybe even seconds! per iteration.
If you want to speed up code, you need to speed up the part that actually takes time, not try to a few hundred nanoseconds (still less than 1 microsecond) per iteration by changing how you are iterating. Look up "code profiling" (here's a good link to get you started) for guidance on how to identify the slow parts of your code.

Evaluate multiline codeblock with microbenchmark

Is it possible to evaluate a codeblock consisting of multiple lines of code with microbenchmark? If so, how?
Example:
We have some numeric data in character columns:
testdata <- tibble::tibble(col1 = runif(1000), col2 = as.character(runif(1000)), col3 = as.character(runif(1000)))
Now we can try different ways of converting these.
We can directly call as.numeric on the columns:
testdata$col2 <- as.numeric(testdata$col2)
testdata$col3 <- as.numeric(testdata$col3)
We could try doing it inside a dplyr mutate:
testdata <- dplyr::mutate(testdata, col2 = as.numeric(col2),
col3 = as.numeric(col3))
Or maybe we know all columns should be numeric so we can try something less explicit that does some checking:
testdata <- dplyr::mutate_if(testdata, .predicate = is.character, .funs = as.numeric)
Now we want to compare the performance of these 3 options.
The latter 2 options are individual calls so these can easily be tested in microbenchmark, but the first option consists of two separate calls. We could wrap the two calls in a function and then evaluate that in microbenchmark, but this introduces the slight overhead of the function, so isn't technically evaluating the solution that we have now. We can include the calls separately in the microbenchmark and then add them up after, for the mean should do fine, but for things like the min or the max this doesn't necessarily give sensible results.
The examples in the docs for microbenchmark mostly use simple individual expressions and often use a simple function to wrap code.
Is it possible to directly input multiple lines of code into microbenchmark to be evaluated together?
By wrapping multiple lines of code in {} and separating them with a ; they can be evaluated as one block in microbenchmark
bench <- microbenchmark(separate = {as.numeric(testdata$col2); as.numeric(testdata$col3)},
mutate = dplyr::mutate(testdata, col2 = as.numeric(col2),
col3 = as.numeric(col3)),
mutateif = dplyr::mutate_if(testdata, .predicate = is.character, .funs = as.numeric))
Which gives the following results:
> bench
Unit: microseconds
expr min lq mean median uq max neval
separate 477.014 529.708 594.8982 576.4275 611.6275 1109.762 100
mutate 3410.351 3633.070 4465.0583 3876.6975 4446.0845 34298.910 100
mutateif 5118.725 5365.126 7241.5727 5637.5520 6290.7795 118874.982 100

Speed up the calculation of squared error for large arrays in R

Basically I am helping someone to write some code for their research, but my usual time saving tactics have not reduced the run time of her algorithm enough for it to be reasonable. I was hoping someone else might know a better way to make a function run quickly based on an example I have written to avoid including information about the research.
The object in the example is smaller than the one she is using (but can easily be made larger). For the actual algorithm, this piece takes about 3 minutes in a small case, but might take 8-10 in the full case, and needs to run probably 1000-10000 times. This is the reason I need to seriously reduce the run time.
How I am currently doing this (hopefully with enough comments to make my thought process obvious):
example<-array(rnorm(100000), dim=c(5, 25, 40, 20))
observation <- array(rnorm(600), dim=c(5, 5, 12))
calc.err<-function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#'the apply function here sums up the squared error for each observation and
#'point. This is the value returned
return(apply(sqError, c(2,3), function(x) sum(x)))
}
run<-apply(example, c(2,3,4), function(x) calc.err(x, observation))
#'It isn't returned in the right format (small problem) but reformatting is fast
format<-array(run, dim=c(5, 12, 25, 40, 20))
Will clarify if necessary.
edit:
The data.table package appears to be very helpful. I will have to learn that package, but preliminaries seem to be much faster. I guess I was working with arrays because the code she gave me to make faster had the objects formatted that way. Didn't even think about changing it
Here's a couple simple refactors along with the timings:
calc.err2 <- function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#' getting rid of the anonymous function
apply(sqError, c(2,3), sum)
}
calc.err3 <- function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#' replacing with colSums
colSums(sqError)
}
R>microbenchmark(times=8, apply(example, 2:4, calc.err, observation),
+ apply(example, 2:4, calc.err2, observation),
+ apply(example, 2:4, calc.err3, observation)
+ )
Unit: milliseconds
expr min lq
apply(example, 2:4, calc.err, observation) 2284.350162 2321.875878
apply(example, 2:4, calc.err2, observation) 2194.316755 2257.007572
apply(example, 2:4, calc.err3, observation) 645.004808 652.567611
mean median uq max neval
2349.7524509 2336.6661645 2393.3452420 2409.894876 8
2301.7896566 2298.9346090 2362.5479790 2383.020177 8
681.3176878 667.9070175 720.7049605 723.177516 8
colSums is way faster than the corresponding apply.

Resources