I've always taken it as fact that colMeans() or colSums() are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr or data.table implementations.
While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.
n = 10000
p = 100
test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix)
benchmark <- microbenchmark(
colMeans(test_df),
colMeans(as.matrix(test_df)),
sapply(test_df, mean),
vapply(test_df, mean, 0),
colMeans(test_matrix),
apply(test_matrix, 2, mean)
)
Unit: microseconds
expr min lq mean median uq max neval
colMeans(test_df) 3099.941 3165.8290 3733.024 3241.345 3617.039 11387.090 100
colMeans(as.matrix(test_df)) 3091.634 3158.0880 3553.537 3241.345 3548.507 8531.067 100
sapply(test_df, mean) 2209.227 2267.3750 2723.176 2338.172 2602.289 10384.612 100
vapply(test_df, mean, 0) 2180.153 2228.2945 2611.982 2270.584 2514.123 7421.356 100
colMeans(test_matrix) 904.307 915.0685 1020.085 939.422 1002.667 2985.911 100
apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009 100
For a matrix, colMeans() torches apply() That is expected. But for a data frame, sapply() and vapply() routinely beat colMeans(), even as I increase n and p. Is there a reason why I would want to use colMeans() on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.
Main Question
In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.
colMeans2 <- function(myobject) {
if (typeof(myobject) == "double") {
colMeans(myobject)
} else if (typeof(myobject) == "list") {
vapply(myobject, mean, 0)
} else {
stop("what is this")
}
}
For Reference
Here are two other posts I could find, both somewhat related and mentioning how colMeans() should be faster.
Grouping functions (tapply, by, aggregate) and the *apply family
Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?
Related
I was under the impression that data.table is extremely well optimized, so I was quite surprised to see this:
library(data.table)
SimData <- data.table(ID = sample(1:4e5, 4e6, replace = TRUE),
DATE = sample(seq(as.Date("2000-01-01"), as.Date("2019-12-31"), by = "day"),
4e6, replace = TRUE))
microbenchmark::microbenchmark(SimData[ID==1&DATE>="2005-01-01"])
microbenchmark::microbenchmark(SimData[ID==1][DATE>="2005-01-01"])
The two solutions are quite obviously the same, yet there is more than an order of magnitude difference in runtime. Is it possible that data.table performs so poorly with the first form? (I.e., that it can't automatically optimize this call.) Or I overlook something here...?
The long operation is SimData[DATE>="2005-01-01"] because it returns millions of rows.
microbenchmark::microbenchmark(SimData[DATE>="2005-01-01"],SimData[ID==1])
Unit: microseconds
expr min lq mean median uq max neval
SimData[DATE >= "2005-01-01"] 32542.8 44549.55 51323.53 47529.75 50258.10 117396.3 100
SimData[ID == 1] 820.0 1043.55 1397.79 1435.15 1688.25 2302.5 100
SimData[ID == 1] is much shorter as it returns only a few rows,
When you execute SimData[ID==1&DATE>="2005-01-01"], you force both evaluations on all rows.
With SimData[ID==1][DATE>="2005-01-01"] the quick operation is done first, and the subsequent filter is also quick because applied on only a few row.
As mentioned by #jangorecki, there is a room for improvement in that matter.
data.table optimizes the queries like X==x, and ID==1 is in this form thus the first time you run this query it takes a while but after that calling the same query is very fast. In your case second run of the query SimData[ID==1] is very fast and the returned data table is very small which makes the second query also fast.
How should I understand the parallelism built into data.table objects? From the getDTthreads function documentation, it seems that shared memory parallelism is employed using OpenMP. That seems fairly low level, and I imagine that it only works for a certain subset
of overloaded functions and operators.
Or, is data.table somehow smart enough to split work for even more complicated expressions? More specifically, to parallelize a j-expression, what restrictions do I need to take into account?
Not to run too much afoul of Stack Overflow's question policy, here is an example. I often want to apply a function to each object in a huge data.table. For example,
library(data.table)
n <- 100000L
dt <- data.table(a = rnorm(n), b = rnorm(n))
dt[, c := sapply(a, function(x) paste(x, 'silly example')]
Would the sapply call in the j-expression work on chunks of column a in parallel? Or is it a plain old base R sapply, which works sequentially?
If the latter is the case, then is embedding one of R's many parallel computing frameworks inside the j-expression a good approach? For example, can I safely and efficiently call foreach, future, et al. in the j-expression?
From ?setDTthreads:
Internally parallelized code is used in the following places:
between.c - between()
cj.c - CJ()
coalesce.c - fcoalesce()
fifelse.c - fifelse()
fread.c - fread()
forder.c, fsort.c, and reorder.c - forder() and related
froll.c, frolladaptive.c, and frollR.c - froll() and family
fwrite.c - fwrite()
gsumm.c - GForce in various places, see GForce
nafill.c - nafill()
subset.c - Used in [.data.table subsetting
types.c - Internal testing usage
My understanding is that you should not expect data.table to make use of multithreading outside of the above use cases. Note that [.data.table uses multithreading for subsetting only, i.e., in i-expressions but not j-expressions. That is presumably just to speed up relational and logical operations, as in x[!is.na(a) & a > 0].
In a j-expression, sum and sapply are still just base::sum and base::sapply. You can test this with a benchmark:
library("data.table")
setDTthreads(4L)
x <- data.table(a = rnorm(2^25))
microbenchmark::microbenchmark(sum(x$a), x[, sum(a)], times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
sum(x$a) 51.61281 51.68317 51.95975 51.84204 52.09202 56.67213 1000
x[, sum(a)] 51.78759 51.89054 52.18827 52.07291 52.33486 61.11378 1000
x <- data.table(a = seq_len(1e+04L))
microbenchmark::microbenchmark(sapply(x$a, paste, "is a good number"), x[, sapply(a, paste, "is a good number")], times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(x$a, paste, "is a good number") 14.07403 15.7293 16.72879 16.31326 17.49072 45.62300 1000
x[, sapply(a, paste, "is a good number")] 14.56324 15.9375 17.03164 16.48971 17.69045 45.99823 1000
where it is clear that simply putting code into a j-expression does not improve performance.
data.table does recognize and handle certain constructs exceptionally. For instance, data.table uses its own radix-based forder instead of base::order when it sees x[order(...)]. (This feature is somewhat redundant now that users of base::order can request data.table's radix sort by passing method = "radix".) I haven't seen a "master list" of such exceptions.
As for whether using, e.g., parallel::mclapply inside of a j-expression can have performance benefits, I think the answer (as usual) depends on what you are trying to do and the scale of your data. Ultimately, you'll have to do your own benchmarks and profiling to find out. For example:
library("parallel")
cl <- makePSOCKcluster(4L)
microbenchmark::microbenchmark(x[, sapply(a, paste, "is a good number")], x[, parSapply(cl, a, paste, "is a good number")], times = 1000L)
stopCluster(cl)
Unit: milliseconds
expr min lq mean median uq max neval
x[, sapply(a, paste, "is a good number")] 14.553934 15.982681 17.105667 16.585525 17.864623 48.81276 1000
x[, parSapply(cl, a, paste, "is a good number")] 7.675487 8.426607 9.022947 8.802454 9.334532 25.67957 1000
So it is possible to see speed-up, though sometimes you pay the price in memory usage. For small enough problems, the overhead associated with R-level parallelism can definitely outweigh the performance benefits.
You'll find good thread about integrating parallel and data.table (including reasons not to) here.
I have seen many variations on the "split vector X into Y chunks in R" question on here. See for example: here and here for just two. So, when I realized I needed to split a vector into Y chunks of random size, I was surprised to find that the randomness requirement might be "new"--I couldn't find a way to do this on here.
So, here's what I've drawn up:
k.chunks = function(seq.size, n.chunks) {
break.pts = sample(1:seq.size, n.chunks, replace=F) %>% sort() #Get a set of break points chosen from along the length of the vector without replacement so no duplicate selections.
groups = rep(NA, seq.size) #Set up the empty output vector.
groups[1:break.pts[1]] = 1 #Set the first set of group affiliations because it has a unique start point of 1.
for (i in 2:(n.chunks)) { #For all other chunks...
groups[break.pts[i-1]:break.pts[i]] = i #Set the respective group affiliations
}
groups[break.pts[n.chunks]:seq.size] = n.chunks #Set the last group affiliation because it has a unique endpoint of seq.size.
return(groups)
}
My question is: Is this inelegant or inefficient somehow? It will get called 1000s of times in the code I plan to do, so efficiency is important to me. It'd be especially nice to avoid the for loop or having to set both the first and last groups "manually." My other question: Are there logical inputs that could break this? I recognize that n.chunks cannot > seq.size, so I mean other than that.
That should be pretty quick for smaller numbers. But here a more concise way.
k.chunks2 = function(seq.size, n.chunks) {
break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
break.len <- diff(c(0, break.pts, seq.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
If you really get a huge number of groups, I think the sort will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks gets very large, the proportions will get very small. But it is faster.
k.chunks3 = function(seq.size, n.chunks) {
props <- runif(n.chunks)
grp.props <- props / sum(props)
chunk.size <- floor(grp.props[-n.chunks] * seq.size)
break.len <- c(chunk.size, seq.size - sum(chunk.size))
groups <- rep(1:n.chunks, times = break.len)
return(groups)
}
Running a benchmark, I think any of these will be fast enough (unit is microseconds).
n <- 1000
y <- 10
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: microseconds
expr min lq mean median uq max neval
k.chunks(n, y) 49.9 52.05 59.613 53.45 58.35 251.7 100
k.chunks2(n, y) 46.1 47.75 51.617 49.25 52.55 107.1 100
k.chunks3(n, y) 8.1 9.35 11.412 10.80 11.75 44.2 100
But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).
n <- 1000000
y <- 100000
microbenchmark::microbenchmark(k.chunks(n, y),
k.chunks2(n, y),
k.chunks3(n, y))
Unit: milliseconds
expr min lq mean median uq max neval
k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038 100
k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510 69.5639 100
k.chunks3(n, y) 7.7354 8.62715 10.32754 9.07045 10.44675 58.2093 100
All said and done, I would probably use my k.chunks2() function.
Random is probably inefficient, but it would seem to be expected that it should be so. Random suggests all input elements should also be random. So, considering a desired random selection from a vector Y; it would seem the effort should be applied to an index of Y, and successive Y(s), that would be or seem random. With sufficient sets of Y(s) it can be discerned how far from completely random the indexing is, but maybe that is not material, or perhaps merely thousands of repetitions is insufficient to demonstrate it.
None the less, my sense is that both inputs to sample need to be 'random' in some way as a certainty in one reduces the randomness of the other.
my_vector <- c(1:100000)
sample_1 <- sample(my_vector, 50, replace = FALSE)
sample_2 <- sample(my_vector, 80, replace = FALSE)
full_range <- c(1, sort(unique(sample1,sample2)), 100000)
starts <- full_range[c(TRUE,FALSE)]#[generally](https://stackoverflow.com/questions/33257610/how-to-return-the-elements-in-the-odd-position)
ends <- full_range[c(FALSE, TRUE)]
!unique(diff(full_range))
And absent setting seed, I think non-reproducible is as close as you get to a random selection upon Y(s). This answer is just to suggest an approach to indexing Y. The use of indices thereafter might follow #Adam 's approach. And, of course, I could be completely wrong about all of this. Clearer random thinkers than me might well weigh in...
Basically I am helping someone to write some code for their research, but my usual time saving tactics have not reduced the run time of her algorithm enough for it to be reasonable. I was hoping someone else might know a better way to make a function run quickly based on an example I have written to avoid including information about the research.
The object in the example is smaller than the one she is using (but can easily be made larger). For the actual algorithm, this piece takes about 3 minutes in a small case, but might take 8-10 in the full case, and needs to run probably 1000-10000 times. This is the reason I need to seriously reduce the run time.
How I am currently doing this (hopefully with enough comments to make my thought process obvious):
example<-array(rnorm(100000), dim=c(5, 25, 40, 20))
observation <- array(rnorm(600), dim=c(5, 5, 12))
calc.err<-function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#'the apply function here sums up the squared error for each observation and
#'point. This is the value returned
return(apply(sqError, c(2,3), function(x) sum(x)))
}
run<-apply(example, c(2,3,4), function(x) calc.err(x, observation))
#'It isn't returned in the right format (small problem) but reformatting is fast
format<-array(run, dim=c(5, 12, 25, 40, 20))
Will clarify if necessary.
edit:
The data.table package appears to be very helpful. I will have to learn that package, but preliminaries seem to be much faster. I guess I was working with arrays because the code she gave me to make faster had the objects formatted that way. Didn't even think about changing it
Here's a couple simple refactors along with the timings:
calc.err2 <- function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#' getting rid of the anonymous function
apply(sqError, c(2,3), sum)
}
calc.err3 <- function(value, observation){
#'This creates the squared error for each observation, and each point in the
#'example array, across the five values in the first dimension of each
sqError<-(value-observation)^2
#' replacing with colSums
colSums(sqError)
}
R>microbenchmark(times=8, apply(example, 2:4, calc.err, observation),
+ apply(example, 2:4, calc.err2, observation),
+ apply(example, 2:4, calc.err3, observation)
+ )
Unit: milliseconds
expr min lq
apply(example, 2:4, calc.err, observation) 2284.350162 2321.875878
apply(example, 2:4, calc.err2, observation) 2194.316755 2257.007572
apply(example, 2:4, calc.err3, observation) 645.004808 652.567611
mean median uq max neval
2349.7524509 2336.6661645 2393.3452420 2409.894876 8
2301.7896566 2298.9346090 2362.5479790 2383.020177 8
681.3176878 667.9070175 720.7049605 723.177516 8
colSums is way faster than the corresponding apply.
The Vectorize() and the apply() functions in R can often be used to accomplish the same goal. I usually prefer vectorizing a function for readability reasons, because the main calling function is related to the task at hand while sapply is not. It is also useful to Vectorize() when I am going to be using that vectorized function multiple times in my R code. For instance:
a <- 100
b <- 200
c <- 300
varnames <- c('a', 'b', 'c')
getv <- Vectorize(get)
getv(varnames)
vs
sapply(varnames, get)
However, at least on SO I rarely see examples with Vectorize() in the solution, only apply() (or one of it's siblings). Are there any efficiency issues or other legitimate concerns with Vectorize() that make apply() a better option?
Vectorize is just a wrapper for mapply. It just builds you an mapply loop for whatever function you feed it. Thus there are often easier things do to than Vectorize() it and the explicit *apply solutions end up being computationally equivalent or perhaps superior.
Also, for your specific example, you've heard of mget, right?
To add to Thomas's answer. Maybe also speed?
# install.packages(c("microbenchmark", "stringr"), dependencies = TRUE)
require(microbenchmark)
require(stringr)
Vect <- function(x) { getv <- Vectorize(get); getv(x) }
sapp <- function(x) sapply(x, get)
mgett <- function(x) mget(x)
res <- microbenchmark(Vect(varnames), sapp(varnames), mget(varnames), times = 15)
## Print results:
print(res)
Unit: microseconds
expr min lq median uq max neval
Vect(varnames) 106.752 110.3845 116.050 122.9030 246.934 15
sapp(varnames) 31.731 33.8680 36.199 36.7810 100.712 15
mget(varnames) 2.856 3.1930 3.732 4.1185 13.624 15
### Plot results:
boxplot(res)