Why is mean() so slow? - r

Everything is in the question! I just tried to do a bit of optimization, and nailing down the bottle necks, out of curiosity, I tried that:
t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
times = 10000)
and the result is that mean() is 6+ times slower than the computation "by hand"!
Does it stem from the overhead in the code of mean() before the call to the Internal(mean) or is it the C code itself which is slower? Why? Is there a good reason and thus a good use case?

It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)
sum and length are both Primitive functions. so will be fast (but how are you handling NA values?)
t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
times = 10000)
Unit: nanoseconds
expr min lq median uq max neval
mean(t1) 10266 10951 11293 11635 1470714 10000
sum(t1)/length(t1) 684 1027 1369 1711 104367 10000
mean.default(t1) 2053 2396 2738 2739 1167195 10000
.Internal(mean(t1)) 342 343 685 685 86574 10000
The internal bit of mean is faster even than sum/length.
See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal).
Note that if we increase the length of the vector, then the primitive approach is fastest
t1 <- rnorm(1e7)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
+ times = 100)
Unit: milliseconds
expr min lq median uq max neval
mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137 100
sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824 100
mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896 100
.Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054 100
Now method dispatch is only a fraction of the overall "time" required.

mean is slower than computing "by hand" for several reasons:
S3 Method dispatch
NA handling
Error correction
Points 1 and 2 have already been covered. Point 3 is discussed in What algorithm is R using to calculate mean?. Basically, mean makes 2 passes over the vector in order to correct for floating point errors. sum only makes 1 pass over the vector.
Notice that identical(sum(t1)/length(t1), mean(t1)) may be FALSE, due to these precision issues.
> set.seed(21); t1 <- rnorm(1e7,,21)
> identical(sum(t1)/length(t1), mean(t1))
[1] FALSE
> sum(t1)/length(t1) - mean(t1)
[1] 2.539201e-16

Related

Function composition in R: compose versus %>%

I am working with a code that make an extensive use of the function composition. Usually I use the compose function from the purrr package but it has some drawbacks, for instance if you print the composed function you don't get useful information.
log_sqrt_compose <- purrr::compose(log, sqrt)
print(log_sqrt_compose)
function (...)
{
out <- last(...)
for (f in rev(rest)) {
out <- f(out)
}
out
}
Instead with the pipe you can get a nicer results:
library(magrittr)
log_sqrt_pipe <- . %>% sqrt %>% log
print(log_sqrt_pipe)
Functional sequence with the following components:
1. sqrt(.)
2. log(.)
Use 'functions' to extract the individual functions.
Performing a benchmark it seems that the pipe solution is even faster
microbenchmark::microbenchmark(log_sqrt_compose(10), log_sqrt_pipe(10), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
log_sqrt_compose(10) 2.531 2.872 3.201496 3.058 3.296 54.786 10000
log_sqrt_pipe(10) 2.007 2.411 2.767140 2.643 2.949 42.422 10000
Given the previous results, is there some other reason to prefer one of the two method in you experience?
EDIT:
Following #mt1022's comment I add also the base version
log_sqrt_f <- function(x) log(sqrt(x))
print(log_sqrt_f)
function(x) log(sqrt(x))
microbenchmark(log_sqrt_compose(10), log_sqrt_pipe(10), log_sqrt_f(10), times = 10000)
Unit: nanoseconds
expr min lq mean median uq max neval
log_sqrt_compose(10) 2529 2874 3151.7432 3052 3285 234039 10000
log_sqrt_pipe(10) 2051 2474 2770.7148 2708 2990 16194 10000
log_sqrt_f(10) 251 288 634.0649 341 420 2659789 10000

What is the time complexity of name look-up in an R list?

I have tried desperately to find the answer via Google and failed. I am about to do the benchmark myself but thought that maybe someone here knows the answer a or at least a reference where this is documented.
To expand on my question: suppose I have a list L in R of length N, where N is rather large (say, 10000, 100.000, 1 million or more).
Assume my list has names for every element. `
I wonder how long does it take to retrieve a single named entry, i.e, to do
L[[ "any_random_name" ]]
Is this time O(N), i.e. proportional to the length of the list, or is it O(1), that is, constant independent of the name of the list. or is it maybe O( log N ) ?
The worst case for name lookup is O(n). Take a look here: https://www.refsmmat.com/posts/2016-09-12-r-lists.html .
Interesting, the answer turns out to be both O(1) and O(n). The timing depends not so much on the length of the list, as the length of the list before the named element you want is obtained.
Let's define some lists of different lengths:
library(microbenchmark)
makelist <- function(n){
L <- as.list(runif(n))
names(L) <- paste0("Element", 1:n)
return(L)
}
L100 <- makelist(100)
L1000 <- makelist(1000)
LMillion <- makelist(10^6)
L10Million <- makelist(10^7)
If we try to extract the element named Element100 our of each of these - the 100th element - it takes basically the same length of time:
microbenchmark(
L10Million[["Element100"]],
LMillion[["Element100"]],
L1000[["Element100"]],
L100[["Element100"]]
)
Unit: nanoseconds
expr min lq mean median uq max neval cld
L10Million[["Element100"]] 791 791 996.59 792 1186 2766 100 a
LMillion[["Element100"]] 791 791 1000.56 989 1186 1976 100 a
L1000[["Element100"]] 791 791 1474.64 792 1186 28050 100 a
L100[["Element100"]] 791 791 1352.21 792 1186 17779 100 a
But if we try to get the last element, the time required is O(n)
microbenchmark(
L10Million[["Element10000000"]],
LMillion[["Element1000000"]],
L1000[["Element1000"]],
L100[["Element100"]]
)
Unit: nanoseconds
expr min lq mean median uq max neval cld
L10Million[["Element10000000"]] 154965791 165074635 172399030.13 169602043 175170244 235525230 100 c
LMillion[["Element1000000"]] 15362773 16540057 17379942.70 16967712 17734922 22361689 100 b
L1000[["Element1000"]] 9482 10668 17770.94 16594 20544 67557 100 a
L100[["Element100"]] 791 1186 3995.15 3556 6322 24100 100 a
library(ggplot2)
library(dplyr)
res <- data.frame(x = c(100, 1000, 10^6, 10^7),
y = c(3556, 16594, 16967715, 169602041))
ggplot(res, aes(x = x, y = y ))+
geom_smooth(method = "lm") +
geom_point(, size = 3) +
scale_x_log10() +
scale_y_log10()

Why is this simplistic cpp function version slower?

Consider this comparison:
require(Rcpp)
require(microbenchmark)
cppFunction('int cppFun (int x) {return x;}')
RFun = function(x) x
x=as.integer(2)
microbenchmark(RFun=RFun(x),cppFun=cppFun(x),times=1e5)
Unit: nanoseconds
expr min lq mean median uq max neval cld
RFun 302 357 470.2047 449 513 110152 1e+06 a
cppFun 2330 2598 4045.0059 2729 2879 68752603 1e+06 b
cppFun seems slower than RFun. Why is it so? Do the times for calling the functions differ? Or is it the function itself that differ in running time? Is it the time for passing and returning arguments? Is there some data conversion or data copying I am unaware of when the data are passed to (or returned from) cppFun?
This simply is not a well-posed or thought-out question as the comments above indicate.
The supposed baseline of an empty function simply is not one. Every function created via cppFunction() et al will call one R function interfacing to some C++ function. So this simply cannot be equal.
Here is a slightly more meaningful comparison. For starters, let's make the R function complete with curlies. Second, let's call another compiler (internal) function:
require(Rcpp)
require(microbenchmark)
cppFunction('int cppFun (int x) {return x;}')
RFun1 <- function(x) { x }
RFun2 <- function(x) { .Primitive("abs")(x) }
print(microbenchmark(RFun1(2L), RFun2(2L), cppFun(2L), times=1e5))
On my box, I see a) a closer gap between versions 1 and 2 (or the C++ function) and b) little penalty over the internal function. But calling ANY compiled function from R has cost.
Unit: nanoseconds
expr min lq mean median uq max neval
RFun1(2L) 171 338 434.136 355 408 2659984 1e+05
RFun2(2L) 683 937 1334.046 1257 1341 7605628 1e+05
cppFun(2L) 721 1131 1416.091 1239 1385 8544656 1e+05
As we say in the real world: there ain't no free lunch.

is there a shift operation in R?

in c++ there is shift operartion
>> right shift
<< left shift
this is consider to be very fast.
I tried to apply the same in R but it seems to be wrong.
Is there a similar operation in R that is as fast as this?
thanks in advance.
You can use bitwShiftL and bitwShiftR:
bitwShiftL(16, 2)
#[1] 64
bitwShiftR(16, 2)
#[1] 4
Here is the source code. Judging by the amount of additional code in these functions, and the fact that * and / are primitives, is unlikely that these will be faster than dividing / multiplying by the equivalent power of two. On one of my VMs,
microbenchmark::microbenchmark(
bitwShiftL(16, 2),
16 * 4,
times = 1000L
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftL(16, 2) 1167 1353.5 2336.779 1604 2067 117880 1000 b
# 16 * 4 210 251.0 564.528 347 470 51885 1000 a
microbenchmark::microbenchmark(
bitwShiftR(16, 2),
16 / 4,
times = 1000L
)
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftR(16, 2) 1161 1238.5 1635.131 1388.5 1688.5 39225 1000 b
# 16/4 210 240.0 323.787 280.0 334.0 14284 1000 a
I should also point out that trying to micro-optimize an interpreted language is probably a waste of time. If performance is such a big concern that you are willing to split hairs over a couple of clock cycles, just write your program in C or C++ in the first place.

Why is it faster to evaluate in `j` than with `$` in a `data.table`?

Perhaps this is already answered and I missed it, but it's hard to search.
A very simple question: Why is dt[,x] generally a tiny bit faster than dt$x?
Example:
dt<-data.table(id=1:1e7,var=rnorm(1e6))
test<-microbenchmark(times=100L,
dt[sample(1e7,size=200000),var],
dt[sample(1e7,size=200000),]$var)
test[,"expr"]<-c("in j","$")
Unit: milliseconds
expr min lq mean median uq max neval
$ 14.28863 15.88779 18.84229 17.23109 18.41577 53.63473 100
in j 14.35916 15.97063 18.87265 17.99266 18.37939 54.19944 100
I might not have chosen the best example, so feel free to suggest something perhaps more poignant.
Anyway, evaluating in j is faster at least 75% of the time (though there appears to be a fat upper tail as the mean is higher; side note, it would be nice if microbenchmark could spit me out some histograms).
Why is this the case?
With j, you are subsetting and selecting within a call to [.data.table.
With $ (and your call), you are subsetting within [.data.table and then selecting with $
You are in essence calling 2 functions not 1, thus there is a neglible difference in timing.
In your current example you are calling `sampling(1e,200000) each time.
For comparison to return identical results
dt<-data.table(id=1:1e7,var=rnorm(1e6))
setkey(dt, id)
ii <- sample(1e7,size=200000)
microbenchmark("in j" = dt[.(ii),var], "$"=dt[.(ii)]$var, '[[' =dt[.(ii)][['var']], .subset2(dt[.(ii)],'var'), dt[.(ii)][[2]], dt[['var']][ii], dt$var[ii], .subset2(dt,'var')[ii] )
Unit: milliseconds
expr min lq mean median uq max neval cld
in j 39.491156 40.358669 41.570057 40.860342 41.485622 70.202441 100 b
$ 39.957211 40.561965 41.587420 41.136836 41.634584 69.928363 100 b
[[ 40.046558 40.515480 42.388432 41.244444 41.750946 72.224827 100 b
.subset2(dt[.(ii)], "var") 39.772781 40.564077 41.561271 41.111630 41.635489 69.252222 100 b
dt[.(ii)][[2]] 40.004300 40.513669 41.682526 40.927503 41.492866 72.986995 100 b
dt[["var"]][ii] 4.432346 4.546898 4.946219 4.623416 4.755777 31.761115 100 a
dt$var[ii] 4.440496 4.539502 4.668361 4.597457 4.729214 5.425125 100 a
.subset2(dt, "var")[ii] 4.365939 4.508261 4.660435 4.598815 4.703858 6.072289 100 a

Resources