Why is this simplistic cpp function version slower?

Why is this simplistic cpp function version slower? - r

Consider this comparison:
require(Rcpp)
require(microbenchmark)
cppFunction('int cppFun (int x) {return x;}')
RFun = function(x) x
x=as.integer(2)
microbenchmark(RFun=RFun(x),cppFun=cppFun(x),times=1e5)
Unit: nanoseconds
expr min lq mean median uq max neval cld
RFun 302 357 470.2047 449 513 110152 1e+06 a
cppFun 2330 2598 4045.0059 2729 2879 68752603 1e+06 b
cppFun seems slower than RFun. Why is it so? Do the times for calling the functions differ? Or is it the function itself that differ in running time? Is it the time for passing and returning arguments? Is there some data conversion or data copying I am unaware of when the data are passed to (or returned from) cppFun?

This simply is not a well-posed or thought-out question as the comments above indicate.
The supposed baseline of an empty function simply is not one. Every function created via cppFunction() et al will call one R function interfacing to some C++ function. So this simply cannot be equal.
Here is a slightly more meaningful comparison. For starters, let's make the R function complete with curlies. Second, let's call another compiler (internal) function:
require(Rcpp)
require(microbenchmark)
cppFunction('int cppFun (int x) {return x;}')
RFun1 <- function(x) { x }
RFun2 <- function(x) { .Primitive("abs")(x) }
print(microbenchmark(RFun1(2L), RFun2(2L), cppFun(2L), times=1e5))
On my box, I see a) a closer gap between versions 1 and 2 (or the C++ function) and b) little penalty over the internal function. But calling ANY compiled function from R has cost.
Unit: nanoseconds
expr min lq mean median uq max neval
RFun1(2L) 171 338 434.136 355 408 2659984 1e+05
RFun2(2L) 683 937 1334.046 1257 1341 7605628 1e+05
cppFun(2L) 721 1131 1416.091 1239 1385 8544656 1e+05
As we say in the real world: there ain't no free lunch.

Related

Rcpp rowMaxs vs. matrixStats rowMaxs

I am trying to efficiently compute rowMaxs in Rcpp. A very simple implementation is
arma::mat RcppRowmaxs(arma::mat x){
int N = x.n_rows;
arma::mat rm(N,1);
for(int nn = 0; nn < N; nn++){
rm(nn) = max(x.row(nn));
}
return(rm);
}
which works perfectly fine. However, comparing this function to other packages, it turned out that other implementations are by far more efficient. Specifically, Rfast::rowMaxs is more than 6 times faster than the simple Rcpp implementation!
Naturally, I tried to mimic the behavior of Rfast.
However, as a beginner in Rcpp, I only tried to load Rfast::rowMaxs directly in Rcpp as described e.g. here. Unfortunately, using an Rcpp script to load an R function that again calls an Rcpp script seems pretty slow following my benchmark (see row "RfastinRcpp"):
m = matrix(rnorm(1000*1000),1000,1000)
microbenchmark::microbenchmark(
matrixStats = matrixStats::rowMaxs(m),
Rfast = Rfast::rowMaxs(m,value=T),
Rcpp = RcppRowmaxs(m),
RfastinRcpp = RfastRcpp(m),
apply = apply(m,1,max)
)
Unit: microseconds
expr min lq mean median uq max neval cld
matrixStats 1929.570 2042.8975 2232.1980 2086.5180 2175.470 4025.923 100 a
Rfast 666.711 727.2245 842.5578 795.2215 891.443 1477.969 100 a
Rcpp 5552.216 5825.4855 6186.9850 5997.8295 6373.737 8568.878 100 b
RfastinRcpp 7495.042 7931.2480 9471.8453 8382.6350 10659.672 19968.817 100 b
apply 12281.758 15145.7495 22015.2798 17202.9730 20310.939 136844.591 100 c
Any tips on how to improve performance in the function I've provided above? I've looked at the source code from Rfast and believe that this is the correct file. However, so far I did not manage to locate the important parts of the code.
Edit: Changed the post to focus on Rfast now, following the answer of Michail.

I just did some experiments on my laptop. I have an 5 year old HP with 2 intel i5 cores at 2.3 GHz. Attached is an picture with my results. Rfast's implementation is way faster than matrixStats' implementation, always and as the matrix gets bigger the time difference increases.

library(Rfast)
library(microbenchmark)
library(matrixStats)
x <- matrnorm(100,100)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: microseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 20.5 20.9 242.64 21.50 23.2 2223.8 10
matrixStats::rowMaxs(x) 43.7 44.7 327.43 46.95 88.2 2776.8 10
x <- matrnorm(1000,1000)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: microseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 799.5 844.0 875.08 858.5 900.3 960.3 10
matrixStats::rowMaxs(x) 2229.8 2260.8 2303.04 2269.4 2293.3 2607.8 10
x <- matrnorm(10000,10000)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 82.1157 83.4288 85.81769 84.57885 86.2742 93.0031 10
matrixStats::rowMaxs(x) 216.0003 218.5324 222.46670 221.00330 225.3302 237.7666 10

Function composition in R: compose versus %>%

I am working with a code that make an extensive use of the function composition. Usually I use the compose function from the purrr package but it has some drawbacks, for instance if you print the composed function you don't get useful information.
log_sqrt_compose <- purrr::compose(log, sqrt)
print(log_sqrt_compose)
function (...)
{
out <- last(...)
for (f in rev(rest)) {
out <- f(out)
}
out
}
Instead with the pipe you can get a nicer results:
library(magrittr)
log_sqrt_pipe <- . %>% sqrt %>% log
print(log_sqrt_pipe)
Functional sequence with the following components:
1. sqrt(.)
2. log(.)
Use 'functions' to extract the individual functions.
Performing a benchmark it seems that the pipe solution is even faster
microbenchmark::microbenchmark(log_sqrt_compose(10), log_sqrt_pipe(10), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
log_sqrt_compose(10) 2.531 2.872 3.201496 3.058 3.296 54.786 10000
log_sqrt_pipe(10) 2.007 2.411 2.767140 2.643 2.949 42.422 10000
Given the previous results, is there some other reason to prefer one of the two method in you experience?
EDIT:
Following #mt1022's comment I add also the base version
log_sqrt_f <- function(x) log(sqrt(x))
print(log_sqrt_f)
function(x) log(sqrt(x))
microbenchmark(log_sqrt_compose(10), log_sqrt_pipe(10), log_sqrt_f(10), times = 10000)
Unit: nanoseconds
expr min lq mean median uq max neval
log_sqrt_compose(10) 2529 2874 3151.7432 3052 3285 234039 10000
log_sqrt_pipe(10) 2051 2474 2770.7148 2708 2990 16194 10000
log_sqrt_f(10) 251 288 634.0649 341 420 2659789 10000

is there a shift operation in R?

in c++ there is shift operartion
>> right shift
<< left shift
this is consider to be very fast.
I tried to apply the same in R but it seems to be wrong.
Is there a similar operation in R that is as fast as this?
thanks in advance.

You can use bitwShiftL and bitwShiftR:
bitwShiftL(16, 2)
#[1] 64
bitwShiftR(16, 2)
#[1] 4
Here is the source code. Judging by the amount of additional code in these functions, and the fact that * and / are primitives, is unlikely that these will be faster than dividing / multiplying by the equivalent power of two. On one of my VMs,
microbenchmark::microbenchmark(
bitwShiftL(16, 2),
16 * 4,
times = 1000L
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftL(16, 2) 1167 1353.5 2336.779 1604 2067 117880 1000 b
# 16 * 4 210 251.0 564.528 347 470 51885 1000 a
microbenchmark::microbenchmark(
bitwShiftR(16, 2),
16 / 4,
times = 1000L
)
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftR(16, 2) 1161 1238.5 1635.131 1388.5 1688.5 39225 1000 b
# 16/4 210 240.0 323.787 280.0 334.0 14284 1000 a
I should also point out that trying to micro-optimize an interpreted language is probably a waste of time. If performance is such a big concern that you are willing to split hairs over a couple of clock cycles, just write your program in C or C++ in the first place.

Why is it faster to evaluate in `j` than with `$` in a `data.table`?

Perhaps this is already answered and I missed it, but it's hard to search.
A very simple question: Why is dt[,x] generally a tiny bit faster than dt$x?
Example:
dt<-data.table(id=1:1e7,var=rnorm(1e6))
test<-microbenchmark(times=100L,
dt[sample(1e7,size=200000),var],
dt[sample(1e7,size=200000),]$var)
test[,"expr"]<-c("in j","$")
Unit: milliseconds
expr min lq mean median uq max neval
$ 14.28863 15.88779 18.84229 17.23109 18.41577 53.63473 100
in j 14.35916 15.97063 18.87265 17.99266 18.37939 54.19944 100
I might not have chosen the best example, so feel free to suggest something perhaps more poignant.
Anyway, evaluating in j is faster at least 75% of the time (though there appears to be a fat upper tail as the mean is higher; side note, it would be nice if microbenchmark could spit me out some histograms).
Why is this the case?

With j, you are subsetting and selecting within a call to [.data.table.
With $ (and your call), you are subsetting within [.data.table and then selecting with $
You are in essence calling 2 functions not 1, thus there is a neglible difference in timing.
In your current example you are calling `sampling(1e,200000) each time.
For comparison to return identical results
dt<-data.table(id=1:1e7,var=rnorm(1e6))
setkey(dt, id)
ii <- sample(1e7,size=200000)
microbenchmark("in j" = dt[.(ii),var], "$"=dt[.(ii)]$var, '[[' =dt[.(ii)][['var']], .subset2(dt[.(ii)],'var'), dt[.(ii)][[2]], dt[['var']][ii], dt$var[ii], .subset2(dt,'var')[ii] )
Unit: milliseconds
expr min lq mean median uq max neval cld
in j 39.491156 40.358669 41.570057 40.860342 41.485622 70.202441 100 b
$ 39.957211 40.561965 41.587420 41.136836 41.634584 69.928363 100 b
[[ 40.046558 40.515480 42.388432 41.244444 41.750946 72.224827 100 b
.subset2(dt[.(ii)], "var") 39.772781 40.564077 41.561271 41.111630 41.635489 69.252222 100 b
dt[.(ii)][[2]] 40.004300 40.513669 41.682526 40.927503 41.492866 72.986995 100 b
dt[["var"]][ii] 4.432346 4.546898 4.946219 4.623416 4.755777 31.761115 100 a
dt$var[ii] 4.440496 4.539502 4.668361 4.597457 4.729214 5.425125 100 a
.subset2(dt, "var")[ii] 4.365939 4.508261 4.660435 4.598815 4.703858 6.072289 100 a

Why is mean() so slow?

Everything is in the question! I just tried to do a bit of optimization, and nailing down the bottle necks, out of curiosity, I tried that:
t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
times = 10000)
and the result is that mean() is 6+ times slower than the computation "by hand"!
Does it stem from the overhead in the code of mean() before the call to the Internal(mean) or is it the C code itself which is slower? Why? Is there a good reason and thus a good use case?

It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)
sum and length are both Primitive functions. so will be fast (but how are you handling NA values?)
t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
times = 10000)
Unit: nanoseconds
expr min lq median uq max neval
mean(t1) 10266 10951 11293 11635 1470714 10000
sum(t1)/length(t1) 684 1027 1369 1711 104367 10000
mean.default(t1) 2053 2396 2738 2739 1167195 10000
.Internal(mean(t1)) 342 343 685 685 86574 10000
The internal bit of mean is faster even than sum/length.
See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal).
Note that if we increase the length of the vector, then the primitive approach is fastest
t1 <- rnorm(1e7)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
+ times = 100)
Unit: milliseconds
expr min lq median uq max neval
mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137 100
sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824 100
mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896 100
.Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054 100
Now method dispatch is only a fraction of the overall "time" required.

mean is slower than computing "by hand" for several reasons:
S3 Method dispatch
NA handling
Error correction
Points 1 and 2 have already been covered. Point 3 is discussed in What algorithm is R using to calculate mean?. Basically, mean makes 2 passes over the vector in order to correct for floating point errors. sum only makes 1 pass over the vector.
Notice that identical(sum(t1)/length(t1), mean(t1)) may be FALSE, due to these precision issues.
> set.seed(21); t1 <- rnorm(1e7,,21)
> identical(sum(t1)/length(t1), mean(t1))
[1] FALSE
> sum(t1)/length(t1) - mean(t1)
[1] 2.539201e-16

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why is this simplistic cpp function version slower? - r

Related

Rcpp rowMaxs vs. matrixStats rowMaxs

Function composition in R: compose versus %>%

is there a shift operation in R?

Why is it faster to evaluate in `j` than with `$` in a `data.table`?

Why is mean() so slow?

Categories

Resources