Rcpp rowMaxs vs. matrixStats rowMaxs

Rcpp rowMaxs vs. matrixStats rowMaxs - r

I am trying to efficiently compute rowMaxs in Rcpp. A very simple implementation is
arma::mat RcppRowmaxs(arma::mat x){
int N = x.n_rows;
arma::mat rm(N,1);
for(int nn = 0; nn < N; nn++){
rm(nn) = max(x.row(nn));
}
return(rm);
}
which works perfectly fine. However, comparing this function to other packages, it turned out that other implementations are by far more efficient. Specifically, Rfast::rowMaxs is more than 6 times faster than the simple Rcpp implementation!
Naturally, I tried to mimic the behavior of Rfast.
However, as a beginner in Rcpp, I only tried to load Rfast::rowMaxs directly in Rcpp as described e.g. here. Unfortunately, using an Rcpp script to load an R function that again calls an Rcpp script seems pretty slow following my benchmark (see row "RfastinRcpp"):
m = matrix(rnorm(1000*1000),1000,1000)
microbenchmark::microbenchmark(
matrixStats = matrixStats::rowMaxs(m),
Rfast = Rfast::rowMaxs(m,value=T),
Rcpp = RcppRowmaxs(m),
RfastinRcpp = RfastRcpp(m),
apply = apply(m,1,max)
)
Unit: microseconds
expr min lq mean median uq max neval cld
matrixStats 1929.570 2042.8975 2232.1980 2086.5180 2175.470 4025.923 100 a
Rfast 666.711 727.2245 842.5578 795.2215 891.443 1477.969 100 a
Rcpp 5552.216 5825.4855 6186.9850 5997.8295 6373.737 8568.878 100 b
RfastinRcpp 7495.042 7931.2480 9471.8453 8382.6350 10659.672 19968.817 100 b
apply 12281.758 15145.7495 22015.2798 17202.9730 20310.939 136844.591 100 c
Any tips on how to improve performance in the function I've provided above? I've looked at the source code from Rfast and believe that this is the correct file. However, so far I did not manage to locate the important parts of the code.
Edit: Changed the post to focus on Rfast now, following the answer of Michail.

I just did some experiments on my laptop. I have an 5 year old HP with 2 intel i5 cores at 2.3 GHz. Attached is an picture with my results. Rfast's implementation is way faster than matrixStats' implementation, always and as the matrix gets bigger the time difference increases.

library(Rfast)
library(microbenchmark)
library(matrixStats)
x <- matrnorm(100,100)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: microseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 20.5 20.9 242.64 21.50 23.2 2223.8 10
matrixStats::rowMaxs(x) 43.7 44.7 327.43 46.95 88.2 2776.8 10
x <- matrnorm(1000,1000)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: microseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 799.5 844.0 875.08 858.5 900.3 960.3 10
matrixStats::rowMaxs(x) 2229.8 2260.8 2303.04 2269.4 2293.3 2607.8 10
x <- matrnorm(10000,10000)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 82.1157 83.4288 85.81769 84.57885 86.2742 93.0031 10
matrixStats::rowMaxs(x) 216.0003 218.5324 222.46670 221.00330 225.3302 237.7666 10

Related

Function composition in R: compose versus %>%

I am working with a code that make an extensive use of the function composition. Usually I use the compose function from the purrr package but it has some drawbacks, for instance if you print the composed function you don't get useful information.
log_sqrt_compose <- purrr::compose(log, sqrt)
print(log_sqrt_compose)
function (...)
{
out <- last(...)
for (f in rev(rest)) {
out <- f(out)
}
out
}
Instead with the pipe you can get a nicer results:
library(magrittr)
log_sqrt_pipe <- . %>% sqrt %>% log
print(log_sqrt_pipe)
Functional sequence with the following components:
1. sqrt(.)
2. log(.)
Use 'functions' to extract the individual functions.
Performing a benchmark it seems that the pipe solution is even faster
microbenchmark::microbenchmark(log_sqrt_compose(10), log_sqrt_pipe(10), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
log_sqrt_compose(10) 2.531 2.872 3.201496 3.058 3.296 54.786 10000
log_sqrt_pipe(10) 2.007 2.411 2.767140 2.643 2.949 42.422 10000
Given the previous results, is there some other reason to prefer one of the two method in you experience?
EDIT:
Following #mt1022's comment I add also the base version
log_sqrt_f <- function(x) log(sqrt(x))
print(log_sqrt_f)
function(x) log(sqrt(x))
microbenchmark(log_sqrt_compose(10), log_sqrt_pipe(10), log_sqrt_f(10), times = 10000)
Unit: nanoseconds
expr min lq mean median uq max neval
log_sqrt_compose(10) 2529 2874 3151.7432 3052 3285 234039 10000
log_sqrt_pipe(10) 2051 2474 2770.7148 2708 2990 16194 10000
log_sqrt_f(10) 251 288 634.0649 341 420 2659789 10000

Why is this simplistic cpp function version slower?

Consider this comparison:
require(Rcpp)
require(microbenchmark)
cppFunction('int cppFun (int x) {return x;}')
RFun = function(x) x
x=as.integer(2)
microbenchmark(RFun=RFun(x),cppFun=cppFun(x),times=1e5)
Unit: nanoseconds
expr min lq mean median uq max neval cld
RFun 302 357 470.2047 449 513 110152 1e+06 a
cppFun 2330 2598 4045.0059 2729 2879 68752603 1e+06 b
cppFun seems slower than RFun. Why is it so? Do the times for calling the functions differ? Or is it the function itself that differ in running time? Is it the time for passing and returning arguments? Is there some data conversion or data copying I am unaware of when the data are passed to (or returned from) cppFun?

This simply is not a well-posed or thought-out question as the comments above indicate.
The supposed baseline of an empty function simply is not one. Every function created via cppFunction() et al will call one R function interfacing to some C++ function. So this simply cannot be equal.
Here is a slightly more meaningful comparison. For starters, let's make the R function complete with curlies. Second, let's call another compiler (internal) function:
require(Rcpp)
require(microbenchmark)
cppFunction('int cppFun (int x) {return x;}')
RFun1 <- function(x) { x }
RFun2 <- function(x) { .Primitive("abs")(x) }
print(microbenchmark(RFun1(2L), RFun2(2L), cppFun(2L), times=1e5))
On my box, I see a) a closer gap between versions 1 and 2 (or the C++ function) and b) little penalty over the internal function. But calling ANY compiled function from R has cost.
Unit: nanoseconds
expr min lq mean median uq max neval
RFun1(2L) 171 338 434.136 355 408 2659984 1e+05
RFun2(2L) 683 937 1334.046 1257 1341 7605628 1e+05
cppFun(2L) 721 1131 1416.091 1239 1385 8544656 1e+05
As we say in the real world: there ain't no free lunch.

How to measure the execution time of a code without actually running the code in R?

Can I use `microbenchmark to calculate the approximate time it would take to execute my code in R? I am running some code and I can see that it takes a lot of hours to execute? I don't want to run my code all that time. I want to see the approximate execution time without actually running the code in R.

Try running your code on smaller problems to see how it scales
> fun0 = function(n) { x = integer(); for (i in seq_len(n)) x = c(x, i); x }
> p = microbenchmark(fun0(1000), fun0(2000), fun0(4000), fun0(8000), fun0(16000),
+ times=20)
> p
Unit: milliseconds
expr min lq mean median uq max
fun0(1000) 1.627601 1.697958 1.995438 1.723522 2.289424 2.935609
fun0(2000) 5.691456 6.333478 6.745057 6.928060 7.056893 8.040366
fun0(4000) 23.343611 24.487355 24.987870 24.854968 25.554553 26.088183
fun0(8000) 92.517691 95.827525 104.900161 97.305930 112.924961 136.434998
fun0(16000) 365.495320 369.697953 380.981034 374.456565 390.829214 411.203191
neval
20
20
20
20
20
Doubling the problem size leads to exponentially slower execution; visualize as
library(ggplot2)
ggplot(p, aes(x=expr, y=log(time))) + geom_point() +
geom_smooth(method="lm", aes(x=as.integer(expr)))
This is terrible news for big problems!
Investigate alternative implementations that scale better while returning the same answer, both as the problem increases in size and at a given problem size. First make sure your algorithms / implementations get the same answer
> ## linear, ok
> fun1 = function(n) { x = integer(n); for (i in seq_len(n)) x[[i]] = i; x }
> identical(fun0(100), fun1(100))
[1] TRUE
then see how the new algorithm scales with problem size
> microbenchmark(fun1(100), fun1(1000), fun1(10000))
Unit: microseconds
expr min lq mean median uq max neval
fun1(100) 86.260 97.558 121.5591 102.6715 107.6995 1058.321 100
fun1(1000) 845.160 902.221 932.7760 922.8610 945.6305 1915.264 100
fun1(10000) 8776.673 9100.087 9699.7925 9385.8560 10310.6240 13423.718 100
Explore more algorithms, especially those that replace iteration with vectorization
> ## linear, faster -- *nano*seconds
> fun2 = seq_len
> identical(fun1(100), fun2(100))
[1] TRUE
> microbenchmark(fun2(100), fun2(1000), fun2(10000))
Unit: nanoseconds
expr min lq mean median uq max neval
fun2(100) 417 505.0 587.53 553 618 2247 100
fun2(1000) 2126 2228.5 2774.90 2894 2986 5511 100
fun2(10000) 19426 19741.0 25390.93 27177 28209 43418 100
Comparing algorithms at specific sizes
> n = 1000; microbenchmark(fun0(n), fun1(n), fun2(n), times=10)
Unit: microseconds
expr min lq mean median uq max neval
fun0(n) 1625.797 1637.949 2018.6295 1657.1195 2800.272 2857.400 10
fun1(n) 819.448 843.988 874.9445 853.9290 910.871 1006.582 10
fun2(n) 2.158 2.386 2.5990 2.6565 2.716 3.055 10
> n = 10000; microbenchmark(fun0(n), fun1(n), fun2(n), times=10)
Unit: microseconds
expr min lq mean median uq max
fun0(n) 157010.750 157276.699 169905.4745 159944.5715 192185.973 197389.965
fun1(n) 8613.977 8630.599 9212.2207 9165.9300 9394.605 10299.821
fun2(n) 19.296 19.384 20.7852 20.8595 21.868 22.435
neval
10
10
10
shows the increasing importance of a sensible implementation as problem size increases.

is there a shift operation in R?

in c++ there is shift operartion
>> right shift
<< left shift
this is consider to be very fast.
I tried to apply the same in R but it seems to be wrong.
Is there a similar operation in R that is as fast as this?
thanks in advance.

You can use bitwShiftL and bitwShiftR:
bitwShiftL(16, 2)
#[1] 64
bitwShiftR(16, 2)
#[1] 4
Here is the source code. Judging by the amount of additional code in these functions, and the fact that * and / are primitives, is unlikely that these will be faster than dividing / multiplying by the equivalent power of two. On one of my VMs,
microbenchmark::microbenchmark(
bitwShiftL(16, 2),
16 * 4,
times = 1000L
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftL(16, 2) 1167 1353.5 2336.779 1604 2067 117880 1000 b
# 16 * 4 210 251.0 564.528 347 470 51885 1000 a
microbenchmark::microbenchmark(
bitwShiftR(16, 2),
16 / 4,
times = 1000L
)
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftR(16, 2) 1161 1238.5 1635.131 1388.5 1688.5 39225 1000 b
# 16/4 210 240.0 323.787 280.0 334.0 14284 1000 a
I should also point out that trying to micro-optimize an interpreted language is probably a waste of time. If performance is such a big concern that you are willing to split hairs over a couple of clock cycles, just write your program in C or C++ in the first place.

Why is it faster to evaluate in `j` than with `$` in a `data.table`?

Perhaps this is already answered and I missed it, but it's hard to search.
A very simple question: Why is dt[,x] generally a tiny bit faster than dt$x?
Example:
dt<-data.table(id=1:1e7,var=rnorm(1e6))
test<-microbenchmark(times=100L,
dt[sample(1e7,size=200000),var],
dt[sample(1e7,size=200000),]$var)
test[,"expr"]<-c("in j","$")
Unit: milliseconds
expr min lq mean median uq max neval
$ 14.28863 15.88779 18.84229 17.23109 18.41577 53.63473 100
in j 14.35916 15.97063 18.87265 17.99266 18.37939 54.19944 100
I might not have chosen the best example, so feel free to suggest something perhaps more poignant.
Anyway, evaluating in j is faster at least 75% of the time (though there appears to be a fat upper tail as the mean is higher; side note, it would be nice if microbenchmark could spit me out some histograms).
Why is this the case?

With j, you are subsetting and selecting within a call to [.data.table.
With $ (and your call), you are subsetting within [.data.table and then selecting with $
You are in essence calling 2 functions not 1, thus there is a neglible difference in timing.
In your current example you are calling `sampling(1e,200000) each time.
For comparison to return identical results
dt<-data.table(id=1:1e7,var=rnorm(1e6))
setkey(dt, id)
ii <- sample(1e7,size=200000)
microbenchmark("in j" = dt[.(ii),var], "$"=dt[.(ii)]$var, '[[' =dt[.(ii)][['var']], .subset2(dt[.(ii)],'var'), dt[.(ii)][[2]], dt[['var']][ii], dt$var[ii], .subset2(dt,'var')[ii] )
Unit: milliseconds
expr min lq mean median uq max neval cld
in j 39.491156 40.358669 41.570057 40.860342 41.485622 70.202441 100 b
$ 39.957211 40.561965 41.587420 41.136836 41.634584 69.928363 100 b
[[ 40.046558 40.515480 42.388432 41.244444 41.750946 72.224827 100 b
.subset2(dt[.(ii)], "var") 39.772781 40.564077 41.561271 41.111630 41.635489 69.252222 100 b
dt[.(ii)][[2]] 40.004300 40.513669 41.682526 40.927503 41.492866 72.986995 100 b
dt[["var"]][ii] 4.432346 4.546898 4.946219 4.623416 4.755777 31.761115 100 a
dt$var[ii] 4.440496 4.539502 4.668361 4.597457 4.729214 5.425125 100 a
.subset2(dt, "var")[ii] 4.365939 4.508261 4.660435 4.598815 4.703858 6.072289 100 a

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rcpp rowMaxs vs. matrixStats rowMaxs - r

I just did some experiments on my laptop. I have an 5 year old HP with 2 intel i5 cores at 2.3 GHz. Attached is an picture with my results. Rfast's implementation is way faster than matrixStats' implementation, always and as the matrix gets bigger the time difference increases.

Related

Function composition in R: compose versus %>%

Why is this simplistic cpp function version slower?

How to measure the execution time of a code without actually running the code in R?

is there a shift operation in R?

Why is it faster to evaluate in `j` than with `$` in a `data.table`?

Categories

Resources