This question is related to this old question and this old question.
R has the nice wrapper-ish function anyNA for quicker evaluation of any(is.na(x)). When working in Rcpp a similar minimal implementation could be given by:
// CharacterVector example
#include <Rcpp.h>
using namespace Rcpp;
template<typename T, typename S>
bool any_na(S x){
T xx = as<T>(x);
for(auto i : xx){
if(T::is_na(i))
return true;
}
return false;
}
// [[Rcpp::export(rng = false)]]
LogicalVector any_na(SEXP x){
return any_na<CharacterVector>(x);
}
// [[Rcpp::export(rng = false)]]
SEXP overhead(SEXP x){
CharacterVector xx = as<CharacterVector>(x);
return wrap(xx);
}
/***R
library(microbenchmark)
vec <- sample(letters, 1e6, TRUE)
vec[1e6] <- NA_character_
any_na(vec)
# [1] TRUE
*/
But comparing the performance of this to anyNA I was surprised by the benchmark below
library(microbenchmark)
microbenchmark(
Rcpp = any_na(vec),
R = anyNA(vec),
overhead = overhead(vec),
unit = "ms"
)
Unit: milliseconds
expr min lq mean median uq max neval cld
Rcpp 2.647901 2.8059500 3.243573 3.0435010 3.675051 5.899100 100 c
R 0.800300 0.8151005 0.952301 0.8577015 0.961201 3.467402 100 b
overhead 0.001300 0.0029010 0.011388 0.0122510 0.015751 0.048401 100 a
where the last line is the "overhead" incurred from converting back and forth from SEXP to CharacterVector (turns out to be negligible). As immediately evident the Rcpp version is roughly ~3.5 times slower than the R version. I was curious so I checked up on the source for Rcpp's is_na and finding no obvious reasons for the slow performance I continued to check the source for anyNA for R's own character vectors's and reimplementing the function using R's C API thinking to speed up this
// Added after SEXP overhead(SEXP x){ --- }
inline bool anyNA2(SEXP x){
R_xlen_t n = Rf_length(x);
for(R_xlen_t i = 0; i < n; i++){
if(STRING_ELT(x, i) == NA_STRING)
return true;
}
return false;
}
// [[Rcpp::export(rng = false)]]
SEXP any_na2(SEXP x){
bool xx = anyNA2(x);
return wrap(xx);
}
// [[Rcpp::export(rng = false)]]
SEXP any_na3(SEXP x){
Function anyNA("anyNA");
return anyNA(x);
}
/***R
microbenchmark(
Rcpp = any_na(vec),
R = anyNA(vec),
R_C_api = any_na2(vec),
Rcpp_Function = any_na3(vec),
overhead = overhead(vec),
unit = "ms"
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Rcpp 2.654901 2.8650515 3.54936501 3.2392510 3.997901 8.074201 100 d
# R 0.803701 0.8303015 1.01017200 0.9400015 1.061751 2.019902 100 b
# R_C_api 2.336402 2.4536510 3.01576302 2.7220010 3.314951 6.905101 100 c
# Rcpp_Function 0.844001 0.8862510 1.09259990 0.9597505 1.120701 3.011801 100 b
# overhead 0.001500 0.0071005 0.01459391 0.0146510 0.017651 0.101401 100 a
*/
Note that I've included a simple wrapper calling anyNA through Rcpp::Function as well. Once again this implementation of anyNA is not just a little but alot slower than the base implementation.
So the question becomes 2 fold:
Why is the Rcpp so much slower?
Derived from 1: How could this be "changed" to speed up the code?
The questions themselves are not very interesting in itself, but it is interesting if this is affecting multiple parts of Rcpp implementations that may in aggregate gain significant performance boosts.
SessonInfo()
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_Denmark.1252 LC_CTYPE=English_Denmark.1252 LC_MONETARY=English_Denmark.1252 LC_NUMERIC=C LC_TIME=English_Denmark.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-7 cmdline.arguments_0.0.1 glue_1.4.2 R6_2.5.0 Rcpp_1.0.6
loaded via a namespace (and not attached):
[1] codetools_0.2-18 lattice_0.20-41 mvtnorm_1.1-1 zoo_1.8-8 MASS_7.3-53 grid_4.0.3 multcomp_1.4-15 Matrix_1.2-18 sandwich_3.0-0 splines_4.0.3
[11] TH.data_1.0-10 tools_4.0.3 survival_3.2-7 compiler_4.0.3
Edit (Not only a windows problem):
I wanted to make sure this is not a "Windows problem" so I went through and executed the problem within a Docker container running linux. The result is shown below and is very similar
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rcpp 2.3399 2.62155 4.093380 3.12495 3.92155 26.2088 100
# R 0.7635 0.84415 1.459659 1.10350 1.42145 12.1148 100
# R_C_api 2.3358 2.56500 3.833955 3.11075 3.65925 14.2267 100
# Rcpp_Function 0.8163 0.96595 1.574403 1.27335 1.56730 11.9240 100
# overhead 0.0009 0.00530 0.013330 0.01195 0.01660 0.0824 100
Session info:
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-7 Rcpp_1.0.5
loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2
This is an interesting question, but the answer is pretty simple: there are two versions of STRING_ELT one used internally by R or if you set the USE_RINTERNALS macro in Rinlinedfuns.h and one for plebs in memory.c.
Comparing the two versions, you can see that the pleb version has more checks, which fully accounts for the difference in speed.
If you really want speed and don't care about safety, you can usually beat R by at least a little bit.
// [[Rcpp::export(rng = false)]]
bool any_na_unsafe(SEXP x) {
SEXP* ptr = STRING_PTR(x);
R_xlen_t n = Rf_xlength(x);
for(R_xlen_t i=0; i<n; ++i) {
if(ptr[i] == NA_STRING) return true;
}
return false;
}
Bench:
> microbenchmark(
+ R = anyNA(vec),
+ R_C_api = any_na2(vec),
+ unsafe = any_na_unsafe(vec),
+ unit = "ms"
+ )
Unit: milliseconds
expr min lq mean median uq max neval
R 0.5058 0.52830 0.553696 0.54000 0.55465 0.7758 100
R_C_api 1.9990 2.05170 2.214136 2.06695 2.10220 12.2183 100
unsafe 0.3170 0.33135 0.369585 0.35270 0.37730 1.2856 100
Although as written this is unsafe, if you add a few checks before the loop in the beginning it'd be fine.
This questions turns out to be a good example of why some people rail and rant against microbenchmarks.
Baseline is a built-in primitive
The function that is supposed to be beat here is actually a primitive so that makes it a little tricky already
> anyNA
function (x, recursive = FALSE) .Primitive("anyNA")
>
ALTREP puts a performance floor down
Next, a little experiment shows that the baseline function anyNA() never loops. We define a very short vector srt and a long vector lng, both contain a NA value. Turns out ... R is optimised via ALTREP keeping a matching bit in the data structure headers and the cost of checking is independent of length:
> srt <- c("A",NA_character_); lng <- c(rep("A", 1e6), NA_character_)
> microbenchmark(short=function(srt) { anyNA(srt) },
+ long=function(lng) { anyNA(lng) }, times=1000)
Unit: nanoseconds
expr min lq mean median uq max neval cld
short 48 50 69.324 51 53 5293 1000 a
long 48 50 92.166 51 52 15494 1000 a
>
Note the units here (nanoseconds) and time spent. We are measuring looking at single bit.
(Edit: Scrab that. Thinko of mine in a rush, see comments.)
Rcpp functions have some small overhead
This is not new and documented. If you look at the code generated by Rcpp Attributes, conveniently giving us an R function of the same name of the C++ function we designate you see that at least one other function call is involved. Plus a baked-in try/catch layer, RNG setting (here turned off) and so on. That cannot be zero, and if amortized against anything reasonable it does neither matter not show up in measurements.
Here, however, the exercise was set up to match a primitive function looking at one bit. It's a race one cannot win. So here is my final table
> microbenchmark(anyNA = anyNA(vec), Rcpp_plain = rcpp_c_api(vec),
+ Rcpp_tmpl = rcpp_any_na(vec), Rcpp_altrep = rcpp_altrep(vec),
+ times = .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval cld
anyNA 643.993 658.43 827.773 700.729 819.78 6280.85 5000 a
Rcpp_plain 1916.188 1952.55 2168.708 2022.017 2191.64 8506.71 5000 d
Rcpp_tmpl 1709.380 1743.04 1933.043 1798.788 1947.83 8176.10 5000 c
Rcpp_altrep 1501.148 1533.88 1741.465 1590.572 1744.74 10584.93 5000 b
It contains the primitive R function, the original (templated) C++ function which looks pretty good still, something using Rcpp (and its small overhead) with just C API use (plus the automatic wrappers in/out) a little slower -- and then for comparison a function from Michel's checkmate package which does look at the ALTREP bit. And it is barely faster.
So really what we are looking at here is overhead from function calls getting in the way of measurning a micro-operations. So no, Rcpp cannot be made faster than a highly optimised primitive. The question looked interesting, but was, at the end of the day, somewhat ill-posed. Sometimes it is worth working through that.
My code version follows below.
// CharacterVector example
#include <Rcpp.h>
using namespace Rcpp;
template<typename T, typename S>
bool any_na(S x){
T xx = as<T>(x);
for (auto i : xx){
if (T::is_na(i))
return true;
}
return false;
}
// [[Rcpp::export(rng = false)]]
LogicalVector rcpp_any_na(SEXP x){
return any_na<CharacterVector>(x);
}
// [[Rcpp::export(rng = false)]]
SEXP overhead(SEXP x){
CharacterVector xx = as<CharacterVector>(x);
return wrap(xx);
}
// [[Rcpp::export(rng = false)]]
bool rcpp_c_api(SEXP x) {
R_xlen_t n = Rf_length(x);
for (R_xlen_t i = 0; i < n; i++) {
if(STRING_ELT(x, i) == NA_STRING)
return true;
}
return false;
}
// [[Rcpp::export(rng = false)]]
SEXP any_na3(SEXP x){
Function anyNA("anyNA");
return anyNA(x);
}
// courtesy of the checkmate package
// [[Rcpp::export(rng=false)]]
R_xlen_t rcpp_altrep(SEXP x) {
#if defined(R_VERSION) && R_VERSION >= R_Version(3, 5, 0)
if (STRING_NO_NA(x))
return 0;
#endif
const R_xlen_t nx = Rf_xlength(x);
for (R_xlen_t i = 0; i < nx; i++) {
if (STRING_ELT(x, i) == NA_STRING)
return i + 1;
}
return 0;
}
/***R
library(microbenchmark)
srt <- c("A",NA_character_)
lng <- c(rep("A", 1e6), NA_character_)
microbenchmark(short = function(srt) { anyNA(srt) },
long = function(lng) { anyNA(lng) },
times=1000)
N <- 1e6
vec <- sample(letters, N, TRUE)
vec[N] <- NA_character_
anyNA(vec) # to check
microbenchmark(
anyNA = anyNA(vec),
Rcpp_plain = rcpp_c_api(vec),
Rcpp_tmpl = rcpp_any_na(vec),
Rcpp_altrep = rcpp_altrep(vec),
#Rcpp_Function = any_na3(vec),
#overhead = overhead(vec),
times = 5000
# unit="relative"
)
*/
Related
I am currently working with a very large array with dimension 5663x1000x100 in R. I would like to get 100 maximum values, which will be the maximum of each individual 5663x1000 matrix.
big_array = array(data=rnorm(566300000),dim=c(5663,1000,100))
Two methods I have tried so far include a for loop and apply (which intuitively should not be the fastest methods).
maximas = rep(0,100)
# Method 1 - Runs in 17 seconds
for(i in seq(1,100)){
maximas[i]=max(big_array[,,i])
}
# Method 2 - Runs in 36 seconds
apply(big_array,3,max)
I would think because of the array data structure there is an even faster way to run this. I have considered pmax() but from what I see I would have to reshape my data (which given the array is almost 4GB I do not want to create another object). This code is already part of code which is being parallelized so I am unable to parallelize it any further.
Any ideas would help greatly!
Why not just do that with Rcpp and RcppArmadillo? Try this
library(Rcpp)
library(RcppArmadillo)
cppFunction('NumericVector max_slice(const arma::cube& Q) {
int n = Q.n_slices;
NumericVector out(n);
for (int i; i < n; i++) {
out[i] = Q.slice(i).max();
}
return out;
}', depends = "RcppArmadillo")
str(big_array)
max_slice(big_array)
Output
> str(big_array)
num [1:5663, 1:1000, 1:100] -0.282 -0.166 1.114 -0.447 -0.255 ...
> max_slice(big_array)
[1] 5.167835 4.837959 5.026354 5.211833 5.054781 5.785444 4.782578 5.169154 5.427360 5.271900 5.197460 4.994804 4.977396 5.093390 5.124796 5.221609
[17] 5.124122 4.857690 5.230277 5.217994 4.957608 5.060677 4.943275 5.382807 5.455486 5.226405 5.598238 4.942523 5.096521 5.000764 5.257607 4.843708
[33] 4.866905 5.125437 5.662431 5.224198 5.026749 5.349403 4.987372 5.228885 5.456373 5.576859 5.166118 5.124967 4.991101 5.210636 5.057471 5.005961
[49] 5.223063 5.182867 5.333683 5.528648 5.015871 4.837031 5.311825 4.981555 5.876951 5.145006 5.107017 5.252450 5.219044 5.310852 5.081958 5.210729
[65] 5.439197 5.034269 5.339251 5.567369 5.117237 5.382006 5.332199 5.032523 5.622024 5.008994 5.537377 5.279285 5.175870 5.056068 5.019422 5.616507
[81] 5.141175 4.948246 5.262170 4.961154 5.119193 4.908987 5.175458 5.328144 5.127913 5.816863 4.745966 5.507947 5.226849 5.247738 5.336941 5.134757
[97] 4.899032 5.067129 5.615639 5.118519
Benchmark
cppFunction('NumericVector max_slice(const arma::cube& Q) {
int n = Q.n_slices;
NumericVector out(n);
for (int i; i < n; i++) {
out[i] = Q.slice(i).max();
}
return out;
}', depends = "RcppArmadillo")
max_vapply <- function(x) vapply(seq_len(dim(x)[3]), function(i) max(x[,,i]), numeric(1))
microbenchmark::microbenchmark(
max_vapply(big_array), max_slice(big_array),
times = 5L
)
Result
Unit: milliseconds
expr min lq mean median uq max neval cld
max_vapply(big_array) 4735.7055 4789.6901 5159.8319 5380.784 5428.8319 5464.1480 5 b
max_slice(big_array) 724.8582 742.0412 800.8939 747.811 833.2658 956.4935 5 a
My mac's R is linked with openblas. When I look at the "% CPU" usage while performing the sparse-sparse multiplication in R or in Rcpp using Armadillo, it doesn't seem like multithreading is being used unlike the dense-dense multiplication. Speed-wise, the single thread sparse-sparse multiplication in R or Armadillo seems slower than Matlab as well.
To address this issue, I have implemented FG Gustavson's algorithm (https://dl.acm.org/citation.cfm?id=355796) for performing sparse-sparse matrix multiplication in Rcpp using Armadillo's spMat container.
I can see an improvement (please see below) if I ignore ordering of the rows which is direct implementation of the algorithm, however the standard ordering makes it slower than R's (edited as per mtall's comment). I am not an expert in Rcpp/RcppArmadillo/C++ and I am looking for help in two specific things:
Programmatically how can I make the sp_sp_gc_ord function more efficient and faster based on single thread application?
My lame attempt at multithreading sp_sp_gc_ord with openmp is causing R to crash. I have commented out the omp commands below. I have looked at Rcpp gallery discussions on OpenMP http://gallery.rcpp.org/tags/openmp/ but couldn't figure out the problem
I would appreciate any help. Below here is a reproducible example of the code and corresponding microbenchmark:
#### Rcpp functions
#include <RcppArmadillo.h>
#include<omp.h>
#include<Rcpp.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::plugins(openmp)]]
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
sp_mat sp_sp_gc_ord(const arma::sp_mat &A, const arma::sp_mat &B, double p){
// This function evaluates A * B where both A & B are sparse and the resultant
// product is also sparse
// define matrix sizes
const int mA= A.n_rows;
const int nB= B.n_cols;
// number of non-zeros in the resultant matrix
const int nnzC = ceil(mA * nB * p);
// initialize colptr, row_index and value vectors for the resultant sparse matrix
urowvec colptrC(nB+1);
colptrC.zeros();
uvec rowvalC(nnzC);
rowvalC.zeros();
colvec nzvalC(nnzC);
//setenv("OMP_STACKSIZE","500M",1);
// counters and other variables
unsigned int i, jp, j, kp, k, vp;
unsigned int ip = 0;
double nzB, nzA;
ivec xb(mA);
xb.fill(-1);
vec x(mA);
// loop logic: outer loop over columns of B and inner loop over columns of A and then aggregate
// #pragma omp parallel for shared(colptrC,rowvalC,nzvalC,x,xb,ip,A,B) private(j,nzA,nzB,kp,i,jp,kp,k,vp) default(none) schedule(auto)
for(i=0; i< nB; i++) {
colptrC.at(i) = ip;
for ( jp = B.col_ptrs[i]; jp < B.col_ptrs[i+1]; jp++) {
j = B.row_indices[jp];
nzB = B.values[jp];
for ( kp = A.col_ptrs[j]; kp < A.col_ptrs[j+1]; kp++ ){
k = A.row_indices[kp];
nzA = A.values[kp];
if (xb.at(k) != i){
rowvalC.at(ip) = k;
ip +=1;
// Rcpp::print(wrap(ip));
xb.at(k) = i;
x.at(k) = nzA * nzB;
} else {
x.at(k) += nzA * nzB;
}
}
}
// put in the value vector of resultant matrix
if(ip>0){
for ( vp= colptrC.at(i); vp <= (ip-1); vp++ ) {
nzvalC.at(vp) = x(rowvalC.at(vp));
}
}
}
// resize and put in the spMat container
colptrC.at(nB) = ip;
sp_mat C(rowvalC.subvec(0,(ip-1)),colptrC,nzvalC.subvec(0,(ip-1)),mA,nB);
// Gustavson's algorithm produces unordered rows for each column: a standard way to address this is: (X.t()).t()
return (C.t()).t();
}
// [[Rcpp::export]]
sp_mat sp_sp_arma(const sp_mat &A, const sp_mat &B){
return A * B;
}
// [[Rcpp::export]]
mat dense_dense_arma(const mat &A, const mat &B){
return A * B;
}
#### End
The corresponding microbenchmark part in R:
#### Microbenchmark
library(Matrix)
library(microbenchmark)
## define two matrices
m<- 1000
n<- 6000
p<- 2000
A<- matrix(runif(m*n),m,n)
B<- matrix(runif(n*p),n,p)
A[abs(A)> .01] = B[abs(B)> .01] = 0
A <- as(A,'dgCMatrix')
B<- as(B,'dgCMatrix')
Adense<- as.matrix(A)
Bdense<- as.matrix(B)
## sp_sp_gc is the function without ordering
microbenchmark(sp_sp_gc(A,B,.5),sp_sp_arma(A,B),A%*%B,
dense_dense_arma(Adense,Bdense),Adense %*% Bdense,Adense %*% B, times=100)
Unit: milliseconds
expr min lq mean median uq max neval
sp_sp_gc(A, B, 0.5) 16.09809 21.75001 25.76436 24.44657 26.96300 99.30778 100
sp_sp_gc_ord(A, B, 0.5) 36.78781 44.64558 49.82102 47.64348 51.87361 116.85013 100
sp_sp_arma(A, B) 47.45203 52.77132 59.37077 59.24010 62.41710 86.15647 100
A %*% B 23.64307 28.99649 32.88566 32.10017 35.21816 59.16251 100
dense_dense_arma(Adense, Bdense) 286.22358 302.95170 345.66766 317.75786 340.50143 862.15116 100
Adense %*% Bdense 292.32099 317.10795 342.48345 329.80950 342.21333 697.56468 100
Adense %*% B 167.87248 186.63499 219.11872 195.19197 212.50286 843.17172 100
####
sessionInfo():
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /usr/local/Cellar/openblas/0.3.3/lib/libopenblas_haswellp-r0.3.3.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.2-14 RcppArmadillo_0.8.500.0 Rcpp_0.12.18
loaded via a namespace (and not attached):
[1] compiler_3.5.1 grid_3.5.1 lattice_0.20-35
Rcpp and RcppArmadillo are installed from source after installing clang4 for mac following coatless's link https://github.com/coatless/r-macos-rtools
I am wondering if I can apply lgamma on all entries of a large matrix using Rcpp. I tried using a vector:
// lgammaRcpp.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector lgammaRcpp(NumericVector v){
NumericVector out;
out = lgamma(v);
return(out);
}
I did a simple microbenchmarking:
library("microbenchmark")
x <- round(runif(100000)+50000);
microbenchmark(
lgammaRcpp(x),
lgamma(x)
)
and the Rcpp is slightly faster:
Unit: milliseconds
expr min lq mean median uq max neval
lgammaRcpp(x) 5.405556 5.416283 5.810254 5.436139 5.511993 8.650419 100
lgamma(x) 5.613717 5.628769 6.114942 5.644215 6.872677 9.947497 100
When I try using a "NumericMatrix", however:
// [[Rcpp::export]]
NumericMatrix lgammaRcpp(NumericMatrix v){
NumericMatrix out;
out = lgamma(v);
return(out);
}
there are errors that I don't understand, e.g.
/home/canghel/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include/Rcpp/vector /Matrix.h:83:13: note: Rcpp::Matrix<RTYPE, StoragePolicy>& Rcpp::Matrix<RTYPE, StoragePolicy>::operator=(const Rcpp::Matrix<RTYPE, StoragePolicy>&) [with int RTYPE = 14; StoragePolicy = Rcpp::PreserveStorage]
Matrix& operator=(const Matrix& other) {
My questions are: 1) Is there a way to modify my function to apply lgamma over all entries to a matrix? and 2) Is it worth it, or is the underlying library that is called for the lgamma function the same for C++ and R?
It seems better (i.e. faster) to apply functions like lgamma/digamma to a matrix using the Rfast package.
library("microbenchmark");
library("RcppArmadillo");
library("Rfast");
sourceCpp("lgammaRcpp.cpp");
x <- matrix(round(runif(100000)+50000), 100, 1000);
microbenchmark(
lgammaRcpp(x),
lgamma(x),
Rfast::Lgamma(x)
)
Unit: milliseconds
expr min lq mean median uq max neval
lgammaRcppArma(x) 4.654526 4.919831 5.577843 5.413790 5.888895 9.258325 100
lgamma(x) 5.572671 5.840268 6.582007 6.131651 7.280895 8.779301 100
Rfast::Lgamma(x) 4.450824 4.588596 5.128323 4.791287 5.608678 6.865331 100
where I had:
#include<RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat lgammaRcpp(arma::mat m) {
arma::mat out = lgamma(m);
return(out);
}
Rcpp Sugar tends to return Vectors unless otherwise specified. Thus, you will always get back in this case a Vector of type Numeric e.g. NumericVector. See my notes on different sugar functions here: https://github.com/coatless/rcpp-api
The following allows for a compilation under the above note:
#include <Rcpp.h>
// [[Rcpp::export]]
NumericVector lgammaRcpp(NumericMatrix v) {
NumericVector out;
out = lgamma(v);
return(out);
}
It is highly unlikely you will see a large speed up as the functions being used are the same. This is partially indicated with your above benchmarks and can be verified by looking at Rcpp Math defines. Now, this isn't to say a benefit is not available. In particular, the main benefit here is if you are encapsulating a routine completely in C++. In which case, your routine will be significantly quicker if you use Sugar functions if compared to calling an R function from C++.
I have been working on a package that uses Rcpp to apply arbitrary R code over a group of large medical imaging files. I noticed that my Rcpp implementation is considerably slower than the original pure C version. I traced the difference to calling a function via Function, vs the original Rf_eval. My question is why is there a close to 4x performance degradation, and is there a way to speed up the function call to be closer in performance to Rf_eval?
Example:
library(Rcpp)
library(inline)
library(microbenchmark)
cpp_fun1 <-
'
Rcpp::List lots_of_calls(Function fun, NumericVector vec){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
'
cpp_fun2 <-
'
Rcpp::List lots_of_calls2(SEXP fun, SEXP env){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
'
lots_of_calls <- cppFunction(cpp_fun1)
lots_of_calls2 <- cppFunction(cpp_fun2)
microbenchmark(lots_of_calls(mean, 1:1000),
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv))
Results
Unit: milliseconds
expr min lq mean median uq max neval
lots_of_calls(mean, 1:1000) 38.23032 38.80177 40.84901 39.29197 41.62786 54.07380 100
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv) 10.53133 10.71938 11.08735 10.83436 11.03759 18.08466 100
Rcpp is great because it makes things look absurdly clean to the programmer. The cleanliness has a cost in the form of templated responses and a set of assumptions that weigh down the execution time. But, such is the case with a generalized vs. specific code setup.
Take for instance the call route for an Rcpp::Function. The initial construction and then outside call to a modified version of Rf_reval requires a special Rcpp specific eval function given in Rcpp_eval.h. In turn, this function is wrapped in protections to protect against a function error when calling into R via a Shield associated with it. And so on...
In comparison, Rf_eval has neither. If it fails, you will be up the creek without a paddle. (Unless, of course, you implement error catching via R_tryEval for it.)
With this being said, the best way to speed up the calculation is to simply write everything necessary for the computation in C++.
Besides the points made by #coatless, you aren't even comparing apples with apples. Your Rf_eval example does not pass the vector to the function, and, more importantly, plays tricks on the function via quote().
In short, it is all a little silly.
Below is a more complete example using the sugar function mean().
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List callFun(Function fun, NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
// [[Rcpp::export]]
List callRfEval(SEXP fun, SEXP env){
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
// [[Rcpp::export]]
List callSugar(NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
double d = mean(vec);
output[i] = d;
}
return output;
}
/*** R
library(microbenchmark)
microbenchmark(callFun(mean, 1:1000),
callRfEval(quote(mean(1:1000)), .GlobalEnv),
callSugar(1:1000))
*/
You can just sourceCpp() this:
R> sourceCpp("/tmp/ch.cpp")
R> library(microbenchmark)
R> microbenchmark(callFun(mean, 1:1000),
+ callRfEval(quote(mean(1:1000)), .GlobalEnv),
+ callSugar(1:1000))
Unit: milliseconds
expr min lq mean median uq max neval
callFun(mean, 1:1000) 14.87451 15.54385 18.57635 17.78990 18.29127 114.77153 100
callRfEval(quote(mean(1:1000)), .GlobalEnv) 3.35954 3.57554 3.97380 3.75122 4.16450 6.29339 100
callSugar(1:1000) 1.50061 1.50827 1.62204 1.51518 1.76683 1.84513 100
R>
using Rcpp I am trying to test for NA in a POSIXct vector passed to C++ (class DatetimeVector). It seems that the Rcpp::is_na(.) function works for NumericVector, CharcterVector... but not DatetimeVector.
Here is the C++ code that tests NA for NumericVector and CharacterVector but fails to compile if you add DatetimeVector
#include <Rcpp.h>
using namespace std;
using namespace Rcpp;
//[[Rcpp::export]]
List testNA(DataFrame df){
const int N = df.nrows();
//Test for NA in an IntegerVector
IntegerVector intV = df["intV"];
LogicalVector resInt = is_na(intV);
//Test for NA in an CharacterVector
CharacterVector strV = df["strV"];
LogicalVector resStr = is_na(strV);
//Test for NA in an DatetimeVector
DatetimeVector dtV = df["dtV"];
LogicalVector resDT;
//resDT = is_na(dtV); UNCOMMENT => DOES NOT COMPILE
return(List::create(_["df"]=df,
_["resInt"]=resInt,
_["resStr"]=resStr,
_["resDT"]=resDT));
}
/*** R
cat("testing for NA\n")
intV <- c(1,NA,2)
df <- data.frame(intV=intV, strV=as.character(intV), dtV=as.POSIXct(intV,origin='1970-01-01'))
str(df)
testNA(df)
*/
In R
library("Rcpp")
sourceCpp("theCodeAbove.cpp")
I've added (rev 4405 of Rcpp) implementations of is_na for DateVector and DatetimeVector that don't need the cast to NumericVector, which creates a temporary object we don't actually need.
However, we don't get much of a performance hit, because most of the time is taken to construct DatetimeVector objects.
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
LogicalVector isna_cast( DatetimeVector d){
// version with the cast
return is_na( as<NumericVector>( d ) ) ;
}
// [[Rcpp::export]]
LogicalVector isna( DatetimeVector d){
// without cast
return is_na( d ) ;
}
// [[Rcpp::export]]
void do_nothing( DatetimeVector d){
// just measuring the time it takes to
// create a DatetimeVector from an R object
}
Benchmarking this with microbenchmark :
require(microbenchmark)
intV <- rep( c(1,NA,2), 100000 )
dtV <- as.POSIXct(intV,origin='1970-01-01')
microbenchmark(
isna_cast( dtV ),
isna( dtV ),
do_nothing( dtV )
)
# Unit: milliseconds
# expr min lq median uq max neval
# isna_cast(dtV) 67.03146 68.04593 68.71991 69.39960 96.46747 100
# isna(dtV) 65.71262 66.43674 66.77992 67.16535 95.93567 100
# do_nothing(dtV) 57.15901 57.72670 58.08646 58.39948 58.97939 100
About 85% of the time is used to just create the DatetimeVector object. This is because the DatetimeVector and DateVector classes don't use the proxy design we used everywhere else in Rcpp. A DatetimeVector is essentially a std::vector<Datetime> and each of these Datetime objects is created from the corresponding element of the underlying object from R.
It is probably too late to change the api of DatetimeVector and DateVector and make them proxy based, but maybe there is room for something like a POSIXct class.
In comparison, let's measure the time it takes to do nothing with a NumericVector:
// [[Rcpp::export]]
void do_nothing_NumericVector( NumericVector d){}
# Unit: microseconds
# expr min lq median uq max
# isna_cast(dtV) 66985.21 68103.0060 68960.7880 69416.227 95724.385
# isna(dtV) 65699.72 66544.9935 66893.5720 67213.064 95262.267
# do_nothing(dtV) 57209.26 57865.1140 58306.8780 58630.236 69897.636
# do_nothing_numeric(intV) 4.22 9.6095 15.2425 15.511 33.978
The compiler error suggests the method is not (yet?) available for DateTimeVectors:
test.cpp:18:13: error: no matching function for call to 'is_na'
An easy workaround:
resDT = is_na( as<NumericVector>(dtV) ); // As per Dirk's suggestion