I have been working on a package that uses Rcpp to apply arbitrary R code over a group of large medical imaging files. I noticed that my Rcpp implementation is considerably slower than the original pure C version. I traced the difference to calling a function via Function, vs the original Rf_eval. My question is why is there a close to 4x performance degradation, and is there a way to speed up the function call to be closer in performance to Rf_eval?
Example:
library(Rcpp)
library(inline)
library(microbenchmark)
cpp_fun1 <-
'
Rcpp::List lots_of_calls(Function fun, NumericVector vec){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
'
cpp_fun2 <-
'
Rcpp::List lots_of_calls2(SEXP fun, SEXP env){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
'
lots_of_calls <- cppFunction(cpp_fun1)
lots_of_calls2 <- cppFunction(cpp_fun2)
microbenchmark(lots_of_calls(mean, 1:1000),
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv))
Results
Unit: milliseconds
expr min lq mean median uq max neval
lots_of_calls(mean, 1:1000) 38.23032 38.80177 40.84901 39.29197 41.62786 54.07380 100
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv) 10.53133 10.71938 11.08735 10.83436 11.03759 18.08466 100
Rcpp is great because it makes things look absurdly clean to the programmer. The cleanliness has a cost in the form of templated responses and a set of assumptions that weigh down the execution time. But, such is the case with a generalized vs. specific code setup.
Take for instance the call route for an Rcpp::Function. The initial construction and then outside call to a modified version of Rf_reval requires a special Rcpp specific eval function given in Rcpp_eval.h. In turn, this function is wrapped in protections to protect against a function error when calling into R via a Shield associated with it. And so on...
In comparison, Rf_eval has neither. If it fails, you will be up the creek without a paddle. (Unless, of course, you implement error catching via R_tryEval for it.)
With this being said, the best way to speed up the calculation is to simply write everything necessary for the computation in C++.
Besides the points made by #coatless, you aren't even comparing apples with apples. Your Rf_eval example does not pass the vector to the function, and, more importantly, plays tricks on the function via quote().
In short, it is all a little silly.
Below is a more complete example using the sugar function mean().
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List callFun(Function fun, NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
// [[Rcpp::export]]
List callRfEval(SEXP fun, SEXP env){
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
// [[Rcpp::export]]
List callSugar(NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
double d = mean(vec);
output[i] = d;
}
return output;
}
/*** R
library(microbenchmark)
microbenchmark(callFun(mean, 1:1000),
callRfEval(quote(mean(1:1000)), .GlobalEnv),
callSugar(1:1000))
*/
You can just sourceCpp() this:
R> sourceCpp("/tmp/ch.cpp")
R> library(microbenchmark)
R> microbenchmark(callFun(mean, 1:1000),
+ callRfEval(quote(mean(1:1000)), .GlobalEnv),
+ callSugar(1:1000))
Unit: milliseconds
expr min lq mean median uq max neval
callFun(mean, 1:1000) 14.87451 15.54385 18.57635 17.78990 18.29127 114.77153 100
callRfEval(quote(mean(1:1000)), .GlobalEnv) 3.35954 3.57554 3.97380 3.75122 4.16450 6.29339 100
callSugar(1:1000) 1.50061 1.50827 1.62204 1.51518 1.76683 1.84513 100
R>
Related
I am using R to create a simulation that takes samples from a random normal distribution, and not surprisingly, it's fairly slow. So, I looked for some ways to speed it up using Rcpp, and came across the RcppZiggurat package for faster random normal samples, and the RcppParallel package for multi-threaded computation, and thought, why not use both a faster algorithm and draw the samples in parallel?
So I started prototyping, and finally ended up with three methods to compare:
Samples using RcppParallel and RcppZiggurat together
Samples using just RcppZiggurat
Samples using good old rnorm
Below are my implementations using RcppParallel + RcppZiggurat (the parallelDraws function), and just RcppZiggurat (the serialDraws function):
#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>
static Ziggurat::Ziggurat::Ziggurat zigg;
using namespace RcppParallel;
struct Norm : public Worker
{
int input;
// saved draws
RVector<double> draws;
// constructors
Norm(const int input, Rcpp::NumericVector draws)
: input(input), draws(draws) {}
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; i++) {
draws[i] = zigg.norm();
}
}
};
// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(x);
// declare the Norm instance
Norm norm(x, draws);
// call parallelFor to start the work
parallelFor(0, x, norm);
// return the draws
return draws;
};
// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(x);
for (int i = 0; i < x; i++) {
draws[i] = zigg.norm();
}
// return the draws
return draws;
};
When I benchmarked them, I found some surprising results:
library(microbenchmark)
microbenchmark(parallelDraws(1e5), serialDraws(1e5), rnorm(1e5))
Unit: microseconds
expr min lq mean median uq max neval
parallelDraws(1e+05) 3113.752 3539.686 3687.794 3599.1540 3943.282 5058.376 100
serialDraws(1e+05) 695.501 734.593 2536.940 757.2325 806.135 175712.496 100
rnorm(1e+05) 6072.043 6264.030 6655.835 6424.0195 6661.739 18578.669 100
Using RcppZiggurat alone was about 8x faster than rnorm, but using RcppParallel and RcppZiggurat together was slower than using RcppZiggurat alone! I tried playing around with the grain size for the RcppParallel ParallelFor function, but it didn't result in any noticeable improvement.
My question is: What could be the reason why adding parallelism is actually worse here? I know that "overhead" in parallel computations can outweigh the benefits depending on various factors. Is that what is happening here? Or am I completely misunderstanding how to effectively use the RcppParallel package?
As mentioned in the comments, overhead can be problematic especially when the overall runtime is short, it is better to not initialize the output vectors with zero and to use thread local RNGs. Example implementation:
#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>
using namespace RcppParallel;
struct Norm : public Worker
{
// saved draws
RVector<double> draws;
// constructors
Norm(Rcpp::NumericVector draws)
: draws(draws) {}
void operator()(std::size_t begin, std::size_t end) {
Ziggurat::Ziggurat::Ziggurat zigg(end);
for (std::size_t i = begin; i < end; i++) {
draws[i] = zigg.norm();
}
}
};
// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(Rcpp::no_init(x));
Norm norm(draws);
parallelFor(0, x, norm);
return draws;
}
// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(Rcpp::no_init(x));
Ziggurat::Ziggurat::Ziggurat zigg(42);
for (int i = 0; i < x; i++) {
draws[i] = zigg.norm();
}
return draws;
}
Note that I am using "poor man's parallel RNG", i.e. distinct seeds for the different threads, and hope for the best. I am using end as seed, since begin might be zero and I am not sure if the RNG in RcppZiggurat likes that. Since it takes some time (and memory) to create a Ziggurat object, I also use a local one for the serial computation to be fair.
For 10^5 random draws, there is still no gain from using parallel computation:
> bench::mark(parallelDraws(1e5), serialDraws(1e5), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 parallelDraws(1e+05) 1.08ms 1.78ms 558. 784KB
2 serialDraws(1e+05) 624.16µs 758.6µs 1315. 784KB
But for 10^8 draws I get a nice speed-up on my dual core laptop:
> bench::mark(parallelDraws(1e8), serialDraws(1e8), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 parallelDraws(1e+08) 326ms 343ms 2.91 763MB
2 serialDraws(1e+08) 757ms 770ms 1.30 763MB
So whether it makes sense to use parallel computation depends heavily on the number of random draws you need.
BTW, in the comments my dqrng package is mentioned. This package also uses the Ziggurat method for normal (and exponential) draws combined with very fast 64bit RNGs, giving it comparable serial speed to RcppZiggurat for normal draws. In addition, the used RNGs are ment for parallel computation, i.e. there is no need for hoping to get non-overlapping random streams by using different seeds.
I am wondering if I can apply lgamma on all entries of a large matrix using Rcpp. I tried using a vector:
// lgammaRcpp.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector lgammaRcpp(NumericVector v){
NumericVector out;
out = lgamma(v);
return(out);
}
I did a simple microbenchmarking:
library("microbenchmark")
x <- round(runif(100000)+50000);
microbenchmark(
lgammaRcpp(x),
lgamma(x)
)
and the Rcpp is slightly faster:
Unit: milliseconds
expr min lq mean median uq max neval
lgammaRcpp(x) 5.405556 5.416283 5.810254 5.436139 5.511993 8.650419 100
lgamma(x) 5.613717 5.628769 6.114942 5.644215 6.872677 9.947497 100
When I try using a "NumericMatrix", however:
// [[Rcpp::export]]
NumericMatrix lgammaRcpp(NumericMatrix v){
NumericMatrix out;
out = lgamma(v);
return(out);
}
there are errors that I don't understand, e.g.
/home/canghel/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include/Rcpp/vector /Matrix.h:83:13: note: Rcpp::Matrix<RTYPE, StoragePolicy>& Rcpp::Matrix<RTYPE, StoragePolicy>::operator=(const Rcpp::Matrix<RTYPE, StoragePolicy>&) [with int RTYPE = 14; StoragePolicy = Rcpp::PreserveStorage]
Matrix& operator=(const Matrix& other) {
My questions are: 1) Is there a way to modify my function to apply lgamma over all entries to a matrix? and 2) Is it worth it, or is the underlying library that is called for the lgamma function the same for C++ and R?
It seems better (i.e. faster) to apply functions like lgamma/digamma to a matrix using the Rfast package.
library("microbenchmark");
library("RcppArmadillo");
library("Rfast");
sourceCpp("lgammaRcpp.cpp");
x <- matrix(round(runif(100000)+50000), 100, 1000);
microbenchmark(
lgammaRcpp(x),
lgamma(x),
Rfast::Lgamma(x)
)
Unit: milliseconds
expr min lq mean median uq max neval
lgammaRcppArma(x) 4.654526 4.919831 5.577843 5.413790 5.888895 9.258325 100
lgamma(x) 5.572671 5.840268 6.582007 6.131651 7.280895 8.779301 100
Rfast::Lgamma(x) 4.450824 4.588596 5.128323 4.791287 5.608678 6.865331 100
where I had:
#include<RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat lgammaRcpp(arma::mat m) {
arma::mat out = lgamma(m);
return(out);
}
Rcpp Sugar tends to return Vectors unless otherwise specified. Thus, you will always get back in this case a Vector of type Numeric e.g. NumericVector. See my notes on different sugar functions here: https://github.com/coatless/rcpp-api
The following allows for a compilation under the above note:
#include <Rcpp.h>
// [[Rcpp::export]]
NumericVector lgammaRcpp(NumericMatrix v) {
NumericVector out;
out = lgamma(v);
return(out);
}
It is highly unlikely you will see a large speed up as the functions being used are the same. This is partially indicated with your above benchmarks and can be verified by looking at Rcpp Math defines. Now, this isn't to say a benefit is not available. In particular, the main benefit here is if you are encapsulating a routine completely in C++. In which case, your routine will be significantly quicker if you use Sugar functions if compared to calling an R function from C++.
In his "Advanced R" book, Hadley Wickham says "noNA(x) asserts that the vector x does not contain any missing values." However I still don't know how to use it. I can't do
if (noNA(x))
do this
so how am I supposed to use it?
http://adv-r.had.co.nz/Rcpp.html#rcpp-sugar
Many of the Rcpp sugar expressions are implemented through template classes which have specializations for cases when the input object is known to be free of missing values, thereby allowing the underlying algorithm to avoid having to perform the extra work of dealing with NA values (e.g. calls to is_na). This is only possible because the VectorBase class has a boolean parameter indicating whether the underlying object can (can, not that it necessarily does) have NA values, or not.
noNA returns (when called on a VectorBase object) an instance of the Nona template class. Note that Nona itself derives from
Rcpp::VectorBase<RTYPE, false, Nona<RTYPE,NA,VECTOR>>
// ^^^^^
meaning that the returned object gets encoded with information that essentially says "you can assume that this data is free of NA values".
As an example, Rcpp::sum is implemented via the Sum class in the Rcpp::sugar namespace. In the default case, we see that there is extra work to manage the possibility of missing values:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
STORAGE current ;
for( R_xlen_t i=0; i<n; i++){
current = object[i] ;
if( Rcpp::traits::is_na<RTYPE>(current) ) // here
return Rcpp::traits::get_na<RTYPE>() ; // here
result += current ;
}
return result ;
}
On the other hand, there is also a specialization for cases when the input does not have missing values, in which the algorithm does less work:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
for( R_xlen_t i=0; i<n; i++){
result += object[i] ;
}
return result ;
}
To answer your question of "how do I apply this in practice?", here is an example:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Sum(IntegerVector x) {
return sum(x);
}
// [[Rcpp::export]]
int SumNoNA(IntegerVector x) {
return sum(noNA(x));
}
Benchmarking these two functions,
set.seed(123)
x <- as.integer(rpois(1e6, 25))
all.equal(Sum(x), SumNoNA(x))
# [1] TRUE
microbenchmark::microbenchmark(
Sum(x),
SumNoNA(x),
times = 500L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# Sum(x) 577.386 664.620 701.2422 677.1640 731.7090 1214.447 500
# SumNoNA(x) 454.990 517.709 556.5783 535.1935 582.7065 1138.426 500
the noNA version is indeed faster.
I need to remove NA values efficiently from vectors inside a function which is implemented with RcppEigen. I can of course do it using a for loop, but I wonder if there is a more efficient way.
Here is an example:
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::VectorXd;
typedef Map<VectorXd> MapVecd;
'
body <- '
const MapVecd x(as<MapVecd>(xx)), y(as<MapVecd>(yy));
VectorXd x1(x), y1(y);
int k(0);
for (int i = 0; i < x.rows(); ++i) {
if (x.coeff(i)==x.coeff(i) && y.coeff(i)==y.coeff(i)) {
x1(k) = x.coeff(i);
y1(k) = y.coeff(i);
k++;
};
};
x1.conservativeResize(k);
y1.conservativeResize(k);
return Rcpp::List::create(Rcpp::Named("x") = x1,
Rcpp::Named("y") = y1);
'
na.omit.cpp <- cxxfunction(signature(xx = "Vector", yy= "Vector"),
body, "RcppEigen", incl)
na.omit.cpp(c(1.5, NaN, 7, NA), c(7.0, 1, NA, 3))
#$x
#[1] 1.5
#
#$y
#[1] 7
In my use case I need to do this about one million times in a loop (inside the Rcpp function) and the vectors could be quite long (let's assume 1000 elements).
PS: I've also investigated the route to find all NA/NaN values using x.array()==x.array(), but was unable to find a way to use the result for subsetting with Eigen.
Perhaps I am not understanding the question correctly, but within Rcpp, I don't see how you could possibly do this more efficiently than a for loop. for loops are generally inefficient in R only because iterating through a loop in R requires a lot of heavy interpreted machinery. But this is not the case once you are down at the C++ level. Even natively vectorized R functions ultimately are implemented with for loops in C. So the only way I can think to make this more efficient is to try to do it in parallel.
For example, here's a simple na.omit.cpp function that omits NA values from a single vector:
rcppfun<-"
Rcpp::NumericVector naomit(Rcpp::NumericVector x){
std::vector<double> r(x.size());
int k=0;
for (int i = 0; i < x.size(); ++i) {
if (x[i]==x[i]) {
r[k] = x[i];
k++;
}
}
r.resize(k);
return Rcpp::wrap(r);
}"
na.omit.cpp<-cppFunction(rcppfun)
This runs even more quickly than R's built in na.omit:
> set.seed(123)
> x<-1:10000
> x[sample(10000,1000)]<-NA
> y1<-na.omit(x)
> y2<-na.omit.cpp(x)
> all(y1==y2)
[1] TRUE
> require(microbenchmark)
> microbenchmark(na.omit(x),na.omit.cpp(x))
Unit: microseconds
expr min lq median uq max neval
na.omit(x) 290.157 363.9935 376.4400 401.750 6547.447 100
na.omit.cpp(x) 107.524 168.1955 173.6035 210.524 222.564 100
I do not know if I understand the problem correctly or not but you can use the following arguments:
a = c(1.5, NaN, 7, NA)
a[-which(is.na(a))]
[1] 1.5 7.0
It might be useful to use `rinside' if you want to use it in C++.
using Rcpp I am trying to test for NA in a POSIXct vector passed to C++ (class DatetimeVector). It seems that the Rcpp::is_na(.) function works for NumericVector, CharcterVector... but not DatetimeVector.
Here is the C++ code that tests NA for NumericVector and CharacterVector but fails to compile if you add DatetimeVector
#include <Rcpp.h>
using namespace std;
using namespace Rcpp;
//[[Rcpp::export]]
List testNA(DataFrame df){
const int N = df.nrows();
//Test for NA in an IntegerVector
IntegerVector intV = df["intV"];
LogicalVector resInt = is_na(intV);
//Test for NA in an CharacterVector
CharacterVector strV = df["strV"];
LogicalVector resStr = is_na(strV);
//Test for NA in an DatetimeVector
DatetimeVector dtV = df["dtV"];
LogicalVector resDT;
//resDT = is_na(dtV); UNCOMMENT => DOES NOT COMPILE
return(List::create(_["df"]=df,
_["resInt"]=resInt,
_["resStr"]=resStr,
_["resDT"]=resDT));
}
/*** R
cat("testing for NA\n")
intV <- c(1,NA,2)
df <- data.frame(intV=intV, strV=as.character(intV), dtV=as.POSIXct(intV,origin='1970-01-01'))
str(df)
testNA(df)
*/
In R
library("Rcpp")
sourceCpp("theCodeAbove.cpp")
I've added (rev 4405 of Rcpp) implementations of is_na for DateVector and DatetimeVector that don't need the cast to NumericVector, which creates a temporary object we don't actually need.
However, we don't get much of a performance hit, because most of the time is taken to construct DatetimeVector objects.
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
LogicalVector isna_cast( DatetimeVector d){
// version with the cast
return is_na( as<NumericVector>( d ) ) ;
}
// [[Rcpp::export]]
LogicalVector isna( DatetimeVector d){
// without cast
return is_na( d ) ;
}
// [[Rcpp::export]]
void do_nothing( DatetimeVector d){
// just measuring the time it takes to
// create a DatetimeVector from an R object
}
Benchmarking this with microbenchmark :
require(microbenchmark)
intV <- rep( c(1,NA,2), 100000 )
dtV <- as.POSIXct(intV,origin='1970-01-01')
microbenchmark(
isna_cast( dtV ),
isna( dtV ),
do_nothing( dtV )
)
# Unit: milliseconds
# expr min lq median uq max neval
# isna_cast(dtV) 67.03146 68.04593 68.71991 69.39960 96.46747 100
# isna(dtV) 65.71262 66.43674 66.77992 67.16535 95.93567 100
# do_nothing(dtV) 57.15901 57.72670 58.08646 58.39948 58.97939 100
About 85% of the time is used to just create the DatetimeVector object. This is because the DatetimeVector and DateVector classes don't use the proxy design we used everywhere else in Rcpp. A DatetimeVector is essentially a std::vector<Datetime> and each of these Datetime objects is created from the corresponding element of the underlying object from R.
It is probably too late to change the api of DatetimeVector and DateVector and make them proxy based, but maybe there is room for something like a POSIXct class.
In comparison, let's measure the time it takes to do nothing with a NumericVector:
// [[Rcpp::export]]
void do_nothing_NumericVector( NumericVector d){}
# Unit: microseconds
# expr min lq median uq max
# isna_cast(dtV) 66985.21 68103.0060 68960.7880 69416.227 95724.385
# isna(dtV) 65699.72 66544.9935 66893.5720 67213.064 95262.267
# do_nothing(dtV) 57209.26 57865.1140 58306.8780 58630.236 69897.636
# do_nothing_numeric(intV) 4.22 9.6095 15.2425 15.511 33.978
The compiler error suggests the method is not (yet?) available for DateTimeVectors:
test.cpp:18:13: error: no matching function for call to 'is_na'
An easy workaround:
resDT = is_na( as<NumericVector>(dtV) ); // As per Dirk's suggestion