Rcpp Matrix row column permutations - r

I am trying to mimic the R function that allows to run column and row matrix permutations based on a vector of indices. Like in the following code:
m=matrix(sample(c(0:9),5*5,T),ncol=5,nrow=5)
diag(m)=0
rand=sample(c(1:5))
m[rand,rand]
I tried the following code in c++:
Library(Rcpp)
cppFunction(‘
NumericMatrix test(NumericMatrix& M, int col, IntegerVector& rand) {
NumericMatrix M2(col,col);
for(int a=0;a<col;a++){
for(int b=a+1;b<col;b++){
M2(b,a)=M(rand(b),rand(a));
M2(a,b)=M(rand(a),rand(b));
}
}
return M2;
}
‘)
But it is very slow:
microbenchmark::microbenchmark(test(m,5,(rand-1)),m2[rand,rand])
Any ideas how I could speed up the process?

Using a simpler loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix test(NumericMatrix& M, int col, IntegerVector& rand) {
NumericMatrix M2(col,col);
for(int a=0;a<col;a++){
for(int b=a+1;b<col;b++){
M2(b,a)=M(rand(b),rand(a));
M2(a,b)=M(rand(a),rand(b));
}
}
return M2;
}
// [[Rcpp::export]]
NumericMatrix test2(const NumericMatrix& M, const IntegerVector& ind) {
int col = M.ncol();
NumericMatrix M2(col, col);
for (int j = 0; j < col; j++)
for (int i = 0; i < col; i++)
M2(i, j) = M(ind[i], ind[j]);
return M2;
}
/*** R
N <- 500
m <- matrix(sample(c(0:9), N * N, TRUE), ncol = N, nrow = N)
diag(m) <- 0
rand <- sample(N)
all.equal(test(m, ncol(m), rand - 1), m[rand, rand], test2(m, rand - 1))
microbenchmark::microbenchmark(
test(m, ncol(m), rand - 1),
m[rand, rand],
test2(m, rand - 1)
)
*/
For N = 5, the R version is faster, but in terms of nanoseconds..
For example, with N = 500, you get:
Unit: microseconds
expr min lq mean median uq max neval
test(m, ncol(m), rand - 1) 2092.474 2233.020 2843.145 2360.654 2548.050 7412.057 100
m[rand, rand] 1422.352 1506.117 2064.500 1578.129 1718.345 6700.219 100
test2(m, rand - 1) 698.595 769.944 1161.747 838.811 928.535 5379.841 100

Related

Allocate Rcpp List of n NumericMatrix

Is there a way to allocate an Rcpp List of length n, where each element of the List will be filled with a NumericMatrix, but the size of each NumericMatrix can change?
I have an idea for doing this using std::list and push_back(), but the size of the list may be quite large and I want to avoid the overhead of creating an extra copy of the list when I return from the function.
The below R code gives an idea of what I hope to do:
myvec = function(n) {
x = vector("list", n)
for (i in seq_len(n)) {
nc = sample(1:3, 1)
nr = sample(1:3, 1)
x[[i]] = matrix(rbinom(nc * nr, size = 1, prob = 0.5),
nrow = nr, ncol = nc)
}
x
}
This could result in something like:
> myvec(2)
[[1]]
[,1]
[1,] 0
[2,] 1
[[2]]
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 1 1
Update: based on the comments of #Dirk and #Ralf, I created functions based on Rcpp::List and std::list with a wrap at the end. Speed comparisons don't seem to favor one version over the other, but perhaps there's an inefficiency I'm not aware of.
src = '
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List myvec(int n) {
Rcpp::RNGScope rngScope;
Rcpp::List x(n);
// Rcpp::IntegerVector choices = {1, 2 ,3};
Rcpp::IntegerVector choices = Rcpp::seq_len(50);
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x(i) = Rcpp::NumericMatrix(nc, nr, entries.begin());
}
return x;
}
// [[Rcpp::export]]
Rcpp::List myvec2(int n) {
Rcpp::RNGScope scope;
std::list< Rcpp::NumericMatrix > x;
// Rcpp::IntegerVector choices = {1, 2 ,3};
Rcpp::IntegerVector choices = Rcpp::seq_len(50);
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x.push_back( Rcpp::NumericMatrix(nc, nr, entries.begin()));
}
return Rcpp::wrap(x);
}
'
sourceCpp(code = src)
Resulting benchmarks on my computer are:
> library(microbenchmark)
> rcpp_list = function() {
+ set.seed(10);myvec(105)
+ }
> std_list = function() {
+ set.seed(10);myvec2(105)
+ }
> microbenchmark(rcpp_list(), std_list(), times = 1000)
Unit: milliseconds
expr min lq mean median uq
rcpp_list() 1.8901 1.92535 2.205286 1.96640 2.22380
std_list() 1.9164 1.95570 2.224941 2.00555 2.32315
max neval cld
7.1569 1000 a
7.1194 1000 a
The fundamental issue that Rcpp objects are R objects governed my R's memory management where resizing is expensive: full copies.
So when I have tasks similar to yours where sizes may change, or are unknown, I often work with different data structures -- the STL gives us plenty -- and only convert to R(cpp) at the return step at the end.
The devil in the detail here (as always). Profile, experiment, ...
Edit: And in the narrower sense of "can we return a List of NumericMatrix objects with varying sizes" the answer is of course we can because that is what List objects do. You can also insert other types.
As Dirk said, it is of course possible to create a list with matrices of different size. To make it a bit more concrete, here a translation of your R function:
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
Rcpp::List myvec(int n) {
Rcpp::List x(n);
Rcpp::IntegerVector choices = {1, 2 ,3};
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x(i) = Rcpp::NumericMatrix(nc, nr, entries.begin());
}
return x;
}
/***R
myvec(2)
*/
The main difference to the R code are the explicitly named vectors choices and entries, which are only implicit in the R code.

Efficient way to calculate Hawk's process gradient

I am interested in calculating the following quantity
B(i) = \sum_{j < i}(x_i-x_j)exp^{-\beta(x_i - x_j)}
which is part of computing the gradient wrt one of the parameters of a Hawk's process likelihood (more information can be found here: http://www.ism.ac.jp/editsec/aism/pdf/031_1_0145.pdf).
Beta is just a constant for the shake of the problem and x_i is my i-th data point.
I am trying to calculate the above quantity in RCPP, using the following chunk of code:
for( int i = 1; i< x.size();i++) {
double temp=0;
for(int j=0; j<=i-1;j++){
temp+=(x[i]-x[j])*exp(-beta*(x[i]-x[j]));
}
but it is highly inefficient and slow. Any suggestion on how this formula could be speeded-up?
Standard operations are very fast in C++ (+, -, etc).
Yet, exp is more complicated to compute, so slower.
So, if we want some performance improvement, the more likely would be to be able to precompute the exp computations.
Here, B(i) = \sum_{j < i}(x_i-x_j)exp^{-\beta(x_i - x_j)} is equivalent to B(i) = \sum_{j < i}(x_i-x_j) / exp^{\beta x_i} * exp^{\beta x_j} so that you can precompute the exp for each index only (and also put the one depending on i out of the loop). By refactoring it, you can do other precomputations. So, I put here the two previous solutions then my incremental solutions:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_org(Rcpp::NumericVector x, double beta = 3) {
int n = x.size();
Rcpp::NumericVector B = Rcpp::no_init( n - 1);
for (int i = 1; i < n; i++) {
double temp = 0;
for (int j = 0; j <= i - 1; j++) {
temp += (x[i] - x[j]) * exp(-beta * (x[i] - x[j]));
}
B(i - 1) = temp;
}
return B;
}
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache(Rcpp::NumericVector x, double beta = 3) {
int n = x.size();
Rcpp::NumericVector B = Rcpp::no_init( n - 1);
double x_i;
for (int i = 1; i < n; ++i) {
double temp = 0;
x_i = x[i];
for (int j = 0; j <= i - 1; ++j) {
temp += (x_i - x[j]) * 1 / exp(beta * (x_i - x[j]));
}
B(i - 1) = temp;
}
return B;
}
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_2(Rcpp::NumericVector x,
double beta = 3) {
int i, j, n = x.size();
Rcpp::NumericVector B(n);
Rcpp::NumericVector x_exp = exp(beta * x);
double temp;
for (i = 1; i < n; i++) {
temp = 0;
for (j = 0; j < i; j++) {
temp += (x[i] - x[j]) * x_exp[j] / x_exp[i];
}
B[i] = temp;
}
return B;
}
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_3(Rcpp::NumericVector x,
double beta = 3) {
int i, j, n = x.size();
Rcpp::NumericVector B(n);
Rcpp::NumericVector x_exp = exp(beta * x);
double temp;
for (i = 1; i < n; i++) {
temp = 0;
for (j = 0; j < i; j++) {
temp += (x[i] - x[j]) * x_exp[j];
}
B[i] = temp / x_exp[i];
}
return B;
}
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_4(Rcpp::NumericVector x,
double beta = 3) {
Rcpp::NumericVector exp_pre = exp(beta * x);
Rcpp::NumericVector exp_pre_cumsum = cumsum(exp_pre);
Rcpp::NumericVector x_exp_pre_cumsum = cumsum(x * exp_pre);
return (x * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
}
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_5(Rcpp::NumericVector x,
double beta = 3) {
int n = x.size();
NumericVector B(n);
double exp_pre, exp_pre_cumsum = 0, x_exp_pre_cumsum = 0;
for (int i = 0; i < n; i++) {
exp_pre = exp(beta * x[i]);
exp_pre_cumsum += exp_pre;
x_exp_pre_cumsum += x[i] * exp_pre;
B[i] = (x[i] * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
}
return B;
}
/*** R
set.seed(111)
x = rnorm(1e3)
all.equal(
hawk_process_org(x),
hawk_process_cache(x)
)
all.equal(
hawk_process_org(x),
hawk_process_cache_2(x)[-1]
)
all.equal(
hawk_process_org(x),
hawk_process_cache_3(x)[-1]
)
all.equal(
hawk_process_org(x),
hawk_process_cache_4(x)[-1]
)
all.equal(
hawk_process_org(x),
hawk_process_cache_5(x)[-1]
)
microbenchmark::microbenchmark(
hawk_process_org(x),
hawk_process_cache(x),
hawk_process_cache_2(x),
hawk_process_cache_3(x),
hawk_process_cache_4(x),
hawk_process_cache_5(x)
)
*/
Benchmark for x = rnorm(1e3):
Unit: microseconds
expr min lq mean median uq max neval cld
hawk_process_org(x) 19801.686 20610.0365 21017.89339 20816.1385 21157.4900 25548.042 100 d
hawk_process_cache(x) 20506.903 21062.1370 21534.47944 21297.8710 21775.2995 26030.106 100 e
hawk_process_cache_2(x) 1895.809 2038.0105 2087.20696 2065.8220 2103.0695 3212.874 100 c
hawk_process_cache_3(x) 430.084 458.3915 494.09627 474.2840 503.0885 1580.282 100 b
hawk_process_cache_4(x) 50.657 55.2930 71.60536 57.6105 63.5700 1190.260 100 a
hawk_process_cache_5(x) 43.373 47.0155 60.43775 49.6640 55.6235 842.288 100 a
This is much more effective than trying to gain nanoseconds from small optimizations that are likely to get your code more difficult to read.
But still, let's try the optimizations proposed by #coatless on my very last solution:
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache_6(Rcpp::NumericVector x,
double beta = 3) {
int n = x.size();
NumericVector B = Rcpp::no_init(n);
double x_i, exp_pre, exp_pre_cumsum = 0, x_exp_pre_cumsum = 0;
for (int i = 0; i < n; ++i) {
x_i = x[i];
exp_pre = exp(beta * x_i);
exp_pre_cumsum += exp_pre;
x_exp_pre_cumsum += x_i * exp_pre;
B[i] = (x_i * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
}
return B;
}
Benchmark for x = rnorm(1e6):
Unit: milliseconds
expr min lq mean median uq max neval cld
hawk_process_cache_5(x) 42.52886 43.53653 45.28427 44.46688 46.74129 57.38046 100 a
hawk_process_cache_6(x) 42.14778 43.19054 45.93252 44.28445 46.51052 153.30447 100 a
Still not very convincing..
Interesting question. In my tests combining the two answers does give a further performance boost (benchmarks further down):
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector hawk_process_cache_combined(NumericVector x,
double beta = 3) {
int n = x.size();
NumericVector B = Rcpp::no_init(n-1);
double exp_pre(exp(beta * x[0]));
double exp_pre_cumsum(exp_pre);
double x_exp_pre_cumsum(x[0] * exp_pre);
double x_i;
for (int i = 1; i < n; ++i) {
x_i = x[i];
exp_pre = exp(beta * x_i);
exp_pre_cumsum += exp_pre;
x_exp_pre_cumsum += x_i * exp_pre;
B[i-1] = (x_i * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
}
return B;
}
all.equal(
hawk_process_org(x),
hawk_process_cache_combined(x)
)
#> [1] TRUE
Now while the original formulation is "embarrassingly parallel", this is no longer the case for this expression. However, prefix scan algorithms like cumsum can also be parallelized. And libraries like ArrayFire provide interfaces to such algorithms using the GPU. Using RcppArrayFire one can write based on F. Privé's hawk_process_cached_4:
// [[Rcpp::depends(RcppArrayFire)]]
#include <RcppArrayFire.h>
// [[Rcpp::export]]
af::array hawk_process_af(RcppArrayFire::typed_array<f32> x,
double beta = 3) {
af::array exp_pre = exp(beta * x);
af::array exp_pre_cumsum = af::accum(exp_pre);
af::array x_exp_pre_cumsum = af::accum(x * exp_pre);
af::array result = (x * exp_pre_cumsum - x_exp_pre_cumsum) / exp_pre;
return result(af::seq(1, af::end));
}
Here the results are not exactly equal, since my driver/card only supports single precision floats:
all.equal(
hawk_process_org(x),
hawk_process_af(x)
)
#> [1] "Mean relative difference: 3.437819e-07"
With double precision one would write f64 above and obtain identical results. Now for the benchmarks:
set.seed(42)
x <- rnorm(1e3)
microbenchmark::microbenchmark(
hawk_process_af(x),
hawk_process_cache_combined(x),
hawk_process_cache_5(x)[-1]
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> hawk_process_af(x) 245.281 277.4625 338.92232 298.5410 346.576 1030.045 100
#> hawk_process_cache_combined(x) 35.343 39.0120 43.69496 40.7770 45.264 84.242 100
#> hawk_process_cache_5(x)[-1] 52.408 57.8580 65.55799 60.5265 67.965 125.864 100
x <- rnorm(1e6)
microbenchmark::microbenchmark(
hawk_process_af(x),
hawk_process_cache_combined(x),
hawk_process_cache_5(x)[-1]
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> hawk_process_af(x) 27.54936 28.42794 30.93452 29.20025 32.40667 49.41888 100
#> hawk_process_cache_combined(x) 34.00380 36.84497 40.74862 39.03649 41.85902 111.51628 100
#> hawk_process_cache_5(x)[-1] 47.02501 53.24702 57.94747 55.35018 58.42097 130.89737 100
So for small vectors, the combined approach is faster, while for longer once offloading to the GPU pays off. All this not with some high power GPU but simple on-board graphics:
RcppArrayFire::arrayfire_info()
#> ArrayFire v3.5.1 (OpenCL, 64-bit Linux, build 0a675e8)
#> [0] BEIGNET: Intel(R) HD Graphics Skylake ULT GT2, 4096 MB
This is an O(N^2) operation without factoring in the cost of exp. Any tweaks are likely to yield minimal improvements.
A few quick suggestions:
cache the value of x[i] on the outer loop as you are repeatedly subsetting that in the inner loop.
switch from using exp(-beta * ..) to 1/exp(beta*(x ... ))
use ++i instead of i++ to avoid a slight performance hiccup since you avoid a copy of i that the latter does.
Original code:
#include<Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_org(Rcpp::NumericVector x, double beta = 3) {
int n = x.size();
Rcpp::NumericVector B = Rcpp::no_init( n - 1);
for (int i = 1; i < n; i++) {
double temp = 0;
for (int j = 0; j <= i - 1; j++) {
temp += (x[i] - x[j]) * exp(-beta * (x[i] - x[j]));
}
B(i - 1) = temp;
}
return B;
}
Modified code:
#include<Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector hawk_process_cache(Rcpp::NumericVector x, double beta = 3) {
int n = x.size();
Rcpp::NumericVector B = Rcpp::no_init( n - 1);
double x_i;
for (int i = 1; i < n; ++i) {
double temp = 0;
x_i = x[i];
for (int j = 0; j <= i - 1; ++j) {
temp += (x_i - x[j]) * 1 / exp(beta * (x_i - x[j]));
}
B(i - 1) = temp;
}
return B;
}
Test
set.seed(111)
x = rnorm(1e4)
all.equal(
hawk_process_org(x),
hawk_process_cache(x)
)
#> [1] TRUE
bench_func = microbenchmark::microbenchmark(
hawk_process_org(x),
hawk_process_cache(x)
)
bench_func
#> Unit:milliseconds
#> expr min lq mean median uq max neval
#> hawk_process_org(x) 436.5349 465.9674 505.9606 481.4703 500.6652 894.7477 100
#> hawk_process_cache(x) 446.0499 454.9098 485.3830 468.6580 494.9457 799.0940 100
So, you get marginally better results under the recommendations.

Fibonacci number in R vs Rcpp

I was just trying to check the execution speed of Fiboncci number generation in R vs Rcpp. To my surprise, my R function was faster(also, linearly growing) than my Rcpp function. What is wrong here.
The R code:
fibo = function (n){
x = rep(0, n)
x[1] = 1
x[2] = 2
for(i in 3:n){
x[i] = x[i-2] + x[i-1]
}
return(x)
}
The Rcpp code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fibo_sam(int n){
IntegerVector x;
x.push_back(1);
x.push_back(2);
for(int i =2; i < n; i++){
x.push_back(x[i - 2] + x[i-1]);
}
return(x);
}
The problem with your Rcpp code is that you are growing the vector instead of allocating the size at the beginning. Try with:
// [[Rcpp::export]]
IntegerVector fibo_sam2(int n) {
IntegerVector x(n);
x[0] = 1;
x[1] = 2;
for (int i = 2; i < n; i++){
x[i] = x[i-2] + x[i-1];
}
return(x);
}
Benchmark:
Unit: microseconds
expr min lq mean median uq max neval cld
fibo(1000) 99.989 102.6375 157.42543 103.962 106.9415 4806.395 100 a
fibo_sam(1000) 493.320 511.8615 801.39046 534.044 590.4945 2825.168 100 b
fibo_sam2(1000) 2.980 3.3110 10.18763 3.642 4.3040 573.443 100 a
PS1: check your first values
PS2: beware large numbers (see this)

Fast checking of missing values in Rcpp

This question is linked to NA values in Rcpp conditional.
I basically have some Rcpp code that loop over multiple (double) elements. And I need to check if there are missing values, for each element (and I can't use vectorization). Let's count the number of missing values in a vector, just as minimal reproducible example:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
return c;
}
// [[Rcpp::export]]
int nb_na3(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) if (x[i] == 3) c++;
return c;
}
// [[Rcpp::export]]
LogicalVector na_real(NumericVector x) {
return x == NA_REAL;
}
Then, in R, we get:
> x <- rep(c(1, 2, NA), 1e4)
> x2 <- replace(x, is.na(x), 3)
> microbenchmark::microbenchmark(
+ nb_na(x),
+ nb_na3(x2)
+ )
Unit: microseconds
expr min lq mean median uq max neval
nb_na(x) 135.633 135.982 153.08586 139.753 140.3115 1294.928 100
nb_na3(x2) 22.490 22.908 30.14005 23.188 23.5025 684.026 100
> all.equal(nb_na(x), nb_na3(x2))
[1] TRUE
> na_real(x[1:3])
[1] NA NA NA
As noted in the linked question, you can't just check x[i] == NA_REAL because it always returns a missing value. Yet, using R_IsNA(x[i]) is much slower that checking equality with a numeric value (e.g. 3).
Basically, I want a solution where I can check that a single value is a missing value. This solution should be as fast as checking equality with a numeric value.
Checking for missing value or any NaN specific variant is always going to be more expensive than checking for a specific value. That's just floating point arithmetic.
However there's still room for improvement in your code. I would encourage you to use NumericVector::is_na instead of R_IsNA but this is mostly cosmetic.
Then branching can be expensive, i.e. I'd replace if (R_IsNA(x[i])) c++; by c += NumericVector::is_na(x[i]). This gives this version:
// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
return c;
}
Then iterating on an int and accessing x[i] can be replaced by using the std::count_if algorithm. This is it's raison d'être. Leading to this version:
// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}
Now if the performance is still not good enough, you might want to try parallelization, for this I typically use the tbb library from the RcppParallel package.
// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
return tbb::parallel_reduce(
tbb::blocked_range<const double*>(x.begin(), x.end()),
0,
[](const tbb::blocked_range<const double*>& r, int init) -> int {
return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
},
[]( int x, int y){ return x+y; }
) ;
}
Benchmarking with this function:
library(microbenchmark)
bench <- function(n){
x <- rep(c(1, 2, NA), n)
microbenchmark(
nb_na = nb_na(x),
nb_na4 = nb_na4(x),
nb_na5 = nb_na5(x),
nb_na6 = nb_na6(x)
)
}
bench(1e5)
On my machine I get:
> bench(1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 84.358 94.6500 107.41957 110.482 118.9580 137.393 100 d
nb_na4 59.984 69.4925 79.42195 82.442 85.9175 106.567 100 b
nb_na5 65.047 75.2625 85.17134 87.501 93.0315 116.993 100 c
nb_na6 39.205 51.0785 59.20582 54.457 68.9625 97.225 100 a
> bench(1e5)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 730.416 732.2660 829.8440 797.4350 872.3335 1410.467 100 d
nb_na4 520.800 521.6215 598.8783 562.7200 657.1755 1059.991 100 b
nb_na5 578.527 579.3805 664.8795 626.5530 710.5925 1166.365 100 c
nb_na6 294.486 345.2050 368.6664 353.6945 372.6205 897.552 100 a
Another way is to totally circumvent floating point arithmetic and pretend the vector is a vector of long long, aka 64 bit integers and compare the values to the bit pattern of NA_REAL:
> devtools::install_github( "ThinkR-open/seven31" )
> seven31::reveal(NA, NaN, +Inf, -Inf )
0 11111111111 ( NaN ) 0000000000000000000000000000000000000000011110100010 : NA
0 11111111111 ( NaN ) 1000000000000000000000000000000000000000000000000000 : NaN
0 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : +Inf
1 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : -Inf
A serial solution using this hack:
// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
return std::count(p, p + x.size(), na ) ;
}
And then a parallel version:
// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
return init + std::count( r.begin(), r.end(), na);
} ;
return tbb::parallel_reduce(
tbb::blocked_range<const long long*>(p, p + x.size()),
0,
count_chunk,
[]( int x, int y){ return x+y; }
) ;
}
> bench(1e5)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 730.346 762.5720 839.9479 857.5865 881.8635 1045.048 100 f
nb_na4 520.946 521.6850 589.0911 578.2825 653.4950 832.449 100 d
nb_na5 578.621 579.3245 640.9772 616.8645 701.8125 890.736 100 e
nb_na6 291.115 307.4300 340.1626 344.7955 360.7030 484.261 100 c
nb_na7 122.156 123.4990 141.1954 132.6385 149.7895 253.988 100 b
nb_na8 69.356 86.9980 109.6427 115.2865 126.2775 182.184 100 a
> bench(1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 7342.984 7956.3375 10261.583 9227.7450 10869.605 79757.09 100 d
nb_na4 5286.970 5721.9150 7659.009 6660.2390 9234.646 31141.47 100 c
nb_na5 5840.946 6272.7050 7307.055 6883.2430 8205.117 10420.48 100 c
nb_na6 2833.378 2895.7160 3891.745 3049.4160 4054.022 18242.26 100 b
nb_na7 1661.421 1791.1085 2708.992 1916.6055 2232.720 60827.63 100 ab
nb_na8 650.639 869.6685 1289.373 939.0045 1291.025 10223.29 100 a
This assumes there's only one bit pattern to represent NA.
Here's my entire file for reference:
#include <Rcpp.h>
#include <RcppParallel.h>
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
return c;
}
// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
return c;
}
// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}
// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
return tbb::parallel_reduce(
tbb::blocked_range<const double*>(x.begin(), x.end()),
0,
[](const tbb::blocked_range<const double*>& r, int init) -> int {
return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
},
[]( int x, int y){ return x+y; }
) ;
}
// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
return std::count(p, p + x.size(), na ) ;
}
// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
return init + std::count( r.begin(), r.end(), na);
} ;
return tbb::parallel_reduce(
tbb::blocked_range<const long long*>(p, p + x.size()),
0,
count_chunk,
[]( int x, int y){ return x+y; }
) ;
}
/*** R
library(microbenchmark)
bench <- function(n){
x <- rep(c(1, 2, NA), n)
microbenchmark(
nb_na = nb_na(x),
nb_na4 = nb_na4(x),
nb_na5 = nb_na5(x),
nb_na6 = nb_na6(x),
nb_na7 = nb_na7(x),
nb_na8 = nb_na8(x)
)
}
bench(1e5)
bench(1e6)
*/
Checking for (IEEE) missing floating-point values is an expensive operating and there is no way around it. This is unrelated to R.
This is one reason why we're excited about the upcoming ALTREP in R - there we can for instance keep track of whether a double/real vector contains missing values or not - if it doesn't, then we don't have to waste time looking for them. Although not updated to mention ALTREP, you can get the gist from https://github.com/HenrikBengtsson/Wishlist-for-R/issues/12

Rcpp version of tabulate is slower; where is this from, how to understand

In the process of creating some sampling functions for already aggregated data I found that table was rather slow on the size data I am working with. I tried two improvements, first an Rcpp function as follows
// [[Rcpp::export]]
IntegerVector getcts(NumericVector x, int m) {
IntegerVector cts(m);
int t;
for (int i = 0; i < x.length(); i++) {
t = x[i] - 1;
if (0 <= t && t < m)
cts[t]++;
}
return cts;
}
And then while trying to understand why table was rather slow I found it being based on tabulate. Tabulate works well for me, and is faster than the Rcpp version. The code for tabulate is at:
https://github.com/wch/r-source/blob/545d365bd0485e5f0913a7d609c2c21d1f43145a/src/main/util.c#L2204
With the key line being:
for(R_xlen_t i = 0 ; i < n ; i++)
if (x[i] != NA_INTEGER && x[i] > 0 && x[i] <= nb) y[x[i] - 1]++;
Now the key parts of tabulate and my Rcpp version seem pretty close (I have not bothered dealing with NA).
Q1: why is my Rcpp version 3 times slower?
Q2: how can I find out where this time goes?
I would very much appreciate knowing where the time went, but even better would be a good way to profile the code. My C++ skills are only so so, but this seems simple enough that I should (cross my fingers) have been able to avoid any silly stuff that would triple my time.
My timing code:
max_x <- 100
xs <- sample(seq(max_x), size = 50000000, replace = TRUE)
system.time(getcts(xs, max_x))
system.time(tabulate(xs))
This gives 0.318 for getcts and 0.126 for tabulate.
Your function calls a length method in each loop iteration. Seems compiler don't cache it. To fix this store size of the vector in a separate variable or use range based loop. Also note that we don't really need explicit missing values check because in C++ all comparisons involving a NaN always return false.
Let's compare performance:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector tabulate1(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (std::size_t i = 0; i < x.size(); i++) {
if (x[i] > 0 && x[i] <= max)
counts[x[i] - 1]++;
}
return counts;
}
// [[Rcpp::export]]
IntegerVector tabulate2(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
std::size_t n = x.size();
for (std::size_t i = 0; i < n; i++) {
if (x[i] > 0 && x[i] <= max)
counts[x[i] - 1]++;
}
return counts;
}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate3(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto& now : x) {
if (now > 0 && now <= max)
counts[now - 1]++;
}
return counts;
}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto it = x.begin(); it != x.end(); it++) {
if (*it > 0 && *it <= max)
counts[*it - 1]++;
}
return counts;
}
/***R
library(microbenchmark)
x <- sample(10, 1e5, rep = TRUE)
microbenchmark(
tabulate(x, 10), tabulate1(x, 10),
tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
x[sample(10e5, 10e3)] <- NA
microbenchmark(
tabulate(x, 10), tabulate1(x, 10),
tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
*/
tabulate1 is the original version.
Benchmark results:
Without NA:
Unit: microseconds
expr min lq mean median uq max neval
tabulate(x, 10) 143.557 146.8355 169.2820 156.1970 177.327 286.370 100
tabulate1(x, 10) 390.706 392.6045 437.7357 416.5655 443.065 748.767 100
tabulate2(x, 10) 108.149 111.4345 139.7579 118.2735 153.118 337.647 100
tabulate3(x, 10) 107.879 111.7305 138.2711 118.8650 139.598 300.023 100
tabulate4(x, 10) 391.003 393.4530 436.3063 420.1915 444.048 777.862 100
With NA:
Unit: microseconds
expr min lq mean median uq max neval
tabulate(x, 10) 943.555 1089.5200 1614.804 1333.806 2042.320 3986.836 100
tabulate1(x, 10) 4523.076 4787.3745 5258.490 4929.586 5624.098 7233.029 100
tabulate2(x, 10) 765.102 931.9935 1361.747 1113.550 1679.024 3436.356 100
tabulate3(x, 10) 773.358 914.4980 1350.164 1140.018 1642.354 3633.429 100
tabulate4(x, 10) 4241.025 4466.8735 4933.672 4717.016 5148.842 8603.838 100
The tabulate4 function which uses an iterator also slower than tabulate. We can improve it:
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
auto start = x.begin();
auto end = x.end();
for (auto it = start; it != end; it++) {
if (*(it) > 0 && *(it) <= max)
counts[*(it) - 1]++;
}
return counts;
}

Resources