Fastest way to get nonnegative component - r

What is a faster way to get the nonnegative component of a double vector? That is,
pmax(x, 0)
My attempt is using Rcpp:
//' #title Parallel maximum
//' #description A faster \code{pmax()}.
//'
//' #name pmaxC
//' #param x A numeric vector.
//' #param a A single numeric value.
//' #return The parallel maximum of the input values.
//' #note This function will always be faster than \code{pmax(x, a)} when \code{a} is a single value, but can be slower than \code{pmax.int(x, a)} when \code{x} is short. Use this function when comparing a numeric vector with a single value.
//' #export pmaxC
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector pmaxC(NumericVector x, double a) {
int n = x.length();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
double xi = x[i];
if (xi < a) {
out[i] = a;
} else {
out[i] = xi;
}
}
return out;
}
This is a modest improvement:
set.seed(5)
x <- rnorm(1e6)
microbenchmark(pmax(x, 0), pmaxC(x, 0))
Unit: milliseconds
expr min lq mean median uq max neval cld
pmax(x, 0) 8.500419 8.621341 11.09672 10.132045 10.791020 58.44972 100 a
pmaxC(x, 0) 5.624480 5.709262 8.83968 7.598093 7.907853 53.91339 100 a
Neither are unacceptably slow, but given it is a common scenario, I was wondering whether a package had developed a faster approach.

The operation you are performing is fairly simple, so I'm not sure there is much room for improvement with regard to your algorithm above. However, if you really need to squeeze out extra performance, this seems like a good candidate for parallelization. Here is a possible implementation using RcppParallel:
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
#include <Rcpp.h>
struct Pmax : public RcppParallel::Worker {
struct Apply {
double mx;
Apply(double mx_)
: mx(mx_)
{}
double operator()(const double x) const
{
return x > mx ? x : mx;
}
};
const RcppParallel::RVector<double> input;
RcppParallel::RVector<double> output;
Apply f;
Pmax(const Rcpp::NumericVector input_,
Rcpp::NumericVector output_,
double mx_)
: input(input_), output(output_), f(mx_)
{}
void operator()(std::size_t begin, std::size_t end)
{
std::transform(
input.begin() + begin,
input.begin() + end,
output.begin() + begin,
f
);
}
};
// [[Rcpp::export]]
Rcpp::NumericVector par_pmax(Rcpp::NumericVector x, double y)
{
Rcpp::NumericVector res = Rcpp::no_init_vector(x.size());
Pmax p(x, res, y);
RcppParallel::parallelFor(0, x.size(), p);
return res;
}
Testing this with your example data, I get a reasonable improvement:
set.seed(5)
x <- rnorm(1e6)
all.equal(pmax(x, 0), par_pmax(x, 0))
#[1] TRUE
microbenchmark::microbenchmark(
pmax(x, 0),
pmaxC(x, 0),
par_pmax(x, 0),
times = 500L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# pmax(x, 0) 11.843528 12.193126 14.972588 13.030448 16.799250 102.09895 500
# pmaxC(x, 0) 7.804883 8.036879 10.462070 8.772635 12.407587 69.08290 500
# par_pmax(x, 0) 2.244691 2.443971 4.552169 2.624008 6.359027 65.99233 500

Related

Translate outer() from base R to RcppArmadillo

Is there any way to efficiently translate the outer() function for multiplication of two vectors from R base to RcppArmadillo? I attempted to do so but it is not efficient at all.
Take the following example:
library(Rcpp)
library(RcppArmadillo)
library(microbenchmark)
#Outer attempt
cppFunction(depends = "RcppArmadillo",
'
arma::mat outer_rcpp(arma::vec x, arma::vec y) {
int x_length = x.n_elem;
int y_length = y.n_elem;
arma::mat final(x_length, y_length);
// And use loops instead of outer
for(int i = 0; i < x_length; i++) {
final.col(i) = x[i] * y;
}
return(final);
}
'
)
#Test for equal results
a <- rnorm(5)
base <- base::outer(a, a)
rcpp <- outer_rcpp(a, a)
all.equal(base, rcpp)
#Test for speed
b <- rnorm(5000)
microbenchmark(base = base::outer(b, b),
rcpp = outer_rcpp(b, b), times = 10)
The results are 2 times slower using R base. I am sure that this can be done though matrix multiplication, any idea how?
As #thelatemail pointed out in the comments, the outer routine is already using a heavily optimized C routine.
src/library/base/R/outer.R: tcrossprod usage.
src/main/array.c: underlying C routine powering the tcrossprod computation.
Armadillo itself has its own optimization for addressing matrix multiplication using the dgemm and dgemv routines from LAPACK:
armadillo_bits/mul_gemm.hpp: C := alpha*op( A )op( B ) + betaC,
armadillo_bits/mul_gemv.hpp: y := alphaAx + betay, or y := alphaA**Tx + betay,
Playing around with the outerproduct calculations leads to a few optimizations. Mainly, we're opting to move the outer product into armadillo actions instead of loops:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat outer_rcpp(const arma::vec& x, const arma::vec& y) {
int x_length = x.n_elem;
int y_length = y.n_elem;
arma::mat final(x_length, y_length);
// And use loops instead of outer
for(int i = 0; i < x_length; i++) {
final.col(i) = x[i] * y;
}
return final;
}
// [[Rcpp::export]]
arma::mat outer_with_armadillo(const arma::vec& x, const arma::vec& y) {
arma::mat final = x*y.t();
return final;
}
// [[Rcpp::export]]
arma::mat outer_with_armadillo_transposed(const arma::vec& x, const arma::rowvec& y) {
arma::mat final = x*y;
return final;
}
Revisiting the benchmarking code, we have:
b = rnorm(5000)
b_tranposed = t(b)
bench_results = microbenchmark::microbenchmark(base = base::outer(b, b),
outer_armadillo_loop = outer_rcpp(b, b),
outer_armadillo_optimized = outer_with_armadillo(b, b),
outer_armadillo_optimized_transposed = outer_with_armadillo_transposed(b, b_tranposed), times = 10)
bench_results
expr
min
lq
mean
median
uq
max
neval
base
132.8601
141.3532
156.9979
146.7993
154.8954
234.2619
10
outer_armadillo_loop
278.4115
279.9204
317.7907
288.4212
329.0769
451.6872
10
outer_armadillo_optimized
272.4348
283.3380
347.7913
304.1181
339.3282
728.2264
10
outer_armadillo_optimized_transposed
269.7855
270.7108
297.9580
279.8099
312.3488
386.4270
10
From the results, the lowest I could achieve is having a pre-transposed b vector from column vector form into row-vector form: (n x 1) * (1 x m)

Most efficient way to return Eigen::VectorXi with more than 2^31-1 elements to R

I have a vector x of type Eigen::VectorXi with more than 2^31-1 entries, which I would like to return to R. I can do that by copying x entry-wisely to a new vector of type Rcpp::IntegerVector, but that seems to be quite slow.
I am wondering:
whether there is a more efficient workaround;
why in the following reproducible example Rcpp::wrap(x) doesn't work.
test.cpp
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
SEXP foo(const R_xlen_t size) {
Eigen::VectorXi x(size);
for (R_xlen_t i = 0; i < size; i++) {
x(i) = 1;
}
return Rcpp::wrap(x);
}
// [[Rcpp::export]]
Rcpp::IntegerVector fooSlow(const R_xlen_t size) {
Eigen::VectorXi x(size);
for (R_xlen_t i = 0; i < size; i++) {
x(i) = 1;
}
Rcpp::IntegerVector y(size);
for (R_xlen_t i = 0; i < size; i++) {
y(i) = x(i);
}
return y;
}
test.R
Rcpp::sourceCpp("./test.cpp")
a <- foo(2^(31)) # Error in foo(2^(31)) : negative length vectors are not allowed
a <- fooSlow(2^(31)) # runs fine but it's slow
Rcpp::wrap is dispatching to a method for Eigen matrices and vectors implemented in RcppEigen. That method doesn't appear to support long vectors, currently. (Edit: It now does; see below.)
The error about negative length is thrown by allocVector3 here. It arises when allocVector3 is called with a negative value for its argument length. My guess is that Rcpp::wrap tries to represent 2^31 as an int, resulting in integer overflow. Maybe this happens here?
In any case, you seem to have stumbled on a bug, so you might consider sharing your example with the RcppEigen maintainers on GitHub. (Edit: Never mind - I've just submitted a patch.) (Edit: Patched now, if you'd like to build RcppEigen from sources [commit 5fd125e or later] in order to update your Rcpp::wrap.)
Attempting to answer your first question, I compared your two approaches with my own based on std::memcpy. The std::memcpy approach supports long vectors and is only slightly slower than Rcpp::wrap.
The std::memcpy approach
The C arrays beneath Eigen::VectorXi x and Rcpp::IntegerVector y have the same type (int) and length (n), so they contain the same number of bytes. You can use std::memcpy to copy that number of bytes from one's memory address to other's without a for loop. The hard part is knowing how to obtain the addresses. Eigen::VectorXi has a member function data that returns the address of the underlying int array. R objects of integer type use INTEGER from the R API, which does the same thing.
Tests
Rcpp::sourceCpp(code = '
#include <RcppEigen.h>
#include <Rinternals.h>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
Rcpp::IntegerVector f_for(const R_xlen_t n) {
Eigen::VectorXi x(n);
for (R_xlen_t i = 0; i < n; ++i) {
x(i) = i % 10;
}
Rcpp::IntegerVector y(n);
for (R_xlen_t i = 0; i < n; ++i) {
y(i) = x(i);
}
return y;
}
// [[Rcpp::export]]
Rcpp::IntegerVector f_wrap(const R_xlen_t n) {
Eigen::VectorXi x(n);
for (R_xlen_t i = 0; i < n; ++i) {
x(i) = i % 10;
}
return Rcpp::wrap(x);
}
// [[Rcpp::export]]
Rcpp::IntegerVector f_memcpy(const R_xlen_t n) {
Eigen::VectorXi x(n);
for (R_xlen_t i = 0; i < n; ++i) {
x(i) = i % 10;
}
Rcpp::IntegerVector y(n);
memcpy(INTEGER(y), x.data(), n * sizeof(int));
return y;
}
')
n <- 100L
x <- rep_len(0:9, n)
identical(f_for(n), x) # TRUE
identical(f_wrap(n), x) # TRUE
identical(f_memcpy(n), x) # TRUE
b <- function(n) microbenchmark::microbenchmark(f_for(n), f_wrap(n), f_memcpy(n), setup = gc(FALSE))
b(2^10)
## Unit: microseconds
## expr min lq mean median uq max neval
## f_for(n) 6.806 8.5280 15.09497 10.332 11.8900 461.496 100
## f_wrap(n) 4.469 6.2115 12.60750 8.569 9.7170 435.420 100
## f_memcpy(n) 4.633 7.0520 13.64193 9.061 9.6965 465.924 100
b(2^20)
## Unit: microseconds
## expr min lq mean median uq max neval
## f_for(n) 3094.106 3118.2960 3160.2501 3132.4205 3171.329 3515.996 100
## f_wrap(n) 864.690 890.0485 912.7006 905.4440 929.593 988.797 100
## f_memcpy(n) 940.048 971.6590 1001.9805 987.3825 1009.195 1428.235 100
b(2^30)
## Unit: seconds
## expr min lq mean median uq max neval
## f_for(n) 3.527164 3.554672 3.575698 3.573021 3.593006 3.693711 100
## f_wrap(n) 1.119750 1.133130 1.143425 1.139702 1.149030 1.203602 100
## f_memcpy(n) 1.304877 1.330994 1.343253 1.339099 1.354286 1.422912 100

Fast checking of missing values in Rcpp

This question is linked to NA values in Rcpp conditional.
I basically have some Rcpp code that loop over multiple (double) elements. And I need to check if there are missing values, for each element (and I can't use vectorization). Let's count the number of missing values in a vector, just as minimal reproducible example:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
return c;
}
// [[Rcpp::export]]
int nb_na3(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) if (x[i] == 3) c++;
return c;
}
// [[Rcpp::export]]
LogicalVector na_real(NumericVector x) {
return x == NA_REAL;
}
Then, in R, we get:
> x <- rep(c(1, 2, NA), 1e4)
> x2 <- replace(x, is.na(x), 3)
> microbenchmark::microbenchmark(
+ nb_na(x),
+ nb_na3(x2)
+ )
Unit: microseconds
expr min lq mean median uq max neval
nb_na(x) 135.633 135.982 153.08586 139.753 140.3115 1294.928 100
nb_na3(x2) 22.490 22.908 30.14005 23.188 23.5025 684.026 100
> all.equal(nb_na(x), nb_na3(x2))
[1] TRUE
> na_real(x[1:3])
[1] NA NA NA
As noted in the linked question, you can't just check x[i] == NA_REAL because it always returns a missing value. Yet, using R_IsNA(x[i]) is much slower that checking equality with a numeric value (e.g. 3).
Basically, I want a solution where I can check that a single value is a missing value. This solution should be as fast as checking equality with a numeric value.
Checking for missing value or any NaN specific variant is always going to be more expensive than checking for a specific value. That's just floating point arithmetic.
However there's still room for improvement in your code. I would encourage you to use NumericVector::is_na instead of R_IsNA but this is mostly cosmetic.
Then branching can be expensive, i.e. I'd replace if (R_IsNA(x[i])) c++; by c += NumericVector::is_na(x[i]). This gives this version:
// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
return c;
}
Then iterating on an int and accessing x[i] can be replaced by using the std::count_if algorithm. This is it's raison d'ĂȘtre. Leading to this version:
// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}
Now if the performance is still not good enough, you might want to try parallelization, for this I typically use the tbb library from the RcppParallel package.
// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
return tbb::parallel_reduce(
tbb::blocked_range<const double*>(x.begin(), x.end()),
0,
[](const tbb::blocked_range<const double*>& r, int init) -> int {
return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
},
[]( int x, int y){ return x+y; }
) ;
}
Benchmarking with this function:
library(microbenchmark)
bench <- function(n){
x <- rep(c(1, 2, NA), n)
microbenchmark(
nb_na = nb_na(x),
nb_na4 = nb_na4(x),
nb_na5 = nb_na5(x),
nb_na6 = nb_na6(x)
)
}
bench(1e5)
On my machine I get:
> bench(1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 84.358 94.6500 107.41957 110.482 118.9580 137.393 100 d
nb_na4 59.984 69.4925 79.42195 82.442 85.9175 106.567 100 b
nb_na5 65.047 75.2625 85.17134 87.501 93.0315 116.993 100 c
nb_na6 39.205 51.0785 59.20582 54.457 68.9625 97.225 100 a
> bench(1e5)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 730.416 732.2660 829.8440 797.4350 872.3335 1410.467 100 d
nb_na4 520.800 521.6215 598.8783 562.7200 657.1755 1059.991 100 b
nb_na5 578.527 579.3805 664.8795 626.5530 710.5925 1166.365 100 c
nb_na6 294.486 345.2050 368.6664 353.6945 372.6205 897.552 100 a
Another way is to totally circumvent floating point arithmetic and pretend the vector is a vector of long long, aka 64 bit integers and compare the values to the bit pattern of NA_REAL:
> devtools::install_github( "ThinkR-open/seven31" )
> seven31::reveal(NA, NaN, +Inf, -Inf )
0 11111111111 ( NaN ) 0000000000000000000000000000000000000000011110100010 : NA
0 11111111111 ( NaN ) 1000000000000000000000000000000000000000000000000000 : NaN
0 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : +Inf
1 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : -Inf
A serial solution using this hack:
// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
return std::count(p, p + x.size(), na ) ;
}
And then a parallel version:
// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
return init + std::count( r.begin(), r.end(), na);
} ;
return tbb::parallel_reduce(
tbb::blocked_range<const long long*>(p, p + x.size()),
0,
count_chunk,
[]( int x, int y){ return x+y; }
) ;
}
> bench(1e5)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 730.346 762.5720 839.9479 857.5865 881.8635 1045.048 100 f
nb_na4 520.946 521.6850 589.0911 578.2825 653.4950 832.449 100 d
nb_na5 578.621 579.3245 640.9772 616.8645 701.8125 890.736 100 e
nb_na6 291.115 307.4300 340.1626 344.7955 360.7030 484.261 100 c
nb_na7 122.156 123.4990 141.1954 132.6385 149.7895 253.988 100 b
nb_na8 69.356 86.9980 109.6427 115.2865 126.2775 182.184 100 a
> bench(1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 7342.984 7956.3375 10261.583 9227.7450 10869.605 79757.09 100 d
nb_na4 5286.970 5721.9150 7659.009 6660.2390 9234.646 31141.47 100 c
nb_na5 5840.946 6272.7050 7307.055 6883.2430 8205.117 10420.48 100 c
nb_na6 2833.378 2895.7160 3891.745 3049.4160 4054.022 18242.26 100 b
nb_na7 1661.421 1791.1085 2708.992 1916.6055 2232.720 60827.63 100 ab
nb_na8 650.639 869.6685 1289.373 939.0045 1291.025 10223.29 100 a
This assumes there's only one bit pattern to represent NA.
Here's my entire file for reference:
#include <Rcpp.h>
#include <RcppParallel.h>
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
return c;
}
// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
return c;
}
// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}
// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
return tbb::parallel_reduce(
tbb::blocked_range<const double*>(x.begin(), x.end()),
0,
[](const tbb::blocked_range<const double*>& r, int init) -> int {
return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
},
[]( int x, int y){ return x+y; }
) ;
}
// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
return std::count(p, p + x.size(), na ) ;
}
// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;
auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
return init + std::count( r.begin(), r.end(), na);
} ;
return tbb::parallel_reduce(
tbb::blocked_range<const long long*>(p, p + x.size()),
0,
count_chunk,
[]( int x, int y){ return x+y; }
) ;
}
/*** R
library(microbenchmark)
bench <- function(n){
x <- rep(c(1, 2, NA), n)
microbenchmark(
nb_na = nb_na(x),
nb_na4 = nb_na4(x),
nb_na5 = nb_na5(x),
nb_na6 = nb_na6(x),
nb_na7 = nb_na7(x),
nb_na8 = nb_na8(x)
)
}
bench(1e5)
bench(1e6)
*/
Checking for (IEEE) missing floating-point values is an expensive operating and there is no way around it. This is unrelated to R.
This is one reason why we're excited about the upcoming ALTREP in R - there we can for instance keep track of whether a double/real vector contains missing values or not - if it doesn't, then we don't have to waste time looking for them. Although not updated to mention ALTREP, you can get the gist from https://github.com/HenrikBengtsson/Wishlist-for-R/issues/12

Rcpp version of tabulate is slower; where is this from, how to understand

In the process of creating some sampling functions for already aggregated data I found that table was rather slow on the size data I am working with. I tried two improvements, first an Rcpp function as follows
// [[Rcpp::export]]
IntegerVector getcts(NumericVector x, int m) {
IntegerVector cts(m);
int t;
for (int i = 0; i < x.length(); i++) {
t = x[i] - 1;
if (0 <= t && t < m)
cts[t]++;
}
return cts;
}
And then while trying to understand why table was rather slow I found it being based on tabulate. Tabulate works well for me, and is faster than the Rcpp version. The code for tabulate is at:
https://github.com/wch/r-source/blob/545d365bd0485e5f0913a7d609c2c21d1f43145a/src/main/util.c#L2204
With the key line being:
for(R_xlen_t i = 0 ; i < n ; i++)
if (x[i] != NA_INTEGER && x[i] > 0 && x[i] <= nb) y[x[i] - 1]++;
Now the key parts of tabulate and my Rcpp version seem pretty close (I have not bothered dealing with NA).
Q1: why is my Rcpp version 3 times slower?
Q2: how can I find out where this time goes?
I would very much appreciate knowing where the time went, but even better would be a good way to profile the code. My C++ skills are only so so, but this seems simple enough that I should (cross my fingers) have been able to avoid any silly stuff that would triple my time.
My timing code:
max_x <- 100
xs <- sample(seq(max_x), size = 50000000, replace = TRUE)
system.time(getcts(xs, max_x))
system.time(tabulate(xs))
This gives 0.318 for getcts and 0.126 for tabulate.
Your function calls a length method in each loop iteration. Seems compiler don't cache it. To fix this store size of the vector in a separate variable or use range based loop. Also note that we don't really need explicit missing values check because in C++ all comparisons involving a NaN always return false.
Let's compare performance:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector tabulate1(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (std::size_t i = 0; i < x.size(); i++) {
if (x[i] > 0 && x[i] <= max)
counts[x[i] - 1]++;
}
return counts;
}
// [[Rcpp::export]]
IntegerVector tabulate2(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
std::size_t n = x.size();
for (std::size_t i = 0; i < n; i++) {
if (x[i] > 0 && x[i] <= max)
counts[x[i] - 1]++;
}
return counts;
}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate3(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto& now : x) {
if (now > 0 && now <= max)
counts[now - 1]++;
}
return counts;
}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto it = x.begin(); it != x.end(); it++) {
if (*it > 0 && *it <= max)
counts[*it - 1]++;
}
return counts;
}
/***R
library(microbenchmark)
x <- sample(10, 1e5, rep = TRUE)
microbenchmark(
tabulate(x, 10), tabulate1(x, 10),
tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
x[sample(10e5, 10e3)] <- NA
microbenchmark(
tabulate(x, 10), tabulate1(x, 10),
tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
*/
tabulate1 is the original version.
Benchmark results:
Without NA:
Unit: microseconds
expr min lq mean median uq max neval
tabulate(x, 10) 143.557 146.8355 169.2820 156.1970 177.327 286.370 100
tabulate1(x, 10) 390.706 392.6045 437.7357 416.5655 443.065 748.767 100
tabulate2(x, 10) 108.149 111.4345 139.7579 118.2735 153.118 337.647 100
tabulate3(x, 10) 107.879 111.7305 138.2711 118.8650 139.598 300.023 100
tabulate4(x, 10) 391.003 393.4530 436.3063 420.1915 444.048 777.862 100
With NA:
Unit: microseconds
expr min lq mean median uq max neval
tabulate(x, 10) 943.555 1089.5200 1614.804 1333.806 2042.320 3986.836 100
tabulate1(x, 10) 4523.076 4787.3745 5258.490 4929.586 5624.098 7233.029 100
tabulate2(x, 10) 765.102 931.9935 1361.747 1113.550 1679.024 3436.356 100
tabulate3(x, 10) 773.358 914.4980 1350.164 1140.018 1642.354 3633.429 100
tabulate4(x, 10) 4241.025 4466.8735 4933.672 4717.016 5148.842 8603.838 100
The tabulate4 function which uses an iterator also slower than tabulate. We can improve it:
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
auto start = x.begin();
auto end = x.end();
for (auto it = start; it != end; it++) {
if (*(it) > 0 && *(it) <= max)
counts[*(it) - 1]++;
}
return counts;
}

sorting columns of Rcpp NumericMatrix for median calculations

I've been testing Rcpp and RcppArmadillo for calculating summary stats on big matrices. This was a lot faster (5 or 10 times faster) than the base R colMeans or the the Armadillo on ~4million rows, 45 columns.
colMeansRcpp <- cxxfunction(signature(X_="integer"),
plugin='Rcpp',
body='
Rcpp::IntegerMatrix X = X_;
int ncol = X.ncol(); int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for(int col = 0; col < ncol; col++){
out[col]=Rcpp::sum(X(_, col));
}
return wrap(out/nrow);
')
I really want to calculate the median and maybe other quantiles for plotting - and because it requires a sort its even more needy of C++ outsourcing. The armadillo seems a bit slow so I wanted to do an in place sort on code similar to above but I just cant get the syntax right... here is what I'm trying..
# OK I'm aware this floor(nrow/2) is not **absolutely** correct
# I'm simplifying here
colMedianRcpp <- cxxfunction(signature(X_="integer"),
plugin='Rcpp',
body='
Rcpp::IntegerMatrix X = clone(X_);
int ncol = X.ncol(); int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for(int col = 0; col < ncol; col++){
X(_,col)= std::sort((X_,col).begin, (X_,col).end));
out[col]=X(floor(nrow/2), col));
}
return wrap(out);
')
Basically it's the line
X(_,col)= std::sort((X_,col).begin, (X_,col).end));
I don't know how to express "sort a column in place" with this mixture of Rcpp sugar and std C++. Sorry I can see what I'm doing is wrong but a hint on the right syntax would be lovely.
ps Am I right I need to do this clone() so I don't change the R object?
EDIT
I add the RcppArmadillo code and a benchmark comparison to address the answer/comment below. the benchmark was only on 50k rows for a quick reply but I recall it was similar with many more. I realise you are the Rcpp author.. so many thanks for your time!
The thought occurs that perhaps I'm doing something daft with the RcppArmadillo code to make it run so much slower than the base colMeans or Rcpp version?
colMeansRcppArmadillo <- cxxfunction(signature(X_="integer"),
plugin="RcppArmadillo",
body='
arma::mat X = Rcpp::as<arma::mat > (X_);
arma::rowvec MD= arma::mean(X, 0);
return wrap(MD);
')
And the benchmark is ...
(mb = microbenchmark(
+ colMeans(fqSmallMatrix),
+ colMeansRcpp(fqSmallMatrix),
+ colMeansRcppArmadillo(fqSmallMatrix),
+ times=50))
Unit: milliseconds
expr min lq median uq max neval
colMeans(fqSmallMatrix) 10.620919 10.63289 10.640819 10.648882 10.907145 50
colMeansRcpp(fqSmallMatrix) 2.649038 2.66832 2.676709 2.700839 2.841012 50
colMeansRcppArmadillo(fqSmallMatrix) 25.687067 26.23488 33.168589 33.792489 113.832495 50
You can copy the column into a new vector with
NumericVector y = x(_,j);
Complete example:
library(Rcpp)
cppFunction('
NumericVector colMedianRcpp(NumericMatrix x) {
int nrow = x.nrow();
int ncol = x.ncol();
int position = nrow / 2; // Euclidian division
NumericVector out(ncol);
for (int j = 0; j < ncol; j++) {
NumericVector y = x(_,j); // Copy the column -- the original will not be modified
std::nth_element(y.begin(), y.begin() + position, y.end());
out[j] = y[position];
}
return out;
}
')
x <- matrix( sample(1:12), 3, 4 )
x
colMedianRcpp(x)
x # Unchanged
You are not actually showing RcppArmadillo code -- I have been quite happy with the performance of RcppArmadillo code where I needed row/col column subsetting.
You can instantiate Armadillo matrices via Rcpp just about as efficiently (no copy, re-using R object memory) so I would try that.
And you: you want clone() for a distinct copy, and I think you'd get that for free if you use the default RcppArmadillo ctor (rather than the more efficient two-step).
Edit a few hours later
You had left an open question about why your Armadillo was slow. In the meantime, Vincent solved the issue for you but here is a revisited, cleaner solution using your code as well as Vincent's.
Now how it instantiates the Armadillo matrix without copy -- so it is faster. And it also avoids mixing integer and numeric matrices. The code first:
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
NumericVector colMedianRcpp(NumericMatrix x) {
int nrow = x.nrow();
int ncol = x.ncol();
int position = nrow / 2; // Euclidian division
NumericVector out(ncol);
for (int j = 0; j < ncol; j++) {
NumericVector y = x(_,j); // Copy column -- original will not be mod
std::nth_element(y.begin(), y.begin() + position, y.end());
out[j] = y[position];
}
return out;
}
// [[Rcpp::export]]
arma::rowvec colMeansRcppArmadillo(NumericMatrix x){
arma::mat X = arma::mat(x.begin(), x.nrow(), x.ncol(), false);
return arma::mean(X, 0);
}
// [[Rcpp::export]]
NumericVector colMeansRcpp(NumericMatrix X) {
int ncol = X.ncol();
int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for (int col = 0; col < ncol; col++){
out[col]=Rcpp::sum(X(_, col));
}
return wrap(out/nrow);
}
/*** R
set.seed(42)
X <- matrix(rnorm(100*10), 100, 10)
library(microbenchmark)
mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
colMedianRcpp(X), times=50)
print(mb)
*/
And here is the result on my machine, with the concise Armadillo version about as fast as yours, and median a little slower as it has to do more work:
R> sourceCpp("/tmp/stephen.cpp")
R> set.seed(42)
R> X <- matrix(rnorm(100*10), 100, 10)
R> library(microbenchmark)
R> mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
+ colMedianRcpp(X), times=50)
R> print(mb)
Unit: microseconds
expr min lq median uq max neval
colMeans(X) 9.469 10.422 11.5810 12.421 30.597 50
colMeansRcpp(X) 3.922 4.281 4.5245 5.306 18.020 50
colMeansRcppArmadillo(X) 4.196 4.549 4.9295 5.927 11.159 50
colMedianRcpp(X) 15.615 16.291 16.7290 17.971 27.026 50
R>

Resources