I was just trying to check the execution speed of Fiboncci number generation in R vs Rcpp. To my surprise, my R function was faster(also, linearly growing) than my Rcpp function. What is wrong here.
The R code:
fibo = function (n){
x = rep(0, n)
x[1] = 1
x[2] = 2
for(i in 3:n){
x[i] = x[i-2] + x[i-1]
}
return(x)
}
The Rcpp code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fibo_sam(int n){
IntegerVector x;
x.push_back(1);
x.push_back(2);
for(int i =2; i < n; i++){
x.push_back(x[i - 2] + x[i-1]);
}
return(x);
}
The problem with your Rcpp code is that you are growing the vector instead of allocating the size at the beginning. Try with:
// [[Rcpp::export]]
IntegerVector fibo_sam2(int n) {
IntegerVector x(n);
x[0] = 1;
x[1] = 2;
for (int i = 2; i < n; i++){
x[i] = x[i-2] + x[i-1];
}
return(x);
}
Benchmark:
Unit: microseconds
expr min lq mean median uq max neval cld
fibo(1000) 99.989 102.6375 157.42543 103.962 106.9415 4806.395 100 a
fibo_sam(1000) 493.320 511.8615 801.39046 534.044 590.4945 2825.168 100 b
fibo_sam2(1000) 2.980 3.3110 10.18763 3.642 4.3040 573.443 100 a
PS1: check your first values
PS2: beware large numbers (see this)
Related
I have a vector x of type Eigen::VectorXi with more than 2^31-1 entries, which I would like to return to R. I can do that by copying x entry-wisely to a new vector of type Rcpp::IntegerVector, but that seems to be quite slow.
I am wondering:
whether there is a more efficient workaround;
why in the following reproducible example Rcpp::wrap(x) doesn't work.
test.cpp
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
SEXP foo(const R_xlen_t size) {
Eigen::VectorXi x(size);
for (R_xlen_t i = 0; i < size; i++) {
x(i) = 1;
}
return Rcpp::wrap(x);
}
// [[Rcpp::export]]
Rcpp::IntegerVector fooSlow(const R_xlen_t size) {
Eigen::VectorXi x(size);
for (R_xlen_t i = 0; i < size; i++) {
x(i) = 1;
}
Rcpp::IntegerVector y(size);
for (R_xlen_t i = 0; i < size; i++) {
y(i) = x(i);
}
return y;
}
test.R
Rcpp::sourceCpp("./test.cpp")
a <- foo(2^(31)) # Error in foo(2^(31)) : negative length vectors are not allowed
a <- fooSlow(2^(31)) # runs fine but it's slow
Rcpp::wrap is dispatching to a method for Eigen matrices and vectors implemented in RcppEigen. That method doesn't appear to support long vectors, currently. (Edit: It now does; see below.)
The error about negative length is thrown by allocVector3 here. It arises when allocVector3 is called with a negative value for its argument length. My guess is that Rcpp::wrap tries to represent 2^31 as an int, resulting in integer overflow. Maybe this happens here?
In any case, you seem to have stumbled on a bug, so you might consider sharing your example with the RcppEigen maintainers on GitHub. (Edit: Never mind - I've just submitted a patch.) (Edit: Patched now, if you'd like to build RcppEigen from sources [commit 5fd125e or later] in order to update your Rcpp::wrap.)
Attempting to answer your first question, I compared your two approaches with my own based on std::memcpy. The std::memcpy approach supports long vectors and is only slightly slower than Rcpp::wrap.
The std::memcpy approach
The C arrays beneath Eigen::VectorXi x and Rcpp::IntegerVector y have the same type (int) and length (n), so they contain the same number of bytes. You can use std::memcpy to copy that number of bytes from one's memory address to other's without a for loop. The hard part is knowing how to obtain the addresses. Eigen::VectorXi has a member function data that returns the address of the underlying int array. R objects of integer type use INTEGER from the R API, which does the same thing.
Tests
Rcpp::sourceCpp(code = '
#include <RcppEigen.h>
#include <Rinternals.h>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
Rcpp::IntegerVector f_for(const R_xlen_t n) {
Eigen::VectorXi x(n);
for (R_xlen_t i = 0; i < n; ++i) {
x(i) = i % 10;
}
Rcpp::IntegerVector y(n);
for (R_xlen_t i = 0; i < n; ++i) {
y(i) = x(i);
}
return y;
}
// [[Rcpp::export]]
Rcpp::IntegerVector f_wrap(const R_xlen_t n) {
Eigen::VectorXi x(n);
for (R_xlen_t i = 0; i < n; ++i) {
x(i) = i % 10;
}
return Rcpp::wrap(x);
}
// [[Rcpp::export]]
Rcpp::IntegerVector f_memcpy(const R_xlen_t n) {
Eigen::VectorXi x(n);
for (R_xlen_t i = 0; i < n; ++i) {
x(i) = i % 10;
}
Rcpp::IntegerVector y(n);
memcpy(INTEGER(y), x.data(), n * sizeof(int));
return y;
}
')
n <- 100L
x <- rep_len(0:9, n)
identical(f_for(n), x) # TRUE
identical(f_wrap(n), x) # TRUE
identical(f_memcpy(n), x) # TRUE
b <- function(n) microbenchmark::microbenchmark(f_for(n), f_wrap(n), f_memcpy(n), setup = gc(FALSE))
b(2^10)
## Unit: microseconds
## expr min lq mean median uq max neval
## f_for(n) 6.806 8.5280 15.09497 10.332 11.8900 461.496 100
## f_wrap(n) 4.469 6.2115 12.60750 8.569 9.7170 435.420 100
## f_memcpy(n) 4.633 7.0520 13.64193 9.061 9.6965 465.924 100
b(2^20)
## Unit: microseconds
## expr min lq mean median uq max neval
## f_for(n) 3094.106 3118.2960 3160.2501 3132.4205 3171.329 3515.996 100
## f_wrap(n) 864.690 890.0485 912.7006 905.4440 929.593 988.797 100
## f_memcpy(n) 940.048 971.6590 1001.9805 987.3825 1009.195 1428.235 100
b(2^30)
## Unit: seconds
## expr min lq mean median uq max neval
## f_for(n) 3.527164 3.554672 3.575698 3.573021 3.593006 3.693711 100
## f_wrap(n) 1.119750 1.133130 1.143425 1.139702 1.149030 1.203602 100
## f_memcpy(n) 1.304877 1.330994 1.343253 1.339099 1.354286 1.422912 100
I am trying to mimic the R function that allows to run column and row matrix permutations based on a vector of indices. Like in the following code:
m=matrix(sample(c(0:9),5*5,T),ncol=5,nrow=5)
diag(m)=0
rand=sample(c(1:5))
m[rand,rand]
I tried the following code in c++:
Library(Rcpp)
cppFunction(‘
NumericMatrix test(NumericMatrix& M, int col, IntegerVector& rand) {
NumericMatrix M2(col,col);
for(int a=0;a<col;a++){
for(int b=a+1;b<col;b++){
M2(b,a)=M(rand(b),rand(a));
M2(a,b)=M(rand(a),rand(b));
}
}
return M2;
}
‘)
But it is very slow:
microbenchmark::microbenchmark(test(m,5,(rand-1)),m2[rand,rand])
Any ideas how I could speed up the process?
Using a simpler loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix test(NumericMatrix& M, int col, IntegerVector& rand) {
NumericMatrix M2(col,col);
for(int a=0;a<col;a++){
for(int b=a+1;b<col;b++){
M2(b,a)=M(rand(b),rand(a));
M2(a,b)=M(rand(a),rand(b));
}
}
return M2;
}
// [[Rcpp::export]]
NumericMatrix test2(const NumericMatrix& M, const IntegerVector& ind) {
int col = M.ncol();
NumericMatrix M2(col, col);
for (int j = 0; j < col; j++)
for (int i = 0; i < col; i++)
M2(i, j) = M(ind[i], ind[j]);
return M2;
}
/*** R
N <- 500
m <- matrix(sample(c(0:9), N * N, TRUE), ncol = N, nrow = N)
diag(m) <- 0
rand <- sample(N)
all.equal(test(m, ncol(m), rand - 1), m[rand, rand], test2(m, rand - 1))
microbenchmark::microbenchmark(
test(m, ncol(m), rand - 1),
m[rand, rand],
test2(m, rand - 1)
)
*/
For N = 5, the R version is faster, but in terms of nanoseconds..
For example, with N = 500, you get:
Unit: microseconds
expr min lq mean median uq max neval
test(m, ncol(m), rand - 1) 2092.474 2233.020 2843.145 2360.654 2548.050 7412.057 100
m[rand, rand] 1422.352 1506.117 2064.500 1578.129 1718.345 6700.219 100
test2(m, rand - 1) 698.595 769.944 1161.747 838.811 928.535 5379.841 100
What is a faster way to get the nonnegative component of a double vector? That is,
pmax(x, 0)
My attempt is using Rcpp:
//' #title Parallel maximum
//' #description A faster \code{pmax()}.
//'
//' #name pmaxC
//' #param x A numeric vector.
//' #param a A single numeric value.
//' #return The parallel maximum of the input values.
//' #note This function will always be faster than \code{pmax(x, a)} when \code{a} is a single value, but can be slower than \code{pmax.int(x, a)} when \code{x} is short. Use this function when comparing a numeric vector with a single value.
//' #export pmaxC
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector pmaxC(NumericVector x, double a) {
int n = x.length();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
double xi = x[i];
if (xi < a) {
out[i] = a;
} else {
out[i] = xi;
}
}
return out;
}
This is a modest improvement:
set.seed(5)
x <- rnorm(1e6)
microbenchmark(pmax(x, 0), pmaxC(x, 0))
Unit: milliseconds
expr min lq mean median uq max neval cld
pmax(x, 0) 8.500419 8.621341 11.09672 10.132045 10.791020 58.44972 100 a
pmaxC(x, 0) 5.624480 5.709262 8.83968 7.598093 7.907853 53.91339 100 a
Neither are unacceptably slow, but given it is a common scenario, I was wondering whether a package had developed a faster approach.
The operation you are performing is fairly simple, so I'm not sure there is much room for improvement with regard to your algorithm above. However, if you really need to squeeze out extra performance, this seems like a good candidate for parallelization. Here is a possible implementation using RcppParallel:
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
#include <Rcpp.h>
struct Pmax : public RcppParallel::Worker {
struct Apply {
double mx;
Apply(double mx_)
: mx(mx_)
{}
double operator()(const double x) const
{
return x > mx ? x : mx;
}
};
const RcppParallel::RVector<double> input;
RcppParallel::RVector<double> output;
Apply f;
Pmax(const Rcpp::NumericVector input_,
Rcpp::NumericVector output_,
double mx_)
: input(input_), output(output_), f(mx_)
{}
void operator()(std::size_t begin, std::size_t end)
{
std::transform(
input.begin() + begin,
input.begin() + end,
output.begin() + begin,
f
);
}
};
// [[Rcpp::export]]
Rcpp::NumericVector par_pmax(Rcpp::NumericVector x, double y)
{
Rcpp::NumericVector res = Rcpp::no_init_vector(x.size());
Pmax p(x, res, y);
RcppParallel::parallelFor(0, x.size(), p);
return res;
}
Testing this with your example data, I get a reasonable improvement:
set.seed(5)
x <- rnorm(1e6)
all.equal(pmax(x, 0), par_pmax(x, 0))
#[1] TRUE
microbenchmark::microbenchmark(
pmax(x, 0),
pmaxC(x, 0),
par_pmax(x, 0),
times = 500L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# pmax(x, 0) 11.843528 12.193126 14.972588 13.030448 16.799250 102.09895 500
# pmaxC(x, 0) 7.804883 8.036879 10.462070 8.772635 12.407587 69.08290 500
# par_pmax(x, 0) 2.244691 2.443971 4.552169 2.624008 6.359027 65.99233 500
In the process of creating some sampling functions for already aggregated data I found that table was rather slow on the size data I am working with. I tried two improvements, first an Rcpp function as follows
// [[Rcpp::export]]
IntegerVector getcts(NumericVector x, int m) {
IntegerVector cts(m);
int t;
for (int i = 0; i < x.length(); i++) {
t = x[i] - 1;
if (0 <= t && t < m)
cts[t]++;
}
return cts;
}
And then while trying to understand why table was rather slow I found it being based on tabulate. Tabulate works well for me, and is faster than the Rcpp version. The code for tabulate is at:
https://github.com/wch/r-source/blob/545d365bd0485e5f0913a7d609c2c21d1f43145a/src/main/util.c#L2204
With the key line being:
for(R_xlen_t i = 0 ; i < n ; i++)
if (x[i] != NA_INTEGER && x[i] > 0 && x[i] <= nb) y[x[i] - 1]++;
Now the key parts of tabulate and my Rcpp version seem pretty close (I have not bothered dealing with NA).
Q1: why is my Rcpp version 3 times slower?
Q2: how can I find out where this time goes?
I would very much appreciate knowing where the time went, but even better would be a good way to profile the code. My C++ skills are only so so, but this seems simple enough that I should (cross my fingers) have been able to avoid any silly stuff that would triple my time.
My timing code:
max_x <- 100
xs <- sample(seq(max_x), size = 50000000, replace = TRUE)
system.time(getcts(xs, max_x))
system.time(tabulate(xs))
This gives 0.318 for getcts and 0.126 for tabulate.
Your function calls a length method in each loop iteration. Seems compiler don't cache it. To fix this store size of the vector in a separate variable or use range based loop. Also note that we don't really need explicit missing values check because in C++ all comparisons involving a NaN always return false.
Let's compare performance:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector tabulate1(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (std::size_t i = 0; i < x.size(); i++) {
if (x[i] > 0 && x[i] <= max)
counts[x[i] - 1]++;
}
return counts;
}
// [[Rcpp::export]]
IntegerVector tabulate2(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
std::size_t n = x.size();
for (std::size_t i = 0; i < n; i++) {
if (x[i] > 0 && x[i] <= max)
counts[x[i] - 1]++;
}
return counts;
}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate3(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto& now : x) {
if (now > 0 && now <= max)
counts[now - 1]++;
}
return counts;
}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto it = x.begin(); it != x.end(); it++) {
if (*it > 0 && *it <= max)
counts[*it - 1]++;
}
return counts;
}
/***R
library(microbenchmark)
x <- sample(10, 1e5, rep = TRUE)
microbenchmark(
tabulate(x, 10), tabulate1(x, 10),
tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
x[sample(10e5, 10e3)] <- NA
microbenchmark(
tabulate(x, 10), tabulate1(x, 10),
tabulate2(x, 10), tabulate3(x, 10), tabulate4(x, 10)
)
*/
tabulate1 is the original version.
Benchmark results:
Without NA:
Unit: microseconds
expr min lq mean median uq max neval
tabulate(x, 10) 143.557 146.8355 169.2820 156.1970 177.327 286.370 100
tabulate1(x, 10) 390.706 392.6045 437.7357 416.5655 443.065 748.767 100
tabulate2(x, 10) 108.149 111.4345 139.7579 118.2735 153.118 337.647 100
tabulate3(x, 10) 107.879 111.7305 138.2711 118.8650 139.598 300.023 100
tabulate4(x, 10) 391.003 393.4530 436.3063 420.1915 444.048 777.862 100
With NA:
Unit: microseconds
expr min lq mean median uq max neval
tabulate(x, 10) 943.555 1089.5200 1614.804 1333.806 2042.320 3986.836 100
tabulate1(x, 10) 4523.076 4787.3745 5258.490 4929.586 5624.098 7233.029 100
tabulate2(x, 10) 765.102 931.9935 1361.747 1113.550 1679.024 3436.356 100
tabulate3(x, 10) 773.358 914.4980 1350.164 1140.018 1642.354 3633.429 100
tabulate4(x, 10) 4241.025 4466.8735 4933.672 4717.016 5148.842 8603.838 100
The tabulate4 function which uses an iterator also slower than tabulate. We can improve it:
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector tabulate4(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
auto start = x.begin();
auto end = x.end();
for (auto it = start; it != end; it++) {
if (*(it) > 0 && *(it) <= max)
counts[*(it) - 1]++;
}
return counts;
}
I've been testing Rcpp and RcppArmadillo for calculating summary stats on big matrices. This was a lot faster (5 or 10 times faster) than the base R colMeans or the the Armadillo on ~4million rows, 45 columns.
colMeansRcpp <- cxxfunction(signature(X_="integer"),
plugin='Rcpp',
body='
Rcpp::IntegerMatrix X = X_;
int ncol = X.ncol(); int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for(int col = 0; col < ncol; col++){
out[col]=Rcpp::sum(X(_, col));
}
return wrap(out/nrow);
')
I really want to calculate the median and maybe other quantiles for plotting - and because it requires a sort its even more needy of C++ outsourcing. The armadillo seems a bit slow so I wanted to do an in place sort on code similar to above but I just cant get the syntax right... here is what I'm trying..
# OK I'm aware this floor(nrow/2) is not **absolutely** correct
# I'm simplifying here
colMedianRcpp <- cxxfunction(signature(X_="integer"),
plugin='Rcpp',
body='
Rcpp::IntegerMatrix X = clone(X_);
int ncol = X.ncol(); int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for(int col = 0; col < ncol; col++){
X(_,col)= std::sort((X_,col).begin, (X_,col).end));
out[col]=X(floor(nrow/2), col));
}
return wrap(out);
')
Basically it's the line
X(_,col)= std::sort((X_,col).begin, (X_,col).end));
I don't know how to express "sort a column in place" with this mixture of Rcpp sugar and std C++. Sorry I can see what I'm doing is wrong but a hint on the right syntax would be lovely.
ps Am I right I need to do this clone() so I don't change the R object?
EDIT
I add the RcppArmadillo code and a benchmark comparison to address the answer/comment below. the benchmark was only on 50k rows for a quick reply but I recall it was similar with many more. I realise you are the Rcpp author.. so many thanks for your time!
The thought occurs that perhaps I'm doing something daft with the RcppArmadillo code to make it run so much slower than the base colMeans or Rcpp version?
colMeansRcppArmadillo <- cxxfunction(signature(X_="integer"),
plugin="RcppArmadillo",
body='
arma::mat X = Rcpp::as<arma::mat > (X_);
arma::rowvec MD= arma::mean(X, 0);
return wrap(MD);
')
And the benchmark is ...
(mb = microbenchmark(
+ colMeans(fqSmallMatrix),
+ colMeansRcpp(fqSmallMatrix),
+ colMeansRcppArmadillo(fqSmallMatrix),
+ times=50))
Unit: milliseconds
expr min lq median uq max neval
colMeans(fqSmallMatrix) 10.620919 10.63289 10.640819 10.648882 10.907145 50
colMeansRcpp(fqSmallMatrix) 2.649038 2.66832 2.676709 2.700839 2.841012 50
colMeansRcppArmadillo(fqSmallMatrix) 25.687067 26.23488 33.168589 33.792489 113.832495 50
You can copy the column into a new vector with
NumericVector y = x(_,j);
Complete example:
library(Rcpp)
cppFunction('
NumericVector colMedianRcpp(NumericMatrix x) {
int nrow = x.nrow();
int ncol = x.ncol();
int position = nrow / 2; // Euclidian division
NumericVector out(ncol);
for (int j = 0; j < ncol; j++) {
NumericVector y = x(_,j); // Copy the column -- the original will not be modified
std::nth_element(y.begin(), y.begin() + position, y.end());
out[j] = y[position];
}
return out;
}
')
x <- matrix( sample(1:12), 3, 4 )
x
colMedianRcpp(x)
x # Unchanged
You are not actually showing RcppArmadillo code -- I have been quite happy with the performance of RcppArmadillo code where I needed row/col column subsetting.
You can instantiate Armadillo matrices via Rcpp just about as efficiently (no copy, re-using R object memory) so I would try that.
And you: you want clone() for a distinct copy, and I think you'd get that for free if you use the default RcppArmadillo ctor (rather than the more efficient two-step).
Edit a few hours later
You had left an open question about why your Armadillo was slow. In the meantime, Vincent solved the issue for you but here is a revisited, cleaner solution using your code as well as Vincent's.
Now how it instantiates the Armadillo matrix without copy -- so it is faster. And it also avoids mixing integer and numeric matrices. The code first:
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
NumericVector colMedianRcpp(NumericMatrix x) {
int nrow = x.nrow();
int ncol = x.ncol();
int position = nrow / 2; // Euclidian division
NumericVector out(ncol);
for (int j = 0; j < ncol; j++) {
NumericVector y = x(_,j); // Copy column -- original will not be mod
std::nth_element(y.begin(), y.begin() + position, y.end());
out[j] = y[position];
}
return out;
}
// [[Rcpp::export]]
arma::rowvec colMeansRcppArmadillo(NumericMatrix x){
arma::mat X = arma::mat(x.begin(), x.nrow(), x.ncol(), false);
return arma::mean(X, 0);
}
// [[Rcpp::export]]
NumericVector colMeansRcpp(NumericMatrix X) {
int ncol = X.ncol();
int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for (int col = 0; col < ncol; col++){
out[col]=Rcpp::sum(X(_, col));
}
return wrap(out/nrow);
}
/*** R
set.seed(42)
X <- matrix(rnorm(100*10), 100, 10)
library(microbenchmark)
mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
colMedianRcpp(X), times=50)
print(mb)
*/
And here is the result on my machine, with the concise Armadillo version about as fast as yours, and median a little slower as it has to do more work:
R> sourceCpp("/tmp/stephen.cpp")
R> set.seed(42)
R> X <- matrix(rnorm(100*10), 100, 10)
R> library(microbenchmark)
R> mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
+ colMedianRcpp(X), times=50)
R> print(mb)
Unit: microseconds
expr min lq median uq max neval
colMeans(X) 9.469 10.422 11.5810 12.421 30.597 50
colMeansRcpp(X) 3.922 4.281 4.5245 5.306 18.020 50
colMeansRcppArmadillo(X) 4.196 4.549 4.9295 5.927 11.159 50
colMedianRcpp(X) 15.615 16.291 16.7290 17.971 27.026 50
R>