Parallel Addition of Vectors using RcppParallel - vector

I am trying to parallelise the addition of (large) vectors using RcppParallel. That's what I've come up with.
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
#include <Rcpp.h>
#include <assert.h>
using namespace RcppParallel;
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector directVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
results = first + second;
return results;
}
// [[Rcpp::export]]
NumericVector loopVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
for(unsigned i = 0; i != first.length(); i++)
results[i] = first[i] + second[i];
return results;
}
struct VectorAddition : public Worker
{
const RVector<double> first, second;
RVector<double> results;
VectorAddition(const NumericVector one, const NumericVector two, NumericVector three) : first(one), second(two), results(three) {}
void operator()(std::size_t a1, std::size_t a2) {
std::transform(first.begin() + a1, first.begin() + a2,
second.begin() + a1,
results.begin() + a1,
[](double i, double j) {return i + j;});
}
};
// [[Rcpp::export]]
NumericVector parallelVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
VectorAddition myVectorAddition(first, second, results);
parallelFor(0, first.length(), myVectorAddition);
return results;
}
It seems to work, but doesn't speed up things (at least not on a 4-core machine).
> v1 <- 1:1000000
> v2 <- 1000000:1
> all(directVectorAddition(v1, v2) == loopVectorAddition(v1, v2))
[1] TRUE
> all(directVectorAddition(v1, v2) == parallelVectorAddition(v1, v2))
[1] TRUE
> result <- benchmark(v1 + v2, directVectorAddition(v1, v2), loopVectorAddition(v1, v2), parallelVectorAddition(v1, v2), order="relative")
> result[,1:4]
test replications elapsed relative
1 v1 + v2 100 0.206 1.000
4 parallelVectorAddition(v1, v2) 100 0.993 4.820
2 directVectorAddition(v1, v2) 100 1.015 4.927
3 loopVectorAddition(v1, v2) 100 1.056 5.126
Can this be implemented more efficiently?
Thanks a lot in advance,
mce

Rookie mistake :) You define this as Rcpp::NumericVector but create data that is created via the sequence operator. And that creates integer values so you are forcing a copy onto all your functions!
Make it
v1 <- as.double(1:1000000)
v2 <- as.double(1000000:1)
instead, and on a machine with lots of cores (at work) I then see
R> result[,1:4]
test replications elapsed relative
4 parallelVectorAddition(v1, v2) 100 0.301 1.000
2 directVectorAddition(v1, v2) 100 0.424 1.409
1 v1 + v2 100 0.436 1.449
3 loopVectorAddition(v1, v2) 100 0.736 2.445
The example is still not that impressive because the relevant operation is "cheap" whereas the parallel approach needs to allocate memory, copy data to workers, collect again etc pp.
But the good news is that you wrote your parallel code correctly. Not a small task.

Related

"These samplers cannot be used in parallelized code"

I was reading the vignette for the rgen package which provides headers for sampling from some common distributions. In the first paragraph, it says that:
Please note, these samplers, just like the ones in armadillo cannot be used in parallelized code as the underlying generation routines rely upon R calls that are single-threaded.
This was news to me, and I've been using RcppArmadillo for quite some time now. I was wondering if someone could elaborate on this point (or provide references to where I can read about the issue). I'm especially interested in learning what "cannot be used" means here; will results be wrong, or will it just not parallelize?
These functions use R's random number generator, which must not be used in parallelized code, since that leads to undefined behavior. Undefined behavior can lead to virtually anything. From my point of view you are lucky if the program crashes, since this clearly tells you that something is going wrong.
The HPC task view lists some RNGs that are suitable for parallel computation. But you cannot use them easily with the distributions provided by rgen or RcppDist. Instead, one could do the following:
Copy function for multivariate normal distribution from rgen an adjust it's signature such that it takes a std::function<double()> as source for N(0, 1) distributed random numbers.
Use a fast RNG instead of R's RNG.
Use the same fast RNG in parallel mode.
In code as a quick hack:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
inline arma::mat rmvnorm(unsigned int n, const arma::vec& mu, const arma::mat& S,
std::function<double()> rnorm = norm_rand){
unsigned int ncols = S.n_cols;
arma::mat Y(n, ncols);
Y.imbue( rnorm ) ;
return arma::repmat(mu, 1, n).t() + Y * arma::chol(S);
}
// [[Rcpp::export]]
arma::mat defaultRNG(unsigned int n, const arma::vec& mu, const arma::mat& S) {
return rmvnorm(n, mu, S);
}
// [[Rcpp::export]]
arma::mat serial(unsigned int n, const arma::vec& mu, const arma::mat& S) {
dqrng::normal_distribution dist(0.0, 1.0);
dqrng::xoshiro256plus rng(42);
return rmvnorm(n, mu, S, [&](){return dist(rng);});
}
// [[Rcpp::export]]
std::vector<arma::mat> parallel(unsigned int n, const arma::vec& mu, const arma::mat& S, unsigned int ncores = 1) {
dqrng::normal_distribution dist(0.0, 1.0);
dqrng::xoshiro256plus rng(42);
std::vector<arma::mat> res(ncores);
#pragma omp parallel num_threads(ncores)
{
dqrng::xoshiro256plus lrng(rng); // make thread local copy of rng
lrng.jump(omp_get_thread_num() + 1); // advance rng by 1 ... ncores jumps
res[omp_get_thread_num()] = rmvnorm(n, mu, S, [&](){return dist(lrng);});
}
return res;
}
/*** R
set.seed(42)
N <- 1000000
M <- 100
mu <- rnorm(M)
S <- matrix(rnorm(M*M), M, M)
S <- S %*% t(S)
system.time(defaultRNG(N, mu, S))
system.time(serial(N, mu, S))
system.time(parallel(N/2, mu, S, 2))
*/
Result:
> system.time(defaultRNG(N, mu, S))
user system elapsed
6.984 1.380 6.881
> system.time(serial(N, mu, S))
user system elapsed
4.008 1.448 3.971
> system.time(parallel(N/2, mu, S, 2))
user system elapsed
4.824 2.096 3.080
Here the real performance improvement comes from using a faster RNG, which is understandable since the focus here lies on many random numbers and not so much on matrix operations. If I shift more towards matrix operations by using N <- 100000 and M <- 1000 I get:
> system.time(defaultRNG(N, mu, S))
user system elapsed
16.740 1.768 9.725
> system.time(serial(N, mu, S))
user system elapsed
13.792 1.864 6.792
> system.time(parallel(N/2, mu, S, 2))
user system elapsed
14.112 3.900 5.859
Here we clearly see that in all cases user time is larger than elapsed time. The reason for this is the parallel BLAS implementation I am using (OpenBLAS). So there are quite a few factors to consider before deciding on a method.

Rcpp use outer with pmax

I have got an R function which I need to calculate approximately one million times for vectors of length ~ 5000. Is there any possibily to speed it up by implementing it in Rcpp? I hardly worked with Rcpp before and the code below does not to work:
set.seet(1)
a <- rt(5e3, df = 2)
b <- rt(5e3, df = 2.5)
c <- rt(5e3, df = 3)
d <- rt(5e3, df = 3.5)
sum((1 - outer(a, b, pmax)) * (1 - outer(c, d, pmax)))
#[1] -367780.1
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double f_outer(NumericVector u, NumericVector v, NumericVector x, NumericVector y) {
double result = sum((1 - Rcpp::outer(u, v, Rcpp::pmax)) * (1 - Rcpp::outer(x, y, Rcpp::pmax)));
return(result);
}
Thank you very much!
F. Privé is right -- we'll want to go with loops here; I've got the following C++ code in a file so-answer.cpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double f_outer(NumericVector u, NumericVector v, NumericVector x, NumericVector y) {
// We'll use the size of the first and second vectors for our for loops
int n = u.size();
int m = v.size();
// Make sure the vectors are appropriately sized for what we're doing
if ( (n != x.size() ) || ( m != y.size() ) ) {
::Rf_error("Vectors not of compatible sizes.");
}
// Initialize a result variable
double result = 0.0;
// And use loops instead of outer
for ( int i = 0; i < n; ++i ) {
for ( int j = 0; j < m; ++j ) {
result += (1 - std::max(u[i], v[j])) * (1 - std::max(x[i], y[j]));
}
}
// Then return the result
return result;
}
Then we see in R that the C++ code gives the same answer as your R code, but runs much faster:
library(Rcpp) # for sourceCpp()
library(microbenchmark) # for microbenchmark() (for benchmarking)
sourceCpp("so-answer.cpp") # compile our C++ code and make it available in R
set.seed(1) # for reproducibility
a <- rt(5e3, df = 2)
b <- rt(5e3, df = 2.5)
c <- rt(5e3, df = 3)
d <- rt(5e3, df = 3.5)
sum((1 - outer(a, b, pmax)) * (1 - outer(c, d, pmax)))
#> [1] -69677.99
f_outer(a, b, c, d)
#> [1] -69677.99
# Same answer, so looking good. Which one's faster?
microbenchmark(base = sum((1 - outer(a, b, pmax)) * (1 - outer(c, d, pmax))),
rcpp = f_outer(a, b, c, d))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 3978.9201 4119.6757 4197.9292 4131.3300 4144.4524 10121.5558 100
#> rcpp 118.8963 119.1531 129.4071 119.4767 122.5218 909.2744 100
#> cld
#> b
#> a
Created on 2018-12-13 by the reprex package (v0.2.1)

change vector element by name in rcpp

I have a function where I need to make a table (tab, then change one value - the value where tab.names() == k, where k is given in the function call.
Looking at http://dirk.eddelbuettel.com/code/rcpp/Rcpp-quickref.pdf, I've hoped that the following code would work (replacing "foo" with a variable name), but I guess that requires the element name to be static, and mine won't be. I've tried using which but that won't compile (invalid conversion from 'char' to 'Rcpp::traits::storage_type<16>::type {aka SEXPREC*}' - so I'm doing something wrong there.
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
char kc = '0' + k;
// what I need a working version of:
tab(kc) = 1; // gets ignored, as does a [] version of the same thing.
// or
tab('0' + k) = 1; // also ignored
int ki = which(all_proteins == kc); // gives me compile errors
// extra credit
// tab.names(k-1) = "-1";
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
# same thing in R:
expected_output <- table(x)
expected_output # before modification
# x
# 3 4 5 6 7 9 10 12
# 2 4 3 3 4 2 1 1
expected_output[as.character(k)] <- 1 # this is what I need help with
expected_output
# x
# 3 4 5 6 7 9 10 12
# 2 4 1 3 4 2 1 1
# extra credit:
names(expected_output)[as.character(k)] <- -1
*/
I'm still learning rcpp, and more importantly, still learning how to read the manual pages and plug in the right search terms into google/stackoverflow. I'm sure this is basic stuff (and I'm open to better methods - I currently think like an R programmer in terms of initial approaches to problems, not a C++ programmer.)
(BTW - The use of arma::vec is used in other parts of the code which I'm not showing for simplicity - I realize it's not useful here. I debated on switching it, but decided against it on the principle that I've tested that part, it works, and the last thing I want to do is introduce an extra bug...)
Thanks!
You can use the .findName() method to get the relevant index:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fun(const arma::vec& assignment, int k) {
// count number of peptides per protein
IntegerVector tab = table(as<IntegerVector>(wrap(assignment)));
CharacterVector all_proteins = tab.names();
int index = tab.findName(std::string(1, '0' + k));
tab(index) = 1;
all_proteins(index) = "-1";
tab.names() = all_proteins;
return tab;
}
/*** R
set.seed(23)
x <- rpois(20, 5)
k <- 5
fun(x, k)
*/
Output:
> Rcpp::sourceCpp('table-name.cpp')
> set.seed(23)
> x <- rpois(20, 5)
> k <- 5
> fun(x, k)
3 4 -1 6 7 9 10 12
2 4 1 3 4 2 1 1
You could write your own function (use String instead of char):
int first_which_equal(const CharacterVector& x, String y) {
int n = x.size();
for (int i = 0; i < n; i++) {
if (x[i] == y) return(i);
}
return -1;
}
Also, it seems that tab(kc) is converting kc to an integer representation.

Rcpp and R: pass by reference

Working with Rcpp and R I observed the following behaviour, which I do not understand at the moment. Consider the following simple function written in Rcpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix hadamard_product(NumericMatrix & X, NumericMatrix & Y){
unsigned int ncol = X.ncol();
unsigned int nrow = X.nrow();
int counter = 0;
for (unsigned int j=0; j<ncol; j++) {
for (unsigned int i=0; i<nrow; i++) {
X[counter++] *= Y(i, j);
}
}
return X;
}
This simply returns the component-wise product of two matrices. Now I know that the arguments to this function are passed by reference, i.e., calling
M <- matrix(rnorm(4), ncol = 2)
N <- matrix(rnorm(4), ncol = 2)
M_copy <- M
hadamard_product(M, N)
will overwrite the original M. However, it also overwrites M_copy, which I do not understand. I thought that M_copy <- M makes a copy of the object M and saves it somewhere in the memory and not that this assignment points M_copy to M, which would be the behaviour when executing
x <- 1
y <- x
x <- 2
for example. This does not change y but only x.
So why does the behaviour above occur?
No, R does not make a copy immediately, only if it is necessary, i.e., copy-on-modify:
x <- 1
tracemem(x)
#[1] "<0000000009A57D78>"
y <- x
tracemem(x)
#[1] "<0000000009A57D78>"
x <- 2
tracemem(x)
#[1] "<00000000099E9900>"
Since you modify M by reference outside R, R can't know that a copy is necessary. If you want to ensure a copy is made, you can use data.table::copy. Or avoid the side effect in your C++ code, e.g., make a deep copy there (by using clone).

equivalent of 'which' function in Rcpp

I'm a newbie to C++ and Rcpp. Suppose, I have a vector
t1<-c(1,2,NA,NA,3,4,1,NA,5)
and I want to get a index of elements of t1 that are NA. I can write:
NumericVector retIdxNA(NumericVector x) {
// Step 1: get the positions of NA in the vector
LogicalVector y=is_na(x);
// Step 2: count the number of NA
int Cnt=0;
for (int i=0;i<x.size();i++) {
if (y[i]) {
Cnt++;
}
}
// Step 3: create an output matrix whose size is same as that of NA
// and return the answer
NumericVector retIdx(Cnt);
int Cnt1=0;
for (int i=0;i<x.size();i++) {
if (y[i]) {
retIdx[Cnt1]=i+1;
Cnt1++;
}
}
return retIdx;
}
then I get
retIdxNA(t1)
[1] 3 4 8
I was wondering:
(i) is there any equivalent of which in Rcpp?
(ii) is there any way to make the above function shorter/crisper? In particular, is there any easy way to combine the Step 1, 2, 3 above?
Recent version of RcppArmadillo have functions to identify the indices of finite and non-finite values.
So this code
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec whichNA(arma::vec x) {
return arma::find_nonfinite(x);
}
/*** R
t1 <- c(1,2,NA,NA,3,4,1,NA,5)
whichNA(t1)
*/
yields your desired answer (module the off-by-one in C/C++ as they are zero-based):
R> sourceCpp("/tmp/uday.cpp")
R> t1 <- c(1,2,NA,NA,3,4,1,NA,5)
R> whichNA(t1)
[,1]
[1,] 2
[2,] 3
[3,] 7
R>
Rcpp can do it too if you first create the sequence to subset into:
// [[Rcpp::export]]
Rcpp::IntegerVector which2(Rcpp::NumericVector x) {
Rcpp::IntegerVector v = Rcpp::seq(0, x.size()-1);
return v[Rcpp::is_na(x)];
}
Added to code above it yields:
R> which2(t1)
[1] 2 3 7
R>
The logical subsetting is also somewhat new in Rcpp.
Try this:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector which4( NumericVector x) {
int nx = x.size();
std::vector<int> y;
y.reserve(nx);
for(int i = 0; i < nx; i++) {
if (R_IsNA(x[i])) y.push_back(i+1);
}
return wrap(y);
}
which we can run like this in R:
> which4(t1)
[1] 3 4 8
Performance
Note that we have changed the above solution to reserve space for the output vector. This replaces which3 which is:
// [[Rcpp::export]]
IntegerVector which3( NumericVector x) {
int nx = x.size();
IntegerVector y;
for(int i = 0; i < nx; i++) {
// if (internal::Rcpp_IsNA(x[i])) y.push_back(i+1);
if (R_IsNA(x[i])) y.push_back(i+1);
}
return y;
}
Then the performance on a vector 9 elements long is the following with which4 the fastest:
> library(rbenchmark)
> benchmark(retIdxNA(t1), whichNA(t1), which2(t1), which3(t1), which4(t1),
+ replications = 10000, order = "relative")[1:4]
test replications elapsed relative
5 which4(t1) 10000 0.14 1.000
4 which3(t1) 10000 0.16 1.143
1 retIdxNA(t1) 10000 0.17 1.214
2 whichNA(t1) 10000 0.17 1.214
3 which2(t1) 10000 0.25 1.786
Repeating this for a vector 9000 elements long the Armadillo solution comes in quite a bit faster than the others. Here which3 (which is the same as which4 except it does not reserve space for the output vector) comes in worst while which4 comes second.
> tt <- rep(t1, 1000)
> benchmark(retIdxNA(tt), whichNA(tt), which2(tt), which3(tt), which4(tt),
+ replications = 1000, order = "relative")[1:4]
test replications elapsed relative
2 whichNA(tt) 1000 0.09 1.000
5 which4(tt) 1000 0.79 8.778
3 which2(tt) 1000 1.03 11.444
1 retIdxNA(tt) 1000 1.19 13.222
4 which3(tt) 1000 23.58 262.000
All of the solutions above are serial. Although not trivial, it is quite possible to take advantage of threading for implementing which. See this write up for more details. Although for such small sizes, it would not more harm than good. Like taking a plane for a small distance, you lose too much time at airport security..
R implements which by allocating memory for a logical vector as large as the input, does a single pass to store the indices in this memory, then copy it eventually into a proper logical vector.
Intuitively this seems less efficient than a double pass loop, but not necessarily, as copying a data range is cheap. See more details here.
Just write a function for yourself like:
which_1<-function(a,b){
return(which(a>b))
}
Then pass this function into rcpp.

Resources