Can an Rcpp::IntegerVector (ever) be used with OpenMP? - r

I am probably being greedy for performance, but I've observed significant performance gains when combining Rcpp and OpenMP in possibly illict ways. I understand that "Calling any of the R API from threaded code is ‘for experts only’ and strongly discouraged." but I'm don't fully understand when C++ code may be implicitly doing this on Rcpp vectors. I understand RcppParallel has the RVector class but I understand this may involve taking a copy of the vector, which may wash away any performance gains.
I note the Rcpp gallery has (https://gallery.rcpp.org/articles/gerber-statistic/) which appears to access a NumericMatrix HIST_RETURN_RAW inside a parallel loop, so it seems "some" access is allowed, but I know/believe that some wrappers (like Rcpp::List) would not be permitted. Is atomicity the distinguishing characteristic?
Concretely, are any of the following acceptable uses of OpenMP? (i.e. are they all threadsafe, compliant with R's memory management, and free from undefined behaviour?)
#include <Rcpp.h>
#ifdef _OPENMP
#include <omp.h>
#endif
using namespace Rcpp;
// 1. No 'outside' R vector, but IntegerVector created outside omp region
// [[Rcpp::export]]
IntegerVector fastInitalize(int n, int nThread = 1) {
IntegerVector out = no_init(n);
#pragma omp parallel for num_threads(nThread)
for (int i = 0; i < n; ++i) {
out[i] = 0;
}
return out;
}
// 2. Simple access
// [[Rcpp::export]]
IntegerVector AddOMP(IntegerVector x, IntegerVector y, int nThread = 1) {
R_xlen_t N = x.length();
if (N != y.length()) {
stop("Lengths differ");
}
IntegerVector out = no_init(N);
#pragma omp parallel for num_threads(nThread)
for (R_xlen_t i = 0; i < N; ++i) {
out[i] = x[i] + y[i];
}
return out;
}
// 3. Access, copy inside Rcpp
// [[Rcpp::export]]
IntegerVector pmax0xy(IntegerVector x, IntegerVector y, int nThread = 1) {
R_xlen_t N = x.length();
if (N != y.length()) {
stop("Lengths differ");
}
IntegerVector out = clone(y);
#pragma omp parallel for num_threads(nThread)
for (R_xlen_t i = 0; i < N; ++i) {
if (x[i] > 0) {
out[i] = 0;
}
}
return out;
}
// 4. Access with omp + reduction
// [[Rcpp::export]]
int firstNonzero(IntegerVector x, int nThread = 1) {
R_xlen_t N = x.length();
int out = N;
#pragma omp parallel for num_threads(nThread) reduction(min : out)
for (R_xlen_t i = 0; i < N; ++i) {
if (x[i] != 0) {
out = (i < out) ? i : out;
}
}
return out;
}
// 5. Access with omp array reduction
// [[Rcpp::export]]
IntegerVector count_one_to_ten(IntegerVector x, int nThread = 1) {
R_xlen_t N = x.length();
if (N >= INT_MAX) {
stop("Possibly too many numbers.");
}
const int TEN = 10;
int numbers[TEN] = {}; // what if 10 was large?
#if defined _OPENMP && _OPENMP >= 201511
#pragma omp parallel for num_threads(nThread) reduction(+:numbers[:TEN])
#endif
for (R_xlen_t i = 0; i < N; ++i) {
int xi = x[i];
if (xi < 1 || xi > 10) {
continue;
}
numbers[xi - 1] += 1;
}
IntegerVector out(TEN);
for (int n = 0; n < TEN; ++n) {
out[n] = numbers[n];
}
return out;
}
// You can include R code blocks in C++ files processed with sourceCpp
// (useful for testing and development). The R code will be automatically
// run after the compilation.
//
/*** R
x <- sample(1:1200, size = 1e6, replace = TRUE)
y <- sample(1:1200, size = 1e6, replace = TRUE)
fastInitalize(1e6)[1]
head(AddOMP(x, y))
head(AddOMP(x, y, 2))
head(pmax0xy(x, y))
head(pmax0xy(x, y, 2))
firstNonzero(x)
firstNonzero(x, 2)
count_one_to_ten(x, 2)
*/

Related

Get matrix/vector type from C using Rcpp

I am trying to get some results by using Rcpp and this is the code.
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
enter code here
// [[Rcpp::export]]
double compssr(NumericMatrix dist, NumericVector x, int n, int p) {
double ssr = 0; double del_sq = 0; double del_ij = 0;
int i, j, ip;
for (i = 0; i < n; i++) {
for (j = 0; j < i; j++) {
for (ip = 0; ip < p; ip++) {
del_sq = del_sq + (x(i, ip) - x(j, ip))*(x(i, ip) - x(j, ip));
if (i == j) del_sq = 0;
}
del_ij = sqrt(del_sq);
ssr = ssr + (dist(i, j) - del_ij)*(dist(i, j) - del_ij);
}}
return ssr;
}
NumericMatrix Scaling_X(NumericVector xbar, NumericMatrix x, double n, double p) {
NumericMatrix Sig_x(p, p);
int i, ii, ip, ip2;
for (ii = 0; ii < n; ii++) {
for (i = 0; i < p; i++) {
x(ii, i) = x(ii, i) - xbar(i);
}}
for (i = 0; i < n; i++) {
for (ip = 0; ip < p; ip++) {
for (ip2 = 0; ip2 < p; ip2++) {
Sig_x(ip, ip2) = Sig_x(ip, ip2) + x(i, ip)*x(i, ip2);
}}}
for (i = 0; i < Sig_x.ncol(); i++) {
for (ii = 0; ii < Sig_x.nrow(); ii++) {
Sig_x(i, ii) = Sig_x(i, ii) / n;
}}
return Sig_x;
}
In fact there are some more functions and the file name of this code is "test.cpp"
And I called this file in R by using
sourceCpp("test.cpp")
There was no error and I could use the function "compssr" the first function(return type: double)
But I couldn't call the function Scaling_X
Is there any error in my code?
I made other functions and I could use the function with return type double, but I couldn't use others(NumericMatrix, NumericVector, List)
You are missing the
// [[Rcpp::export]]
in front of function Scaling_X so the compileAttributes() function does as it has been told: compile both functions, make just one available.

R Weighted moving average with partial averages

I am trying to code in R a(centered) weighted moving average function that returns me a vector of the same size than the input vector.
The following code almost gives me what I want but it does not work for the first and last values of my vector
set.seed(0)
len=10
x=floor(l*runif(l))
weights=c(1,3,0,3,1)
weights=weights/sum(weights)
rollapply(x,width=length(weights), function(x) sum(x*weights),align="center")
na.omit(filter(x,sides=2,weights))
Setting partial=TRUE in the rollapply function is sort of what I want to do. Anyway it does not work since my function does not support an x of changing sizes.
I could the latter and manually add the sides computations with a loop. It would work but I would like to find a nicer (computationally faster) way to do it.
For a more rigorous description of my needs here is a mathematical version
r is the vector my function would return
x and the weights w as inputs :
With Rcpp, you can do:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector roll_mean(const NumericVector& x,
const NumericVector& w) {
int n = x.size();
int w_size = w.size();
int size = (w_size - 1) / 2;
NumericVector res(n);
int i, ind_x, ind_w;
double w_sum = Rcpp::sum(w), tmp_wsum, tmp_xwsum, tmp_w;
// beginning
for (i = 0; i < size; i++) {
tmp_xwsum = tmp_wsum = 0;
for (ind_x = i + size, ind_w = w_size - 1; ind_x >= 0; ind_x--, ind_w--) {
tmp_w = w[ind_w];
tmp_wsum += tmp_w;
tmp_xwsum += x[ind_x] * tmp_w;
}
res[i] = tmp_xwsum / tmp_wsum;
}
// middle
int lim2 = n - size;
for (; i < lim2; i++) {
tmp_xwsum = 0;
for (ind_x = i - size, ind_w = 0; ind_w < w_size; ind_x++, ind_w++) {
tmp_xwsum += x[ind_x] * w[ind_w];
}
res[i] = tmp_xwsum / w_sum;
}
// end
for (; i < n; i++) {
tmp_xwsum = tmp_wsum = 0;
for (ind_x = i - size, ind_w = 0; ind_x < n; ind_x++, ind_w++) {
tmp_w = w[ind_w];
tmp_wsum += tmp_w;
tmp_xwsum += x[ind_x] * tmp_w;
}
res[i] = tmp_xwsum / tmp_wsum;
}
return res;
}
I use this function in one of my packages.
Just put that in a .cpp file and source it with Rcpp::sourceCpp.

Why does mclapply function in R is more efficient than Rcpp + OpenMP?

I have a function with a loop (EstimateUniques) that is parallelized with OpenMP. I suggested that multithreading should be more efficient than multiprocessing, but when I compare this function with the simple run of "mclapply", it showed lower performance. What is the proper way to achieve the same level of parallelization in c++ as in R? Am I doing something wrong?
Performance comparison (time in seconds):
#Cores CPP R
1 1.721s 1.538s
2 1.945s 1.080s
3 2.858s 0.801s
R code:
Rcpp::sourceCpp('ReproducibleExample.cpp')
arr <- 1:10000
n_rep <- 150
n_iters <- 200
EstimateUniquesR <- function(arr, n_iters, n_rep, cores) {
parallel::mclapply(1:n_iters, function(i)
GetNumberOfUniqSamples(arr, i * 10, n_rep), mc.cores=cores)
}
cpp_times <- sapply(1:3, function(threads)
system.time(EstimateUniques(arr, n_iters, n_rep, threads))['elapsed'])
r_times <- sapply(1:3, function(cores)
system.time(EstimateUniquesR(arr, n_iters, n_rep, cores))['elapsed'])
data.frame(CPP=cpp_times, R=r_times)
Example.cpp file:
// [[Rcpp::plugins(openmp)]]
// [[Rcpp::plugins(cpp11)]]
#include <algorithm>
#include <vector>
#include <omp.h>
// [[Rcpp::export]]
int GetNumberOfUniqSamples(const std::vector<int> &bs_array, int size, unsigned n_rep) {
unsigned long sum = 0;
for (unsigned i = 0; i < n_rep; ++i) {
std::vector<int> uniq_vals(size);
for (int try_num = 0; try_num < size; ++try_num) {
uniq_vals[try_num] = bs_array[rand() % bs_array.size()];
}
std::sort(uniq_vals.begin(), uniq_vals.end());
sum += std::distance(uniq_vals.begin(), std::unique(uniq_vals.begin(), uniq_vals.end()));
}
return std::round(double(sum) / n_rep);
}
// [[Rcpp::export]]
std::vector<int> EstimateUniques(const std::vector<int> &bs_array, const int n_iters,
const int n_rep = 1000, const int threads=1) {
std::vector<int> uniq_counts(n_iters);
#pragma omp parallel for num_threads(threads) schedule(dynamic)
for (int i = 0; i < n_iters; ++i) {
uniq_counts[i] = GetNumberOfUniqSamples(bs_array, (i + 1) * 10, n_rep);
}
return uniq_counts;
}
I tried to use other types of scheduling in OpenMP, but they gave even worse results.

RcppArmadillo on several cpu cores

I have the following RccpArmadillo function that runs fine if I execute it on one cpu core. But if I use several cores, then R will crash. All the other Rcpp functions I created so far run fine on several cores (with foreach), only RccpArmadillo seems to be problematic. Any ideas how to fix that?
cppFunction('double augmentedDickeyFullerCpp(NumericVector a, NumericVector b, double gamma, double mu, int lags) {
if (gamma < 0) {
return 0;
}
int n = a.size()-1;
int lags2 = lags + 1;
// first rows, then columns
NumericMatrix x(n-lags2,lags2);
NumericMatrix zdifflag(n-lags2+1,lags2);
NumericVector diff(n);
NumericVector zdiff(n-lags2+1);
NumericVector residuals(n+1);
residuals[0] = a[0] - gamma * b[0] - mu;
// residuals a is y and b is x
for(int i = 1; i < n+1; i++) {
residuals[i] = a[i] - gamma * b[i] - mu;
diff[i-1] = residuals[i] - residuals[i-1];
}
for(int i = 0; i < n-lags2+1; i++) {
zdifflag[0,i] = residuals[i+lags2-1];
}
for(int j = 0; j < n-lags2+1; j++) {
for(int i = 0; i < lags2; i++) {
x(j,i) = diff[j+lags2-1-i];
if (i > 0) {
zdifflag(j,i) = x(j,i);
}
}
zdiff[j] = x(j,0);
}
int length = zdifflag.nrow(), k = zdifflag.ncol();
arma::mat X(zdifflag.begin(), length, k, false); // reuses memory and avoids extra copy
arma::colvec y(zdiff.begin(), zdiff.size(), false);
arma::colvec coef = arma::solve(X, y); // fit model y ~ X
arma::colvec res = y - X*coef; // residuals
// std.errors of coefficients
//arma::colvec res = y - X*coef[0];
// sqrt(sum(residuals^2)/(length - k))
double s2 = std::inner_product(res.begin(), res.end(), res.begin(), 0.0)/(length - k);
arma::colvec std_err = arma::sqrt(s2 * arma::diagvec(arma::pinv(arma::trans(X)*X)));
return coef[0]/std_err[0];
}',depends = "RcppArmadillo", includes="#include <RcppArmadillo.h>")
I generally recommend putting the code into a small package, and having each parallel worker load the package. That is known to work, both in serial and parallel, whereas relying on cppFunction() for an ad-hoc function may be too fragile for parallel execution.

When calling same Rcpp function several times different results are returned

I have written parallel implementation of sums in groups using RcppParallel.
// [[Rcpp::depends(RcppParallel)]]
#include <Rcpp.h>
#include <RcppParallel.h>
using namespace Rcpp;
using namespace RcppParallel;
struct SumsG: public Worker
{
const RVector<double> v;
const RVector<int> gi;
RVector<double> sg;
SumsG(const NumericVector v, const IntegerVector gi, NumericVector sg): v(v), gi(gi), sg(sg) {}
SumsG(const SumsG& p, Split): v(p.v), gi(p.gi), sg(p.sg) {}
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; i++) {
sg[gi[i]] += v[i];
}
}
void join(const SumsG& p) {
for(std::size_t i = 0; i < sg.length(); i++) {
sg[i] += p.sg[i];
}
}
};
// [[Rcpp::export]]
List sumsingroups(NumericVector v, IntegerVector gi, int ni) {
NumericVector sg(ni);
SumsG p(v, gi, sg);
parallelReduce(0, v.length(), p);
return List::create(_["sg"] = p.sg);
}
It compiles using Rcpp::sourceCpp. Now when I call it from R sumsingroups(1:10, rep(0:1, each = 5), 2) several times I get the right answer (15 40) and then something different (usually some multiplicative of the right answer). Running
res <- sumsingroups(1:10, rep(0:1, each = 5), 2)
for(i in 1:1000) {
tmp <- sumsingroups(1:10, rep(0:1, each = 5), 2)
if(res[[1]][1] != tmp[[1]][1]) break
Sys.sleep(0.1)
}
breaks at random iteration returning
$sg
[1] 60 160
or
$sg
[1] 30 80
I am new to Rcpp and RcppParallel and do not know what could cause such behavior.
Update. Things that did not help:
Added for (std::size_t i = 0; i < sg.length(); i++) sg[i] = 0; to
both of constructors.
Changed names so that they are different in
Worker definition and in function implementation.
Try this.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
using namespace RcppParallel;
struct SumsInGroups5: public Worker
{
const RVector<double> v;
const RVector<int> g;
std::vector<double> s;
SumsInGroups5(const NumericVector v, const IntegerVector g): v(v), g(g), s(*std::max_element(g.begin(), g.end()) + 1, 0.0){ }
SumsInGroups5(const SumsInGroups5& p, Split): v(p.v), g(p.g), s(*std::max_element(g.begin(), g.end()) + 1, 0.0) {}
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; ++i) {
s[g[i]]+=v[i];
}
}
void join(const SumsInGroups5& rhs) {
for(std::size_t i = 0; i < s.size(); i++) {
s[i] += rhs.s[i];
}
}
};
// [[Rcpp::export]]
NumericVector sg5(NumericVector v, IntegerVector g) {
SumsInGroups5 p(v, g);
parallelReduce(0, v.length(), p);
return wrap(p.s);
}
/*** R
a <- 1:10
g <- c(rep(0,5),rep(1,5))
bb <- lapply(1:10000,function(x)sg5(a,g))
cc<-do.call("rbind",bb)
unique(cc)
*/
Compared to my other tries this code did not produce weird result in the same cases other code did. Not very assuring.

Resources