rcpp updating data in base environment - r

I am working on a simple matching algorithm in Rcpp which is taking a number of individuals (I), a number of schools (J), a number of submitted choices (nc), a priority ranking of individuals (pos), the number of vacancy (emp), and the true choices..
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
NumericVector gs2(int I, int J, int nc, NumericVector pos, NumericVector emp, NumericMatrix choices) {
NumericVector admits(J);
NumericVector out(I);
std::fill(out.begin(),out.end(),J+1);
for (int i=0;i<I;i++){
NumericVector apply = choices(pos(i),_)-1;
for (int j=0;j<nc;j++){
if (emp(apply(j))>0)
{
out(pos(i)) = apply(j)+1;
admits(apply(j)) = admits(apply(j)) + 1;
emp(apply(j)) = emp(apply(j)) - 1;
break;
}
}
}
return out;
}
The code works fine.. Except that it looks like it is messing with my data.. after running the code my size variable has been changed...Am I missing something? Thanks
set.seed(123)
rank = (1:20)-1
stuchoice = matrix(sample(1:3,6*20,replace=T),byrow=T,ncol=6,nrow=20)
size = c(7,11,4)
gs2(20,3,6,rank,size,stuchoice)
size

Your size variable is changing because you are changing it in your C++ code. In particular this line:
emp(apply(j)) = emp(apply(j)) - 1;
Rcpp passes variables by reference so anything you do to them inside will be reflected in your top R variables. If you want to avoid this, then you want to clone your variable. Changing your code to the following corrects the problem.
#include <Rcpp.h>
using namespace Rcpp;
// Note the change in the name of 'emp' to 'emp_'!!!
//[[Rcpp::export]]
NumericVector gs2(int I, int J, int nc, NumericVector pos, NumericVector emp_, NumericMatrix choices) {
NumericVector admits(J);
NumericVector out(I);
// clone your emp
NumericVector emp = clone(emp_);
std::fill(out.begin(),out.end(),J+1);
for (int i=0;i<I;i++){
NumericVector apply = choices(pos(i),_)-1;
for (int j=0;j<nc;j++){
if (emp(apply(j))>0)
{
out(pos(i)) = apply(j)+1;
admits(apply(j)) = admits(apply(j)) + 1;
emp(apply(j)) = emp(apply(j)) - 1;
break;
}
}
}
return out;
}
Test
library(Rcpp)
sourceCpp("test.cpp")
set.seed(123)
rank = (1:20)-1
stuchoice = matrix(sample(1:3,6*20,replace=T),byrow=T,ncol=6,nrow=20)
size = c(7,11,4)
gs2(20,3,6,rank,size,stuchoice)
size
[1] 7 11 4

Related

Rcpp compiler flags error with return value - clueless

Im trying to speed up some R code with Rcpp functions. One function is giving me fits to compile and I am clueless to figure out why the compiler complains about the return argument. I declared the function to return NumericVector, the result is NumericVector and yet the compiler complains the return argument is invalid.
Rcpp is version 0.12.18,
R is Microsoft Open R 3.5.3
cppFunction('NumericVector NNE(IntegerVector X, IntegerVector Y, IntegerVector XY, IntegerVector xy, NumericVector P, int radius ) {
int n = X.size();
NumericVector vN[n];
NumericVector vSum[n];
NumericVector vAvg[n];
// for each xy determine neighborhood Sum and count (N)
for(int i=0; i<n; i++) {
vN[i] = 0.0;
vSum[i] = 0.0;
// traverse neighborhood, if the xy exists in the input
// vector then accumulate the values, otherwise ignore
for(int dx=-1*radius; dx<=radius; dx++) {
for(int dy=-1*radius; dy<=radius; dy++) {
// construct an xy index for the neighborhood die
xy[0] = ( (X[i]+dx) * 10000 ) + (Y[i]+dy);
// check to see if index above exists in input set
IntegerVector m = Rcpp::match(xy, XY);
// if valid then accumulate and count
if(m[0] != NA_INTEGER) {
vN[i] = vN[i] + 1.0;
vSum[i] = vSum[i] + P[ m[0] ];
}
}
}
vAvg[i] = vSum[i] / vN[i];
}
return vAvg;
}')
The confusing compiler message is as follows:
C:/RBuildTools/3.5/mingw_64/bin/g++ -m64 -I"C:/PROGRA~1/MICROS~3/ROPEN~1/R-35~1.3/include" -DNDEBUG -I"D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include" -I"D:/Users/ka/AppData/Local/Temp/4/RtmpeGKfUg/sourceCpp-x86_64-w64-mingw32-0.12.18" -I"C:/a/w/1/s/vendor/extsoft/include" -O2 -Wall -mtune=core2 -c filefcc651c7fa9.cpp -o filefcc651c7fa9.o
filefcc651c7fa9.cpp: In function 'Rcpp::NumericVector NNE(Rcpp::IntegerVector, Rcpp::IntegerVector, Rcpp::IntegerVector, Rcpp::IntegerVector, Rcpp::NumericVector, int)':
filefcc651c7fa9.cpp:42:10: error: invalid conversion from 'Rcpp::NumericVector* {aka Rcpp::Vector<14, Rcpp::PreserveStorage>*}' to 'const int&' [-fpermissive]
return vAvg;
^
In file included from D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include/Rcpp/Vector.h:52:0,
from D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include/Rcpp.h:40,
from filefcc651c7fa9.cpp:1:
D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include/Rcpp/vector/Vector.h:128:5: note: initializing argument 1 of 'Rcpp::Vector<RTYPE, StoragePolicy>::Vector(const int&) [with int RTYPE = 14; StoragePolicy = Rcpp::PreserveStorage]'
Vector( const int& size ) {
^
make: *** [C:/PROGRA~1/MICROS~3/ROPEN~1/R-35~1.3/etc/x64/Makeconf:215: filefcc651c7fa9.o] Error 1
Error in sourceCpp(code = code, env = env, rebuild = rebuild, cacheDir = cacheDir, :
Error 1 occurred building shared library.
You had a miniscule error rendering the variable "bad" as far as the compiler is concerned, and you then misunderstood the rejected return of the "bad" variable as a different issue.
It happens. We have all been there.
Here is the repaired code. In short, you needed NumeriVector x(n); with round instead of squared parens (as the latter denote arrays in C and then C++).
Code
I also turned it into input for sourceCpp() which is easier given the length of the functions.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector NNE(IntegerVector X, IntegerVector Y, IntegerVector XY,
IntegerVector xy, NumericVector P, int radius ) {
int n = X.size();
NumericVector vN(n);
NumericVector vSum(n);
NumericVector vAvg(n);
// for each xy determine neighborhood Sum and count (N)
for(int i=0; i<n; i++) {
vN[i] = 0.0;
vSum[i] = 0.0;
// traverse neighborhood, if the xy exists in the input
// vector then accumulate the values, otherwise ignore
for(int dx=-1*radius; dx<=radius; dx++) {
for(int dy=-1*radius; dy<=radius; dy++) {
// construct an xy index for the neighborhood die
xy[0] = ( (X[i]+dx) * 10000 ) + (Y[i]+dy);
// check to see if index above exists in input set
IntegerVector m = Rcpp::match(xy, XY);
// if valid then accumulate and count
if(m[0] != NA_INTEGER) {
vN[i] = vN[i] + 1.0;
vSum[i] = vSum[i] + P[ m[0] ];
}
}
}
vAvg[i] = vSum[i] / vN[i];
}
return vAvg;
}
/*** R
cat("Built\n")
*/
Output
As we have no reference data, I can only show that it built:
R> sourceCpp("~/git/stackoverflow/61377960/answer.cpp")
R> cat("Built\n")
Built
R>

"inner_product" was not declared in this scope

Hi I am new to rcpp and computing the inner product of two variables but getting an error "inner_product was not declared in this scope" for the following code:
#include <math.h>
#include <RcppCommon.h>
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector polynomial_kernel(NumericVector x, NumericMatrix Y, double scale = 1, double offset =
1, int d=1){
int n = Y.nrow();
NumericVector kernel(n);
for (int j = 0; j < n; j++){
NumericVector v = Y( j,_ );
double crossProd =innerProduct(x,v);
kernel[j]= pow((scale*crossProd+offset),2);
}
return kernel;
}
Please help me to resolve this problem.
Below is simpler, repaired version of your code that actually compiles. It uses Armadillo types for consistency, and instead of calling a non-existing "inner_product" routines computes the inner product of two vectors the standard way via multiplication.
#include <RcppArmadillo.h> // also pulls in Rcpp.h amd cmath
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::vec polynomial_kernel(arma::vec x, arma::mat Y,
double scale = 1, double offset = 1, int d=1) {
int n = Y.n_rows;
arma::vec kernel(n);
for (int j = 0; j < n; j++){
arma::rowvec v = Y.row(j);
double crossProd = arma::as_scalar(v * x);
kernel[j] = std::pow((scale*crossProd+offset),2);
}
return kernel;
}
Your example was not a minimallyc complete verifiable example so I cannot show it any data you could have supplied with. On some made up data it seems to work:
R> set.seed(123)
R> polynomial_kernel(runif(4), matrix(rnorm(16),4))
[,1]
[1,] 3.317483
[2,] 3.055690
[3,] 1.208345
[4,] 0.301834
R>

Rcpp function complaining about unintialized variables

In a very first attempt at creating a C++ function which can be called from R using Rcpp, I have a simple function to compute a minimum spanning tree from a distance matrix using Prim's algorithm. This function has been converted into C++ from a former version in ANSI C (which works fine).
Here it is:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame primlm(const int n, NumericMatrix d)
{
double const din = 9999999.e0;
long int i1, nc, nc1;
double dlarge, dtot;
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
for (int i=2; i <= n; i++) {
is(i) = 0;
}
for (int i=2; i <= n; i++) {
dlarge = din;
i1 = i - 1;
for (int j=1; j <= i1; j++) {
for (int k=1; k <= n; k++) {
if (l(j) == k)
continue;
if (d[l(j), k] > dlarge)
continue;
if (is(k) == 1)
continue;
nc = k;
nc1 = l(j);
dlarge = d(nc1, nc);
}
}
is(nc) = 1;
l(i) = nc;
lp(i) = nc1;
dist(i) = dlarge;
}
dtot = 0.e0;
for (int i=2; i <= n; i++){
dtot += dist(i);
}
return DataFrame::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
When I compile this function using Rcpp under RStudio, I get two warnings, complaining that variables 'nc' and 'nc1' have not been initialized. Frankly, I could not understand that, as it seems to me that both variables are being initialized inside the third loop. Also, why there is no similar complaint about variable 'i1'?
Perhaps it comes as no surprise that, when attempting to call this function from R, using the below code, what I get is a crash of the R system!
# Read test data
df <- read.csv("zygo.csv", header=TRUE)
lonlat <- data.frame(df$Longitude, df$Latitude)
colnames(lonlat) <- c("lon", "lat")
# Compute distance matrix using geosphere library
library(geosphere)
d <- distm(lonlat, lonlat, fun=distVincentyEllipsoid)
# Calls Prim minimum spanning tree routine via Rcpp
library(Rcpp)
sourceCpp("Prim.cpp")
n <- nrow(df)
p <- primlm(n, d)
Here is the dataset I use for testing purposes:
"Scientific name",Locality,Longitude,Latitude Zygodontmys,Bush Bush
Forest,-61.05,10.4 Zygodontmys,Cerro Azul,-79.4333333333,9.15
Zygodontmys,Dividive,-70.6666666667,9.53333333333 Zygodontmys,Hato El
Frio,-63.1166666667,7.91666666667 Zygodontmys,Finca Vuelta
Larga,-63.1166666667,10.55 Zygodontmys,Isla
Cebaco,-81.1833333333,7.51666666667 Zygodontmys,Kayserberg
Airstrip,-56.4833333333,3.1 Zygodontmys,Limao,-60.5,3.93333333333
Zygodontmys,Montijo Bay,-81.0166666667,7.66666666667
Zygodontmys,Parcela 200,-67.4333333333,8.93333333333 Zygodontmys,Rio
Chico,-65.9666666667,10.3166666667 Zygodontmys,San Miguel
Island,-78.9333333333,8.38333333333
Zygodontmys,Tukuko,-72.8666666667,9.83333333333
Zygodontmys,Urama,-68.4,10.6166666667
Zygodontmys,Valledup,-72.9833333333,10.6166666667
Could anyone give me a hint?
The initializations of ncand nc1 are never reached if one of the three if statements is true. It might be that this is not possible with your data, but the compiler has no way knowing that.
However, this is not the reason for the crash. When I run your code I get:
Index out of bounds: [index=1; extent=0].
This comes from here:
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
When declaring a NumericVector you have to tell the required size if you want to assign values by index. In your case
NumericVector is(n), l(n), lp(n), dist(n);
might work. You have to analyze the C code carefully w.r.t. memory allocation and array boundaries.
Alternatively you could use the C code as is and use Rcpp to build a wrapper function, e.g.
#include <array>
#include <Rcpp.h>
using namespace Rcpp;
// One possibility for the function signature ...
double prim(const int n, double *d, double *l, double *lp, double *dist) {
....
}
// [[Rcpp::export]]
List primlm(NumericMatrix d) {
int n = d.nrow();
std::array<double, n> lp; // adjust size as needed!
std::array<double, n> dist; // adjust size as needed!
double dtot = prim(n, d.begin(), l.begin(), lp.begin(), dist.begin());
return List::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
Notes:
I am returning a List instead of a DataFrame since dtot is a scalar value.
The above code is meant to illustrate the idea. Most likely it will not work without adjustments!

Rcpp - generate multiple random observations from custom distribution

This question is related to a previous one on calling functions within functions in Rcpp.
I need to generate a large number of random draws from a custom distribution, in a way similar to rnorm() or rbinom(), with the additional complication that my function produces a vector output.
As a solution, I thought about defining a function that generates observations from the custom distribution, and then a main function that draws n times from the generating function via a for loop. Here below is a much simplified working version of the code:
#include <Rcpp.h>
using namespace Rcpp;
// generating function
NumericVector gen(NumericVector A, NumericVector B){
NumericVector out = no_init_vector(2);
out[0] = R::runif(A[0],A[1]) + R::runif(B[0],B[1]);
out[1] = R::runif(A[0],A[1]) - R::runif(B[0],B[1]);
return out;
}
// [[Rcpp::export]]
// draw n observations
NumericVector rdraw(int n, NumericVector A, NumericVector B){
NumericMatrix out = no_init_matrix(n, 2);
for (int i = 0; i < n; ++i) {
out(i,_) = gen(A, B);
}
return out;
}
I am looking for ways to speed up the draws. My questions are: is there any more efficient alternative to the for loop? Would parallelization help in this case?
Thank you for any help!
There are different ways to speed this up:
Use inline on gen(), reducing the number of function calls.
Use Rcpp::runif instead of a loop with R::runif to remove even more function calls.
Use a faster RNG that allows for parallel execution.
Here points 1. and 2.:
#include <Rcpp.h>
using namespace Rcpp;
// generating function
inline NumericVector gen(NumericVector A, NumericVector B){
NumericVector out = no_init_vector(2);
out[0] = R::runif(A[0],A[1]) + R::runif(B[0],B[1]);
out[1] = R::runif(A[0],A[1]) - R::runif(B[0],B[1]);
return out;
}
// [[Rcpp::export]]
// draw n observations
NumericVector rdraw(int n, NumericVector A, NumericVector B){
NumericMatrix out = no_init_matrix(n, 2);
for (int i = 0; i < n; ++i) {
out(i,_) = gen(A, B);
}
return out;
}
// [[Rcpp::export]]
// draw n observations
NumericVector rdraw2(int n, NumericVector A, NumericVector B){
NumericMatrix out = no_init_matrix(n, 2);
out(_, 0) = Rcpp::runif(n, A[0],A[1]) + Rcpp::runif(n, B[0],B[1]);
out(_, 1) = Rcpp::runif(n, A[0],A[1]) - Rcpp::runif(n, B[0],B[1]);
return out;
}
/*** R
set.seed(42)
system.time(rdraw(1e7, c(0,2), c(1,3)))
system.time(rdraw2(1e7, c(0,2), c(1,3)))
*/
Result:
> set.seed(42)
> system.time(rdraw(1e7, c(0,2), c(1,3)))
user system elapsed
1.576 0.034 1.610
> system.time(rdraw2(1e7, c(0,2), c(1,3)))
user system elapsed
0.458 0.139 0.598
For comparison, your original code took about 1.8s for 10^7 draws. For point 3. I am adapting code from the parallel vignette of my dqrng package:
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
Rcpp::NumericMatrix rdraw3(int n, Rcpp::NumericVector A, Rcpp::NumericVector B, int seed, int ncores) {
dqrng::uniform_distribution distA(A(0), A(1));
dqrng::uniform_distribution distB(B(0), B(1));
dqrng::xoshiro256plus rng(seed);
Rcpp::NumericMatrix res = Rcpp::no_init_matrix(n, 2);
RcppParallel::RMatrix<double> output(res);
#pragma omp parallel num_threads(ncores)
{
dqrng::xoshiro256plus lrng(rng); // make thread local copy of rng
lrng.jump(omp_get_thread_num() + 1); // advance rng by 1 ... ncores jumps
auto genA = std::bind(distA, std::ref(lrng));
auto genB = std::bind(distB, std::ref(lrng));
#pragma omp for
for (int i = 0; i < n; ++i) {
output(i, 0) = genA() + genB();
output(i, 1) = genA() - genB();
}
}
return res;
}
/*** R
system.time(rdraw3(1e7, c(0,2), c(1,3), 42, 2))
*/
Result:
> system.time(rdraw3(1e7, c(0,2), c(1,3), 42, 2))
user system elapsed
0.276 0.025 0.151
So with a faster RNG and moderate parallelism, we can gain an order of magnitude in execution time. The results will be different, of course, but summary statistics should be the same.

RcppParallel RVector push_back or something similar?

I am using RcppParallel to speed up some calculations. However, I am running out of memory in the process, so I would like to save results within the Parallel loop that are pass some relevance threshold. Below is a toy example to illustrate my point:
#include <Rcpp.h>
#include <RcppParallel.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
struct Example : public RcppParallel::Worker {
RcppParallel::RVector<double> xvals, xvals_output, yvals;
Example(const NumericVector & xvals, NumericVector & yvals, NumericVector & xvals_output) :
xvals(xvals), xvals_output(xvals_output), yvals(yvals) {}
void operator()(std::size_t begin, size_t end) {
for(std::size_t i=begin; i < end; i++) {
double y = xvals[i] * (xvals[i] - 1);
// if(y < 0) {
// xvals_output.push_back(xvals[i]);
// yvals.push_back(y);
// }
xvals_output[i] = xvals[i];
yvals[i] = y;
}
}
};
// [[Rcpp::export]]
List find_values(NumericVector xvals) {
NumericVector xvals_output(xvals.size());
NumericVector yvals(xvals.size());
Example ex(xvals, yvals, xvals_output);
parallelFor(0, xvals.size(), ex);
List L = List::create(xvals_output, yvals);
return(L);
}
The R code would be:
find_values(seq(-10,10, by=0.5))
The commented out code is what I would like to do.
That is, I would like to initialize an empty vector, and append only the y-values that pass a certain threshold and also the associated x-values.
In my real usage, I am calculating a MxN matrix, so memory is an issue.
What is the correct way to approach this issue?
If anyone ever comes across a similar problem, here's a solution using "concurrent_vector" from TBB (which RcppParallel uses under the hood and is available as a header).
#include <Rcpp.h>
#include <RcppParallel.h>
#include <tbb/concurrent_vector.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
struct Example : public RcppParallel::Worker {
RcppParallel::RVector<double> xvals;
tbb::concurrent_vector< std::pair<double, double> > &output;
Example(const NumericVector & xvals, tbb::concurrent_vector< std::pair<double, double> > &output) :
xvals(xvals), output(output) {}
void operator()(std::size_t begin, size_t end) {
for(std::size_t i=begin; i < end; i++) {
double y = xvals[i] * (xvals[i] - 1);
if(y < 0) {
output.push_back( std::pair<double, double>(xvals[i], y) );
}
}
}
};
// [[Rcpp::export]]
List find_values(NumericVector xvals) {
tbb::concurrent_vector< std::pair<double, double> > output;
Example ex(xvals,output);
parallelFor(0, xvals.size(), ex);
NumericVector xout(output.size());
NumericVector yout(output.size());
for(int i=0; i<output.size(); i++) {
xout[i] = output[i].first;
yout[i] = output[i].second;
}
List L = List::create(xout, yout);
return(L);
}
Output:
> find_values(seq(-10,10, by=0.5))
[[1]]
[1] 0.5
[[2]]
[1] -0.25

Resources