I'm trying to improve the performance of a row-wise custom distance measure over a large matrix using the parallelDist R package and RcppArmadillo. The example they provide works with me
library(parallelDist)
library(RcppArmadillo)
library(RcppXPtrUtils)
euclideanFuncPtr <- cppXPtr("double customDist(const arma::mat &A, const arma::mat &B) {return sqrt(arma::accu(arma::square(A - B))); }",depends = c("RcppArmadillo"))
# distance matrix for user-defined euclidean distance function
# (note that method is set to "custom")
x = matrix(1:16,ncol=2)
parDist(x, method="custom", func = euclideanFuncPtr)
# same result as dist(x)
I want to do something similar with the following:
overlapSlow = function(x,y){sum(pmin(x,y))/sum(pmax(x,y))}
x = matrix(1:16,ncol=2)
res = matrix(NA,nrow=8,ncol=8)
for (i in 1:nrow(x)) {
for (j in 1:nrow(x)){
if (i>j)
{
res[i,j]=overlapSlow(x[i,],x[j,])
}
}
}
res=as.dist(res)
But the following Xptr object fails to compile:
overlap <- cppXPtr("double customDist(const arma::mat &A, const arma::mat &B) {arma::accu(pmin(A,B)) / arma::accu(pmax(A,B)); }",depends = c("RcppArmadillo"))
Throwing the error message
error: no matching function for call to ‘pmin(const mat&, const mat&)
I never used Rcpp before, but I suppose this does not work because pmax and pmin require a Numeric vector, but I am struggling to find a way to convert the objects A and B internally.
NOTE: parallelDist vignette mentions that
"The user-defined function needs to have the following signature:
double customDist(const arma::mat &A, const arma::mat &B)
Note that the return value must be a double and the two parameters must be of type const
arma::mat ¶m." so I cannot simply make changes in the signature.
Thanks in advance for any help/tips!
E
UPDATE 1
I could, of course, do this entirely in Rcpp (without using the structure imposed by parellelDist). The following does work fine:
// [[Rcpp::depends(RcppProgress)]]
#include <progress.hpp>
#include <progress_bar.hpp>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix overlapDistance(Rcpp::NumericMatrix x, bool display_progress=true) {
int n = x.nrow();
Progress p(n*n, display_progress);
NumericMatrix m( n );
for (int i=0; i < n; ++i) {
if (Progress::check_abort() )
return -1.0;
for (int j=0; j < n; ++j){
p.increment(); // update progress
if (i>j)
{m(i,j)= sum(pmin(x(i,_),x(j,_)))/sum(pmax(x(i,_),x(j,_)));}
}
}
return m;
}
and it is much faster than plain R, but still too slow for what I want to achieve...
UPDATE 2
I tried the following
library(parallelDist)
library(RcppArmadillo)
library(RcppXPtrUtils)
overlap <- cppXPtr("double customDist(const arma::mat &A, const arma::mat &B)
{Rcpp::NumericMatrix x = as<Rcpp::NumericMatrix>(wrap(A));
Rcpp::NumericMatrix y = as<Rcpp::NumericMatrix>(wrap(B));
return sum(pmin(x,y)) / sum(pmax(x,y)); }",depends = c("RcppArmadillo"))
x= matrix(1:16, ncol=2)
parDist(x, method="custom", func = overlap)
which compiles without error and runs. However when I re-ran the line parDist(x, method="custom", func = overlap) I get
Error in parDist(x, method = "custom", func = overlap) : Not a matrix.
and when I tried the third time I get
Error in parDist(x, method = "custom", func = overlap) : bad value
and eventually
*** caught segfault ***
address 0x140, cause 'memory not mapped'
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: parDist(x, method = "custom", func = overlap)
Traceback:
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1: parDist(x, method = "custom", func = overlap)
Related
Im trying to speed up some R code with Rcpp functions. One function is giving me fits to compile and I am clueless to figure out why the compiler complains about the return argument. I declared the function to return NumericVector, the result is NumericVector and yet the compiler complains the return argument is invalid.
Rcpp is version 0.12.18,
R is Microsoft Open R 3.5.3
cppFunction('NumericVector NNE(IntegerVector X, IntegerVector Y, IntegerVector XY, IntegerVector xy, NumericVector P, int radius ) {
int n = X.size();
NumericVector vN[n];
NumericVector vSum[n];
NumericVector vAvg[n];
// for each xy determine neighborhood Sum and count (N)
for(int i=0; i<n; i++) {
vN[i] = 0.0;
vSum[i] = 0.0;
// traverse neighborhood, if the xy exists in the input
// vector then accumulate the values, otherwise ignore
for(int dx=-1*radius; dx<=radius; dx++) {
for(int dy=-1*radius; dy<=radius; dy++) {
// construct an xy index for the neighborhood die
xy[0] = ( (X[i]+dx) * 10000 ) + (Y[i]+dy);
// check to see if index above exists in input set
IntegerVector m = Rcpp::match(xy, XY);
// if valid then accumulate and count
if(m[0] != NA_INTEGER) {
vN[i] = vN[i] + 1.0;
vSum[i] = vSum[i] + P[ m[0] ];
}
}
}
vAvg[i] = vSum[i] / vN[i];
}
return vAvg;
}')
The confusing compiler message is as follows:
C:/RBuildTools/3.5/mingw_64/bin/g++ -m64 -I"C:/PROGRA~1/MICROS~3/ROPEN~1/R-35~1.3/include" -DNDEBUG -I"D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include" -I"D:/Users/ka/AppData/Local/Temp/4/RtmpeGKfUg/sourceCpp-x86_64-w64-mingw32-0.12.18" -I"C:/a/w/1/s/vendor/extsoft/include" -O2 -Wall -mtune=core2 -c filefcc651c7fa9.cpp -o filefcc651c7fa9.o
filefcc651c7fa9.cpp: In function 'Rcpp::NumericVector NNE(Rcpp::IntegerVector, Rcpp::IntegerVector, Rcpp::IntegerVector, Rcpp::IntegerVector, Rcpp::NumericVector, int)':
filefcc651c7fa9.cpp:42:10: error: invalid conversion from 'Rcpp::NumericVector* {aka Rcpp::Vector<14, Rcpp::PreserveStorage>*}' to 'const int&' [-fpermissive]
return vAvg;
^
In file included from D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include/Rcpp/Vector.h:52:0,
from D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include/Rcpp.h:40,
from filefcc651c7fa9.cpp:1:
D:/Users/ka/Documents/R/win-library/3.5/Rcpp/include/Rcpp/vector/Vector.h:128:5: note: initializing argument 1 of 'Rcpp::Vector<RTYPE, StoragePolicy>::Vector(const int&) [with int RTYPE = 14; StoragePolicy = Rcpp::PreserveStorage]'
Vector( const int& size ) {
^
make: *** [C:/PROGRA~1/MICROS~3/ROPEN~1/R-35~1.3/etc/x64/Makeconf:215: filefcc651c7fa9.o] Error 1
Error in sourceCpp(code = code, env = env, rebuild = rebuild, cacheDir = cacheDir, :
Error 1 occurred building shared library.
You had a miniscule error rendering the variable "bad" as far as the compiler is concerned, and you then misunderstood the rejected return of the "bad" variable as a different issue.
It happens. We have all been there.
Here is the repaired code. In short, you needed NumeriVector x(n); with round instead of squared parens (as the latter denote arrays in C and then C++).
Code
I also turned it into input for sourceCpp() which is easier given the length of the functions.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector NNE(IntegerVector X, IntegerVector Y, IntegerVector XY,
IntegerVector xy, NumericVector P, int radius ) {
int n = X.size();
NumericVector vN(n);
NumericVector vSum(n);
NumericVector vAvg(n);
// for each xy determine neighborhood Sum and count (N)
for(int i=0; i<n; i++) {
vN[i] = 0.0;
vSum[i] = 0.0;
// traverse neighborhood, if the xy exists in the input
// vector then accumulate the values, otherwise ignore
for(int dx=-1*radius; dx<=radius; dx++) {
for(int dy=-1*radius; dy<=radius; dy++) {
// construct an xy index for the neighborhood die
xy[0] = ( (X[i]+dx) * 10000 ) + (Y[i]+dy);
// check to see if index above exists in input set
IntegerVector m = Rcpp::match(xy, XY);
// if valid then accumulate and count
if(m[0] != NA_INTEGER) {
vN[i] = vN[i] + 1.0;
vSum[i] = vSum[i] + P[ m[0] ];
}
}
}
vAvg[i] = vSum[i] / vN[i];
}
return vAvg;
}
/*** R
cat("Built\n")
*/
Output
As we have no reference data, I can only show that it built:
R> sourceCpp("~/git/stackoverflow/61377960/answer.cpp")
R> cat("Built\n")
Built
R>
I am trying to convert my R function to C++ using Rcpp, but I came around errors that I don't understand quite well.
The following code gives my R function, my (poor) attempt to translate it and some examples of uses at the end (testing that the two function return the same thing...)
My R Code function:
intersect_rectangles <- function(x_min, x_max, y_min, y_max) {
rez <- list()
rez$min <- pmax(x_min, y_min)
rez$max <- pmin(x_max, y_max)
if (any(rez$min > rez$max)) {
return(list(NULL))
}
return(rez)
}
My attempt to create the same function with Rcpp.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List Cpp_intersect_rectangles(NumericVector x_min,NumericVector
x_max,NumericVector y_min,NumericVector y_max) {
// Create a list :
NumericVector min = pmax(x_min,y_min);
NumericVector max = pmin(x_max,y_max);
List L = List::create(R_NilValue);
if (! any(min > max)) {
L = List::create(Named("min") = min , _["max"] = max);
}
return(L);
}
I receive the following error messages:
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/sugar/logical/SingleLogicalResult.h:36:2: error: implicit instantiation of undefined template 'Rcpp::sugar::forbidden_conversion<false>'
forbidden_conversion<x>{
^
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/sugar/logical/SingleLogicalResult.h:74:40: note: in instantiation of template class 'Rcpp::sugar::conversion_to_bool_is_forbidden<false>' requested here
conversion_to_bool_is_forbidden<!NA> x ;
^
file637e53281965.cpp:13:9: note: in instantiation of member function 'Rcpp::sugar::SingleLogicalResult<true, Rcpp::sugar::Negate_SingleLogicalResult<true, Rcpp::sugar::Any<true, Rcpp::sugar::Comparator<14, Rcpp::sugar::greater<14>, true, Rcpp::Vector<14, PreserveStorage>, true, Rcpp::Vector<14, PreserveStorage> > > > >::operator bool' requested here
if (! any(min > max))
If the Rcpp function is implemented correctly, then the following should work:
u = rep(0,4)
v = rep(1,4)
w = rep(0.3,4)
x = c(0.8,0.8,3,3)
all.equal(intersect_rectangles(u,v,w,x), Cpp_intersect_rectangles(u,v,w,x))
all.equal(intersect_rectangles(u,v,w,w), Cpp_intersect_rectangles(u,v,w,w))
What's wrong with my cpp code?
The reason the code isn't translating correctly is due to how the any() Rcpp sugar implementation was created. In particular, we have that:
The actual return type of any(X) is an instance of the
SingleLogicalResult template class, but the functions is_true
and is_false may be used to convert the return value to bool.
Per https://thecoatlessprofessor.com/programming/unofficial-rcpp-api-documentation/#any
Therefore, the solution is to add .is_true() to the any() function call, e.g. !any(condition).is_true().
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List Cpp_intersect_rectangles(NumericVector x_min, NumericVector x_max,
NumericVector y_min, NumericVector y_max) {
// Create a list :
NumericVector min = pmax(x_min, y_min);
NumericVector max = pmin(x_max, y_max);
List L = List::create(R_NilValue);
if (! any(min > max).is_true()) {
// ^^^^^^^^^ Added
L = List::create(Named("min") = min , _["max"] = max);
}
return(L);
}
Then, through testing we get:
u = rep(0,4)
v = rep(1,4)
w = rep(0.3,4)
x = c(0.8,0.8,3,3)
all.equal(intersect_rectangles(u,v,w,x), Cpp_intersect_rectangles(u,v,w,x))
# [1] TRUE
all.equal(intersect_rectangles(u,v,w,w), Cpp_intersect_rectangles(u,v,w,w))
# [1] TRUE
In a very first attempt at creating a C++ function which can be called from R using Rcpp, I have a simple function to compute a minimum spanning tree from a distance matrix using Prim's algorithm. This function has been converted into C++ from a former version in ANSI C (which works fine).
Here it is:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame primlm(const int n, NumericMatrix d)
{
double const din = 9999999.e0;
long int i1, nc, nc1;
double dlarge, dtot;
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
for (int i=2; i <= n; i++) {
is(i) = 0;
}
for (int i=2; i <= n; i++) {
dlarge = din;
i1 = i - 1;
for (int j=1; j <= i1; j++) {
for (int k=1; k <= n; k++) {
if (l(j) == k)
continue;
if (d[l(j), k] > dlarge)
continue;
if (is(k) == 1)
continue;
nc = k;
nc1 = l(j);
dlarge = d(nc1, nc);
}
}
is(nc) = 1;
l(i) = nc;
lp(i) = nc1;
dist(i) = dlarge;
}
dtot = 0.e0;
for (int i=2; i <= n; i++){
dtot += dist(i);
}
return DataFrame::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
When I compile this function using Rcpp under RStudio, I get two warnings, complaining that variables 'nc' and 'nc1' have not been initialized. Frankly, I could not understand that, as it seems to me that both variables are being initialized inside the third loop. Also, why there is no similar complaint about variable 'i1'?
Perhaps it comes as no surprise that, when attempting to call this function from R, using the below code, what I get is a crash of the R system!
# Read test data
df <- read.csv("zygo.csv", header=TRUE)
lonlat <- data.frame(df$Longitude, df$Latitude)
colnames(lonlat) <- c("lon", "lat")
# Compute distance matrix using geosphere library
library(geosphere)
d <- distm(lonlat, lonlat, fun=distVincentyEllipsoid)
# Calls Prim minimum spanning tree routine via Rcpp
library(Rcpp)
sourceCpp("Prim.cpp")
n <- nrow(df)
p <- primlm(n, d)
Here is the dataset I use for testing purposes:
"Scientific name",Locality,Longitude,Latitude Zygodontmys,Bush Bush
Forest,-61.05,10.4 Zygodontmys,Cerro Azul,-79.4333333333,9.15
Zygodontmys,Dividive,-70.6666666667,9.53333333333 Zygodontmys,Hato El
Frio,-63.1166666667,7.91666666667 Zygodontmys,Finca Vuelta
Larga,-63.1166666667,10.55 Zygodontmys,Isla
Cebaco,-81.1833333333,7.51666666667 Zygodontmys,Kayserberg
Airstrip,-56.4833333333,3.1 Zygodontmys,Limao,-60.5,3.93333333333
Zygodontmys,Montijo Bay,-81.0166666667,7.66666666667
Zygodontmys,Parcela 200,-67.4333333333,8.93333333333 Zygodontmys,Rio
Chico,-65.9666666667,10.3166666667 Zygodontmys,San Miguel
Island,-78.9333333333,8.38333333333
Zygodontmys,Tukuko,-72.8666666667,9.83333333333
Zygodontmys,Urama,-68.4,10.6166666667
Zygodontmys,Valledup,-72.9833333333,10.6166666667
Could anyone give me a hint?
The initializations of ncand nc1 are never reached if one of the three if statements is true. It might be that this is not possible with your data, but the compiler has no way knowing that.
However, this is not the reason for the crash. When I run your code I get:
Index out of bounds: [index=1; extent=0].
This comes from here:
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
When declaring a NumericVector you have to tell the required size if you want to assign values by index. In your case
NumericVector is(n), l(n), lp(n), dist(n);
might work. You have to analyze the C code carefully w.r.t. memory allocation and array boundaries.
Alternatively you could use the C code as is and use Rcpp to build a wrapper function, e.g.
#include <array>
#include <Rcpp.h>
using namespace Rcpp;
// One possibility for the function signature ...
double prim(const int n, double *d, double *l, double *lp, double *dist) {
....
}
// [[Rcpp::export]]
List primlm(NumericMatrix d) {
int n = d.nrow();
std::array<double, n> lp; // adjust size as needed!
std::array<double, n> dist; // adjust size as needed!
double dtot = prim(n, d.begin(), l.begin(), lp.begin(), dist.begin());
return List::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
Notes:
I am returning a List instead of a DataFrame since dtot is a scalar value.
The above code is meant to illustrate the idea. Most likely it will not work without adjustments!
I am using Rcpp to speed up some R code. However, I'm really struggling with types - since these are foreign in R. Here's a simplified version of what I'm trying to do:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
NumericVector fun(SEXP Pk, int k, int i, const vec& a, const mat& D) {
// this is dummy version of my actual function - with actual arguments.;
// I'm guessing SEXP is going to need to be replaced with something else when it's called from C++ not R.;
return D.col(i);
}
// [[Rcpp::export]]
NumericVector f(const arma::vec& assignment, char k, int B, const mat& D) {
uvec k_ind = find(assignment == k);
NumericVector output(assignment.size()); // for dummy output.
uvec::iterator k_itr = k_ind.begin();
for(; k_itr != k_ind.end(); ++k_itr) {
// this is R code, as I don't know the best way to do this in C++;
k_rep = sample(c(assignment[assignment != k], -1), size = B, replace = TRUE);
output = fun(k_rep, k, *k_itr, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return output;
}
The part I'm struggling with is the random sampling of assignment. I know that sample has been implemented in Rarmadillo. However, I can see two approaches to this, and I'm not sure which is more efficient/doable.
Approach 1:
Make a table of theassignment values. Replace assignment == k with -1 and set its "count" equal to 1.
sample from the table "names" with probability proportional to the count.
Approach 2:
Copy the relevant subset of the assignment vector into a new vector with an extra spot for -1.
Sample from the copied vector with equal probabilities.
I want to say that approach 1 would be more efficient, except that assignment is currently of type arma::vec, and I'm not sure how to make the table from that - or how much of a cost there is to converting it to a more-compatible format. I think I could implement Approach 2, but I'm hoping to avoid the expensive copy.
Thanks for any insights you can provide.
many variable declaration is not coherent with the assignment made by you, like assignment = k is impossible to compare as assignment has real value and k is a char. as the queston is bad written I feel free to change the variables type. this compile..
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::export]]
arma::vec fun(const Rcpp::NumericVector& Pk, int k, unsigned int i, const arma::ivec& a, const arma::mat& D)
{
return D.col(i);
}
// [[Rcpp::export]]
Rcpp::NumericMatrix f(const arma::ivec& assignment, int k, unsigned int B, const arma::mat& D)
{
arma::uvec k_ind = find(assignment == k);
arma::ivec KK = assignment(find(assignment != k));
//these 2 row are for KK = c(assignment[assignment != k], -1)
//I dont know what is this -1 is for, why -1 ? maybe you dont need it.
KK.insert_rows(KK.n_rows, 1);
KK(KK.n_rows - 1) = -1;
arma::uvec k_ind_not = find(assignment != k);
Rcpp::NumericVector k_rep(B);
arma::mat output(D.n_rows,k_ind.n_rows); // for dummy output.
for(unsigned int i =0; i < k_ind.n_rows ; i++)
{
k_rep = Rcpp::RcppArmadillo::sample(KK, B, true);
output(arma::span::all, i) = fun(k_rep, k, i, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return Rcpp::wrap(output);
}
this is not optimized (as the question is bogus), this is badly written, beccause as I think R would be sufficiently faster in searching index of a vector (so do this in R and implemement only fun in Rcpp)...is not useful to waste time here, there are other problems that need a solver implemented in Rcpp , not this searching stuff...
but this is not a useful question as you are asking more for an algorithm than for example signature of function
I am trying to rewrite into (R)cpp an original R function that makes use of the gamma function (from double input). Below the original source. When comping with sourceCpp the following error is raised "no matching function for call to 'gamma(Rcpp::traits::storage_type(<14>:.type)'"
The gamma function should has been put within sugar (as the mean below use) so I expect there should be easily called.
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
// original R function
// function (y_pred, y_true)
// {
// eps <- 1e-15
// y_pred <- pmax(y_pred, eps)
// Poisson_LogLoss <- mean(log(gamma(y_true + 1)) + y_pred -
// log(y_pred) * y_true)
// return(Poisson_LogLoss)
// }
// [[Rcpp::export]]
double poissonLogLoss(NumericVector predicted, NumericVector actual) {
NumericVector temp, y_pred_new;
double out;
const double eps=1e-15;
y_pred_new=pmax(predicted,eps);
long n = predicted.size();
for (long i = 0; i < n; ++i) {
temp[i] = log( gamma(actual[i]+1)+y_pred_new[i]-log(y_pred_new[i])*actual[i]);
}
out=mean(temp); // using sugar implementation
return out;
}
You are making this too complicated as the point of Rcpp Sugar is work vectorized. So the following compiles as well:
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
// [[Rcpp::export]]
double poissonLogLoss(NumericVector predicted, NumericVector actual) {
NumericVector temp, y_pred_new;
double out;
const double eps=1e-15;
y_pred_new=pmax(predicted,eps);
temp = log(gamma(actual + 1)) + y_pred_new - log(y_pred_new)*actual;
out=mean(temp); // using sugar implementation
return out;
}
Now, you didn't supply any test data so I do not know if this computes correctly or not. Also, because your R expression is already vectorized, this will not be much faster.
Lastly, your compile error is likely due to the Sugar function gamma() expecting an Rcpp object whereas you provided a double.