efficiently sample from modified arma::vec object - r

I am using Rcpp to speed up some R code. However, I'm really struggling with types - since these are foreign in R. Here's a simplified version of what I'm trying to do:
#include <RcppArmadillo.h>
#include <algorithm>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
NumericVector fun(SEXP Pk, int k, int i, const vec& a, const mat& D) {
// this is dummy version of my actual function - with actual arguments.;
// I'm guessing SEXP is going to need to be replaced with something else when it's called from C++ not R.;
return D.col(i);
}
// [[Rcpp::export]]
NumericVector f(const arma::vec& assignment, char k, int B, const mat& D) {
uvec k_ind = find(assignment == k);
NumericVector output(assignment.size()); // for dummy output.
uvec::iterator k_itr = k_ind.begin();
for(; k_itr != k_ind.end(); ++k_itr) {
// this is R code, as I don't know the best way to do this in C++;
k_rep = sample(c(assignment[assignment != k], -1), size = B, replace = TRUE);
output = fun(k_rep, k, *k_itr, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return output;
}
The part I'm struggling with is the random sampling of assignment. I know that sample has been implemented in Rarmadillo. However, I can see two approaches to this, and I'm not sure which is more efficient/doable.
Approach 1:
Make a table of theassignment values. Replace assignment == k with -1 and set its "count" equal to 1.
sample from the table "names" with probability proportional to the count.
Approach 2:
Copy the relevant subset of the assignment vector into a new vector with an extra spot for -1.
Sample from the copied vector with equal probabilities.
I want to say that approach 1 would be more efficient, except that assignment is currently of type arma::vec, and I'm not sure how to make the table from that - or how much of a cost there is to converting it to a more-compatible format. I think I could implement Approach 2, but I'm hoping to avoid the expensive copy.
Thanks for any insights you can provide.

many variable declaration is not coherent with the assignment made by you, like assignment = k is impossible to compare as assignment has real value and k is a char. as the queston is bad written I feel free to change the variables type. this compile..
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::export]]
arma::vec fun(const Rcpp::NumericVector& Pk, int k, unsigned int i, const arma::ivec& a, const arma::mat& D)
{
return D.col(i);
}
// [[Rcpp::export]]
Rcpp::NumericMatrix f(const arma::ivec& assignment, int k, unsigned int B, const arma::mat& D)
{
arma::uvec k_ind = find(assignment == k);
arma::ivec KK = assignment(find(assignment != k));
//these 2 row are for KK = c(assignment[assignment != k], -1)
//I dont know what is this -1 is for, why -1 ? maybe you dont need it.
KK.insert_rows(KK.n_rows, 1);
KK(KK.n_rows - 1) = -1;
arma::uvec k_ind_not = find(assignment != k);
Rcpp::NumericVector k_rep(B);
arma::mat output(D.n_rows,k_ind.n_rows); // for dummy output.
for(unsigned int i =0; i < k_ind.n_rows ; i++)
{
k_rep = Rcpp::RcppArmadillo::sample(KK, B, true);
output(arma::span::all, i) = fun(k_rep, k, i, assignment, D);
// do something with output;
}
// compile result, ultimately return a List (after I figure out how to do that. For right now, I'll cheat and return the last output);
return Rcpp::wrap(output);
}
this is not optimized (as the question is bogus), this is badly written, beccause as I think R would be sufficiently faster in searching index of a vector (so do this in R and implemement only fun in Rcpp)...is not useful to waste time here, there are other problems that need a solver implemented in Rcpp , not this searching stuff...
but this is not a useful question as you are asking more for an algorithm than for example signature of function

Related

Rcpp function complaining about unintialized variables

In a very first attempt at creating a C++ function which can be called from R using Rcpp, I have a simple function to compute a minimum spanning tree from a distance matrix using Prim's algorithm. This function has been converted into C++ from a former version in ANSI C (which works fine).
Here it is:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame primlm(const int n, NumericMatrix d)
{
double const din = 9999999.e0;
long int i1, nc, nc1;
double dlarge, dtot;
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
for (int i=2; i <= n; i++) {
is(i) = 0;
}
for (int i=2; i <= n; i++) {
dlarge = din;
i1 = i - 1;
for (int j=1; j <= i1; j++) {
for (int k=1; k <= n; k++) {
if (l(j) == k)
continue;
if (d[l(j), k] > dlarge)
continue;
if (is(k) == 1)
continue;
nc = k;
nc1 = l(j);
dlarge = d(nc1, nc);
}
}
is(nc) = 1;
l(i) = nc;
lp(i) = nc1;
dist(i) = dlarge;
}
dtot = 0.e0;
for (int i=2; i <= n; i++){
dtot += dist(i);
}
return DataFrame::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
When I compile this function using Rcpp under RStudio, I get two warnings, complaining that variables 'nc' and 'nc1' have not been initialized. Frankly, I could not understand that, as it seems to me that both variables are being initialized inside the third loop. Also, why there is no similar complaint about variable 'i1'?
Perhaps it comes as no surprise that, when attempting to call this function from R, using the below code, what I get is a crash of the R system!
# Read test data
df <- read.csv("zygo.csv", header=TRUE)
lonlat <- data.frame(df$Longitude, df$Latitude)
colnames(lonlat) <- c("lon", "lat")
# Compute distance matrix using geosphere library
library(geosphere)
d <- distm(lonlat, lonlat, fun=distVincentyEllipsoid)
# Calls Prim minimum spanning tree routine via Rcpp
library(Rcpp)
sourceCpp("Prim.cpp")
n <- nrow(df)
p <- primlm(n, d)
Here is the dataset I use for testing purposes:
"Scientific name",Locality,Longitude,Latitude Zygodontmys,Bush Bush
Forest,-61.05,10.4 Zygodontmys,Cerro Azul,-79.4333333333,9.15
Zygodontmys,Dividive,-70.6666666667,9.53333333333 Zygodontmys,Hato El
Frio,-63.1166666667,7.91666666667 Zygodontmys,Finca Vuelta
Larga,-63.1166666667,10.55 Zygodontmys,Isla
Cebaco,-81.1833333333,7.51666666667 Zygodontmys,Kayserberg
Airstrip,-56.4833333333,3.1 Zygodontmys,Limao,-60.5,3.93333333333
Zygodontmys,Montijo Bay,-81.0166666667,7.66666666667
Zygodontmys,Parcela 200,-67.4333333333,8.93333333333 Zygodontmys,Rio
Chico,-65.9666666667,10.3166666667 Zygodontmys,San Miguel
Island,-78.9333333333,8.38333333333
Zygodontmys,Tukuko,-72.8666666667,9.83333333333
Zygodontmys,Urama,-68.4,10.6166666667
Zygodontmys,Valledup,-72.9833333333,10.6166666667
Could anyone give me a hint?
The initializations of ncand nc1 are never reached if one of the three if statements is true. It might be that this is not possible with your data, but the compiler has no way knowing that.
However, this is not the reason for the crash. When I run your code I get:
Index out of bounds: [index=1; extent=0].
This comes from here:
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
When declaring a NumericVector you have to tell the required size if you want to assign values by index. In your case
NumericVector is(n), l(n), lp(n), dist(n);
might work. You have to analyze the C code carefully w.r.t. memory allocation and array boundaries.
Alternatively you could use the C code as is and use Rcpp to build a wrapper function, e.g.
#include <array>
#include <Rcpp.h>
using namespace Rcpp;
// One possibility for the function signature ...
double prim(const int n, double *d, double *l, double *lp, double *dist) {
....
}
// [[Rcpp::export]]
List primlm(NumericMatrix d) {
int n = d.nrow();
std::array<double, n> lp; // adjust size as needed!
std::array<double, n> dist; // adjust size as needed!
double dtot = prim(n, d.begin(), l.begin(), lp.begin(), dist.begin());
return List::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
Notes:
I am returning a List instead of a DataFrame since dtot is a scalar value.
The above code is meant to illustrate the idea. Most likely it will not work without adjustments!

Allow C++ constants to be a default function parameter using Rcpp Attributes

I created a cumsum function in an R package with rcpp which will cumulatively sum a vector until it hits the user defined ceiling or floor. However, if one wants the cumsum to be bounded above, the user must still specify a floor.
Example:
a = c(1, 1, 1, 1, 1, 1, 1)
If i wanted to cumsum a and have an upper bound of 3, I could cumsum_bounded(a, lower = 1, upper = 3). I would rather not have to specify the lower bound.
My code:
#include <Rcpp.h>
#include <float.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cumsum_bounded(NumericVector x, int upper, int lower) {
NumericVector res(x.size());
double acc = 0;
for (int i=0; i < x.size(); ++i) {
acc += x[i];
if (acc < lower) acc = lower;
else if (acc > upper) acc = upper;
res[i] = acc;
}
return res;
}
What I would like:
#include <Rcpp.h>
#include <float.h>
#include <cmath>
#include <climits> //for LLONG_MIN and LLONG_MAX
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cumsum_bounded(NumericVector x, long long int upper = LLONG_MAX, long long int lower = LLONG_MIN) {
NumericVector res(x.size());
double acc = 0;
for (int i=0; i < x.size(); ++i) {
acc += x[i];
if (acc < lower) acc = lower;
else if (acc > upper) acc = upper;
res[i] = acc;
}
return res;
}
In short, yes its possible but it requires finesse that involves creating an intermediary function or embedding sorting logic within the main function.
In long, Rcpp attributes only supports a limit feature set of values. These values are listed in the Rcpp FAQ 3.12 entry
String literals delimited by quotes (e.g. "foo")
Integer and Decimal numeric values (e.g. 10 or 4.5)
Pre-defined constants including:
Booleans: true and false
Null Values: R_NilValue, NA_STRING, NA_INTEGER, NA_REAL, and NA_LOGICAL.
Selected vector types can be instantiated using the
empty form of the ::create static member function.
CharacterVector, IntegerVector, and NumericVector
Matrix types instantiated using the rows, cols constructor Rcpp::Matrix n(rows,cols)
CharacterMatrix, IntegerMatrix, and NumericMatrix)
If you were to specify numerical values for LLONG_MAX and LLONG_MIN this would meet the criteria to directly use Rcpp attributes on the function. However, these values are implementation specific. Thus, it would not be ideal to hardcode them. Thus, we have to seek an outside solution: the Rcpp::Nullable<T> class to enable the default NULL value. The reason why we have to wrap the parameter type with Rcpp::Nullable<T> is that NULL is a very special and can cause heartache if not careful.
The NULL value, unlike others on the real number line, will not be used to bound your values in this case. As a result, it is the perfect candidate to use on the function call. There are two choices you then have to make: use Rcpp::Nullable<T> as the parameters on the main function or create a "logic" helper function that has the correct parameters and can be used elsewhere within your application without worry. I've opted for the later below.
#include <Rcpp.h>
#include <float.h>
#include <cmath>
#include <climits> //for LLONG_MIN and LLONG_MAX
using namespace Rcpp;
NumericVector cumsum_bounded_logic(NumericVector x,
long long int upper = LLONG_MAX,
long long int lower = LLONG_MIN) {
NumericVector res(x.size());
double acc = 0;
for (int i=0; i < x.size(); ++i) {
acc += x[i];
if (acc < lower) acc = lower;
else if (acc > upper) acc = upper;
res[i] = acc;
}
return res;
}
// [[Rcpp::export]]
NumericVector cumsum_bounded(NumericVector x,
Rcpp::Nullable<long long int> upper = R_NilValue,
Rcpp::Nullable<long long int> lower = R_NilValue) {
if(upper.isNotNull() && lower.isNotNull()){
return cumsum_bounded_logic(x, Rcpp::as< long long int >(upper), Rcpp::as< long long int >(lower));
} else if(upper.isNull() && lower.isNotNull()){
return cumsum_bounded_logic(x, LLONG_MAX, Rcpp::as< long long int >(lower));
} else if(upper.isNotNull() && lower.isNull()) {
return cumsum_bounded_logic(x, Rcpp::as< long long int >(upper), LLONG_MIN);
} else {
return cumsum_bounded_logic(x, LLONG_MAX, LLONG_MIN);
}
// Required to quiet compiler
return x;
}
Test Output
cumsum_bounded(a, 5)
## [1] 1 2 3 4 5 5 5
cumsum_bounded(a, 5, 2)
## [1] 2 3 4 5 5 5 5

Rcpp gamma integral

I am trying to rewrite into (R)cpp an original R function that makes use of the gamma function (from double input). Below the original source. When comping with sourceCpp the following error is raised "no matching function for call to 'gamma(Rcpp::traits::storage_type(<14>:.type)'"
The gamma function should has been put within sugar (as the mean below use) so I expect there should be easily called.
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
// original R function
// function (y_pred, y_true)
// {
// eps <- 1e-15
// y_pred <- pmax(y_pred, eps)
// Poisson_LogLoss <- mean(log(gamma(y_true + 1)) + y_pred -
// log(y_pred) * y_true)
// return(Poisson_LogLoss)
// }
// [[Rcpp::export]]
double poissonLogLoss(NumericVector predicted, NumericVector actual) {
NumericVector temp, y_pred_new;
double out;
const double eps=1e-15;
y_pred_new=pmax(predicted,eps);
long n = predicted.size();
for (long i = 0; i < n; ++i) {
temp[i] = log( gamma(actual[i]+1)+y_pred_new[i]-log(y_pred_new[i])*actual[i]);
}
out=mean(temp); // using sugar implementation
return out;
}
You are making this too complicated as the point of Rcpp Sugar is work vectorized. So the following compiles as well:
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
// [[Rcpp::export]]
double poissonLogLoss(NumericVector predicted, NumericVector actual) {
NumericVector temp, y_pred_new;
double out;
const double eps=1e-15;
y_pred_new=pmax(predicted,eps);
temp = log(gamma(actual + 1)) + y_pred_new - log(y_pred_new)*actual;
out=mean(temp); // using sugar implementation
return out;
}
Now, you didn't supply any test data so I do not know if this computes correctly or not. Also, because your R expression is already vectorized, this will not be much faster.
Lastly, your compile error is likely due to the Sugar function gamma() expecting an Rcpp object whereas you provided a double.

Rcpp memory management

I am trying to convert some character data to numeric as below. The data will come with special caracters so I have to get them out. I convert the data to std:string to search for the special caracters. Dos it creates a new variable in memory? I want to know if there is a better way to do it.
NumericVector converter_ra_(Rcpp::RObject x){
if(x.sexp_type() == STRSXP){
CharacterVector y(x);
NumericVector resultado(y.size());
for(unsigned int i = 0; i < y.size(); i++){
std::string ra_string = Rcpp::as<std::string>(y[i]);
//std::cout << ra_string << std::endl;
double t = 0;
int base = 0;
for(int j = (int)ra_string.size(); j >= 0; j--){
if(ra_string[j] >= 48 && ra_string[j] <= 57){
t += ((ra_string[j] - '0') * base_m[base]);
base++;
}
}
//std::cout << t << std::endl;
resultado[i] = t;
}
return resultado;
}else if(x.sexp_type() == REALSXP){
return NumericVector(x);
}
return NumericVector();
}
Does it creates a new variable in memory?
If the input object actually is a numeric vector (REALSXP) and you are simply returning, e.g. as<NumericVector>(input), then no additional variables are created. In any other case new memory will, of course, need to be allocated for the returned object. For example,
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector demo(RObject x) {
if (x.sexp_type() == REALSXP) {
return as<NumericVector>(x);
}
return NumericVector::create();
}
/*** R
y <- rnorm(3)
z <- letters[1:3]
data.table::address(y)
# [1] "0x6828398"
data.table::address(demo(y))
# [1] "0x6828398"
data.table::address(z)
# [1] "0x68286f8"
data.table::address(demo(z))
# [1] "0x5c7eea0"
*/
I want to know if there is a better way to do it.
First you need to define "better":
Faster?
Uses less memory?
Fewer lines of code?
More idiomatic?
Personally, I would start with the last definition since it often entails one or more of the others. For example, in this approach we
Define a function object Predicate that relies on the standard library function isdigit rather than trying to implement this locally
Define another function object that uses the erase-remove idiom to eliminate characters as determined by Predicate; and if necessary, uses std::atoi to convert what remains into a double (again, instead of trying to implement this ourselves)
Uses an Rcpp idiom -- the as converter -- to convert the STRSXP to a std::vector<std::string>
Calls std::transform to convert this into the result vector
#include <Rcpp.h>
using namespace Rcpp;
struct Predicate {
bool operator()(char c) const
{ return !(c == '.' || std::isdigit(c)); }
};
struct Converter {
double operator()(std::string s) const {
s.erase(
std::remove_if(s.begin(), s.end(), Predicate()),
s.end()
);
return s.empty() ? NA_REAL : std::atof(s.c_str());
}
};
// [[Rcpp::export]]
NumericVector convert(RObject obj) {
if (obj.sexp_type() == REALSXP) {
return as<NumericVector>(obj);
}
if (obj.sexp_type() != STRSXP) {
return NumericVector::create();
}
std::vector<std::string> x = as<std::vector<std::string> >(obj);
NumericVector res(x.size(), NA_REAL);
std::transform(x.begin(), x.end(), res.begin(), Converter());
return res;
}
Testing this for minimal functionality,
x <- c("123 4", "abc 1567.35 def", "abcdef", "")
convert(x)
# [1] 1234.00 1567.35 NA NA
(y <- rnorm(3))
# [1] 1.04201552 -0.08965042 -0.88236960
convert(y)
# [1] 1.04201552 -0.08965042 -0.88236960
convert(list())
# numeric(0)
Will this be as performant as something hand-written by a seasoned C or C++ programmer? Almost certainly not. However, since we used library functions and common idioms, it is reasonably concise, likely to be bug-free, and the intention is fairly evident even at a quick glance. If you need something faster then there are probably a handful of optimizations to be made, but there's no need to begin on that premise without benchmarking and profiling first.

is it possible to return two vectors from a function?

Im trying to do a merge sort in cpp on a vector called x, which contains x coordinates. As the mergesort sorts the x coordinates, its supposed to move the corresponding elements in a vector called y, containing the y coordinates. the only problem is that i dont know how to (or if i can) return both resulting vectors from the merge function.
alternatively if its easier to implement i could use a slower sort method.
No, you cannot return 2 results from a method like in this example.
vector<int>, vector<int> merge_sort();
What you can do is pass 2 vectors by reference to a function and the resultant mergesorted vector affects the 2 vectors...e.g
void merge_sort(vector<int>& x, vector<int>& y);
Ultimately, you can do what #JoshD mentioned and create a struct called point and merge sort the vector of the point struct instead.
Try something like this:
struct Point {
int x;
int y;
operator <(const Point &rhs) {return x < rhs.x;}
};
vector<Point> my_points.
mergesort(my_points);
Or if you want to sort Points with equal x value by the y cordinate:
Also, I thought I'd add, if you really ever need to, you can alway return a std::pair. A better choice is usually to return through the function parameters.
operator <(const Point &rhs) {return (x < rhs.x || x == rhs.x && y < rhs.y);}
Yes, you can return a tuple, then use structured binding (since C++17).
Here's a full example:
#include <cstdlib>
#include <iostream>
#include <numeric>
#include <tuple>
#include <vector>
using namespace std::string_literals;
auto twoVectors() -> std::tuple<std::vector<int>, std::vector<int>>
{
const std::vector<int> a = { 1, 2, 3 };
const std::vector<int> b = { 4, 5, 6 };
return { a, b };
}
auto main() -> int
{
auto [a, b] = twoVectors();
auto const sum = std::accumulate(a.begin(), a.end(), std::accumulate(b.begin(), b.end(), 0));
std::cout << "sum: "s << sum << std::endl;
return EXIT_SUCCESS;
}
You can have a vector of vectors
=> vector<vector > points = {{a, b}, {c, d}};
now you can return points.
Returning vectors is most probably not what you want, as they are copied for this purpose (which is slow). Have a look at this implementation, for example.

Resources