Dynamically increase size of list in Rcpp - r

I am trying to implement a "coupling to the past" algorithm in Rcpp. For this I need to store a matrix of random numbers, and if the algorithm did not converge create a new matrix of random numbers and store that as well. This might have to be done 10+ times or something until convergence.
I was hoping I could use a List and dynamically update it, similar as I would in R. I was actually very surprised it worked a bit but I got errors whenever the list size becomes large. This seems to make sense as I did not allocate the needed memory for the additional list elements, although I am not that familiar with C++ and not sure if that is the problem.
Here is an example of what I tried. however be aware that this will probably crash your R session:
library("Rcpp")
cppFunction(
includes = '
NumericMatrix RandMat(int nrow, int ncol)
{
int N = nrow * ncol;
NumericMatrix Res(nrow,ncol);
NumericVector Rands = runif(N);
for (int i = 0; i < N; i++)
{
Res[i] = Rands[i];
}
return(Res);
}',
code = '
void foo()
{
// This is the relevant part, I create a list then update it and print the results:
List x;
for (int i=0; i<10; i++)
{
x[i] = RandMat(100,10);
Rf_PrintValue(wrap(x[i]));
}
}
')
foo()
Does anyone know a way to do this without crashing R? I guess I could initiate the list at a fixed amount of elements here, but in my application the amount of elements is random.

You have to "allocate" enough space for your list. Maybe you can use something like a resizefunction:
List resize( const List& x, int n ){
int oldsize = x.size() ;
List y(n) ;
for( int i=0; i<oldsize; i++) y[i] = x[i] ;
return y ;
}
and whenever you want your list to be bigger than it is now, you can do:
x = resize( x, n ) ;
Your initial list is of size 0, so it expected that you get unpredictable behavior at the first iteration of your loop.

Related

Is it possible to dynamically load files inside a cppFunction in R?

I'm working on a problem in which I would greatly benefit from being able to load vectors that are saved in disk dynamically inside a loop as this allows me to skip calculating the vectors on the fly (in my actual process one vector is used many times and the collection of vectors as a matrix is too big to have in memory all at once). As a simplified example, lets say that we have the vectors stored in a directory with path prefix (each in its own file). The names of these files are vec0.txt, vec1.txt, vec2.txt, ... etc. We wish to sum all the numbers of all specified vectors in the inclusive range start-end. The size of all vectors is known and is always the same. I thought of something like:
library(Rcpp)
cppFunction('int sumvectors(int start, int end, string prefix, int size) {
int i;
int j;
int arr[size];
int sum=0;
for (i=start; i <= end; i++) {
// Here you would construct the path to the file paste0(prefix, vec, i, ".txt")
// Then load it and put it into an array
for (j=0; j <= size; j++) {
sum+=arr[j];
}
}
return sum;
}')
Is something like this even possible? I'm ok at R but never worked with C or C++ so I don't really even know if this is even doable with Rcpp
Yes, this is certainly possible. If your numbers are written in plain text files separated by spaces like this:
C://Users/Administrator/vec1.txt
5.1 21.4 563 -21.2 35.6
C://Users/Administrator/vec2.txt
3 6 8 7 10 135
Then you can write the following function:
cppFunction("
std::vector<float> read_floats(const std::string& path)
{
std::vector<float> result;
for(int i = 1; i < 3; ++i)
{
std::string file_path = path + std::to_string(i) + \".txt\";
std::ifstream myfile(file_path.c_str(), std::ios_base::in);
float a, vec_sum = 0;
std::vector<float> vec;
while(myfile >> a)
{
vec.push_back(a);
}
for(std::vector<float>::iterator it = vec.begin(); it != vec.end(); ++it)
{
vec_sum += *it;
}
result.push_back(vec_sum);
}
return result;
}", include = c("#include<string>", "#include<fstream>", "#include<vector>"))
Which creates an R function that allows you to do this:
read_floats("c:/Users/Administrator/vec")
#> [1] 603.9 169.0
Which you can confirm is the sum of the numbers in each file.

C code with openmp called from R gives inconsistent results

Below is a piece of C code run from R used to compare each row of a matrix to a vector. The number of identical values is stored in the first column of a two-column matrix.
I know it can easily be done in R (as done to check the results), but this is a first step for a more complex use case.
When openmp is not used, it works ok. When openmp is used, it give correlated (0.99) but inconsistent results.
Question1: What am I doing wrong?
Question2: I use a double for loop to fill the output matrix (ret) with zeros. What would be a better solution?
Also, inconsistencies were observed when the code was used in a package. I tried to make the code reproducible using inline, but it does not recognize the openmp statements (I tried to include 'omp.h', in the parameters of cfunction, ...).
Question3: How can we make this code work with inline?
I'm (too?) far outside my comfort zone on this topic.
library(inline)
compare <- cfunction(c(x = "integer", vec = "integer"), "
const int I = nrows(x), J = ncols(x);
SEXP ret;
PROTECT(ret = allocMatrix(INTSXP, I, 2));
int *ptx = INTEGER(x), *ptvec = INTEGER(vec), *ptret = INTEGER(ret);
for (int i=0; i<I; i++)
for (int j=0; j<2; j++)
ptret[j * I + i] = 0;
int i, j;
#pragma omp parallel for default(none) shared(ptx, ptvec, ptret) private(i,j)
for (j=0; j<J; j++)
for (i=0; i<I; i++)
if (ptx[i + I * j] == ptvec[j]) {++ptret[i];}
UNPROTECT(1);
return ret;
")
N = 3e3
M = 1e4
m = matrix(sample(c(-1:1), N*M, replace = TRUE), nc = M)
v = sample(-1:1, M, replace = TRUE)
cc = compare(m, v)
cr = rowSums(t(t(m) == v))
all.equal(cc[,1], cr)
Thanks to the comments above, I reconsidered the data race issue.
IIUC, my loop was parallelized on j (the columns). Then, each thread had its own value of i (the rows), but possible identical values across threads, that were then trying to increment ptret[i] at the same time.
To avoid this, I now loop on i first, so that only a single thread will increment each row.
Then, I realized that I could move the zero-initialization of ptret within the first loop.
It seems to work. I get identical results, increased CPU usage, and 3-4x speedup on my laptop.
I guess that solves questions 1 and 2. I will have a closer look at the inline/openmp problem.
Code below, fwiw.
#include <omp.h>
#include <R.h>
#include <Rinternals.h>
#include <stdio.h>
SEXP c_compare(SEXP x, SEXP vec)
{
const int I = nrows(x), J = ncols(x);
SEXP ret;
PROTECT(ret = allocMatrix(INTSXP, I, 2));
int *ptx = INTEGER(x), *ptvec = INTEGER(vec), *ptret = INTEGER(ret);
int i, j;
#pragma omp parallel for default(none) shared(ptx, ptvec, ptret) private(i, j)
for (i = 0; i < I; i++) {
// init ptret to zero
ptret[i] = 0;
ptret[I + i] = 0;
for (j = 0; j < J; j++)
if (ptx[i + I * j] == ptvec[j]) {
++ptret[i];
}
}
UNPROTECT(1);
return ret;
}

Rcpp function complaining about unintialized variables

In a very first attempt at creating a C++ function which can be called from R using Rcpp, I have a simple function to compute a minimum spanning tree from a distance matrix using Prim's algorithm. This function has been converted into C++ from a former version in ANSI C (which works fine).
Here it is:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame primlm(const int n, NumericMatrix d)
{
double const din = 9999999.e0;
long int i1, nc, nc1;
double dlarge, dtot;
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
for (int i=2; i <= n; i++) {
is(i) = 0;
}
for (int i=2; i <= n; i++) {
dlarge = din;
i1 = i - 1;
for (int j=1; j <= i1; j++) {
for (int k=1; k <= n; k++) {
if (l(j) == k)
continue;
if (d[l(j), k] > dlarge)
continue;
if (is(k) == 1)
continue;
nc = k;
nc1 = l(j);
dlarge = d(nc1, nc);
}
}
is(nc) = 1;
l(i) = nc;
lp(i) = nc1;
dist(i) = dlarge;
}
dtot = 0.e0;
for (int i=2; i <= n; i++){
dtot += dist(i);
}
return DataFrame::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
When I compile this function using Rcpp under RStudio, I get two warnings, complaining that variables 'nc' and 'nc1' have not been initialized. Frankly, I could not understand that, as it seems to me that both variables are being initialized inside the third loop. Also, why there is no similar complaint about variable 'i1'?
Perhaps it comes as no surprise that, when attempting to call this function from R, using the below code, what I get is a crash of the R system!
# Read test data
df <- read.csv("zygo.csv", header=TRUE)
lonlat <- data.frame(df$Longitude, df$Latitude)
colnames(lonlat) <- c("lon", "lat")
# Compute distance matrix using geosphere library
library(geosphere)
d <- distm(lonlat, lonlat, fun=distVincentyEllipsoid)
# Calls Prim minimum spanning tree routine via Rcpp
library(Rcpp)
sourceCpp("Prim.cpp")
n <- nrow(df)
p <- primlm(n, d)
Here is the dataset I use for testing purposes:
"Scientific name",Locality,Longitude,Latitude Zygodontmys,Bush Bush
Forest,-61.05,10.4 Zygodontmys,Cerro Azul,-79.4333333333,9.15
Zygodontmys,Dividive,-70.6666666667,9.53333333333 Zygodontmys,Hato El
Frio,-63.1166666667,7.91666666667 Zygodontmys,Finca Vuelta
Larga,-63.1166666667,10.55 Zygodontmys,Isla
Cebaco,-81.1833333333,7.51666666667 Zygodontmys,Kayserberg
Airstrip,-56.4833333333,3.1 Zygodontmys,Limao,-60.5,3.93333333333
Zygodontmys,Montijo Bay,-81.0166666667,7.66666666667
Zygodontmys,Parcela 200,-67.4333333333,8.93333333333 Zygodontmys,Rio
Chico,-65.9666666667,10.3166666667 Zygodontmys,San Miguel
Island,-78.9333333333,8.38333333333
Zygodontmys,Tukuko,-72.8666666667,9.83333333333
Zygodontmys,Urama,-68.4,10.6166666667
Zygodontmys,Valledup,-72.9833333333,10.6166666667
Could anyone give me a hint?
The initializations of ncand nc1 are never reached if one of the three if statements is true. It might be that this is not possible with your data, but the compiler has no way knowing that.
However, this is not the reason for the crash. When I run your code I get:
Index out of bounds: [index=1; extent=0].
This comes from here:
NumericVector is, l, lp, dist;
l(1) = 1;
is(1) = 1;
When declaring a NumericVector you have to tell the required size if you want to assign values by index. In your case
NumericVector is(n), l(n), lp(n), dist(n);
might work. You have to analyze the C code carefully w.r.t. memory allocation and array boundaries.
Alternatively you could use the C code as is and use Rcpp to build a wrapper function, e.g.
#include <array>
#include <Rcpp.h>
using namespace Rcpp;
// One possibility for the function signature ...
double prim(const int n, double *d, double *l, double *lp, double *dist) {
....
}
// [[Rcpp::export]]
List primlm(NumericMatrix d) {
int n = d.nrow();
std::array<double, n> lp; // adjust size as needed!
std::array<double, n> dist; // adjust size as needed!
double dtot = prim(n, d.begin(), l.begin(), lp.begin(), dist.begin());
return List::create(Named("l") = l,
Named("lp") = lp,
Named("dist") = dist,
Named("dtot") = dtot);
}
Notes:
I am returning a List instead of a DataFrame since dtot is a scalar value.
The above code is meant to illustrate the idea. Most likely it will not work without adjustments!

Create a list of RcppArmadillo matrices

Deep inside an MCMC algorithm I need to multiply a user-provided list of matrices with a vector, i.e., the following piece of Rcpp and RcppArmadillo code is called multiple times per MCMC iteration:
List mat_vec1 (const List& Mats, const vec& y) {
int n_list = Mats.size();
Rcpp::List out(n_list);
for (int i = 0; i < n_list; ++i) {
out[i] = as<mat>(Mats[i]) * y;
}
return(out);
}
The user-provided list Mats remains fixed during the MCMC, vector y changes in each iteration. Efficiency is paramount and I'm trying to see if I can speed up the code by not having to convert the elements of Mats to arma::mat that many times (it needs to be done only once). I tried the following approach
List arma_Mats (const List& Mats) {
int n_list = Mats.size();
Rcpp::List res(n_list);
for (int i = 0; i < n_list; ++i) {
res[i] = as<mat>(Mats[i]);
}
return(res);
}
and then
List mat_vec2 (const List& Mats, const vec& y) {
int n_list = Mats.size();
Rcpp::List aMats = arma_Mats(Mats);
Rcpp::List out(n_list);
for (int i = 0; i < n_list; ++i) {
out[i] = aMats[i] * y;
}
return(out);
}
but this does not seem to work. Any pointers of alternative/better solutions are much welcome.
Ok, I basically wrote the answer in the comment but it then occurred to me that we already provide a working example in the stub created by RcppArmadillo.package.skeleton():
// [[Rcpp::export]]
Rcpp::List rcpparma_bothproducts(const arma::colvec & x) {
arma::mat op = x * x.t();
double ip = arma::as_scalar(x.t() * x);
return Rcpp::List::create(Rcpp::Named("outer")=op,
Rcpp::Named("inner")=ip);
}
This returns a list the outer product (a matrix) and the inner product (a scalar) of the given vector.
As for what is fast and what is not: I recommend to not conjecture but rather profile and measure as much as you can. My inclination would be to do more (standalone) C++ code in Armadillo and only return at the very end minimizing conversions.

Rcpp Error: incompatible types (from %s to %s)

I have recently begun using the Rcpp package to write some segments of my R code into C++.
Given a matrix of data, I have the following Rcpp function which calculates a kernel reweighted estimate of the covariance for some observation.
cppFunction('
NumericVector get_cov_1obs(NumericMatrix cdata, int ID, float radius){
int nrow = cdata.nrow(), ncol = cdata.ncol();
float norm_ = 0;
float w;
NumericMatrix out(ncol, ncol);
NumericMatrix outer_prod(ncol, ncol);
for (int i=0; i<ncol;i++){
for (int j=0;j<ncol;j++){
out(i,j) = 0;
outer_prod(i,j) = 0;
}
}
for (int i=0; i<nrow;i++){
w = exp( -(i-ID)*(i-ID)/(2*radius));
norm_ += w;
for (int j=0; j<ncol;j++){
for (int k=0;k<ncol;k++){
outer_prod(j,k) = cdata(i,j) * cdata(i,k);
}
}
for (int j=0; j<ncol;j++){
for (int k=0;k<ncol;k++){
out(j,k) += outer_prod(j,k)*w;
}
}
}
for (int i=0; i<ncol;i++){
for (int j=0;j<ncol;j++){
out(i,j) /= norm_;
}
}
return out;
}')
I would like to quickly estimated the kernel rewieghted covariance matricies for all observations in a dataset and store them as an array. Since Rcpp doesn't handle arrays I have written the following R function:
get_kern_cov_C = function(data, radius){
# data is data for which we wish to estimate covariances
# radius is the radius of the gaussian kernel
# calculate covariances:
kern_cov = array(0, c(ncol(data),ncol(data),nrow(data)))
for (i in 1:nrow(data)){
kern_cov[,,i] = get_cov_1obs(cdata=data, ID = i-1, radius=radius)
}
return(kern_cov)
}
This seems to work fine (and much, MUCH faster than R) however the problem is that every now and then (seemingly at random) I get an error of the following form:
Error in kern_cov[, , i] = get_cov_1obs(cdata = data, ID = i - 1, radius = radius) :
incompatible types (from X to Y)
where X is either builtin or NULL and Y is double.
I roughly understand why this is happening (I am trying to place a builtin/NULL variable into a double) but I am not sure were in the code the bug is. I suspect this might be something related to memory management as it only occurs every now and again.
You can test for NULL at the C(++) level too, and in this case probably should do that.
As to why it is occurring: I am afraid you will need to debug this.

Resources