How to concatenate Lists in Rcpp - r

I want to c() 2 lists in Rcpp, but I'm struggling to get the same structure as I would in R.
Here is some simple data + example:
rlist = list(a = "123")
listadd = list(typ = "fdb")
c(rlist, listadd)
which gives me this:
$a
[1] "123"
$typ
[1] "fdb"
With Rcpp I only found push_back to do more or less what I want but the structure is a bit different. I also tried to use emplace_back based on this reference but it doesnt seem to be implemented in Rcpp.
cppFunction('
List cLists(List x, List y) {
x.push_back(y);
return(x);
}')
which gives me:
cLists(rlist, listadd)
$a
[1] "123"
[[2]]
[[2]]$typ
[1] "fdb"
Based on this question I know that I could use Language("c",x,y).eval(); to use R's c() function and get the correct result, but that doesn't seem to be the right way.
So I was wondering how can I concatenate lists in Rcpp correctly?
EDIT:
Based on #Dirk's comment, I tried to create a new list and fill them with the other lists elements, but then I loose the element names.
cppFunction('
List cLists(List x, List y) {
int nsize = x.size();
int msize = y.size();
List out(nsize + msize);
for(int i = 0; i < nsize; i++) {
out[i] = x[i];
}
for(int i = 0; i < msize; i++) {
out[nsize+i] = y[i];
}
return(out);
}')
Output:
cLists(rlist, listadd)
[[1]]
[1] "123"
[[2]]
[1] "fdb"

The performance hit for your implementation seems to come from copying the name attribute to stl string vectors. You can avoid it like so:
library(Rcpp)
library(microbenchmark)
cppFunction('
List cLists(List x, List y) {
int nsize = x.size();
int msize = y.size();
List out(nsize + msize);
CharacterVector xnames = x.names();
CharacterVector ynames = y.names();
CharacterVector outnames(nsize + msize);
out.attr("names") = outnames;
for(int i = 0; i < nsize; i++) {
out[i] = x[i];
outnames[i] = xnames[i];
}
for(int i = 0; i < msize; i++) {
out[nsize+i] = y[i];
outnames[nsize+i] = ynames[i];
}
return(out);
}')
x <- as.list(runif(1e6)); names(x) <- sample(letters, 1e6, T)
y <- as.list(runif(1e6)); names(y) <- sample(letters, 1e6, T)
microbenchmark(cLists(x,y), c(x,y), times=3)
Unit: milliseconds
expr min lq mean median uq max neval cld
cLists(x, y) 31.70104 31.86375 32.09983 32.02646 32.29922 32.57198 3 a
c(x, y) 47.31037 53.21409 56.41159 59.11781 60.96220 62.80660 3 b
Note: by copying to std::string you're also losing possible character encoding information, whereas working with just R/Rcpp preserves.

This is what I came up with. The output is correct, but unfortunately it is also much less performant than the R version.
library(Rcpp)
cppFunction('
List cLists(List x, List y) {
int nsize = x.size();
int msize = y.size();
List out(nsize + msize);
CharacterVector xnames = x.names();
CharacterVector ynames = y.names();
for(int i = 0; i < nsize; i++) {
out[i] = x[i];
}
for(int i = 0; i < msize; i++) {
out[nsize+i] = y[i];
}
std::vector<std::string> z(x.size() + y.size());
std::copy(xnames.begin(), xnames.end(), z.begin());
std::copy(ynames.begin(), ynames.end(), z.begin() + x.size());
out.attr("names") = z;
return(out);
}')
Output:
cLists(rlist, listadd)
$a
[1] "123"
$typ
[1] "fdb"

Related

RCPP and R discrepancy

I'm new to C++ programming and apologize if my solution is in plain sight. I am attempting to use RCPP to speed up a slow R function. I think I've narrowed down the issue to a nested for loop. I've simplified the function and provided one R and one RCPP version for comparison. Will someone please explain why my RCPP function yields different results? Thanks!
## Data ##
set.seed(666)
input <- rmultinom(10,2,c(.4,.5,.6)) + 1
## R ##
testR <- \(input){
M1 <- matrix(c(0.5,0.4,0.0,0.3,0.5,0.0,0.2,0.1,1.0),3,3)
M2 <- matrix(c(0.75,0.0,0.0,0.0,0.6,0.0,0.25,0.4,1.0),3,3)
Mrows <- nrow(M1)
tmsteps <- ncol(input)
N <- nrow(input)
alphas <- NULL; tmp <- NULL; out <- NULL
for(i in 1:N){
alphas = c(0,-1e6,-1e6)
for(j in 1:tmsteps){
for(k in 1:Mrows){
tmp[k] = sum(alphas + M1[,k] + M2[k, input[i,j] ])
}
alphas <- tmp
}
out[i] <- sum(alphas)
}
sum(out)
}
## RCPP ##
cppFunction('double testRCPP(IntegerMatrix input){
NumericVector v1 = {0.5,0.4,0.0,0.3,0.5,0.0,0.2,0.1,1.0};
v1.attr("dim") = Dimension(3, 3);
NumericMatrix M1 = as<NumericMatrix>(v1);
NumericVector v2 = {0.75,0.0,0.0,0.0,0.6,0.0,0.25,0.4,1.0};
v2.attr("dim") = Dimension(3, 3);
NumericMatrix M2 = as<NumericMatrix>(v2);
int Mrows = M1.nrow();
int tmsteps = input.ncol();
int N = input.nrow();
NumericVector alphas(3);
NumericVector tmp(3);
NumericVector out(N);
for(int i=0; i<N; i++){
alphas = {0,-1e6,-1e6};
for(int j=0; j<tmsteps; j++){
for(int k=0; k<Mrows; k++){
tmp[k] = sum(alphas + M1(_,k) + M2(k, (input(i,j) - 1) ));
}
alphas = tmp;
}
out += alphas;
}
return(sum(out));
}')
> testRCPP(input)
[1] -2.273726e+14
> testR(input)
[1] -354293536945
I have figured out how to get the Rcpp to behave like the R function. I think my issue has to do with C++ variable scoping.
I had previously been initializing the tmp variable outside the nested for loop.
NumericVector tmp(3);
for(int i=0; i<N; i++){
alphas = {0,-1e6,-1e6};
...
All is good when I declare the tmp variable inside the loop, although I don't understand why yet.
for(int i=0; i<N; i++){
alphas = {0,-1e6,-1e6};
for(int j=0; j<tmsteps; j++){
NumericVector tmp(3);
for(int k=0; k<Mrows; k++){
tmp[k] = sum(alphas + M1(_,k) + M2(k, (input(i,j) - 1) ));
}
alphas = tmp;
}
...

Allocate Rcpp List of n NumericMatrix

Is there a way to allocate an Rcpp List of length n, where each element of the List will be filled with a NumericMatrix, but the size of each NumericMatrix can change?
I have an idea for doing this using std::list and push_back(), but the size of the list may be quite large and I want to avoid the overhead of creating an extra copy of the list when I return from the function.
The below R code gives an idea of what I hope to do:
myvec = function(n) {
x = vector("list", n)
for (i in seq_len(n)) {
nc = sample(1:3, 1)
nr = sample(1:3, 1)
x[[i]] = matrix(rbinom(nc * nr, size = 1, prob = 0.5),
nrow = nr, ncol = nc)
}
x
}
This could result in something like:
> myvec(2)
[[1]]
[,1]
[1,] 0
[2,] 1
[[2]]
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 0 1 1
Update: based on the comments of #Dirk and #Ralf, I created functions based on Rcpp::List and std::list with a wrap at the end. Speed comparisons don't seem to favor one version over the other, but perhaps there's an inefficiency I'm not aware of.
src = '
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List myvec(int n) {
Rcpp::RNGScope rngScope;
Rcpp::List x(n);
// Rcpp::IntegerVector choices = {1, 2 ,3};
Rcpp::IntegerVector choices = Rcpp::seq_len(50);
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x(i) = Rcpp::NumericMatrix(nc, nr, entries.begin());
}
return x;
}
// [[Rcpp::export]]
Rcpp::List myvec2(int n) {
Rcpp::RNGScope scope;
std::list< Rcpp::NumericMatrix > x;
// Rcpp::IntegerVector choices = {1, 2 ,3};
Rcpp::IntegerVector choices = Rcpp::seq_len(50);
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x.push_back( Rcpp::NumericMatrix(nc, nr, entries.begin()));
}
return Rcpp::wrap(x);
}
'
sourceCpp(code = src)
Resulting benchmarks on my computer are:
> library(microbenchmark)
> rcpp_list = function() {
+ set.seed(10);myvec(105)
+ }
> std_list = function() {
+ set.seed(10);myvec2(105)
+ }
> microbenchmark(rcpp_list(), std_list(), times = 1000)
Unit: milliseconds
expr min lq mean median uq
rcpp_list() 1.8901 1.92535 2.205286 1.96640 2.22380
std_list() 1.9164 1.95570 2.224941 2.00555 2.32315
max neval cld
7.1569 1000 a
7.1194 1000 a
The fundamental issue that Rcpp objects are R objects governed my R's memory management where resizing is expensive: full copies.
So when I have tasks similar to yours where sizes may change, or are unknown, I often work with different data structures -- the STL gives us plenty -- and only convert to R(cpp) at the return step at the end.
The devil in the detail here (as always). Profile, experiment, ...
Edit: And in the narrower sense of "can we return a List of NumericMatrix objects with varying sizes" the answer is of course we can because that is what List objects do. You can also insert other types.
As Dirk said, it is of course possible to create a list with matrices of different size. To make it a bit more concrete, here a translation of your R function:
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
Rcpp::List myvec(int n) {
Rcpp::List x(n);
Rcpp::IntegerVector choices = {1, 2 ,3};
for (int i = 0; i < n; ++i) {
int nc = Rcpp::sample(choices, 1).at(0);
int nr = Rcpp::sample(choices, 1).at(0);
Rcpp::NumericVector entries = Rcpp::rbinom(nc * nr, 1, 0.5);
x(i) = Rcpp::NumericMatrix(nc, nr, entries.begin());
}
return x;
}
/***R
myvec(2)
*/
The main difference to the R code are the explicitly named vectors choices and entries, which are only implicit in the R code.

Interleaving results from many objects in Rcpp

I need to write to a file row by row of matrices and sparse matrices that appears in a list and I am doing something like this:
#include <RcppArmadillo.h>
// [[Rcpp::export]]
bool write_rows (Rcpp::List data, Rcpp::CharacterVector clss, int n) {
int len = data.length();
for(int i = 0; i<n; i++) {
for(int j=0; j<len; j++) {
if (clss[j] == "matrix") {
Rcpp::NumericMatrix x = data[j];
auto row = x.row(i);
// do something with row i
} else if (clss[j] == "dgCMatrix") {
arma::sp_mat x = data[j];
auto row = x.row(i);
// do something different with row i
}
}
}
return true;
}
This function can be called in R with:
data <- list(
x = Matrix::rsparsematrix(nrow = 1000, ncol = 1000, density = 0.3),
y = matrix(1:10000, nrow = 1000, ncol = 10)
)
clss <- c("dgCMatrix", "matrix")
write_rows(data, clss, 1000)
The function receives a list of matrices or sparse matrices with the same number of rows and writes those matrices row by row, ie. first writes first rows of all elements in data then the second row of all elements and etc.
My problem is that it seems that this line arma::sp_mat x = data[i]; seems to have a huge impact in performance since it seems that I am implicitly casting the list element data[j] to an Armadillo Sparse Matrix n times.
My question is: is there anyway I could avoid this? Is there a more efficient solution? I tried to find a solution by looking into readr's source code, since they also write list elements row by row, but they also do a cast for each row (in this line for example, but maybe this doesn't impact the performance because they deal with SEXPS?
With the clarification, it seems that the result should interleave the rows from each matrix. You can still do this while avoiding multiple conversions.
This is the original code, modified to generate some actual output:
// [[Rcpp::export]]
arma::mat write_rows(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {
int len = data.length();
arma::mat result(nrows*len, ncols);
for (int i = 0, k = 0; i < nrows; i++) {
for (int j = 0; j < len; j++) {
arma::rowvec r;
if (clss[j] == "matrix") {
Rcpp::NumericMatrix x = data[j];
r = x.row(i);
}
else {
arma::sp_mat x = data[j];
r = x.row(i);
}
result.row(k++) = r;
}
}
return result;
}
The following code creates a vector of converted objects, and then extracts the rows from each object as required. The conversion is only done once per matrix. I use a struct containing a dense and sparse mat because it's a lot simpler than dealing with unions; and I don't want to drag in boost::variant or require C++17. Since there's only 2 classes we want to deal with, the overhead is minimal.
struct Matrix_types {
arma::mat m;
arma::sp_mat M;
};
// [[Rcpp::export]]
arma::mat write_rows2(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {
const int len = data.length();
std::vector<Matrix_types> matr(len);
std::vector<bool> is_dense(len);
arma::mat result(nrows*len, ncols);
// populate the structs
for (int j = 0; j < len; j++) {
is_dense[j] = (clss[j] == "matrix");
if (is_dense[j]) {
matr[j].m = Rcpp::as<arma::mat>(data[j]);
}
else {
matr[j].M = Rcpp::as<arma::sp_mat>(data[j]);
}
}
// populate the result
for (int i = 0, k = 0; i < nrows; i++) {
for (int j = 0; j < len; j++, k++) {
if (is_dense[j]) {
result.row(k) = matr[j].m.row(i);
}
else {
arma::rowvec r(matr[j].M.row(i));
result.row(k) = r;
}
}
}
return result;
}
Running on some test data:
data <- list(
a=Matrix(1.0, 1000, 1000, sparse=TRUE),
b=matrix(2.0, 1000, 1000),
c=Matrix(3.0, 1000, 1000, sparse=TRUE),
d=matrix(4.0, 1000, 1000)
)
system.time(z <- write_rows(data, sapply(data, class), 1000, 1000))
# user system elapsed
# 185.75 35.04 221.38
system.time(z2 <- write_rows2(data, sapply(data, class), 1000, 1000))
# user system elapsed
# 4.21 0.05 4.25
identical(z, z2)
# [1] TRUE

Warning when downcasting in Rcpp?

I have an Rcpp function that should take an IntegerVector as input (as toInt). I want to use it on vector of integers, but also on vector of doubles that are just integers (e.g. 1:4 is of type integer but 1:4 + 1 is of type double).
Yet, when this is used on real floating point numbers (e.g. 1.5), I would like it to return a warning or an error instead of silently rounding all values (to make them integers).
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector toInt(RObject x) {
return as<IntegerVector>(x);
}
> toInt(c(1.5, 2.4)) # I would like a warning
[1] 1 2
> toInt(1:2 + 1) # No need of warning
[1] 2 3
Rcpp sugar has all you need. Here is one possible implementation:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector fprive(const RObject & x) {
NumericVector nv(x);
IntegerVector iv(x);
if (is_true(any(nv != NumericVector(iv)))) warning("Uh-oh");
return(iv);
}
/*** R
fprive(c(1.5, 2))
fprive(c(1L, 2L))
*/
Its output is as follows:
R> Rcpp::sourceCpp('/tmp/fprive.cpp')
R> fprive(c(1.5, 2))
[1] 1 2
R> fprive(c(1L, 2L))
[1] 1 2
Warning message:
In fprive(c(1.5, 2)) : Uh-oh
R>
Because it is a warning object, you can control via options("warn") whether you want to abort, print immediately, print at end, ignore, ...
The first solution I thought of
// [[Rcpp::export]]
IntegerVector toInt2(const NumericVector& x) {
for (int i = 0; i < x.size(); i++) {
if (x[i] != (int)x[i]) {
warning("Uh-oh");
break;
}
}
return as<IntegerVector>(x);
}
but I wondered if there wasn't an unnecessary copy when x was an IntegerVector, so I made this other solution:
// [[Rcpp::export]]
IntegerVector toInt3(const RObject& x) {
NumericVector nv(x);
for (int i = 0; i < nv.size(); i++) {
if (nv[i] != (int)nv[i]) {
warning("Uh-oh");
break;
}
}
return as<IntegerVector>(x);
}
But, maybe the best solution would be to test if the RObject is already of type int and to fill the resulting vector at the same time of checking the type:
// [[Rcpp::export]]
SEXP toInt4(const RObject& x) {
if (TYPEOF(x) == INTSXP) return x;
NumericVector nv(x);
int i, n = nv.size();
IntegerVector res(n);
for (i = 0; i < n; i++) {
res[i] = nv[i];
if (nv[i] != res[i]) {
warning("Uh-oh");
break;
}
}
for (; i < n; i++) res[i] = nv[i];
return res;
}
Some benchmarking:
x <- seq_len(1e7)
x2 <- x; x2[1] <- 1.5
x3 <- x; x3[length(x3)] <- 1.5
microbenchmark::microbenchmark(
fprive(x), toInt2(x), toInt3(x), toInt4(x),
fprive(x2), toInt2(x2), toInt3(x2), toInt4(x2),
fprive(x3), toInt2(x3), toInt3(x3), toInt4(x3),
times = 20
)
Unit: microseconds
expr min lq mean median uq max neval
fprive(x) 229865.629 233539.952 236049.68870 235623.390 238500.4335 244608.276 20
toInt2(x) 98249.764 99520.233 102026.44305 100468.627 103480.8695 114144.022 20
toInt3(x) 50631.512 50838.560 52307.34400 51417.296 52524.0260 58311.909 20
toInt4(x) 1.165 6.955 46.63055 10.068 11.0755 766.022 20
fprive(x2) 63134.534 64026.846 66004.90820 65079.292 66674.4835 74907.065 20
toInt2(x2) 43073.288 43435.478 44068.28935 43990.455 44528.1800 45745.834 20
toInt3(x2) 42968.743 43461.838 44268.58785 43682.224 44235.6860 51906.093 20
toInt4(x2) 19379.401 19640.198 20091.04150 19918.388 20232.4565 21756.032 20
fprive(x3) 254034.049 256154.851 258329.10340 258676.363 259549.3530 264550.346 20
toInt2(x3) 77983.539 79162.807 79901.65230 79424.011 80030.3425 87906.977 20
toInt3(x3) 73521.565 74329.410 76050.63095 75128.253 75867.9620 88240.937 20
toInt4(x3) 22109.970 22529.713 23759.99890 23072.738 23688.5365 30905.478 20
So, toInt4 seems the best solution.

Problems with scale() and the Multidimensional Lp-Norm

Today I was trying to debug my code and stumbled across something that renders my solutions useless. What i am generally trying to calculate is the multidimensional L2-Norm for the following two matrices. As long as I am not using scale() everything is working fine. Nonetheless, as soon as I scale the matrices the solutions of the three used approaches are not the same anymore. What am I missing here?
set.seed(655)
df.a <- data.frame(A = sample(100:124, 24), B = sample(1:24, 24), C = sample(1:24, 24), D = rep(0, times=24))
df.b <- data.frame(A = sample(125:148, 24), B = sample(25:48, 24), C = sample(1:24, 24), D = sample(1:100, 24))
For this reason I have three different approaches:
sapply-function and sqrt of rowSums
sse <- function(x1, x2) sum((x1 - x2) ^ 2)
distanceChangeByTech <- function(x) {
sse(df.a[,x], df.b[,x])
}
help1 <- t(data.frame(sapply(colnames(df.a), distanceChangeByTech)))
dist_sap <- sqrt(rowSums(help1))
multidimensional Euclidean distance using RCPP:
multiEucl <- cxxfunction(signature(x="matrix", y="matrix"), plugin="Rcpp",
body='
Rcpp::NumericMatrix dx(x);
Rcpp::NumericMatrix dy(y);
const int N = dx.nrow();
const int M = dx.ncol();
double sum = 0;
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
sum = sum + pow(dx(i,j) - dy(i,j), 2);
}
}
return wrap(sqrt(sum));
')
multidimensional Lp-Norm using RCPP:
multiPNorm <- cxxfunction(signature(x="matrix", y="matrix", p="numeric"), plugin="Rcpp",
body='
Rcpp::NumericMatrix dx(x);
Rcpp::NumericMatrix dy(y);
double dp = Rcpp::as<double>(p);
const int N = dx.nrow();
const int M = dx.ncol();
double sum = 0;
double rsum = 0;
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
sum = sum + pow(abs(dx(i,j) - dy(i,j)), dp);
}
}
rsum = pow(sum, 1/dp);
return wrap(rsum);
')
When I tried this at first all worked well.
> multiEucl(as.matrix(df.a), as.matrix(df.b))
[1] 366.1543
> multiPNorm(as.matrix(df.a), as.matrix(df.b), 2)
[1] 366.1543
> sqrt(rowSums(help1)) sapply.colnames.df.a...distanceChangeByTech.
366.1543
But as soon as I scale the matrices, which I want to do because I will do a Clustering based on these distancemeasures, there is a fault. The solutions are not the same anymore?! What is causing this? I am using these commands to scale.
df.a <- as.data.frame(scale(df.a))
df.a[is.na(df.a)] <- 0
df.b <- as.data.frame(scale(df.b))
df.b[is.na(df.b)] <- 0
> multiEucl(as.matrix(df.a), as.matrix(df.b))
[1] 12.51781
> multiPNorm(as.matrix(df.a), as.matrix(df.b), 2)
[1] 8.944272
> sqrt(rowSums(help1))
sapply.colnames.df.a...distanceChangeByTech.
12.51781
You used abs() which is documented eg here but you meant to use fabs() which is documented here.
The cmath.h header provides overloaded abs() as well, but you probably didn't include that.
It seems that abs() is not doing the right thing here. Instead I changed my coding of the multiPNorm and the changes seem to work.
multiPNorm <- cxxfunction(signature(x="matrix", y="matrix", p="numeric"), plugin="Rcpp",
body='
Rcpp::NumericMatrix dx(x);
Rcpp::NumericMatrix dy(y);
double dp = Rcpp::as<double>(p);
const int N = dx.nrow();
const int M = dx.ncol();
double sum = 0;
double rsum = 0;
double help = 0;
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
help = dx(i,j) - dy(i,j);
if (help < 0) {
help = - help;
}
sum = sum + pow(help, dp);
}
}
rsum = pow(sum, 1/dp);
return wrap(rsum);
')

Resources