Rcpp: extract subset of matrix using indexmatrix - r

I have a question about subsetting from a matrix to a vector. The user has the possibility to explicitly give the indexmatrix (which is a matrix of the same size as M, with 0 if the entry is not wanted, and 1 if the entry has to be extracted). If the indexmatrix is provided, then we just subset it, and if the indexmatrix is not provided (indexmatrix = NULL), then we build it using type1 (which takes true or false). Only two types of indexmatrices are possible.
I used the subsetting technique provided in
Subset of a Rcpp Matrix that matches a logical statement
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::colvec extractElementsRcpp(arma::mat M,
Rcpp::Nullable<Rcpp::NumericMatrix> indexmatrix = R_NilValue,
bool type1 = false) {
unsigned int D = M.n_rows; // dimension of the data
arma::mat indmatrix(D, D); // initialize indexmatrix
if (indexmatrix.isNotNull()) {
// copy indexmatrix to numericmatrix
Rcpp::NumericMatrix indexmatrixt(indexmatrix);
// make indexmatrix into arma matrix indmatrix
indmatrix = Rcpp::as<arma::mat>(indexmatrixt);
} //else {
// get indexmatrix
// Rcpp::NumericMatrix indexmatrixt = getindexmatrix(D, type1)["indexmatrix"];
// // make indexmatrix into arma matrix
// indmatrix = Rcpp::as<arma::mat>(indexmatrixt);
// }
arma::colvec unM = M.elem(find(indmatrix == 1)); // extract wanted elements
return(unM);
}
It works, great! However, the speed is not what I was hoping for. Whenever the indexmatrix is provided, the C++ code is slower than the normal R code, while I was aiming for a nice improvement in speed. I have the feeling I'm copying the matrices around too much. But I am new to C++ and did not find a way to avoid it yet.
The speed comparison is as follows:
test replications elapsed relative
2 extractElementsR(M, indexmatrix = ind) 100 0.084 1.00
1 extractElementsRcpp(M, indexmatrix = ind) 100 0.142 1.69
EDIT: The R function is defined as
extractElementsR <- function (M, indexmatrix, type1 = FALSE) {
D <- nrow(M)
# # get indexmatrix, if necessary
# if(is.null(indexmatrix)) indexmatrix <- getindexmatrix(D, type1 = type1)$indexmatrix
# extract wanted elements
return (M[which(indexmatrix > 0)])
}
One could for example take the matrices
M <- matrix(rnorm(1000^2), ncol = 1000)
indexmatrix <- matrix(1, 1000, 1000)
indexmatrix[lower.tri(indexmatrix)] <- 0
as M and indexmatrix.
EDIT2: I commented the else statement in the Rcpp function and omitted the default NULL value in the R function as it is not important for my question. I want to improve the speed of the Rcpp function when indexmatrix is provided. However, I want to keep the default NULL value (and create and indexmatrix when necessary).

Can you show
the function extractElementR() as well and
example data so that this become a reproducible example?
And at first blush, you are mixing Rcpp and RcppArmadillo types in order to subset with the latter. That will create lots of copies. We can now index with both Rcpp (and Kevin has some answers here) and RcppArmadillo (several older answers) so you could even try two different ways.

Related

Using sample() from within Rcpp

I have a matrix containing probabilities, with each of the four columns corresponding to a score (an integer in sequence from 0 to 4). I want to sample a single score for each row using the probabilities contained in that row as sampling weights. In rows where some columns do not contain probabilities (NAs instead), the sampling frame is limited to the columns (and their corresponding scores) which do (e.g. for a row with 0.45,0.55,NA,NA, either 0 or 1 would be sampled). However, I get this error (followed by several others), so how can I make it work?:
error: no matching function for call to 'as<Rcpp::IntegerVector>(Rcpp::Matrix<14>::Sub&)'
score[i] = sample(scrs,1,true,as<IntegerVector>(probs));
Existing answers suggest RcppArmadillo is the solution but I can't get that to work either. If I add:
require(RcppArmadillo)
before the cppFunction and
score[i] = Rcpp::RcppArmadillo::sample(scrs,1,true,probs);
in place of the existing sample() statement, I get:
error: 'Rcpp::RcppArmadillo' has not been declared
score[i] = Rcpp::RcppArmadillo::sample(scrs,1,true,probs);
Or if I also include,
#include <RcppArmadilloExtensions/sample.h>
at the top, I get:
fatal error: RcppArmadilloExtensions/sample.h: No such file or directory
#include <RcppArmadilloExtensions/sample.h>
Reproducible code:
p.vals <- matrix(c(0.44892077,0.55107923,NA,NA,
0.37111195,0.62888805,NA,NA,
0.04461714,0.47764478,0.303590351,1.741477e-01,
0.91741642,0.07968127,0.002826406,7.589714e-05,
0.69330800,0.24355559,0.058340934,4.795468e-03,
0.43516823,0.43483784,0.120895859,9.098067e-03,
0.73680809,0.22595438,0.037237525,NA,
0.89569365,0.10142719,0.002879163,NA),nrow=8,ncol=4,byrow=TRUE)
step.vals <- c(1,1,3,3,3,3,2,2)
require(Rcpp)
cppFunction('IntegerVector scores_cpp(NumericMatrix p, IntegerVector steps){
int prows = p.nrow();
IntegerVector score(prows);
for(int i=0;i<prows;i++){
int step = steps[i];
IntegerVector scrs = seq(0,step);
int start = 0;
int end = step;
NumericMatrix::Sub probs = p(Range(i,i),Range(start,end));
score[i] = sample(scrs,1,true,probs);
}
return score;
}')
test <- scores_cpp(p.vals,step.vals)
test
Note: the value of step.vals for each row is always equal to the number of columns containing probabilities in that row -1. So passing the step.values to the function may be extraneous.
You may be having a 'forest for the trees' moment here. The RcppArmadillo unit tests actually provide a working example. If you look at the source file inst/tinytest/test_sample.R, it has a simple
Rcpp::sourceCpp("cpp/sample.cpp")
and in the that file inst/tinytest/cpp/sample.cpp we have the standard
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>
to a) tell R to look at RcppArmadillo header directories and b) include the sampler extensions. This is how it works, and this has been documented to work for probably close to a decade.
As an example I can just do (in my $HOME directory containing git/rcpparmadillo)
> Rcpp::sourceCpp("git/rcpparmadillo/inst/tinytest/cpp/sample.cpp")
> set.seed(123)
> csample_integer(1:5, 10, TRUE, c(0.4, 0.3, 0.2, 0.05, 0.05))
[1] 1 3 2 3 4 1 2 3 2 2
>
The later Rcpp addition works the same way, but I find working with parts of matrices to be more expressive and convenient with RcppArmadillo.
Edit: Even simpler for anybody with the RcppArmadillo package installed:
< library(Rcpp)
> sourceCpp(system.file("tinytest","cpp","sample.cpp", package="RcppArmadillo"))
> set.seed(123)
> csample_integer(1:5, 10, TRUE, c(0.4, 0.3, 0.2, 0.05, 0.05))
[1] 1 3 2 3 4 1 2 3 2 2
>
Many thanks for the pointers. I also had some problems with indexing the matrix, so that part is changed, too. The following code works as intended (using sourceCpp()):
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector scores_cpp(NumericMatrix p, IntegerVector steps){
int prows = p.nrow();
IntegerVector score(prows);
for(int i=0;i<prows;i++){
int step = steps[i];
IntegerVector scrs = seq(0,step);
NumericMatrix probs = p(Range(i,i),Range(0,step));
IntegerVector sc = RcppArmadillo::sample(scrs,1,true,probs);
score[i] = sc[0];
}
return score;
}

Matrix indexing via integer vector

I want to access non-consecutive matrix elements and then pass the sub-selection to (for instance) the sum() function. In the example below I get a compile error about invalid conversion.
I am relatively new to Rcpp, so I am sure the answer is simple. Perhaps I am missing some type of cast?
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins("cpp11")]]
double sumExample() {
// these are the matrix row elements I want to sum (the column in this example will be fixed)
IntegerVector a = {2,4,6};
// create 10x10 matrix filled with random numbers [0,1]
NumericVector v = runif(100);
NumericMatrix x(10, 10, v.begin());
// sum the row elements 2,4,6 from column 0
double result = sum( x(a,0) );
return(result);
}
You were close. Indexing uses [] only -- see this write up at the Rcpp Gallery -- and you missed the export tag. The main issue is that compound expresssion are sometimes too much for the compiler and the template programming. So it works if you take it apart.
Corrected Code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::export]]
double sumExample() {
// these are the matrix row elements I want to sum
// (the column in this example will be fixed)
IntegerVector a = {2,4,6};
// create 10x10 matrix filled with random numbers [0,1]
NumericVector v = runif(100);
NumericMatrix x(10, 10, v.begin());
// sum the row elements 2,4,6 from column 0
NumericVector z1 = x.column(0);
NumericVector z2 = z1[a];
double result = sum( z2 );
return(result);
}
/*** R
sumExample()
*/
Demo
R> Rcpp::sourceCpp("~/git/stackoverflow/56739765/question.cpp")
R> sumExample()
[1] 0.758416
R>

memory efficient method to calculate distance matrix [duplicate]

I have an object of class big.matrix in R with dimension 778844 x 2. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix and have as a result an object of class big.matrix. I would like to know if there is an optimal way of doing that.
The reason for my choice of using the class big.matrix is memory limitation. I could transform my big.matrix to an object of class matrix and calculate the Euclidean distance matrix using dist(). However, dist() would return an object of size that would not be allocated in the memory.
Edit
The following answer was given by John W. Emerson, author and maintainer of the bigmemory package:
You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.
Here is a way using RcppArmadillo. Much of this is very similar to the RcppGallery example. This will return a big.matrix with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix functions in a wrapper function to create a cleaner syntax (i.e. avoid the #address and other initializations.
Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.
euc_dist.cpp
// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]
using namespace Rcpp;
using namespace arma;
// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>
// C++11 plugin
// [[Rcpp::plugins(cpp11)]]
template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {
int W = inBigMat.n_rows;
for(int i = 0; i < W - 1; i++){
for(int j=i+1; j < W; j++){
outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
}
}
}
// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
// First we tell Rcpp that the object we've been given is an external
// pointer.
XPtr<BigMatrix> xpMat(pInBigMat);
XPtr<BigMatrix> xpOutMat(pOutBigMat);
int type = xpMat->matrix_type();
switch(type) {
case 1:
BigArmaEuclidean(
arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 2:
BigArmaEuclidean(
arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 4:
BigArmaEuclidean(
arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 8:
BigArmaEuclidean(
arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
default:
// We should never get here, but it resolves compiler warnings.
throw Rcpp::exception("Undefined type for provided big.matrix");
}
}
My little wrapper
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat#address, zeros#address)
return(zeros)
}
The test
library(Rcpp)
sourceCpp("euc_dist.cpp")
library(bigmemory)
set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)
# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]
# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]
# check if identical
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE

Fastest way to drop rows with missing values?

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.
This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i
Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/
For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.
Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04

Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`

Is there a way to pass a data.table objects to c++ functions using Rcpp and/or RcppArmadillo without manually transforming to data.table to a data.frame? In the example below test_rcpp(X2) and test_arma(X2) both fail with c++ exception (unknown reason).
R code
X=data.frame(c(1:100),c(1:100))
X2=data.table(X)
test_rcpp(X)
test_rcpp(X2)
test_arma(X)
test_arma(X2)
c++ functions
NumericMatrix test_rcpp(NumericMatrix X) {
return(X);
}
mat test_arma(mat X) {
return(X);
}
Building on top of other answers, here is some example code:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
double do_stuff_with_a_data_table(DataFrame df){
CharacterVector x = df["x"] ;
NumericVector y = df["y"] ;
IntegerVector z = df["v"] ;
/* do whatever with x, y, v */
double res = sum(y) ;
return res ;
}
So, as Matthew says, this treats the data.table as a data.frame (aka a Rcpp::DataFrame in Rcpp).
require(data.table)
DT <- data.table(
x=rep(c("a","b","c"),each=3),
y=c(1,3,6),
v=1:9)
do_stuff_with_a_data_table( DT )
# [1] 30
This completely ignores the internals of the data.table.
Try passing the data.table as a DataFrame rather than NumericMatrix. It is a data.frame anyway, with the same structure, so you shouldn't need to convert it.
Rcpp sits on top of native R types encoded as SEXP. This includes eg data.frame or matrix.
data.table is not native, it is an add-on. So someone who wants this (you?) has to write a converter, or provide funding for someone else to write one.
For reference, I think the good thing is to output a list from rcpp as data.table allow update via lists.
Here is a dummy example:
cCode <-
'
DataFrame DT(DTi);
NumericVector x = DT["x"];
int N = x.size();
LogicalVector b(N);
NumericVector d(N);
for(int i=0; i<N; i++){
b[i] = x[i]<=4;
d[i] = x[i]+1.;
}
return Rcpp::List::create(Rcpp::Named("b") = b, Rcpp::Named("d") = d);
';
require("data.table");
require("rcpp");
require("inline");
DT <- data.table(x=1:9,y=sample(letters,9)) #declare a data.table
modDataTable <- cxxfunction(signature(DTi="data.frame"), plugin="Rcpp", body=cCode)
DT_add <- modDataTable(DT) #here we get the list
DT[, names(DT_add):=DT_add] #here we update by reference the data.table

Resources