Efficient subsetting in Rcpp (equivalent of the R "which" command) - r

In Rcpp, there are various "Rcpp sugar" commands that permit nice vectorised operations in the code. In the code below I move across a data frame, break it into vectors, then use the "ifelse" and "sum" sugar commands to compute the mean of v over the rows where x equals either y or y+1. All seems to work correctly.
Just wondering if there is a neater way than this - e.g. an equivalent of the "which" command that gives index points satisfying a particular condition? There seems to be a facility available as "find" in Armadillo but that means using incompatible object types (you can't use "find" and "ifelse" together).
On the same topic, is it possible to get "ifelse" to accept a compound logical condition? In the example below, for instance, the definition of indic is formed of two "ifelse" commands, and it would obviously be cleaner as one. Any thoughts would be much appreciated.
Look forward to hearing your responses :)
require(Rcpp)
require(inline)
set.seed(42)
df = data.frame(x = rpois(1000,3), y = rpois(1000,3), v = rnorm(1000),
stringsAsFactors=FALSE)
myfunc1 = cxxfunction(
signature(DF = "data.frame"),
plugin = "Rcpp",
body = '
using namespace Rcpp;
DataFrame df(DF);
IntegerVector x = df["x"];
IntegerVector y = df["y"];
NumericVector v = df["v"];
LogicalVector indic = ifelse(x==y,true,ifelse(x==y+1,true,false));
double subsum = sum(ifelse(indic,v,0));
int subsize = sum(indic);
double mn = ((subsize>0) ? subsum/subsize : 0.0);
return(Rcpp::List::create(_["subsize"] = subsize,
_["submean"] = mn
));
'
)
myfunc1(df)
### OUTPUT:
#
# $subsize
# [1] 300
#
# $submean
# [1] 0.1091555
#

Rcpp (>= 0.10.0) implements the | operator between two logical sugar expressions. So you can do:
require( Rcpp )
cppFunction( code = '
List subsum( IntegerVector x, IntegerVector y, NumericVector v){
using namespace Rcpp ;
LogicalVector indic = (x==y) | (x==y+1) ;
int subsize = sum(indic) ;
double submean = subsize == 0 ? 0.0 : sum(ifelse(indic,v,0)) / subsize ;
return List::create( _["subsize"] = subsize, _["submean"] = submean ) ;
}
' )
subsum( rpois(1000,3), rpois(1000,3), rnorm(1000) )
# $subsize
# [1] 320
#
# $submean
# [1] -0.05708866

Related

Is there any way in which to make an Infix function using sourceCpp()

I was wondering whether it is possible to make an infix function, e.g. A %o% B with Rcpp.
I know that this is possible using the inline package, but have yet been able to find a method for doing this when using sourceCpp().
I have made the following infix implementation of %o% / outer() when arguments are sure to be vectors using RcppEigen and inline:
`%op%` <- cxxfunction(signature(v1="NumericVector",
v2="NumericVector"),
plugin = "RcppEigen",
body = c("
NumericVector xx(v1);
NumericVector yy(v2);
const Eigen::Map<Eigen::VectorXd> x(as<Eigen::Map<Eigen::VectorXd> >(xx));
const Eigen::Map<Eigen::VectorXd> y(as<Eigen::Map<Eigen::VectorXd> >(yy));
Eigen::MatrixXd op = x * y.transpose();
return Rcpp::wrap(op);
"))
This can easily be implemented in to be imported using sourceCpp(), however not as an infix function.
My current attempt is as follows:
#include <Rcpp.h>
using namespace Rcpp;
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
NumericMatrix outerProd(NumericVector v1, NumericVector v2) {
NumericVector xx(v1);
NumericVector yy(v2);
const Eigen::Map<Eigen::VectorXd> x(as<Eigen::Map<Eigen::VectorXd> >(xx));
const Eigen::Map<Eigen::VectorXd> y(as<Eigen::Map<Eigen::VectorXd> >(yy));
Eigen::MatrixXd op = x * y.transpose();
return Rcpp::wrap(op);
}
So to summarize my question.. Is it possible to make an infix function available through sourceCpp?
Is it possible to make an infix function available through sourceCpp?
Yes.
As always, one should read the Rcpp vignettes!
In particular here, if you look in Section 1.6 of the Rcpp attributes vignette, you'd see you can modify the name of a function using the name parameter for Rcpp::export.
For example, we could do:
#include <Rcpp.h>
// [[Rcpp::export(name = `%+%`)]]
Rcpp::NumericVector add(Rcpp::NumericVector x, Rcpp::NumericVector y) {
return x + y;
}
/*** R
1:3 %+% 4:6
*/
Then we'd get:
Rcpp::sourceCpp("~/infix-test.cpp")
> 1:3 %+% 4:6
[1] 5 7 9
So, you still have to name C++ functions valid C++ names in the code, but you can export it to R through the name parameter of Rcpp::export without having to do anything further on the R side.
John Chambers states three principles on page four of the (highly recommended) "Extending R" book:
Everything that exists in R is an object.
Everything that happens in R is a function call.
Interfaces to other software are part of R.
So per point two, you can of course use sourceCpp() to create your a compiled function and hang that at any odd infix operator you like.
Code Example
library(Rcpp)
cppFunction("std::string cc(std::string a, std::string b) { return a+b; }")
`%+%` <- function(a,b) cc(a,b)
cc("Hello", "World")
"hello" %+% "world"
Output
R> library(Rcpp)
R> cppFunction("std::string cc(std::string a, std::string b) { return a+b; }")
R> `%+%` <- function(a,b) cc(a,b)
R>
R> cc("Hello", "World")
[1] "HelloWorld"
R>
R> "hello" %+% "world"
[1] "helloworld"
R>
Summary
Rcpp is really just one cog in the machinery.
Edit
It also works with your initial function, with some minor simplification. For
`%op%` <- cppFunction("Eigen::MatrixXd op(Eigen::VectorXd x, Eigen::VectorXd y) { Eigen::MatrixXd op = x * y.transpose(); return op; }", depends="RcppEigen")
as.numeric(1:3) %op% as.numeric(3:1)
we get
R> `%op%` <- cppFunction("Eigen::MatrixXd op(Eigen::VectorXd x, Eigen::VectorXd y) { Eigen::MatrixXd op = x * y.transpose(); return op; }", depends="RcppEigen")
R> as.numeric(1:3) %op% as.numeric(3:1)
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 6 4 2
[3,] 9 6 3
R>
(modulo some line noise from the compiler).

Remove NA values efficiently

I need to remove NA values efficiently from vectors inside a function which is implemented with RcppEigen. I can of course do it using a for loop, but I wonder if there is a more efficient way.
Here is an example:
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::VectorXd;
typedef Map<VectorXd> MapVecd;
'
body <- '
const MapVecd x(as<MapVecd>(xx)), y(as<MapVecd>(yy));
VectorXd x1(x), y1(y);
int k(0);
for (int i = 0; i < x.rows(); ++i) {
if (x.coeff(i)==x.coeff(i) && y.coeff(i)==y.coeff(i)) {
x1(k) = x.coeff(i);
y1(k) = y.coeff(i);
k++;
};
};
x1.conservativeResize(k);
y1.conservativeResize(k);
return Rcpp::List::create(Rcpp::Named("x") = x1,
Rcpp::Named("y") = y1);
'
na.omit.cpp <- cxxfunction(signature(xx = "Vector", yy= "Vector"),
body, "RcppEigen", incl)
na.omit.cpp(c(1.5, NaN, 7, NA), c(7.0, 1, NA, 3))
#$x
#[1] 1.5
#
#$y
#[1] 7
In my use case I need to do this about one million times in a loop (inside the Rcpp function) and the vectors could be quite long (let's assume 1000 elements).
PS: I've also investigated the route to find all NA/NaN values using x.array()==x.array(), but was unable to find a way to use the result for subsetting with Eigen.
Perhaps I am not understanding the question correctly, but within Rcpp, I don't see how you could possibly do this more efficiently than a for loop. for loops are generally inefficient in R only because iterating through a loop in R requires a lot of heavy interpreted machinery. But this is not the case once you are down at the C++ level. Even natively vectorized R functions ultimately are implemented with for loops in C. So the only way I can think to make this more efficient is to try to do it in parallel.
For example, here's a simple na.omit.cpp function that omits NA values from a single vector:
rcppfun<-"
Rcpp::NumericVector naomit(Rcpp::NumericVector x){
std::vector<double> r(x.size());
int k=0;
for (int i = 0; i < x.size(); ++i) {
if (x[i]==x[i]) {
r[k] = x[i];
k++;
}
}
r.resize(k);
return Rcpp::wrap(r);
}"
na.omit.cpp<-cppFunction(rcppfun)
This runs even more quickly than R's built in na.omit:
> set.seed(123)
> x<-1:10000
> x[sample(10000,1000)]<-NA
> y1<-na.omit(x)
> y2<-na.omit.cpp(x)
> all(y1==y2)
[1] TRUE
> require(microbenchmark)
> microbenchmark(na.omit(x),na.omit.cpp(x))
Unit: microseconds
expr min lq median uq max neval
na.omit(x) 290.157 363.9935 376.4400 401.750 6547.447 100
na.omit.cpp(x) 107.524 168.1955 173.6035 210.524 222.564 100
I do not know if I understand the problem correctly or not but you can use the following arguments:
a = c(1.5, NaN, 7, NA)
a[-which(is.na(a))]
[1] 1.5 7.0
It might be useful to use `rinside' if you want to use it in C++.

Convert RcppArmadillo vector to Rcpp vector

I am trying to convert RcppArmadillo vector (e.g. arma::colvec) to a Rcpp vector (NumericVector). I know I can first convert arma::colvec to SEXP and then convert SEXP to NumericVector (e.g. as<NumericVector>(wrap(temp)), assuming temp is an arma::colvec object). But what is a good way to do that?
I want to do that simply because I am unsure if it is okay to pass arma::colvec object as a parameter to an Rcpp::Function object.
I was trying to Evaluate a Rcpp::Function with argument arma::vec, it seems that it takes the argument in four forms without compilation errors. That is, if f is a Rcpp::Function and a is a arma::vec, then
f(a)
f(wrap(a))
f(as<NumericVector>(wrap(a)))
f(NumericVector(a.begin(),a.end()))
produce no compilation and runtime errors, at least apparently.
For this reason, I have conducted a little test for the four versions of arguments. Since I suspect that somethings will go wrong in garbage collection, I test them again gctorture.
gctorture(on=FALSE)
Rcpp::sourceCpp(code = '
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
double foo1(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(a, b));
}
return sum;
}
// [[Rcpp::export]]
double foo2(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(wrap(a),wrap(b)));
}
return sum;
}
// [[Rcpp::export]]
double foo3(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(as<NumericVector>(wrap(a)),as<NumericVector>(wrap(b))));
}
return sum;
}
// [[Rcpp::export]]
double foo4(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(NumericVector(a.begin(),a.end()),NumericVector(b.begin(),b.end())));
}
return sum;
}
')
# note that when gctorture is on, the program will be very slow as it
# tries to perfrom GC for every allocation.
# gctorture(on=TRUE)
f = function(x,y) {
mean(x) + mean(y)
}
# all three functions should return 700
foo1(c(1,2,3), c(4,5,6), f) # error
foo2(c(1,2,3), c(4,5,6), f) # wrong answer (occasionally)!
foo3(c(1,2,3), c(4,5,6), f) # correct answer
foo4(c(1,2,3), c(4,5,6), f) # correct answer
As a result, the first method produces an error, the second method produces a wrong answer and only the third and the fourth method return the correct answer.
> # they should return 700
> foo1(c(1,2,3), c(4,5,6), f) # error
Error: invalid multibyte string at '<80><a1><e2>'
> foo2(c(1,2,3), c(4,5,6), f) # wrong answer (occasionally)!
[1] 712
> foo3(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
> foo4(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
Note that, if gctorture is set FALSE, then all functions will return a correct result.
> foo1(c(1,2,3), c(4,5,6), f) # error
[1] 700
> foo2(c(1,2,3), c(4,5,6), f) # wrong answer (occasionally)!
[1] 700
> foo3(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
> foo4(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
It means that method 1 and method 2 are subjected to break when garbage is collected during runtime and we don't know when it happens. Thus, it is dangerous to not wrap the parameter properly.
Edit: as of 2017-12-05, all four conversions produce the correct result.
f(a)
f(wrap(a))
f(as<NumericVector>(wrap(a)))
f(NumericVector(a.begin(),a.end()))
and this is the benchmark
> microbenchmark(foo1(c(1,2,3), c(4,5,6), f), foo2(c(1,2,3), c(4,5,6), f), foo
3(c(1,2,3), c(4,5,6), f), foo4(c(1,2,3), c(4,5,6), f))
Unit: milliseconds
expr min lq mean median uq
foo1(c(1, 2, 3), c(4, 5, 6), f) 2.575459 2.694297 2.905398 2.734009 2.921552
foo2(c(1, 2, 3), c(4, 5, 6), f) 2.574565 2.677380 2.880511 2.731615 2.847573
foo3(c(1, 2, 3), c(4, 5, 6), f) 2.582574 2.701779 2.862598 2.753256 2.875745
foo4(c(1, 2, 3), c(4, 5, 6), f) 2.378309 2.469361 2.675188 2.538140 2.695720
max neval
4.186352 100
5.336418 100
4.611379 100
3.734019 100
And f(NumericVector(a.begin(),a.end())) is marginally faster than other methods.
This should works with arma::vec, arma::rowvec and arma::colvec:
template <typename T>
Rcpp::NumericVector arma2vec(const T& x) {
return Rcpp::NumericVector(x.begin(), x.end());
}
I had the same question. I used wrap to do the conversion at the core of several layers of for loops and it was very slow. I think the wrap function is to blame for dragging the speed down so I wish to know if there is an elegant way to do this.
As for Raymond's question, you might want to try including the namespace like: Rcpp::as<Rcpp::NumericVector>(wrap(A)) instead or include a line using namespace Rcpp; at the beginning of your code.

Fastest way to drop rows with missing values?

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.
This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i
Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/
For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.
Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04

Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`

Is there a way to pass a data.table objects to c++ functions using Rcpp and/or RcppArmadillo without manually transforming to data.table to a data.frame? In the example below test_rcpp(X2) and test_arma(X2) both fail with c++ exception (unknown reason).
R code
X=data.frame(c(1:100),c(1:100))
X2=data.table(X)
test_rcpp(X)
test_rcpp(X2)
test_arma(X)
test_arma(X2)
c++ functions
NumericMatrix test_rcpp(NumericMatrix X) {
return(X);
}
mat test_arma(mat X) {
return(X);
}
Building on top of other answers, here is some example code:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
double do_stuff_with_a_data_table(DataFrame df){
CharacterVector x = df["x"] ;
NumericVector y = df["y"] ;
IntegerVector z = df["v"] ;
/* do whatever with x, y, v */
double res = sum(y) ;
return res ;
}
So, as Matthew says, this treats the data.table as a data.frame (aka a Rcpp::DataFrame in Rcpp).
require(data.table)
DT <- data.table(
x=rep(c("a","b","c"),each=3),
y=c(1,3,6),
v=1:9)
do_stuff_with_a_data_table( DT )
# [1] 30
This completely ignores the internals of the data.table.
Try passing the data.table as a DataFrame rather than NumericMatrix. It is a data.frame anyway, with the same structure, so you shouldn't need to convert it.
Rcpp sits on top of native R types encoded as SEXP. This includes eg data.frame or matrix.
data.table is not native, it is an add-on. So someone who wants this (you?) has to write a converter, or provide funding for someone else to write one.
For reference, I think the good thing is to output a list from rcpp as data.table allow update via lists.
Here is a dummy example:
cCode <-
'
DataFrame DT(DTi);
NumericVector x = DT["x"];
int N = x.size();
LogicalVector b(N);
NumericVector d(N);
for(int i=0; i<N; i++){
b[i] = x[i]<=4;
d[i] = x[i]+1.;
}
return Rcpp::List::create(Rcpp::Named("b") = b, Rcpp::Named("d") = d);
';
require("data.table");
require("rcpp");
require("inline");
DT <- data.table(x=1:9,y=sample(letters,9)) #declare a data.table
modDataTable <- cxxfunction(signature(DTi="data.frame"), plugin="Rcpp", body=cCode)
DT_add <- modDataTable(DT) #here we get the list
DT[, names(DT_add):=DT_add] #here we update by reference the data.table

Resources