Remove NA values efficiently - r

I need to remove NA values efficiently from vectors inside a function which is implemented with RcppEigen. I can of course do it using a for loop, but I wonder if there is a more efficient way.
Here is an example:
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::VectorXd;
typedef Map<VectorXd> MapVecd;
'
body <- '
const MapVecd x(as<MapVecd>(xx)), y(as<MapVecd>(yy));
VectorXd x1(x), y1(y);
int k(0);
for (int i = 0; i < x.rows(); ++i) {
if (x.coeff(i)==x.coeff(i) && y.coeff(i)==y.coeff(i)) {
x1(k) = x.coeff(i);
y1(k) = y.coeff(i);
k++;
};
};
x1.conservativeResize(k);
y1.conservativeResize(k);
return Rcpp::List::create(Rcpp::Named("x") = x1,
Rcpp::Named("y") = y1);
'
na.omit.cpp <- cxxfunction(signature(xx = "Vector", yy= "Vector"),
body, "RcppEigen", incl)
na.omit.cpp(c(1.5, NaN, 7, NA), c(7.0, 1, NA, 3))
#$x
#[1] 1.5
#
#$y
#[1] 7
In my use case I need to do this about one million times in a loop (inside the Rcpp function) and the vectors could be quite long (let's assume 1000 elements).
PS: I've also investigated the route to find all NA/NaN values using x.array()==x.array(), but was unable to find a way to use the result for subsetting with Eigen.

Perhaps I am not understanding the question correctly, but within Rcpp, I don't see how you could possibly do this more efficiently than a for loop. for loops are generally inefficient in R only because iterating through a loop in R requires a lot of heavy interpreted machinery. But this is not the case once you are down at the C++ level. Even natively vectorized R functions ultimately are implemented with for loops in C. So the only way I can think to make this more efficient is to try to do it in parallel.
For example, here's a simple na.omit.cpp function that omits NA values from a single vector:
rcppfun<-"
Rcpp::NumericVector naomit(Rcpp::NumericVector x){
std::vector<double> r(x.size());
int k=0;
for (int i = 0; i < x.size(); ++i) {
if (x[i]==x[i]) {
r[k] = x[i];
k++;
}
}
r.resize(k);
return Rcpp::wrap(r);
}"
na.omit.cpp<-cppFunction(rcppfun)
This runs even more quickly than R's built in na.omit:
> set.seed(123)
> x<-1:10000
> x[sample(10000,1000)]<-NA
> y1<-na.omit(x)
> y2<-na.omit.cpp(x)
> all(y1==y2)
[1] TRUE
> require(microbenchmark)
> microbenchmark(na.omit(x),na.omit.cpp(x))
Unit: microseconds
expr min lq median uq max neval
na.omit(x) 290.157 363.9935 376.4400 401.750 6547.447 100
na.omit.cpp(x) 107.524 168.1955 173.6035 210.524 222.564 100

I do not know if I understand the problem correctly or not but you can use the following arguments:
a = c(1.5, NaN, 7, NA)
a[-which(is.na(a))]
[1] 1.5 7.0
It might be useful to use `rinside' if you want to use it in C++.

Related

How is noNA used in Rcpp?

In his "Advanced R" book, Hadley Wickham says "noNA(x) asserts that the vector x does not contain any missing values." However I still don't know how to use it. I can't do
if (noNA(x))
do this
so how am I supposed to use it?
http://adv-r.had.co.nz/Rcpp.html#rcpp-sugar
Many of the Rcpp sugar expressions are implemented through template classes which have specializations for cases when the input object is known to be free of missing values, thereby allowing the underlying algorithm to avoid having to perform the extra work of dealing with NA values (e.g. calls to is_na). This is only possible because the VectorBase class has a boolean parameter indicating whether the underlying object can (can, not that it necessarily does) have NA values, or not.
noNA returns (when called on a VectorBase object) an instance of the Nona template class. Note that Nona itself derives from
Rcpp::VectorBase<RTYPE, false, Nona<RTYPE,NA,VECTOR>>
// ^^^^^
meaning that the returned object gets encoded with information that essentially says "you can assume that this data is free of NA values".
As an example, Rcpp::sum is implemented via the Sum class in the Rcpp::sugar namespace. In the default case, we see that there is extra work to manage the possibility of missing values:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
STORAGE current ;
for( R_xlen_t i=0; i<n; i++){
current = object[i] ;
if( Rcpp::traits::is_na<RTYPE>(current) ) // here
return Rcpp::traits::get_na<RTYPE>() ; // here
result += current ;
}
return result ;
}
On the other hand, there is also a specialization for cases when the input does not have missing values, in which the algorithm does less work:
STORAGE get() const {
STORAGE result = 0 ;
R_xlen_t n = object.size() ;
for( R_xlen_t i=0; i<n; i++){
result += object[i] ;
}
return result ;
}
To answer your question of "how do I apply this in practice?", here is an example:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Sum(IntegerVector x) {
return sum(x);
}
// [[Rcpp::export]]
int SumNoNA(IntegerVector x) {
return sum(noNA(x));
}
Benchmarking these two functions,
set.seed(123)
x <- as.integer(rpois(1e6, 25))
all.equal(Sum(x), SumNoNA(x))
# [1] TRUE
microbenchmark::microbenchmark(
Sum(x),
SumNoNA(x),
times = 500L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# Sum(x) 577.386 664.620 701.2422 677.1640 731.7090 1214.447 500
# SumNoNA(x) 454.990 517.709 556.5783 535.1935 582.7065 1138.426 500
the noNA version is indeed faster.

Rcpp Function slower than Rf_eval

I have been working on a package that uses Rcpp to apply arbitrary R code over a group of large medical imaging files. I noticed that my Rcpp implementation is considerably slower than the original pure C version. I traced the difference to calling a function via Function, vs the original Rf_eval. My question is why is there a close to 4x performance degradation, and is there a way to speed up the function call to be closer in performance to Rf_eval?
Example:
library(Rcpp)
library(inline)
library(microbenchmark)
cpp_fun1 <-
'
Rcpp::List lots_of_calls(Function fun, NumericVector vec){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
'
cpp_fun2 <-
'
Rcpp::List lots_of_calls2(SEXP fun, SEXP env){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
'
lots_of_calls <- cppFunction(cpp_fun1)
lots_of_calls2 <- cppFunction(cpp_fun2)
microbenchmark(lots_of_calls(mean, 1:1000),
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv))
Results
Unit: milliseconds
expr min lq mean median uq max neval
lots_of_calls(mean, 1:1000) 38.23032 38.80177 40.84901 39.29197 41.62786 54.07380 100
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv) 10.53133 10.71938 11.08735 10.83436 11.03759 18.08466 100
Rcpp is great because it makes things look absurdly clean to the programmer. The cleanliness has a cost in the form of templated responses and a set of assumptions that weigh down the execution time. But, such is the case with a generalized vs. specific code setup.
Take for instance the call route for an Rcpp::Function. The initial construction and then outside call to a modified version of Rf_reval requires a special Rcpp specific eval function given in Rcpp_eval.h. In turn, this function is wrapped in protections to protect against a function error when calling into R via a Shield associated with it. And so on...
In comparison, Rf_eval has neither. If it fails, you will be up the creek without a paddle. (Unless, of course, you implement error catching via R_tryEval for it.)
With this being said, the best way to speed up the calculation is to simply write everything necessary for the computation in C++.
Besides the points made by #coatless, you aren't even comparing apples with apples. Your Rf_eval example does not pass the vector to the function, and, more importantly, plays tricks on the function via quote().
In short, it is all a little silly.
Below is a more complete example using the sugar function mean().
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List callFun(Function fun, NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
// [[Rcpp::export]]
List callRfEval(SEXP fun, SEXP env){
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
// [[Rcpp::export]]
List callSugar(NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
double d = mean(vec);
output[i] = d;
}
return output;
}
/*** R
library(microbenchmark)
microbenchmark(callFun(mean, 1:1000),
callRfEval(quote(mean(1:1000)), .GlobalEnv),
callSugar(1:1000))
*/
You can just sourceCpp() this:
R> sourceCpp("/tmp/ch.cpp")
R> library(microbenchmark)
R> microbenchmark(callFun(mean, 1:1000),
+ callRfEval(quote(mean(1:1000)), .GlobalEnv),
+ callSugar(1:1000))
Unit: milliseconds
expr min lq mean median uq max neval
callFun(mean, 1:1000) 14.87451 15.54385 18.57635 17.78990 18.29127 114.77153 100
callRfEval(quote(mean(1:1000)), .GlobalEnv) 3.35954 3.57554 3.97380 3.75122 4.16450 6.29339 100
callSugar(1:1000) 1.50061 1.50827 1.62204 1.51518 1.76683 1.84513 100
R>

Converting String Versions of "Infinity" to Numeric in Rcpp

I have some JSON response that encodes Inf/-Inf/NaN as strings, so the JSON array it returns will look like [1.0, "Infinity", 2.0]. I parse this using a JSON library and end up with a list that looks like list(1.0, "Infinity", 2.0) and I want to convert it to be list(1.0, Inf, 2.0), for performance reasons I need this to use Rcpp. Here is the code I tried doing but I can't seem to get Rcpp to not yell at me about
library(Rcpp)
cppFunction('
NumericVector convertThings(List data) {
const size_t num_rows = data.size();
NumericVector rv(num_rows);
for (size_t i = 0; i < num_rows; ++i) {
if (as<String>(data[i]) == "Infinity") {
rv[i] = R_PosInf;
} else {
rv[i] = as<double>(data[i]);
}
}
return rv;
}
')
convertThings(list('Infinity', 1.0))
# expected output c(Inf, 1.0)
The error I am seeing is Error: not compatible with requested type. Help is much appreciated!
That is a basic C++ problem: how to convert text to numbers reliably.
One possibly answer is provided by the Boost.Lexical_Cast library and illustrated in this Rcpp Gallery post. Just using the first example:
R> library(Rcpp)
R> sourceCpp("/tmp/boostLexicalCastExample.cpp") # from post
R> lexcicalCast(c("Inf", "inf", "Infinity", "NA", 42))
[1] Inf Inf Inf NA 42
R>
As you can see, it matches at least three different ways of spelling infinity in text.

Fast index of lower upper bound in R

I'm trying to find the index of the lower upper bound in R.
This is the same problem that findInterval resolves, but findInterval checks if it's argument is sorted, and I want to avoid that, because I know that it is sorted.
I'm trying to call the underlying C function directly, but I'm confused on whether I should call findInterval or find_interv_vec.
Also, I try to make the call, but can't seem to find the function
findInterval2 <- function (x, vec, rightmost.closed = FALSE, all.inside = TRUE)
{
nx <- length(x)
index <- integer(nx)
.C('find_interv_vec', xt=as.double(vec), n=length(vec),
x=as.double(x), nx=nx, as.logical(rightmost.closed),
as.logical(all.inside), index, DUP = FALSE, NAOK=T,
PACKAGE='base')
index
}
I get
Error in .C("find_interv_vec", xt = as.double(vec), n = length(vec), x = as.double(x), :
"find_interv_vec" not available for .C() for package "base"
On the other hand, I read that it is better to use .Call than old .C, specially because .C copies, and my vec is really big. How should I structure the call to .Call?
Thanks!
After some research and the very helpful answer of #MartinMorgan I decided to do something similar to his answer.
I created some functions which emulate findInterval, without checking if vec is sorted. Clearly this makes a big difference when x is of length 1 and you call it over and over again. If x is of length >> 1 and you can take advantage of vectorizacion, then findInterval only checks once if vec is sorted.
In the following code chunk I created some variants of find interval
findInterval2, which is findInterval written in R as a binary search without the sortedness chek
findInterval2comp, which is findInterval2 compiled with cmpfun
findInterval3, which is findInterval written in C as a binary search compiled with the inline package
After that, I created 2 functions to test
testByOne, which runs findInterval for x of length 1
testVec, which uses vectorization
For testVec, all the functions I created were vectorized in the x argument with Vectorize.
After that, I timed the execution with microbenchmark.
Code
require(inline)
# findInterval written in R as a binary search
findInterval2 <- function(x,v) {
n = length(v)
if (x<v[1])
return (0)
if (x>=v[n])
return (n)
i=1
k=n
while({j = (k-i) %/% 2 + i; !(v[j] <= x && x < v[j+1])}) {
if (x < v[j])
k = j
else
i = j+1
}
return (j)
}
findInterval2Vec = Vectorize(findInterval2,vectorize.args="x")
#findInterval2 compilated with cmpfun
findInterval2Comp <- cmpfun(findInterval2)
findInterval2CompVec <- Vectorize(findInterval2Comp,vectorize.args="x")
findInterval2VecComp <- cmpfun(findInterval2Vec)
findInterval2CompVecComp <- cmpfun(findInterval2CompVec)
sig <-signature(x="numeric",v="numeric",n="integer",idx="integer")
code <- "
if (*x < v[0]) {
*idx = -1;
return;
}
if (*x >= v[*n-1]) {
*idx = *n-1;
return;
}
int i,j,k;
i = 0;
k = *n-1;
while (j = (k-i) / 2 + i, !(v[j] <= *x && *x < v[j+1])) {
if (*x < v[j]) {
k = j;
}
else {
i = j+1;
}
}
*idx=j;
return;
"
fn <- cfunction(sig=sig,body=code,language="C",convention=".C")
# findInterval written in C
findIntervalC <- function(x,v) {
idx = as.integer(-1)
as.integer((fn(x,v,length(v),idx)$idx)+1)
}
findIntervalCVec <- Vectorize(findIntervalC,vectorize.args="x")
# The test case where x is of length 1 and you call findInterval several times
testByOne <- function(f,reps = 100, vlength = 300000, xs = NULL) {
if (is.null(xs))
xs = seq(from=1,to=vlength-1,by=vlength/reps)
v = 1:vlength
for (x in xs)
f(x,v)
}
# The test case where you can take advantage of vectorization
testVec <- function(f,reps = 100, vlength = 300000, xs = NULL) {
if (is.null(xs))
xs = seq(from=1,to=vlength-1,by=vlength/reps)
v = 1:vlength
f(xs,v)
}
Benchmarking
microbenchmark(fi=testByOne(findInterval),fi2=testByOne(findInterval2),fi2comp=testByOne(findInterval2Comp),fic=testByOne(findIntervalC))
Unit: milliseconds
expr min lq median uq max neval
fi 617.536422 648.19212 659.927784 685.726042 754.12988 100
fi2 11.308138 11.60319 11.734305 12.067857 71.98640 100
fi2comp 2.293874 2.52145 2.637388 5.036558 62.01111 100
fic 368.002442 380.81847 416.137318 424.250337 474.31542 100
microbenchmark(fi=testVec(findInterval),fi2=testVec(findInterval2Vec),fi2compVec=testVec(findInterval2CompVec),fi2vecComp=testVec(findInterval2VecComp),fic=testByOne(findIntervalCVec))
Unit: milliseconds
expr min lq median uq max neval
fi 4.218191 4.986061 6.875732 10.216228 68.51321 100
fi2 12.982914 13.786563 16.738707 19.102777 75.64573 100
fi2compVec 4.264839 4.650925 4.902277 9.892413 13.32756 100
fi2vecComp 13.000124 13.689418 14.072334 18.911659 76.19146 100
fic 840.446529 893.445185 908.549874 919.152187 1047.84978 100
Some observations
There must be something wrong in my C code, it can't be that slow
It's better to compile and then vectorize, that to vectorize and then compile
It's weird that fi2comp runs faster than fi2
Compiling again a vectorized compiled function doesn't increase its performance

Fastest way to drop rows with missing values?

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.
This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i
Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/
For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.
Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04

Resources