Improving performance of a loop with succeeding string replacements? - r

I have (html-)texts and I want to change the ö things to real characters like ä, ü, ö, and so on because otherwise the xml-package does not accept it.
So I wrote a little function which cycles through a replacement table (link1, link2) and does replace special character by special character by sp... the function looks like this (only looonger):
html.charconv <- function(text){
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2,byrow=T)
for(i in 1:length(replacer[,1])){
text <- str_replace_all(text,replacer[i,2],replacer[i,1])
}
text
}
How might I speed this up? I thought about vectorization but did not come with any helping solution because for each cycle the result of the last cycle is its starting point.

You can get a significant speedup by constructing your function a bit different, and forget about the text tools. Basically you :
split the character string
match the characters you want and replace them by the new characters
paste everything together again
You can do that with following function :
html.fastconv <- function(x,old,new){
xs <- strsplit(x,"&|;")
old <- gsub("&|;","",old)
xs <- lapply(xs,function(i){
id <- match(i,old,0L)
i[id!=0] <- new[id]
return(i)
})
sapply(xs,paste,collapse="")
}
This works as :
> sometext <- c("Ádd somá leÂtterâ acute problems et´ cetera",
+ "Ádd somá leÂtterâ acute p ..." ... [TRUNCATED]
> newchar <- c("Á","á","Â","â","´")
> oldchar <- c("Á","á","Â","â","´")
> html.fastconv(sometext,oldchar,newchar)
[1] "Ádd somá leÂtterâ acute problems et´ cetera" "Ádd somá leÂtterâ acute problems et´ cetera"
For the record, some benchmarking :
require(rbenchmark)
benchmark(html.fastconv(sometext,oldchar,newchar),html.charconv(sometext),
columns=c("test","elapsed","relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.79 5.643
1 html.fastconv(sometext, oldchar, newchar) 0.14 1.000

Just for fun, here is a version based on Rcpp.
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
CharacterVector rcpp_conv(
CharacterVector text, CharacterVector old , CharacterVector new_){
int n = text.size() ;
int nr = old.size() ;
std::string buffer, current_old, current_new ;
size_t pos, current_size ;
CharacterVector res(n) ;
for( int i=0; i<n; i++){
buffer = text[i] ;
for( int j=0; j<nr; j++){
current_old = old[j] ;
current_size = current_old.size() ;
current_new = new_[j] ;
pos = 0 ;
pos = buffer.find( current_old ) ;
while( pos != std::string::npos ){
buffer.replace(
pos, current_size,
current_new
) ;
pos = buffer.find( current_old ) ;
}
}
res[i] = buffer ;
}
return res ;
}
For which I get quite a further performance gain:
> microbenchmark(
+ html.fastconv( sometext,oldchar,newchar),
+ html.fastconvJC(sometext, oldchar, newchar),
+ rcpp_conv( sometext, oldchar, newchar)
+ )
Unit: microseconds
expr min lq median uq
1 html.fastconv(sometext, oldchar, newchar) 97.588 99.9845 101.4195 103.072
2 html.fastconvJC(sometext, oldchar, newchar) 19.945 23.3060 25.8110 28.134
3 rcpp_conv(sometext, oldchar, newchar) 4.047 5.1555 6.2340 9.275
max
1 256.061
2 40.647
3 25.763
Here is an implementation based on the Rcpp::String feature, available from Rcpp >= 0.10.2:
class StringConv{
public:
typedef String result_type ;
StringConv( CharacterVector old_, CharacterVector new__):
nr(old_.size()), old(old_), new_(new__){}
String operator()(String text) const {
for( int i=0; i<nr; i++){
text.replace_all( old[i], new_[i] ) ;
}
return text ;
}
private:
int nr ;
CharacterVector old ;
CharacterVector new_ ;
} ;
// [[Rcpp::export]]
CharacterVector test_sapply_string(
CharacterVector text, CharacterVector old , CharacterVector new_
){
CharacterVector res = sapply( text, StringConv( old, new_ ) ) ;
return res ;
}

I'm guessing that 36,000 file read and writes is your bottleneck and the way you code in R can't help much with that. Some things just take a while. Your function looks like it will work right, just let it run. There are a few small improvements you could make.
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2, byrow=T)
html.fastconvJC <- function(x,old,new){
n <- length(new)
s <- x #make a copy cause I'm scared of scoping in R :)
for (i in 1:n) s <- gsub(old[i], new[i], s, fixed = TRUE)
s
}
# borrowing the strings from Joris Meys
benchmark(html.fastconvJC(sometext, replacer[,2], replacer[,1]),
html.charconv(sometext), columns = c("test", "elapsed", "relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.727 17.31
1 html.fastconvJC(sometext, replacer[, 2], replacer[, 1]) 0.042 1.00
And they increased speed more than I expected. Note that a huge part of that speedup is making fixed = TRUE, otherwise Joris Meys answer comes in about the same speed.
If this doesn't get your far in overall speed you know your bottleneck is elsewhere, likely file reads and writes. Unless you have solid state or RAID drives, running this in parallel isn't going to speed anything up and might just slow it down.

I will try with plyr :
input.data <- llply(input.files, html.charconv, .parallel=TRUE)

Related

Rcpp Function slower than Rf_eval

I have been working on a package that uses Rcpp to apply arbitrary R code over a group of large medical imaging files. I noticed that my Rcpp implementation is considerably slower than the original pure C version. I traced the difference to calling a function via Function, vs the original Rf_eval. My question is why is there a close to 4x performance degradation, and is there a way to speed up the function call to be closer in performance to Rf_eval?
Example:
library(Rcpp)
library(inline)
library(microbenchmark)
cpp_fun1 <-
'
Rcpp::List lots_of_calls(Function fun, NumericVector vec){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
'
cpp_fun2 <-
'
Rcpp::List lots_of_calls2(SEXP fun, SEXP env){
Rcpp::List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
'
lots_of_calls <- cppFunction(cpp_fun1)
lots_of_calls2 <- cppFunction(cpp_fun2)
microbenchmark(lots_of_calls(mean, 1:1000),
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv))
Results
Unit: milliseconds
expr min lq mean median uq max neval
lots_of_calls(mean, 1:1000) 38.23032 38.80177 40.84901 39.29197 41.62786 54.07380 100
lots_of_calls2(quote(mean(1:1000)), .GlobalEnv) 10.53133 10.71938 11.08735 10.83436 11.03759 18.08466 100
Rcpp is great because it makes things look absurdly clean to the programmer. The cleanliness has a cost in the form of templated responses and a set of assumptions that weigh down the execution time. But, such is the case with a generalized vs. specific code setup.
Take for instance the call route for an Rcpp::Function. The initial construction and then outside call to a modified version of Rf_reval requires a special Rcpp specific eval function given in Rcpp_eval.h. In turn, this function is wrapped in protections to protect against a function error when calling into R via a Shield associated with it. And so on...
In comparison, Rf_eval has neither. If it fails, you will be up the creek without a paddle. (Unless, of course, you implement error catching via R_tryEval for it.)
With this being said, the best way to speed up the calculation is to simply write everything necessary for the computation in C++.
Besides the points made by #coatless, you aren't even comparing apples with apples. Your Rf_eval example does not pass the vector to the function, and, more importantly, plays tricks on the function via quote().
In short, it is all a little silly.
Below is a more complete example using the sugar function mean().
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List callFun(Function fun, NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = fun(NumericVector(vec));
}
return output;
}
// [[Rcpp::export]]
List callRfEval(SEXP fun, SEXP env){
List output(1000);
for(int i = 0; i < 1000; ++i){
output[i] = Rf_eval(fun, env);
}
return output;
}
// [[Rcpp::export]]
List callSugar(NumericVector vec) {
List output(1000);
for(int i = 0; i < 1000; ++i){
double d = mean(vec);
output[i] = d;
}
return output;
}
/*** R
library(microbenchmark)
microbenchmark(callFun(mean, 1:1000),
callRfEval(quote(mean(1:1000)), .GlobalEnv),
callSugar(1:1000))
*/
You can just sourceCpp() this:
R> sourceCpp("/tmp/ch.cpp")
R> library(microbenchmark)
R> microbenchmark(callFun(mean, 1:1000),
+ callRfEval(quote(mean(1:1000)), .GlobalEnv),
+ callSugar(1:1000))
Unit: milliseconds
expr min lq mean median uq max neval
callFun(mean, 1:1000) 14.87451 15.54385 18.57635 17.78990 18.29127 114.77153 100
callRfEval(quote(mean(1:1000)), .GlobalEnv) 3.35954 3.57554 3.97380 3.75122 4.16450 6.29339 100
callSugar(1:1000) 1.50061 1.50827 1.62204 1.51518 1.76683 1.84513 100
R>

Remove NA values efficiently

I need to remove NA values efficiently from vectors inside a function which is implemented with RcppEigen. I can of course do it using a for loop, but I wonder if there is a more efficient way.
Here is an example:
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::VectorXd;
typedef Map<VectorXd> MapVecd;
'
body <- '
const MapVecd x(as<MapVecd>(xx)), y(as<MapVecd>(yy));
VectorXd x1(x), y1(y);
int k(0);
for (int i = 0; i < x.rows(); ++i) {
if (x.coeff(i)==x.coeff(i) && y.coeff(i)==y.coeff(i)) {
x1(k) = x.coeff(i);
y1(k) = y.coeff(i);
k++;
};
};
x1.conservativeResize(k);
y1.conservativeResize(k);
return Rcpp::List::create(Rcpp::Named("x") = x1,
Rcpp::Named("y") = y1);
'
na.omit.cpp <- cxxfunction(signature(xx = "Vector", yy= "Vector"),
body, "RcppEigen", incl)
na.omit.cpp(c(1.5, NaN, 7, NA), c(7.0, 1, NA, 3))
#$x
#[1] 1.5
#
#$y
#[1] 7
In my use case I need to do this about one million times in a loop (inside the Rcpp function) and the vectors could be quite long (let's assume 1000 elements).
PS: I've also investigated the route to find all NA/NaN values using x.array()==x.array(), but was unable to find a way to use the result for subsetting with Eigen.
Perhaps I am not understanding the question correctly, but within Rcpp, I don't see how you could possibly do this more efficiently than a for loop. for loops are generally inefficient in R only because iterating through a loop in R requires a lot of heavy interpreted machinery. But this is not the case once you are down at the C++ level. Even natively vectorized R functions ultimately are implemented with for loops in C. So the only way I can think to make this more efficient is to try to do it in parallel.
For example, here's a simple na.omit.cpp function that omits NA values from a single vector:
rcppfun<-"
Rcpp::NumericVector naomit(Rcpp::NumericVector x){
std::vector<double> r(x.size());
int k=0;
for (int i = 0; i < x.size(); ++i) {
if (x[i]==x[i]) {
r[k] = x[i];
k++;
}
}
r.resize(k);
return Rcpp::wrap(r);
}"
na.omit.cpp<-cppFunction(rcppfun)
This runs even more quickly than R's built in na.omit:
> set.seed(123)
> x<-1:10000
> x[sample(10000,1000)]<-NA
> y1<-na.omit(x)
> y2<-na.omit.cpp(x)
> all(y1==y2)
[1] TRUE
> require(microbenchmark)
> microbenchmark(na.omit(x),na.omit.cpp(x))
Unit: microseconds
expr min lq median uq max neval
na.omit(x) 290.157 363.9935 376.4400 401.750 6547.447 100
na.omit.cpp(x) 107.524 168.1955 173.6035 210.524 222.564 100
I do not know if I understand the problem correctly or not but you can use the following arguments:
a = c(1.5, NaN, 7, NA)
a[-which(is.na(a))]
[1] 1.5 7.0
It might be useful to use `rinside' if you want to use it in C++.

Fast index of lower upper bound in R

I'm trying to find the index of the lower upper bound in R.
This is the same problem that findInterval resolves, but findInterval checks if it's argument is sorted, and I want to avoid that, because I know that it is sorted.
I'm trying to call the underlying C function directly, but I'm confused on whether I should call findInterval or find_interv_vec.
Also, I try to make the call, but can't seem to find the function
findInterval2 <- function (x, vec, rightmost.closed = FALSE, all.inside = TRUE)
{
nx <- length(x)
index <- integer(nx)
.C('find_interv_vec', xt=as.double(vec), n=length(vec),
x=as.double(x), nx=nx, as.logical(rightmost.closed),
as.logical(all.inside), index, DUP = FALSE, NAOK=T,
PACKAGE='base')
index
}
I get
Error in .C("find_interv_vec", xt = as.double(vec), n = length(vec), x = as.double(x), :
"find_interv_vec" not available for .C() for package "base"
On the other hand, I read that it is better to use .Call than old .C, specially because .C copies, and my vec is really big. How should I structure the call to .Call?
Thanks!
After some research and the very helpful answer of #MartinMorgan I decided to do something similar to his answer.
I created some functions which emulate findInterval, without checking if vec is sorted. Clearly this makes a big difference when x is of length 1 and you call it over and over again. If x is of length >> 1 and you can take advantage of vectorizacion, then findInterval only checks once if vec is sorted.
In the following code chunk I created some variants of find interval
findInterval2, which is findInterval written in R as a binary search without the sortedness chek
findInterval2comp, which is findInterval2 compiled with cmpfun
findInterval3, which is findInterval written in C as a binary search compiled with the inline package
After that, I created 2 functions to test
testByOne, which runs findInterval for x of length 1
testVec, which uses vectorization
For testVec, all the functions I created were vectorized in the x argument with Vectorize.
After that, I timed the execution with microbenchmark.
Code
require(inline)
# findInterval written in R as a binary search
findInterval2 <- function(x,v) {
n = length(v)
if (x<v[1])
return (0)
if (x>=v[n])
return (n)
i=1
k=n
while({j = (k-i) %/% 2 + i; !(v[j] <= x && x < v[j+1])}) {
if (x < v[j])
k = j
else
i = j+1
}
return (j)
}
findInterval2Vec = Vectorize(findInterval2,vectorize.args="x")
#findInterval2 compilated with cmpfun
findInterval2Comp <- cmpfun(findInterval2)
findInterval2CompVec <- Vectorize(findInterval2Comp,vectorize.args="x")
findInterval2VecComp <- cmpfun(findInterval2Vec)
findInterval2CompVecComp <- cmpfun(findInterval2CompVec)
sig <-signature(x="numeric",v="numeric",n="integer",idx="integer")
code <- "
if (*x < v[0]) {
*idx = -1;
return;
}
if (*x >= v[*n-1]) {
*idx = *n-1;
return;
}
int i,j,k;
i = 0;
k = *n-1;
while (j = (k-i) / 2 + i, !(v[j] <= *x && *x < v[j+1])) {
if (*x < v[j]) {
k = j;
}
else {
i = j+1;
}
}
*idx=j;
return;
"
fn <- cfunction(sig=sig,body=code,language="C",convention=".C")
# findInterval written in C
findIntervalC <- function(x,v) {
idx = as.integer(-1)
as.integer((fn(x,v,length(v),idx)$idx)+1)
}
findIntervalCVec <- Vectorize(findIntervalC,vectorize.args="x")
# The test case where x is of length 1 and you call findInterval several times
testByOne <- function(f,reps = 100, vlength = 300000, xs = NULL) {
if (is.null(xs))
xs = seq(from=1,to=vlength-1,by=vlength/reps)
v = 1:vlength
for (x in xs)
f(x,v)
}
# The test case where you can take advantage of vectorization
testVec <- function(f,reps = 100, vlength = 300000, xs = NULL) {
if (is.null(xs))
xs = seq(from=1,to=vlength-1,by=vlength/reps)
v = 1:vlength
f(xs,v)
}
Benchmarking
microbenchmark(fi=testByOne(findInterval),fi2=testByOne(findInterval2),fi2comp=testByOne(findInterval2Comp),fic=testByOne(findIntervalC))
Unit: milliseconds
expr min lq median uq max neval
fi 617.536422 648.19212 659.927784 685.726042 754.12988 100
fi2 11.308138 11.60319 11.734305 12.067857 71.98640 100
fi2comp 2.293874 2.52145 2.637388 5.036558 62.01111 100
fic 368.002442 380.81847 416.137318 424.250337 474.31542 100
microbenchmark(fi=testVec(findInterval),fi2=testVec(findInterval2Vec),fi2compVec=testVec(findInterval2CompVec),fi2vecComp=testVec(findInterval2VecComp),fic=testByOne(findIntervalCVec))
Unit: milliseconds
expr min lq median uq max neval
fi 4.218191 4.986061 6.875732 10.216228 68.51321 100
fi2 12.982914 13.786563 16.738707 19.102777 75.64573 100
fi2compVec 4.264839 4.650925 4.902277 9.892413 13.32756 100
fi2vecComp 13.000124 13.689418 14.072334 18.911659 76.19146 100
fic 840.446529 893.445185 908.549874 919.152187 1047.84978 100
Some observations
There must be something wrong in my C code, it can't be that slow
It's better to compile and then vectorize, that to vectorize and then compile
It's weird that fi2comp runs faster than fi2
Compiling again a vectorized compiled function doesn't increase its performance

Convert RcppArmadillo vector to Rcpp vector

I am trying to convert RcppArmadillo vector (e.g. arma::colvec) to a Rcpp vector (NumericVector). I know I can first convert arma::colvec to SEXP and then convert SEXP to NumericVector (e.g. as<NumericVector>(wrap(temp)), assuming temp is an arma::colvec object). But what is a good way to do that?
I want to do that simply because I am unsure if it is okay to pass arma::colvec object as a parameter to an Rcpp::Function object.
I was trying to Evaluate a Rcpp::Function with argument arma::vec, it seems that it takes the argument in four forms without compilation errors. That is, if f is a Rcpp::Function and a is a arma::vec, then
f(a)
f(wrap(a))
f(as<NumericVector>(wrap(a)))
f(NumericVector(a.begin(),a.end()))
produce no compilation and runtime errors, at least apparently.
For this reason, I have conducted a little test for the four versions of arguments. Since I suspect that somethings will go wrong in garbage collection, I test them again gctorture.
gctorture(on=FALSE)
Rcpp::sourceCpp(code = '
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
double foo1(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(a, b));
}
return sum;
}
// [[Rcpp::export]]
double foo2(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(wrap(a),wrap(b)));
}
return sum;
}
// [[Rcpp::export]]
double foo3(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(as<NumericVector>(wrap(a)),as<NumericVector>(wrap(b))));
}
return sum;
}
// [[Rcpp::export]]
double foo4(arma::vec a, arma::vec b, Function f){
double sum = 0.0;
for(int i=0;i<100;i++){
sum += as<double>(f(NumericVector(a.begin(),a.end()),NumericVector(b.begin(),b.end())));
}
return sum;
}
')
# note that when gctorture is on, the program will be very slow as it
# tries to perfrom GC for every allocation.
# gctorture(on=TRUE)
f = function(x,y) {
mean(x) + mean(y)
}
# all three functions should return 700
foo1(c(1,2,3), c(4,5,6), f) # error
foo2(c(1,2,3), c(4,5,6), f) # wrong answer (occasionally)!
foo3(c(1,2,3), c(4,5,6), f) # correct answer
foo4(c(1,2,3), c(4,5,6), f) # correct answer
As a result, the first method produces an error, the second method produces a wrong answer and only the third and the fourth method return the correct answer.
> # they should return 700
> foo1(c(1,2,3), c(4,5,6), f) # error
Error: invalid multibyte string at '<80><a1><e2>'
> foo2(c(1,2,3), c(4,5,6), f) # wrong answer (occasionally)!
[1] 712
> foo3(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
> foo4(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
Note that, if gctorture is set FALSE, then all functions will return a correct result.
> foo1(c(1,2,3), c(4,5,6), f) # error
[1] 700
> foo2(c(1,2,3), c(4,5,6), f) # wrong answer (occasionally)!
[1] 700
> foo3(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
> foo4(c(1,2,3), c(4,5,6), f) # correct answer
[1] 700
It means that method 1 and method 2 are subjected to break when garbage is collected during runtime and we don't know when it happens. Thus, it is dangerous to not wrap the parameter properly.
Edit: as of 2017-12-05, all four conversions produce the correct result.
f(a)
f(wrap(a))
f(as<NumericVector>(wrap(a)))
f(NumericVector(a.begin(),a.end()))
and this is the benchmark
> microbenchmark(foo1(c(1,2,3), c(4,5,6), f), foo2(c(1,2,3), c(4,5,6), f), foo
3(c(1,2,3), c(4,5,6), f), foo4(c(1,2,3), c(4,5,6), f))
Unit: milliseconds
expr min lq mean median uq
foo1(c(1, 2, 3), c(4, 5, 6), f) 2.575459 2.694297 2.905398 2.734009 2.921552
foo2(c(1, 2, 3), c(4, 5, 6), f) 2.574565 2.677380 2.880511 2.731615 2.847573
foo3(c(1, 2, 3), c(4, 5, 6), f) 2.582574 2.701779 2.862598 2.753256 2.875745
foo4(c(1, 2, 3), c(4, 5, 6), f) 2.378309 2.469361 2.675188 2.538140 2.695720
max neval
4.186352 100
5.336418 100
4.611379 100
3.734019 100
And f(NumericVector(a.begin(),a.end())) is marginally faster than other methods.
This should works with arma::vec, arma::rowvec and arma::colvec:
template <typename T>
Rcpp::NumericVector arma2vec(const T& x) {
return Rcpp::NumericVector(x.begin(), x.end());
}
I had the same question. I used wrap to do the conversion at the core of several layers of for loops and it was very slow. I think the wrap function is to blame for dragging the speed down so I wish to know if there is an elegant way to do this.
As for Raymond's question, you might want to try including the namespace like: Rcpp::as<Rcpp::NumericVector>(wrap(A)) instead or include a line using namespace Rcpp; at the beginning of your code.

Efficient subsetting in Rcpp (equivalent of the R "which" command)

In Rcpp, there are various "Rcpp sugar" commands that permit nice vectorised operations in the code. In the code below I move across a data frame, break it into vectors, then use the "ifelse" and "sum" sugar commands to compute the mean of v over the rows where x equals either y or y+1. All seems to work correctly.
Just wondering if there is a neater way than this - e.g. an equivalent of the "which" command that gives index points satisfying a particular condition? There seems to be a facility available as "find" in Armadillo but that means using incompatible object types (you can't use "find" and "ifelse" together).
On the same topic, is it possible to get "ifelse" to accept a compound logical condition? In the example below, for instance, the definition of indic is formed of two "ifelse" commands, and it would obviously be cleaner as one. Any thoughts would be much appreciated.
Look forward to hearing your responses :)
require(Rcpp)
require(inline)
set.seed(42)
df = data.frame(x = rpois(1000,3), y = rpois(1000,3), v = rnorm(1000),
stringsAsFactors=FALSE)
myfunc1 = cxxfunction(
signature(DF = "data.frame"),
plugin = "Rcpp",
body = '
using namespace Rcpp;
DataFrame df(DF);
IntegerVector x = df["x"];
IntegerVector y = df["y"];
NumericVector v = df["v"];
LogicalVector indic = ifelse(x==y,true,ifelse(x==y+1,true,false));
double subsum = sum(ifelse(indic,v,0));
int subsize = sum(indic);
double mn = ((subsize>0) ? subsum/subsize : 0.0);
return(Rcpp::List::create(_["subsize"] = subsize,
_["submean"] = mn
));
'
)
myfunc1(df)
### OUTPUT:
#
# $subsize
# [1] 300
#
# $submean
# [1] 0.1091555
#
Rcpp (>= 0.10.0) implements the | operator between two logical sugar expressions. So you can do:
require( Rcpp )
cppFunction( code = '
List subsum( IntegerVector x, IntegerVector y, NumericVector v){
using namespace Rcpp ;
LogicalVector indic = (x==y) | (x==y+1) ;
int subsize = sum(indic) ;
double submean = subsize == 0 ? 0.0 : sum(ifelse(indic,v,0)) / subsize ;
return List::create( _["subsize"] = subsize, _["submean"] = submean ) ;
}
' )
subsum( rpois(1000,3), rpois(1000,3), rnorm(1000) )
# $subsize
# [1] 320
#
# $submean
# [1] -0.05708866

Resources