the new find_finite and find_nonfinite functions in Armadillo 4.300 are great additions! In my tests using Rcpp, they are about 2.5x slower compared to a standard loop though. Below is some code for calculating the sum and mean with case-wise deletion corresponding to R's na.rm=TRUE option. The performance benchmarks from R show that the first version (sum_arma and mean_arma) is about 3.5x faster compared to the loop. I am doing everything correct? Any way to improve the performance?
C++ code
#include <numeric>
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
double sum_arma1(arma::mat& X) {
double sum = 0;
for (int i = 0; i < X.size(); ++i) {
if (arma::is_finite(X(i)))
sum += X(i);
}
return sum;
}
// [[Rcpp::export]]
double sum_arma2(arma::mat& X) {
return arma::sum(X.elem(arma::find_finite(X)));
}
// [[Rcpp::export]]
double mean_arma1(arma::mat& X) {
double sum = 0;
int n = 0;
for (int i = 0; i < X.size(); ++i) {
if (arma::is_finite(X(i))) {
sum += X(i);
n += 1;
}
}
return sum/n;
}
// [[Rcpp::export]]
double mean_arma2(arma::mat& X) {
return arma::mean(X.elem(arma::find_finite(X)));
}
Benchmark results from R
# data
X = matrix(rnorm(1e6),1000,1000)
X[sample(1:1000,100),sample(1:1000,100)] = NA
# equal?
all.equal(sum(X, na.rm=TRUE),sum_arma1(X))
all.equal(sum(X, na.rm=TRUE),sum_arma2(X))
all.equal(mean(X, na.rm=TRUE),mean_arma1(X))
all.equal(mean(X, na.rm=TRUE),mean_arma2(X))
# benchmark
benchmark(
sum(X, na.rm=TRUE),
sum_arma1(X),
sum_arma2(X),
replications=100)
# test replications elapsed relative user.self sys.self
# 2 sum_arma1(X) 100 0.259 1.000 0.259 0.001
# 3 sum_arma2(X) 100 1.035 3.996 0.750 0.293
# 1 sum(X, na.rm = TRUE) 100 0.491 1.896 0.492 0.003
benchmark(
mean(X, na.rm=TRUE),
mean_arma1(X),
mean_arma2(X),
replications=100)
# test replications elapsed relative user.self sys.self
# 2 mean_arma1(X) 100 0.252 1.00 0.253 0.001
# 3 mean_arma2(X) 100 0.819 3.25 0.620 0.206
# 1 mean(X, na.rm = TRUE) 100 7.440 29.52 7.120 0.373
The general functions find_finite() and find_nonfinite() will always be slower than specialised summation loops. find_finite() was not designed specifically for summation, but for the general case of, well, finding the indices of finite values. What you do with those indices is up to you, and you've chosen to use them as input to the .elem() function.
In the code arma::sum(X.elem(arma::find_finite(X))), the function find_finite() has to go through X, looking for finite values, and the store the resulting indices of the finite values in a temporary vector. The .elem() member function then looks at the vector generated by find_finite() and creates another vector which contains only finite values. In turn, the vector generated by .elem() is then used by sum().
C++ allows abstractions so that your code is quite compact, but sometimes you have to pay for such abstractions. General functions will always be slower than specialised loops.
However, for arithmetic functions such as addition, multiplication, etc, Armadillo will try to avoid the generation of temporary vectors/matrices, through the use of a smart delayed operations framework (based on template expressions) which queues up and combines several operations before executing them. This reduces the generation of temporaries.
The implementation of delayed operations is quite complex, which is why it's mainly done for the most important arithmetic functions. However, Armadillo has it in a few other cases as well, for example, find(X > 123) will avoid generating the temporary for X > 123.
Related
I was reading the vignette for the rgen package which provides headers for sampling from some common distributions. In the first paragraph, it says that:
Please note, these samplers, just like the ones in armadillo cannot be used in parallelized code as the underlying generation routines rely upon R calls that are single-threaded.
This was news to me, and I've been using RcppArmadillo for quite some time now. I was wondering if someone could elaborate on this point (or provide references to where I can read about the issue). I'm especially interested in learning what "cannot be used" means here; will results be wrong, or will it just not parallelize?
These functions use R's random number generator, which must not be used in parallelized code, since that leads to undefined behavior. Undefined behavior can lead to virtually anything. From my point of view you are lucky if the program crashes, since this clearly tells you that something is going wrong.
The HPC task view lists some RNGs that are suitable for parallel computation. But you cannot use them easily with the distributions provided by rgen or RcppDist. Instead, one could do the following:
Copy function for multivariate normal distribution from rgen an adjust it's signature such that it takes a std::function<double()> as source for N(0, 1) distributed random numbers.
Use a fast RNG instead of R's RNG.
Use the same fast RNG in parallel mode.
In code as a quick hack:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
inline arma::mat rmvnorm(unsigned int n, const arma::vec& mu, const arma::mat& S,
std::function<double()> rnorm = norm_rand){
unsigned int ncols = S.n_cols;
arma::mat Y(n, ncols);
Y.imbue( rnorm ) ;
return arma::repmat(mu, 1, n).t() + Y * arma::chol(S);
}
// [[Rcpp::export]]
arma::mat defaultRNG(unsigned int n, const arma::vec& mu, const arma::mat& S) {
return rmvnorm(n, mu, S);
}
// [[Rcpp::export]]
arma::mat serial(unsigned int n, const arma::vec& mu, const arma::mat& S) {
dqrng::normal_distribution dist(0.0, 1.0);
dqrng::xoshiro256plus rng(42);
return rmvnorm(n, mu, S, [&](){return dist(rng);});
}
// [[Rcpp::export]]
std::vector<arma::mat> parallel(unsigned int n, const arma::vec& mu, const arma::mat& S, unsigned int ncores = 1) {
dqrng::normal_distribution dist(0.0, 1.0);
dqrng::xoshiro256plus rng(42);
std::vector<arma::mat> res(ncores);
#pragma omp parallel num_threads(ncores)
{
dqrng::xoshiro256plus lrng(rng); // make thread local copy of rng
lrng.jump(omp_get_thread_num() + 1); // advance rng by 1 ... ncores jumps
res[omp_get_thread_num()] = rmvnorm(n, mu, S, [&](){return dist(lrng);});
}
return res;
}
/*** R
set.seed(42)
N <- 1000000
M <- 100
mu <- rnorm(M)
S <- matrix(rnorm(M*M), M, M)
S <- S %*% t(S)
system.time(defaultRNG(N, mu, S))
system.time(serial(N, mu, S))
system.time(parallel(N/2, mu, S, 2))
*/
Result:
> system.time(defaultRNG(N, mu, S))
user system elapsed
6.984 1.380 6.881
> system.time(serial(N, mu, S))
user system elapsed
4.008 1.448 3.971
> system.time(parallel(N/2, mu, S, 2))
user system elapsed
4.824 2.096 3.080
Here the real performance improvement comes from using a faster RNG, which is understandable since the focus here lies on many random numbers and not so much on matrix operations. If I shift more towards matrix operations by using N <- 100000 and M <- 1000 I get:
> system.time(defaultRNG(N, mu, S))
user system elapsed
16.740 1.768 9.725
> system.time(serial(N, mu, S))
user system elapsed
13.792 1.864 6.792
> system.time(parallel(N/2, mu, S, 2))
user system elapsed
14.112 3.900 5.859
Here we clearly see that in all cases user time is larger than elapsed time. The reason for this is the parallel BLAS implementation I am using (OpenBLAS). So there are quite a few factors to consider before deciding on a method.
I am trying to parallelise the addition of (large) vectors using RcppParallel. That's what I've come up with.
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
#include <Rcpp.h>
#include <assert.h>
using namespace RcppParallel;
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector directVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
results = first + second;
return results;
}
// [[Rcpp::export]]
NumericVector loopVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
for(unsigned i = 0; i != first.length(); i++)
results[i] = first[i] + second[i];
return results;
}
struct VectorAddition : public Worker
{
const RVector<double> first, second;
RVector<double> results;
VectorAddition(const NumericVector one, const NumericVector two, NumericVector three) : first(one), second(two), results(three) {}
void operator()(std::size_t a1, std::size_t a2) {
std::transform(first.begin() + a1, first.begin() + a2,
second.begin() + a1,
results.begin() + a1,
[](double i, double j) {return i + j;});
}
};
// [[Rcpp::export]]
NumericVector parallelVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
VectorAddition myVectorAddition(first, second, results);
parallelFor(0, first.length(), myVectorAddition);
return results;
}
It seems to work, but doesn't speed up things (at least not on a 4-core machine).
> v1 <- 1:1000000
> v2 <- 1000000:1
> all(directVectorAddition(v1, v2) == loopVectorAddition(v1, v2))
[1] TRUE
> all(directVectorAddition(v1, v2) == parallelVectorAddition(v1, v2))
[1] TRUE
> result <- benchmark(v1 + v2, directVectorAddition(v1, v2), loopVectorAddition(v1, v2), parallelVectorAddition(v1, v2), order="relative")
> result[,1:4]
test replications elapsed relative
1 v1 + v2 100 0.206 1.000
4 parallelVectorAddition(v1, v2) 100 0.993 4.820
2 directVectorAddition(v1, v2) 100 1.015 4.927
3 loopVectorAddition(v1, v2) 100 1.056 5.126
Can this be implemented more efficiently?
Thanks a lot in advance,
mce
Rookie mistake :) You define this as Rcpp::NumericVector but create data that is created via the sequence operator. And that creates integer values so you are forcing a copy onto all your functions!
Make it
v1 <- as.double(1:1000000)
v2 <- as.double(1000000:1)
instead, and on a machine with lots of cores (at work) I then see
R> result[,1:4]
test replications elapsed relative
4 parallelVectorAddition(v1, v2) 100 0.301 1.000
2 directVectorAddition(v1, v2) 100 0.424 1.409
1 v1 + v2 100 0.436 1.449
3 loopVectorAddition(v1, v2) 100 0.736 2.445
The example is still not that impressive because the relevant operation is "cheap" whereas the parallel approach needs to allocate memory, copy data to workers, collect again etc pp.
But the good news is that you wrote your parallel code correctly. Not a small task.
I got a document term matrix of ~1600 documents x ~120 words. I would like to compute the cosine similarity between all these vectors, but we are speaking about ~1,300,000 comparisons [n * (n - 1) / 2].
I used parallel::mclapply with 8 but it still takes forever.
Which other solution do you suggest?
Thanks
Here's my take on it.
If I define cosine similarity as
coss <- function(x) {crossprod(x)/(sqrt(tcrossprod(colSums(x^2))))}
(I think that is about as quickly as I can make it with base R functions and the often overseen crossprod which is a little gem). If I compare it with an RCpp function using RCppArmadillo (slightly updated as suggested by #f-privé)
NumericMatrix cosine_similarity(NumericMatrix x) {
arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
// Compute the crossprod
arma::mat res = X.t() * X;
int n = x.ncol();
arma::vec diag(n);
int i, j;
for (i=0; i<n; i++) {
diag(i) = sqrt(res(i,i));
}
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
res(i, j) /= diag(i)*diag(j);
return(wrap(res));
}
(this might possibly be optimised with some of the specialized functions in the armadillo library - just wanted to get some timing measurements).
Comparing those yields
> XX <- matrix(rnorm(120*1600), ncol=1600)
> microbenchmark::microbenchmark(cosine_similarity(XX), coss(XX), coss2(XX), times=50)
> microbenchmark::microbenchmark(coss(x), coss2(x), cosine_similarity(x), cosine_similarity2(x), coss3(x), times=50)
Unit: milliseconds
expr min lq mean median uq max
coss(x) 173.0975 183.0606 192.8333 187.6082 193.2885 331.9206
coss2(x) 162.4193 171.3178 183.7533 178.8296 184.9762 319.7934
cosine_similarity2(x) 169.6075 175.5601 191.4402 181.3405 186.4769 319.8792
neval cld
50 a
50 b
50 a
which is really not that bad. The gain in computing the cosine similarity using C++ is super small (with # f-privé's solution being fastest) so I'm guessing your timing issues are due to what you are doing to convert the text from the words to numbers and not when calculating the cosine similarity. Without knowing more about your specific code it is hard for us to help you.
I very agree with #ekstroem on the use of crossprod but I think there are unnecessary computations in his implementation. I think by the way that coss is giving a wrong result.
Comparing his answer with mine you can use this cpp file:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix cosine_similarity(NumericMatrix x) {
arma::mat X(x.begin(), x.nrow(), x.ncol(), false);
arma::mat rowSums = sum(X % X, 0);
arma::mat res;
res = X.t() * X / sqrt(rowSums.t() * rowSums);
return(wrap(res));
}
// [[Rcpp::export]]
NumericMatrix& toCosine(NumericMatrix& mat,
const NumericVector& diag) {
int n = mat.nrow();
int i, j;
for (j = 0; j < n; j++)
for (i = 0; i < n; i++)
mat(i, j) /= diag(i) * diag(j);
return mat;
}
/*** R
coss <- function(x) {
crossprod(x)/(sqrt(crossprod(x^2)))
}
coss2 <- function(x) {
cross <- crossprod(x)
toCosine(cross, sqrt(diag(cross)))
}
XX <- matrix(rnorm(120*1600), ncol=1600)
microbenchmark::microbenchmark(
cosine_similarity(XX),
coss(XX),
coss2(XX),
times = 20
)
*/
Unit: milliseconds
expr min lq mean median uq max neval
cosine_similarity(XX) 172.1943 176.4804 181.6294 181.6345 185.7542 199.0042 20
coss(XX) 262.6167 270.9357 278.8999 274.4312 276.1176 337.0531 20
coss2(XX) 134.6742 137.6013 147.3153 140.4783 146.5806 204.2115 20
So, I will definility go for computing the crossprod in base R and then do the scaling in Rcpp.
PS: If you have a very sparse matrix, you could use package Matrix to convert your matrix to a sparse matrix. This new class of matrix also have the crossprod method so you could use coss2 as well.
The coop package's coop::cosine function is probably the best way to do this now. It is implemented in Rcpp, but also has a different approach than lsa::cosine, and also has lower memory overhead. Its use is exactly the same as lsa::cosine, just switch out the package names.
For further speedups, you may want to change your BLAS library. The coop manual has a few basic details and suggestions.
This post is about speeding up R code using Rcpp package to avoid recursive loops.
My input is define by the following example (length 7) which is part of the data.frame (length 51673) that I used :
S=c(906.65,906.65,906.65,906.65,906.65,906.65,906.65)
T=c(0.1371253,0.1457896,0.1248953,0.1261278,0.1156931,0.0985253,0.1332596)
r=c(0.013975,0.013975,0.013975,0.013975,0.013975,0.013975,0.013975)
h=c(0.001332596,0.001248470,0.001251458,0.001242143,0.001257921,0.001235755,0.001238440)
P=c(3,1,5,2,1,4,2)
A= data.frame(S=S,T=T,r=r,h=h,P=P)
S T r h Per
1 906.65 0.1971253 0.013975 0.001332596 3
2 906.65 0.1971253 0.013975 0.001248470 1
3 906.65 0.1971253 0.013975 0.001251458 5
4 906.65 0.1971253 0.013975 0.001242143 2
5 906.65 0.1971253 0.013975 0.001257921 1
6 906.65 0.1971253 0.013975 0.001235755 4
7 906.65 0.1971253 0.013975 0.001238440 2
The parameters are :
w=0.001; b=0.2; a=0.0154; c=0.0000052; neta=-0.70
I have the following code of the function that I want to use :
F<-function(x,w,b,a,c,neta,S,T,r,P){
u=1i*x
nu=(1/(neta^2))*(((1-2*neta)^(1/2))-1)
# Recursion back to time t
# Terminal condition for the A and B
A_Q=0
B_Q=0
steps<-round(T*250,0)
for (j in 1:steps){
A_Q= A_Q+ r*u + w*B_Q-(1/2)*log(1-2*a*(neta^4)*B_Q)
B_Q= b*B_Q+u*nu+ (1/neta^2)*(1-sqrt((1-2*a*(neta^4)*B_Q)*( 1- 2*c*B_Q - 2*u*neta)))
}
F= exp(log(S)*u + A_Q + B_Q*h[P])
return(F)
}
S = A$S ; r= A$r ; T= A$T ; P=A$P; h= A$h
Then I want to apply the previous function using my Data.set a the vector of length N= 100000 :
Z=length(S); N=100000 ; alpha=2 ; delta= 0.25
lambda=(2*pi)/(N*delta)
res = matrix(nrow=N, ncol=Z)
for (i in 1:N){
for (j in 1:Z){
res[i,j]= Re(F(((delta*(i-1))-(alpha+1)*1i),w,b,a,c,neta,S[j],T[j],r[j],P[j]))
}
}
But it is taking a lot of time: it takes 20 seconds to execute this line of code for N=100 but I want to execute it for N= 100000 times, the overall run time can take hours. How to fine tune the above code using Rcpp, to reduce the execution time and to obtain an Efficient program?
Is it possible to reduce the execution time and if so, please suggest me a solution even with out Rcpp.
Thanks.
Your function F can be converted to C++ pretty easily by taking advantage of the vec and cx_vec classes in the Armadillo library (accessed through the RcppArmadillo package) - which has great support for vectorized calculations.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::cx_vec Fcpp(const arma::cx_vec& x, double w, double b, double a, double c,
double neta, const arma::vec& S, const arma::vec& T,
const arma::vec& r, Rcpp::IntegerVector P, Rcpp::NumericVector h) {
arma::cx_vec u = x * arma::cx_double(0.0,1.0);
double nu = (1.0/std::pow(neta,2.0)) * (std::sqrt(1.0-2.0*neta)-1.0);
arma::cx_vec A_Q(r.size());
arma::cx_vec B_Q(r.size());
arma::vec steps = arma::round(T*250.0);
for (size_t j = 0; j < steps.size(); j++) {
for (size_t k = 0; k < steps[j]; k++) {
A_Q = A_Q + r*u + w*B_Q -
0.5*arma::log(1.0 - 2.0*a*std::pow(neta,4.0)*B_Q);
B_Q = b*B_Q + u*nu + (1.0/std::pow(neta,2.0)) *
(1.0 - arma::sqrt((1.0 - 2.0*a*std::pow(neta,4.0)*B_Q) *
(1.0 - 2.0*c*B_Q - 2.0*u*neta)));
}
}
arma::vec hP = Rcpp::as<arma::vec>(h[P-1]);
arma::cx_vec F = arma::exp(arma::log(S)*u + A_Q + B_Q*hP);
return F;
}
Just a couple of minor changes to note:
I'm using arma:: functions for vectorized calculations, such as arma::log, arma::exp, arma::round, arma::sqrt, and various overloaded operators (*, +, -); but using std::pow and std::sqrt for scalar calculations. In R, this is abstracted away from us, but here we have to distinguish between the two situations.
Your function F has one loop - for (i in 1:steps) - but the C++ version has two, just due to the differences in loop semantics between the two languages.
Most of the input vectors are arma:: classes (as opposed to using Rcpp::NumericVector and Rcpp::ComplexVector), the exception being P and h, since Rcpp vectors offer R-like element access - e.g. h[P-1]. Also notice that P needs to be offset by 1 (0-based indexing in C++), and then converted to an Armadillo vector (hP) using Rcpp::as<arma::vec>, since your compiler will complain if you try to multiply a cx_vec with a NumericVector (B_Q*hP).
I added a function parameter h - it's not a good idea to rely on the existence of a global variable h, which you were doing in F. If you need to use it in the function body, you should pass it into the function.
I changed the name of your function to Fr, and to make benchmarking a little easier, I just wrapped your double loop that populates the matrix res into the functions Fr and Fcpp:
loop_Fr <- function(mat = res) {
for (i in 1:N) {
for (j in 1:Z) {
mat[i,j]= Re(Fr(((delta*(i-1))-(alpha+1)*1i),w,b,a,c,neta,S[j],T[j],r[j],P[j],h))
}
}
return(mat)
}
loop_Fcpp <- function(mat = res) {
for (i in 1:N) {
for (j in 1:Z) {
mat[i,j]= Re(Fcpp(((delta*(i-1))-(alpha+1)*1i),w,b,a,c,neta,S[j],T[j],r[j],P[j],h))
}
}
return(mat)
}
##
R> all.equal(loop_Fr(),loop_Fcpp())
[1] TRUE
I compared the two functions for N = 100, N = 1000, and N = 100000 (which took forever) - adjusting lambda and res accordingly, but keeping everything else the same. Generally speaking, Fcpp is about 10x faster than Fr on my computer:
N <- 100
lambda <- (2*pi)/(N*delta)
res <- matrix(nrow=N, ncol=Z)
##
R> microbenchmark::microbenchmark(loop_Fr(), loop_Fcpp(),times=50L)
Unit: milliseconds
expr min lq median uq max neval
loop_Fr() 142.44694 146.62848 148.97571 151.86318 186.67296 50
loop_Fcpp() 14.72357 15.26384 15.58604 15.85076 20.19576 50
N <- 1000
lambda <- (2*pi)/(N*delta)
res <- matrix(nrow=N, ncol=Z)
##
R> microbenchmark::microbenchmark(loop_Fr(), loop_Fcpp(),times=50L)
Unit: milliseconds
expr min lq median uq max neval
loop_Fr() 1440.8277 1472.4429 1491.5577 1512.5636 1565.6914 50
loop_Fcpp() 150.6538 153.2687 155.4156 158.0857 181.8452 50
N <- 100000
lambda <- (2*pi)/(N*delta)
res <- matrix(nrow=N, ncol=Z)
##
R> microbenchmark::microbenchmark(loop_Fr(), loop_Fcpp(),times=2L)
Unit: seconds
expr min lq median uq max neval
loop_Fr() 150.14978 150.14978 150.33752 150.52526 150.52526 2
loop_Fcpp() 15.49946 15.49946 15.75321 16.00696 16.00696 2
Other variables, as presented in your question:
S <- c(906.65,906.65,906.65,906.65,906.65,906.65,906.65)
T <- c(0.1371253,0.1457896,0.1248953,0.1261278,0.1156931,0.0985253,0.1332596)
r <- c(0.013975,0.013975,0.013975,0.013975,0.013975,0.013975,0.013975)
h <- c(0.001332596,0.001248470,0.001251458,0.001242143,0.001257921,0.001235755,0.001238440)
P <- c(3,1,5,2,1,4,2)
w <- 0.001; b <- 0.2; a <- 0.0154; c <- 0.0000052; neta <- (-0.70)
Z <- length(S)
alpha <- 2; delta <- 0.25
I'm a newbie to C++ and Rcpp. Suppose, I have a vector
t1<-c(1,2,NA,NA,3,4,1,NA,5)
and I want to get a index of elements of t1 that are NA. I can write:
NumericVector retIdxNA(NumericVector x) {
// Step 1: get the positions of NA in the vector
LogicalVector y=is_na(x);
// Step 2: count the number of NA
int Cnt=0;
for (int i=0;i<x.size();i++) {
if (y[i]) {
Cnt++;
}
}
// Step 3: create an output matrix whose size is same as that of NA
// and return the answer
NumericVector retIdx(Cnt);
int Cnt1=0;
for (int i=0;i<x.size();i++) {
if (y[i]) {
retIdx[Cnt1]=i+1;
Cnt1++;
}
}
return retIdx;
}
then I get
retIdxNA(t1)
[1] 3 4 8
I was wondering:
(i) is there any equivalent of which in Rcpp?
(ii) is there any way to make the above function shorter/crisper? In particular, is there any easy way to combine the Step 1, 2, 3 above?
Recent version of RcppArmadillo have functions to identify the indices of finite and non-finite values.
So this code
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec whichNA(arma::vec x) {
return arma::find_nonfinite(x);
}
/*** R
t1 <- c(1,2,NA,NA,3,4,1,NA,5)
whichNA(t1)
*/
yields your desired answer (module the off-by-one in C/C++ as they are zero-based):
R> sourceCpp("/tmp/uday.cpp")
R> t1 <- c(1,2,NA,NA,3,4,1,NA,5)
R> whichNA(t1)
[,1]
[1,] 2
[2,] 3
[3,] 7
R>
Rcpp can do it too if you first create the sequence to subset into:
// [[Rcpp::export]]
Rcpp::IntegerVector which2(Rcpp::NumericVector x) {
Rcpp::IntegerVector v = Rcpp::seq(0, x.size()-1);
return v[Rcpp::is_na(x)];
}
Added to code above it yields:
R> which2(t1)
[1] 2 3 7
R>
The logical subsetting is also somewhat new in Rcpp.
Try this:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector which4( NumericVector x) {
int nx = x.size();
std::vector<int> y;
y.reserve(nx);
for(int i = 0; i < nx; i++) {
if (R_IsNA(x[i])) y.push_back(i+1);
}
return wrap(y);
}
which we can run like this in R:
> which4(t1)
[1] 3 4 8
Performance
Note that we have changed the above solution to reserve space for the output vector. This replaces which3 which is:
// [[Rcpp::export]]
IntegerVector which3( NumericVector x) {
int nx = x.size();
IntegerVector y;
for(int i = 0; i < nx; i++) {
// if (internal::Rcpp_IsNA(x[i])) y.push_back(i+1);
if (R_IsNA(x[i])) y.push_back(i+1);
}
return y;
}
Then the performance on a vector 9 elements long is the following with which4 the fastest:
> library(rbenchmark)
> benchmark(retIdxNA(t1), whichNA(t1), which2(t1), which3(t1), which4(t1),
+ replications = 10000, order = "relative")[1:4]
test replications elapsed relative
5 which4(t1) 10000 0.14 1.000
4 which3(t1) 10000 0.16 1.143
1 retIdxNA(t1) 10000 0.17 1.214
2 whichNA(t1) 10000 0.17 1.214
3 which2(t1) 10000 0.25 1.786
Repeating this for a vector 9000 elements long the Armadillo solution comes in quite a bit faster than the others. Here which3 (which is the same as which4 except it does not reserve space for the output vector) comes in worst while which4 comes second.
> tt <- rep(t1, 1000)
> benchmark(retIdxNA(tt), whichNA(tt), which2(tt), which3(tt), which4(tt),
+ replications = 1000, order = "relative")[1:4]
test replications elapsed relative
2 whichNA(tt) 1000 0.09 1.000
5 which4(tt) 1000 0.79 8.778
3 which2(tt) 1000 1.03 11.444
1 retIdxNA(tt) 1000 1.19 13.222
4 which3(tt) 1000 23.58 262.000
All of the solutions above are serial. Although not trivial, it is quite possible to take advantage of threading for implementing which. See this write up for more details. Although for such small sizes, it would not more harm than good. Like taking a plane for a small distance, you lose too much time at airport security..
R implements which by allocating memory for a logical vector as large as the input, does a single pass to store the indices in this memory, then copy it eventually into a proper logical vector.
Intuitively this seems less efficient than a double pass loop, but not necessarily, as copying a data range is cheap. See more details here.
Just write a function for yourself like:
which_1<-function(a,b){
return(which(a>b))
}
Then pass this function into rcpp.