Speeding up a strangely slow Rcpp function - r

I want to rewrite an expensive R function using Rcpp. As I am new to this topic I experimented with some very simple stuff.
I wrote the following function:
Rcpp::cppFunction('
std::vector<int> test_C(double a) {
std::vector<int> indices;
indices.reserve(2);
indices.push_back(a);
indices.push_back(a);
return (indices);
}
')
Now that works all well regarding the result. But it takes 0.1 seconds (which for this task of course is way too much). Previously I had
Rcpp::cppFunction('
NumericVector test_C(double a) {
NumericVector indices(2);
indices[0] = a;
indices[1] = a;
return (indices);
}
')
which was equally slow. I am doubting that this is my systems fault. I tried the Rcpp code in the answer of R: Getting indices of elements in a sorted vector which calculates which[v > a][1] for a numeric vector v (of length 10e7 in my test) and a double a and it worked very very fast.
Any hint what I am doing wrong?

Are you by chance measuring the compilation too?
R> library(rbenchmark)
R> benchmark(test_C(2))[1:4]
test replications elapsed relative
1 test_C(2) 100 0.001 1
R>

Related

Finding the value of infinite sums in r

I'm very new to r and programming so please stay with me :)
I am trying to use iterations to find the value of infinite iterations to the 4th decimal place. I.e. where the 4th decimal does not change. so 1.4223, where 3 does not change anymore so the result to 3 decimal place is 1.422.
The link above shows an example of a similar problem that I am faced with. My question is how do I create a for-loop that goes to infinity and find the value where the 4th decimal point stops changing?
I have tried using while loops but I am not sure how to stop it from just looping forever. I need some if statement like below:
result <- 0
i <- 1
d <- 1e-4
while(TRUE)
{
result <- result + (1/(i^2))
if(abs(result) < d)
{
break
}
i <- i + 1
}
result
Here's an example: to do the infinite loop, use while(TRUE) {}, and as you suggested use an if clause and break to stop when necessary.
## example equation shown
## fun <- function(x,n) {
## (x-1)^(2*n)/(n*(2*n-1))
## }
## do it for f(x)=1/x^2 instead
## doesn't have any x-dependence, but leave it in anyway
fun <- function(x,n) {
1/n^2
}
n <- 1
## x <- 0.6
tol <- 1e-4
ans <- 0
while (TRUE) {
next_term <- fun(x,n)
ans <- ans + next_term
if (abs(next_term)<tol) break
n <- n+1
}
When run this gives ans=1.635082, n=101.
R also has a rarely used repeat { } keyword, but while(TRUE) will probably be clearer to readers
there are more efficient ways to do this (i.e. calculating the numerator by multiplying it by (x-1)^2 each time)
it's generally a good idea to test for a maximum number of iterations as well so that you don't set up a truly infinite loop if your series doesn't converge or if you have a bug in your code
I haven't solved your exact problem (chose a smaller value of tol), but you should be able to adjust this to get an answer
as discussed in the answer to your previous question, this isn't guaranteed, but should generally be OK; you can check (I haven't) to be sure that the particular series you want to evaluate has well-behaved convergence

Find the Hamming distance between string sequences

I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).
What is the optimal way to find the pairwise Hamming distance between these sequences?
Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example).
So far, I have attempted the following:
library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}
snpdat is a list variable where snpdat[[i]] contains the ith DNA sequence.
This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram.
I also tried using the stringdist package, which takes more time to generate the same result.
Any help is highly appreciated!
I am not sure if this is the most optimal solution, but I was able to bring the run time down to around 15 minutes using Rcpp. I'll write the code here in case someone might find it useful someday...
This is the C++ code (I have used Sugar operators here)...
#include <Rcpp.h>
using namespace Rcpp;
double test5(const List& C, const int& x){
double HD;
for(int i = 0; i < 3156; i++) if(sum(CharacterVector(C[x])!=CharacterVector(C[i])) < 9829) HD++;
return HD;
}
After compiling:
library(Rcpp)
sourceCpp("hd_code.cpp")
I simply call this function from R:
library(foreach)
library(doParallel)
registerDoParallel(cores = 8)
t =Sys.time()
bla = foreach(i = 1:3156, .combine = "c") %dopar% test5(snpdat,i-1)
Sys.time() - t
Can anyone think of an even quicker way to do this?

Codewars R Challenge: for i from 1 to n, do i % m and return the sum | Looking to optimize the code

This is a rather simple question but somehow my code either takes long time or consumes more resource. It is a question asked in www.codewars.com which I use for R Programming practice.
Below are the two versions of the problem I coded:
Version 1 :
f <- function(n, m){
# Your code here
if(n<=0) return(0) else return((n%%m)+f((n-1),m))
}
Version 2:
#Function created to calculate the sum of the first half of the vector
created
calculate_sum <- function(x,y){
sum = 0
for(i in x){
sum = sum + i%%y
}
return(sum)
}
#Main function to be called with a test call
f <- function(n, m){
# Your code here
#Trying to create two vectors from the number to calculate the sum
#separately for each half
if(n%%2==0){
first_half = 1:(n/2)
second_half = ((n/2)+1):n
} else {
first_half = 1:floor(n/2)
second_half = (ceiling(n/2)):n
}
sum_first_half = calculate_sum(first_half,m)
sum_second_half = 0
for(j in second_half){
sum_second_half = sum_second_half+(j%%m)
}
return(sum_first_half+sum_second_half)
}
I am trying to figure out a way to optimize the code. For the first version it gives the following error message:
Error: C stack usage 7971184 is too close to the limit
Execution halted
For the second version it says my code took more than 7000 ms and hence was killed.
Can someone give me a few pointers on how to optimize the code in R??
The optimisation is mathematical, not programmatical (though as others have mentioned, loops are slow in R.)
Firstly, note that sum(0:(m-1)) = m*(m-1)/2.
You are being asked to calculate this n%/%m times, and add a remainder of (n - n%/%m)(n - n%/%m + 1)/2. So you might try
f <- function(n,m){
k <- n%/%m
r <- n - k*m
return(k*m*(m-1)/2 + r*(r+1)/2)
}
which is a much less complex calculation, and will not take very long regardless of how large n or m is.
There is a risk that, if n is greater than 2^53 and m does not have enough powers of 2 in its factorisation, there will not be enough precision to calculate k and r accurately enough.
EDIT: Just Realized it is actually a trick question
I would do
n%/%m *sum(1:(m-1)) + sum( 0:(n%%m))
Loop are real slow in R. Also, from my experience recursive function in R doesnt help much with the speed and it takes lots of memory

Quickest distance computation between two large vectors in R

I wish to calculate the distance between each element in one vector and each element in another vector in the quickest possible way in R. A small example is:
distf<-function(a,b) abs(a-b)
x<-c(1,2,3)
y<-c(1,1,1)
result<-outer(x,y, distf)
The issue is that my x and y are now of length 30,000 each and R crashes while trying to do this computation. And this is only doing it once, but I have to repeat the process 1000 times in a simulation study. Are there any quicker functions to be able to achieve this?
I eventually need to identify which of these distances are less than a fixed number/calliper. I will be studying many such fixed callipers eventually, therefore, I need to save all these distances, especially if the computation is so demanding. A function called caliper in the R package optmatch does this process directly, but that cannot handle such a big computation as well.
Here's an Rcpp version that returns an integer matrix of 1s and 0s dependent on whether each pair wide comparison is <= a threshold. On my machine it took 22.5 secs to do 30,000 by 30,000. The output matrix is a little under 7 GB in RAM though.
fast_cal.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix fast_cal(NumericVector x, NumericVector y, double threshold) {
const long nr=x.length();
const long nc=y.length();
NumericMatrix output(nr, nc);
for (long i=0; i<nr; i++) {
for (long j=0; j<nc; j++) {
output(i, j) = (fabs(x(i) - y(j)) <= threshold) ? 1 : 0;
}
}
return output;
}
Testing
library("Rcpp")
sourceCpp("fast_cal.cpp")
x <- rnorm(30000)
y <- rnorm(30000)
out <- fast_cal(x, y, 0.5)

How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)

TL;DR
What is the fastest method in R for reading and writing a subset of
columns from a very large matrix. I attempt a solution with data.table
but need a fast way to extract a sequence of columns?
Short Answer: The expensive part of the operation is assignment. Thus the solution is to stick with a matrix and use Rcpp and C++ to modify the matrix in place. There are two excellent answers below with examples.[for those applying to other problems be sure to read the disclaimers in the solutions!]. Scroll to the bottom of the question for some more lessons learned.
This is my first Stack Overflow question- I greatly appreciate your time in taking a look and I apologize if I've left anything out. I'm working on an R package where I have a performance bottleneck from subsetting and writing to portions of a matrix (NB for statisticians the application is updating sufficient statistics after processing each data point). The individual operations are incredibly fast but the sheer number of them requires it to be as fast as possible. The simplest version of the idea is a matrix of dimension K by V where K is generally between 5 and 1000 and V can be between 1000 and 1,000,000.
set.seed(94253)
K <- 100
V <- 100000
mat <- matrix(runif(K*V),nrow=K,ncol=V)
we then end up performing a calculation on a subset of the columns and adding this into the full matrix.
thus naively it looks like
Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert)
because this is done so many times it can be quite slow as a result of R's copy-on-change semantics (but see the lessons learned below, modification can actually happen in place in some cricumstances).
For my problem the object need not be a matrix (and I'm sensitive to the difference as outlined here Assign a matrix to a subset of a data.table). I always want the full column and so the list structure of a data frame is fine. My solution was to use Matthew Dowle's awesome data.table package. The write can be done extraordinarily quickly using set(). Unfortunately getting the value is somewhat more complicated. We have to call the variables setting with=FALSE which dramatically slows things down.
library(data.table)
DT <- as.data.table(mat)
set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))
Within the set() function using i=NULL to reference all rows is incredibly fast but (presumably due to the way things are stored under the hood) there is no comparable option for j. #Roland notes in the comments that one option would be to convert to a triple representation (row number, col number, value) and use data.tables binary search to speed retrieval. I tested this manually and while it is quick, it does approximately triple the memory requirements for the matrix. I would like to avoid this if possible.
Following the question here: Time in getting single elemets from data.table and data.frame objects. Hadley Wickham gave an incredibly fast solution for a single index
Vone <- Vsub[1]
toinsert.one <- toinsert[,1]
set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one))
however since the .subset2(DT,i) is just DT[[i]] without the methods dispatch there is no way (to my knowledge) to grab several columns at once although it certainly seems like it should be possible. As in the previous question, it seems like since we can overwrite the values quickly we should be able to read them quickly.
Any suggestions? Also please let me know if there is a better solution than data.table for this problem. I realized its not really the intended use case in many respects but I'm trying to avoid porting the whole series of operations to C.
Here are a sequence of timings of elements discussed- the first two are all columns, the second two are just one column.
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)),
mat[,Vone] <- mat[,Vone] + toinsert.one,
set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)),
times=1000L)
Unit: microseconds
expr min lq median uq max neval
Matrix 51.970 53.895 61.754 77.313 135.698 1000
Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826 1000
Matrix Single Col 8.021 9.304 10.427 19.570 55303.659 1000
Data.Table Single Col 6.737 7.700 9.304 11.549 89.824 1000
Answer and Lessons Learned:
Comments identified the most expensive part of the operation as the assignment process. Both solutions give answers based on C code which modify the matrix in place breaking R convention of not modifying the argument to a function but providing a much faster result.
Hadley Wickham stopped by in the comments to note that the matrix modification is actually done in place as long as the object mat is not referenced elsewhere (see http://adv-r.had.co.nz/memory.html#modification-in-place). This points to an interesting and subtle point. I was performing my evaluations in RStudio. RStudio as Hadley notes in his book creates an additional reference for every object not within a function. Thus while in the performance case of a function the modification would happen in place, at the command line it was producing a copy-on-change effect. Hadley's package pryr has some nice functions for tracking references and addresses of memory.
Fun with Rcpp:
You can use Eigen's Map class to modify an R object in place.
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::VectorXi;
typedef Map<MatrixXd> MapMatd;
typedef Map<VectorXi> MapVeci;
'
body <- '
MapMatd A(as<MapMatd>(AA));
const MapMatd B(as<MapMatd>(BB));
const MapVeci ix(as<MapVeci>(ind));
const int mB(B.cols());
for (int i = 0; i < mB; ++i)
{
A.col(ix.coeff(i)-1) += B.col(i);
}
'
funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"),
body, "RcppEigen", incl)
set.seed(94253)
K <- 100
V <- 100000
mat2 <- mat <- matrix(runif(K*V),nrow=K,ncol=V)
Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
invisible(funRcpp(mat2, toinsert, Vsub))
all.equal(mat, mat2)
#[1] TRUE
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
funRcpp(mat2, toinsert, Vsub))
# Unit: microseconds
# expr min lq median uq max neval
# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400 100
# funRcpp(mat2, toinsert, Vsub) 6.450 6.805 7.6605 7.9215 25.914 100
I think this is basically what #Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.
I do the addition in C++, but it is trivial to change the function to only do assignment.
Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.
Here's what I had in mind. This could probably be much sexier with Rcpp and friends, but I'm not as familiar with those tools.
#include <R.h>
#include <Rinternals.h>
#include <Rdefines.h>
SEXP addCol(SEXP mat, SEXP loc, SEXP matAdd)
{
int i, nr = nrows(mat), nc = ncols(matAdd), ll = length(loc);
if(ll != nc)
error("length(loc) must equal ncol(matAdd)");
if(TYPEOF(mat) != TYPEOF(matAdd))
error("mat and matAdd must be the same type");
if(nr != nrows(matAdd))
error("mat and matAdd must have the same number of rows");
if(TYPEOF(loc) != INTSXP)
error("loc must be integer");
int *iloc = INTEGER(loc);
switch(TYPEOF(mat)) {
case REALSXP:
for(i=0; i < ll; i++)
memcpy(&(REAL(mat)[(iloc[i]-1)*nr]),
&(REAL(matAdd)[i*nr]), nr*sizeof(double));
break;
case INTSXP:
for(i=0; i < ll; i++)
memcpy(&(INTEGER(mat)[(iloc[i]-1)*nr]),
&(INTEGER(matAdd)[i*nr]), nr*sizeof(int));
break;
default:
error("unsupported type");
}
return R_NilValue;
}
Put the above function in addCol.c, then run R CMD SHLIB addCol.c. Then in R:
addColC <- dyn.load("addCol.so")$addCol
.Call(addColC, mat, Vsub, mat[,Vsub]+toinsert)
The slight advantage to this approach over Roland's is that this only does the assignment. His function does the addition for you, which is faster, but also means you need a separate C/C++ function for every operation you need to do.

Resources