Find the Hamming distance between string sequences - r

I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).
What is the optimal way to find the pairwise Hamming distance between these sequences?
Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example).
So far, I have attempted the following:
library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}
snpdat is a list variable where snpdat[[i]] contains the ith DNA sequence.
This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram.
I also tried using the stringdist package, which takes more time to generate the same result.
Any help is highly appreciated!

I am not sure if this is the most optimal solution, but I was able to bring the run time down to around 15 minutes using Rcpp. I'll write the code here in case someone might find it useful someday...
This is the C++ code (I have used Sugar operators here)...
#include <Rcpp.h>
using namespace Rcpp;
double test5(const List& C, const int& x){
double HD;
for(int i = 0; i < 3156; i++) if(sum(CharacterVector(C[x])!=CharacterVector(C[i])) < 9829) HD++;
return HD;
}
After compiling:
library(Rcpp)
sourceCpp("hd_code.cpp")
I simply call this function from R:
library(foreach)
library(doParallel)
registerDoParallel(cores = 8)
t =Sys.time()
bla = foreach(i = 1:3156, .combine = "c") %dopar% test5(snpdat,i-1)
Sys.time() - t
Can anyone think of an even quicker way to do this?

Related

Dealing with very large vector in R

I am dealing with some large data in R:
I have a vector of normally distributed random numbers with length about 6400*50000, I need to sum every 4 elements in this vector to get a smaller one.
Is there any efficient way to do this in R?
My thoughts till now:
using a matrix with ncol=10 and use apply function-- failed because the matrix size is too big;
Try paralell and foreach package but no progress yet;
example code:
library(parallel)
library(RcppZiggurat)
library(doParallel)
library(foreach)
coreNums<-detectCores()
N1=6400
M=4
N2=N1/M
cl<-makeCluster(getOption("cl.cores", coreNums))
registerDoParallel(cl)
vector1<-zrnorm(N1*K)
vector2=foreach(i=1:(N2*K)) %dopar% {sum(vector1[M*(i-1)+1:M*i])}
vector2=unlist(vector)
I think colSums is the function you are looking for.
vector1 = rnorm(1000*50000)
dim(vector1) = c(10, length(vector1)/10)
vector2 = colSums(vector1)
In my opinion, the task is too simple for parallelization.
Also, I did not get any problems with the matrix size.
If you want to use less memory, here is the code doing the same in parts of 10,000 values in vector1.
vector2 = double(length(vector1)/10);
for( i in seq_len(length(vector2)/10000) ){
part = vector1[((i-1)*10000+1):(i*10000)]
dim(part) = c(10, 1000)
vector2[((i-1)*1000+1):(i*1000)] = colSums(part)
}

Quickest distance computation between two large vectors in R

I wish to calculate the distance between each element in one vector and each element in another vector in the quickest possible way in R. A small example is:
distf<-function(a,b) abs(a-b)
x<-c(1,2,3)
y<-c(1,1,1)
result<-outer(x,y, distf)
The issue is that my x and y are now of length 30,000 each and R crashes while trying to do this computation. And this is only doing it once, but I have to repeat the process 1000 times in a simulation study. Are there any quicker functions to be able to achieve this?
I eventually need to identify which of these distances are less than a fixed number/calliper. I will be studying many such fixed callipers eventually, therefore, I need to save all these distances, especially if the computation is so demanding. A function called caliper in the R package optmatch does this process directly, but that cannot handle such a big computation as well.
Here's an Rcpp version that returns an integer matrix of 1s and 0s dependent on whether each pair wide comparison is <= a threshold. On my machine it took 22.5 secs to do 30,000 by 30,000. The output matrix is a little under 7 GB in RAM though.
fast_cal.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix fast_cal(NumericVector x, NumericVector y, double threshold) {
const long nr=x.length();
const long nc=y.length();
NumericMatrix output(nr, nc);
for (long i=0; i<nr; i++) {
for (long j=0; j<nc; j++) {
output(i, j) = (fabs(x(i) - y(j)) <= threshold) ? 1 : 0;
}
}
return output;
}
Testing
library("Rcpp")
sourceCpp("fast_cal.cpp")
x <- rnorm(30000)
y <- rnorm(30000)
out <- fast_cal(x, y, 0.5)

Speeding up a strangely slow Rcpp function

I want to rewrite an expensive R function using Rcpp. As I am new to this topic I experimented with some very simple stuff.
I wrote the following function:
Rcpp::cppFunction('
std::vector<int> test_C(double a) {
std::vector<int> indices;
indices.reserve(2);
indices.push_back(a);
indices.push_back(a);
return (indices);
}
')
Now that works all well regarding the result. But it takes 0.1 seconds (which for this task of course is way too much). Previously I had
Rcpp::cppFunction('
NumericVector test_C(double a) {
NumericVector indices(2);
indices[0] = a;
indices[1] = a;
return (indices);
}
')
which was equally slow. I am doubting that this is my systems fault. I tried the Rcpp code in the answer of R: Getting indices of elements in a sorted vector which calculates which[v > a][1] for a numeric vector v (of length 10e7 in my test) and a double a and it worked very very fast.
Any hint what I am doing wrong?
Are you by chance measuring the compilation too?
R> library(rbenchmark)
R> benchmark(test_C(2))[1:4]
test replications elapsed relative
1 test_C(2) 100 0.001 1
R>

How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)

TL;DR
What is the fastest method in R for reading and writing a subset of
columns from a very large matrix. I attempt a solution with data.table
but need a fast way to extract a sequence of columns?
Short Answer: The expensive part of the operation is assignment. Thus the solution is to stick with a matrix and use Rcpp and C++ to modify the matrix in place. There are two excellent answers below with examples.[for those applying to other problems be sure to read the disclaimers in the solutions!]. Scroll to the bottom of the question for some more lessons learned.
This is my first Stack Overflow question- I greatly appreciate your time in taking a look and I apologize if I've left anything out. I'm working on an R package where I have a performance bottleneck from subsetting and writing to portions of a matrix (NB for statisticians the application is updating sufficient statistics after processing each data point). The individual operations are incredibly fast but the sheer number of them requires it to be as fast as possible. The simplest version of the idea is a matrix of dimension K by V where K is generally between 5 and 1000 and V can be between 1000 and 1,000,000.
set.seed(94253)
K <- 100
V <- 100000
mat <- matrix(runif(K*V),nrow=K,ncol=V)
we then end up performing a calculation on a subset of the columns and adding this into the full matrix.
thus naively it looks like
Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert)
because this is done so many times it can be quite slow as a result of R's copy-on-change semantics (but see the lessons learned below, modification can actually happen in place in some cricumstances).
For my problem the object need not be a matrix (and I'm sensitive to the difference as outlined here Assign a matrix to a subset of a data.table). I always want the full column and so the list structure of a data frame is fine. My solution was to use Matthew Dowle's awesome data.table package. The write can be done extraordinarily quickly using set(). Unfortunately getting the value is somewhat more complicated. We have to call the variables setting with=FALSE which dramatically slows things down.
library(data.table)
DT <- as.data.table(mat)
set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))
Within the set() function using i=NULL to reference all rows is incredibly fast but (presumably due to the way things are stored under the hood) there is no comparable option for j. #Roland notes in the comments that one option would be to convert to a triple representation (row number, col number, value) and use data.tables binary search to speed retrieval. I tested this manually and while it is quick, it does approximately triple the memory requirements for the matrix. I would like to avoid this if possible.
Following the question here: Time in getting single elemets from data.table and data.frame objects. Hadley Wickham gave an incredibly fast solution for a single index
Vone <- Vsub[1]
toinsert.one <- toinsert[,1]
set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one))
however since the .subset2(DT,i) is just DT[[i]] without the methods dispatch there is no way (to my knowledge) to grab several columns at once although it certainly seems like it should be possible. As in the previous question, it seems like since we can overwrite the values quickly we should be able to read them quickly.
Any suggestions? Also please let me know if there is a better solution than data.table for this problem. I realized its not really the intended use case in many respects but I'm trying to avoid porting the whole series of operations to C.
Here are a sequence of timings of elements discussed- the first two are all columns, the second two are just one column.
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)),
mat[,Vone] <- mat[,Vone] + toinsert.one,
set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)),
times=1000L)
Unit: microseconds
expr min lq median uq max neval
Matrix 51.970 53.895 61.754 77.313 135.698 1000
Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826 1000
Matrix Single Col 8.021 9.304 10.427 19.570 55303.659 1000
Data.Table Single Col 6.737 7.700 9.304 11.549 89.824 1000
Answer and Lessons Learned:
Comments identified the most expensive part of the operation as the assignment process. Both solutions give answers based on C code which modify the matrix in place breaking R convention of not modifying the argument to a function but providing a much faster result.
Hadley Wickham stopped by in the comments to note that the matrix modification is actually done in place as long as the object mat is not referenced elsewhere (see http://adv-r.had.co.nz/memory.html#modification-in-place). This points to an interesting and subtle point. I was performing my evaluations in RStudio. RStudio as Hadley notes in his book creates an additional reference for every object not within a function. Thus while in the performance case of a function the modification would happen in place, at the command line it was producing a copy-on-change effect. Hadley's package pryr has some nice functions for tracking references and addresses of memory.
Fun with Rcpp:
You can use Eigen's Map class to modify an R object in place.
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::VectorXi;
typedef Map<MatrixXd> MapMatd;
typedef Map<VectorXi> MapVeci;
'
body <- '
MapMatd A(as<MapMatd>(AA));
const MapMatd B(as<MapMatd>(BB));
const MapVeci ix(as<MapVeci>(ind));
const int mB(B.cols());
for (int i = 0; i < mB; ++i)
{
A.col(ix.coeff(i)-1) += B.col(i);
}
'
funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"),
body, "RcppEigen", incl)
set.seed(94253)
K <- 100
V <- 100000
mat2 <- mat <- matrix(runif(K*V),nrow=K,ncol=V)
Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
invisible(funRcpp(mat2, toinsert, Vsub))
all.equal(mat, mat2)
#[1] TRUE
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
funRcpp(mat2, toinsert, Vsub))
# Unit: microseconds
# expr min lq median uq max neval
# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400 100
# funRcpp(mat2, toinsert, Vsub) 6.450 6.805 7.6605 7.9215 25.914 100
I think this is basically what #Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.
I do the addition in C++, but it is trivial to change the function to only do assignment.
Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.
Here's what I had in mind. This could probably be much sexier with Rcpp and friends, but I'm not as familiar with those tools.
#include <R.h>
#include <Rinternals.h>
#include <Rdefines.h>
SEXP addCol(SEXP mat, SEXP loc, SEXP matAdd)
{
int i, nr = nrows(mat), nc = ncols(matAdd), ll = length(loc);
if(ll != nc)
error("length(loc) must equal ncol(matAdd)");
if(TYPEOF(mat) != TYPEOF(matAdd))
error("mat and matAdd must be the same type");
if(nr != nrows(matAdd))
error("mat and matAdd must have the same number of rows");
if(TYPEOF(loc) != INTSXP)
error("loc must be integer");
int *iloc = INTEGER(loc);
switch(TYPEOF(mat)) {
case REALSXP:
for(i=0; i < ll; i++)
memcpy(&(REAL(mat)[(iloc[i]-1)*nr]),
&(REAL(matAdd)[i*nr]), nr*sizeof(double));
break;
case INTSXP:
for(i=0; i < ll; i++)
memcpy(&(INTEGER(mat)[(iloc[i]-1)*nr]),
&(INTEGER(matAdd)[i*nr]), nr*sizeof(int));
break;
default:
error("unsupported type");
}
return R_NilValue;
}
Put the above function in addCol.c, then run R CMD SHLIB addCol.c. Then in R:
addColC <- dyn.load("addCol.so")$addCol
.Call(addColC, mat, Vsub, mat[,Vsub]+toinsert)
The slight advantage to this approach over Roland's is that this only does the assignment. His function does the addition for you, which is faster, but also means you need a separate C/C++ function for every operation you need to do.

What is the easiest way to parallelize a vectorized function in R?

I have a very large list X and a vectorized function f. I want to calculate f(X), but this will take a long time if I do it with a single core. I have (access to) a 48-core server. What is the easiest way to parallelize the calculation of f(X)? The following is not the right answer:
library(foreach)
library(doMC)
registerDoMC()
foreach(x=X, .combine=c) %dopar% f(x)
The above code will indeed parallelize the calculation of f(X), but it will do so by applying f separately to every element of X. This ignores the vectorized nature of f and will probably make things slower as a result, not faster. Rather than applying f elementwise to X, I want to split X into reasonably-sized chunks and apply f to those.
So, should I just manually split X into 48 equal-sized sublists and then apply f to each in parallel, then manually put together the result? Or is there a package designed for this?
In case anyone is wondering, my specific use case is here.
Although this is an older question this might be interesting for everyone who stumbled upon this via google (like me): Have a look at the pvec function in the multicore package. I think it does exactly what you want.
Here's my implementation. It's a function chunkmap that takes a
vectorized function, a list of arguments that should be vectorized,
and a list of arguments that should not be vectorized (i.e.
constants), and returns the same result as calling the function on the
arguments directly, except that the result is calculated in parallel.
For a function f, vector arguments v1, v2, v3, and scalar
arguments s1, s2, the following should return identical results:
f(a=v1, b=v2, c=v3, d=s1, e=s2)
f(c=v3, b=v2, e=s2, a=v1, d=s1)
chunkapply(FUN=f, VECTOR.ARGS=list(a=v1, b=v2, c=v3), SCALAR.ARGS=list(d=s1, e=s2))
chunkapply(FUN=f, SCALAR.ARGS=list(e=s2, d=s1), VECTOR.ARGS=list(a=v1, c=v3, b=v2))
Since it is impossible for the chunkapply function to know which
arguments of f are vectorized and which are not, it is up to you to
specify when you call it, or else you will get the wrong results. You
should generally name your arguments to ensure that they get bound
correctly.
library(foreach)
library(iterators)
# Use your favorite doPar backend here
library(doMC)
registerDoMC()
get.chunk.size <- function(vec.length,
min.chunk.size=NULL, max.chunk.size=NULL,
max.chunks=NULL) {
if (is.null(max.chunks)) {
max.chunks <- getDoParWorkers()
}
size <- vec.length / max.chunks
if (!is.null(max.chunk.size)) {
size <- min(size, max.chunk.size)
}
if (!is.null(min.chunk.size)) {
size <- max(size, min.chunk.size)
}
num.chunks <- ceiling(vec.length / size)
actual.size <- ceiling(vec.length / num.chunks)
return(actual.size)
}
ichunk.vectors <- function(vectors=NULL,
min.chunk.size=NULL,
max.chunk.size=NULL,
max.chunks=NULL) {
## Calculate number of chunks
recycle.length <- max(sapply(vectors, length))
actual.chunk.size <- get.chunk.size(recycle.length, min.chunk.size, max.chunk.size, max.chunks)
num.chunks <- ceiling(recycle.length / actual.chunk.size)
## Make the chunk iterator
i <- 1
it <- idiv(recycle.length, chunks=num.chunks)
nextEl <- function() {
n <- nextElem(it)
ix <- seq(i, length = n)
i <<- i + n
vchunks <- foreach(v=vectors) %do% v[1+ (ix-1) %% length(v)]
names(vchunks) <- names(vectors)
vchunks
}
obj <- list(nextElem = nextEl)
class(obj) <- c("ichunk", "abstractiter", "iter")
obj
}
chunkapply <- function(FUN, VECTOR.ARGS, SCALAR.ARGS=list(), MERGE=TRUE, ...) {
## Check that the arguments make sense
stopifnot(is.list(VECTOR.ARGS))
stopifnot(length(VECTOR.ARGS) >= 1)
stopifnot(is.list(SCALAR.ARGS))
## Choose appropriate combine function
if (MERGE) {
combine.fun <- append
} else {
combine.fun <- foreach:::defcombine
}
## Chunk and apply, and maybe merge
foreach(vchunk=ichunk.vectors(vectors=VECTOR.ARGS, ...),
.combine=combine.fun,
.options.multicore = mcoptions) %dopar%
{
do.call(FUN, args=append(vchunk, SCALAR.ARGS))
}
}
## Only do chunkapply if it will run in parallel
maybe.chunkapply <- function(FUN, VECTOR.ARGS, SCALAR.ARGS=list(), ...) {
if (getDoParWorkers() > 1) {
chunkapply(FUN, VECTOR.ARGS, SCALAR.ARGS, ...)
} else {
do.call(FUN, append(VECTOR.ARGS, SCALAR.ARGS))
}
}
Here are some examples showing that chunkapply(f,list(x)) produces identical results to f(x). I have set the max.chunk.size extremely small to ensure that the chunking algorithm is actually used.
> # Generate all even integers from 2 to 100 inclusive
> identical(chunkapply(function(x,y) x*y, list(1:50), list(2), max.chunk.size=10), 1:50 * 2)
[1] TRUE
> ## Sample from a standard normal distribution, then discard values greater than 1
> a <- rnorm(n=100)
> cutoff <- 1
> identical(chunkapply(function(x,limit) x[x<=limit], list(x=a), list(limit=cutoff), max.chunk.size=10), a[a<cutoff])
[1] TRUE
If anyone has a better name than "chunkapply", please suggest it.
Edit:
As another answer points out, there is a function called pvec in the multicore pacakge that has very similar functionality to what I have written. For simple cases, you should us that, and you should vote up Jonas Rauch's answer for it. However, my function is a bit more general, so if any of the following apply to you, you might want to consider using my function instead:
You need to use a parallel backend other than multicore (e.g. MPI). My function uses foreach, so you can use any parallelization framework that provides a backend for foreach.
You need to pass multiple vectorized arguments. pvec only vectorizes over a single argument, so you couldn't easily implement parallel vectorized addition with pvec, for example. My function allows you to specify arbitrary arguments.
The itertools package was designed to address this kind of problem. In this case, I would use isplitVector:
n <- getDoParWorkers()
foreach(x=isplitVector(X, chunks=n), .combine='c') %dopar% f(x)
For this example, pvec is undoubtably faster and simpler, but this can be used on Windows with the doParallel package, for example.
Map-Reduce might be what you're looking for; it's been ported to R
How about something like this? R will take advantage of all the available memory and multicore will parallelize over all available cores.
library(multicore)
result = mclapply(X, function,mc.preschedule=FALSE, mc.set.seed=FALSE)

Resources