Quickest distance computation between two large vectors in R - r

I wish to calculate the distance between each element in one vector and each element in another vector in the quickest possible way in R. A small example is:
distf<-function(a,b) abs(a-b)
x<-c(1,2,3)
y<-c(1,1,1)
result<-outer(x,y, distf)
The issue is that my x and y are now of length 30,000 each and R crashes while trying to do this computation. And this is only doing it once, but I have to repeat the process 1000 times in a simulation study. Are there any quicker functions to be able to achieve this?
I eventually need to identify which of these distances are less than a fixed number/calliper. I will be studying many such fixed callipers eventually, therefore, I need to save all these distances, especially if the computation is so demanding. A function called caliper in the R package optmatch does this process directly, but that cannot handle such a big computation as well.

Here's an Rcpp version that returns an integer matrix of 1s and 0s dependent on whether each pair wide comparison is <= a threshold. On my machine it took 22.5 secs to do 30,000 by 30,000. The output matrix is a little under 7 GB in RAM though.
fast_cal.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix fast_cal(NumericVector x, NumericVector y, double threshold) {
const long nr=x.length();
const long nc=y.length();
NumericMatrix output(nr, nc);
for (long i=0; i<nr; i++) {
for (long j=0; j<nc; j++) {
output(i, j) = (fabs(x(i) - y(j)) <= threshold) ? 1 : 0;
}
}
return output;
}
Testing
library("Rcpp")
sourceCpp("fast_cal.cpp")
x <- rnorm(30000)
y <- rnorm(30000)
out <- fast_cal(x, y, 0.5)

Related

Parallelize the other loop of a nested for loop using allgather

I am trying to parallelize a nested for loop below using allgather
for (int i=0; i<N1; i++) {
for (int j=0; j<N0; j++)
HS_1[i] += IN[j]*W0[j][i];
}
Here N1 is 1000 and N2 is 764.
I have four processes and I just want to parallelize the outer loop. Is there a way to do it?
This looks like a matrix-vector multiplication. Let's assume that you've distributed the HS output vector. Each component needs the full IN vector, so you indeed need an allgather for that. You also need to distribute the W0 matrix: each process gets part of the i indices, and all of the j indices.

Find the Hamming distance between string sequences

I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).
What is the optimal way to find the pairwise Hamming distance between these sequences?
Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example).
So far, I have attempted the following:
library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}
snpdat is a list variable where snpdat[[i]] contains the ith DNA sequence.
This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram.
I also tried using the stringdist package, which takes more time to generate the same result.
Any help is highly appreciated!
I am not sure if this is the most optimal solution, but I was able to bring the run time down to around 15 minutes using Rcpp. I'll write the code here in case someone might find it useful someday...
This is the C++ code (I have used Sugar operators here)...
#include <Rcpp.h>
using namespace Rcpp;
double test5(const List& C, const int& x){
double HD;
for(int i = 0; i < 3156; i++) if(sum(CharacterVector(C[x])!=CharacterVector(C[i])) < 9829) HD++;
return HD;
}
After compiling:
library(Rcpp)
sourceCpp("hd_code.cpp")
I simply call this function from R:
library(foreach)
library(doParallel)
registerDoParallel(cores = 8)
t =Sys.time()
bla = foreach(i = 1:3156, .combine = "c") %dopar% test5(snpdat,i-1)
Sys.time() - t
Can anyone think of an even quicker way to do this?

Iterator for columns of matrix: copying matrix column to std::vector?

Trying to return an std::vector from a designated column in a matrix. This is the code I have so far:
template <typename T>
vector<T> ExtractMatrixColAsVector(NumericMatrix x, NumericVector column){
vector<T> values = as<vector<T> >(NumericVector(x(_,as<int>(column))));
return values;
}
I was wondering whether there was a better way of doing this if I wanted to convert the whole matrix into separate vectors? Is there an iterator for this purpose or some syntactic sugar that returns a vector of that column automatically?
Thanks for any help.
You could use a quick for loop to convert the whole matrix.
// [[Rcpp::export]]
vector< vector<double> > ExtractMatrixAsVectors(NumericMatrix x){
vector< vector<double> > values(x.nrow());
for(int i=0; i<values.size(); i++) values[i] = as< vector<double> >(NumericVector(x(_,i)));
return values;
}
Also, I don't see too much point in using a template. The output of a numeric matrix column will always be a double precision float.

Speeding up a strangely slow Rcpp function

I want to rewrite an expensive R function using Rcpp. As I am new to this topic I experimented with some very simple stuff.
I wrote the following function:
Rcpp::cppFunction('
std::vector<int> test_C(double a) {
std::vector<int> indices;
indices.reserve(2);
indices.push_back(a);
indices.push_back(a);
return (indices);
}
')
Now that works all well regarding the result. But it takes 0.1 seconds (which for this task of course is way too much). Previously I had
Rcpp::cppFunction('
NumericVector test_C(double a) {
NumericVector indices(2);
indices[0] = a;
indices[1] = a;
return (indices);
}
')
which was equally slow. I am doubting that this is my systems fault. I tried the Rcpp code in the answer of R: Getting indices of elements in a sorted vector which calculates which[v > a][1] for a numeric vector v (of length 10e7 in my test) and a double a and it worked very very fast.
Any hint what I am doing wrong?
Are you by chance measuring the compilation too?
R> library(rbenchmark)
R> benchmark(test_C(2))[1:4]
test replications elapsed relative
1 test_C(2) 100 0.001 1
R>

Using MPI for simple calculation, different numbers of processes get different results?

I am using MPI on a very simple computation of PI using numerical integration, using some mathematical rules, eventually I convert the calculation into a summation, where it has this format:
PI = ∑(f(i)), where i start from 1 to 100000, and f(i) is a function to return some double type value based on i.
It is quite strait forward that when programming, I can convert the sum into a for loop, iterating 100000 times. And with MPI using p processors, I divide the for loop into p segments, each processor gets 100000/p loops, (supposing 100000%p = 0). And later on using MPI_Reduce, MPI_SUM to collect those sub-results and sum them up to get the final results.
However, when I using different numbers of processes, the final results will be slightly different, my final PI result has 12 bit precision, and the results start to be different after around 7th bit.
I can not get the answer why the result will be different, as in my mind, it simply does exactly same tasks no mater how the tasks are distributed.
Any help will be appreciated very much!
The numerical result of floating point operations often depends on the order in which they were executed. To understand this, you first need to understand how floating point numbers are represented by a computer. One example is when adding numbers of different size: Due to the different exponents, one will be truncated (e.g. rounded). You can see this with this example:
double small, result1, result2;
small = 1. / 3000.;
result1 = 0.;
for (int i = 0; i < 10000; i++)
result1 += small;
result2 = 0.;
for (int i = 0; i < 100; i++) {
double tmp = 0.;
for (int j = 0; j < 100; j++)
tmp += small;
result2 += tmp;
}
printf("result1= %.17g, result2= %.17g\n", result1, result2);
By adding the numbers to a temporary result first, less truncation happens. It is very likely that something like this is happening in your code.

Resources