I am trying to parallelize a nested for loop below using allgather
for (int i=0; i<N1; i++) {
for (int j=0; j<N0; j++)
HS_1[i] += IN[j]*W0[j][i];
}
Here N1 is 1000 and N2 is 764.
I have four processes and I just want to parallelize the outer loop. Is there a way to do it?
This looks like a matrix-vector multiplication. Let's assume that you've distributed the HS output vector. Each component needs the full IN vector, so you indeed need an allgather for that. You also need to distribute the W0 matrix: each process gets part of the i indices, and all of the j indices.
Related
I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).
What is the optimal way to find the pairwise Hamming distance between these sequences?
Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example).
So far, I have attempted the following:
library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}
snpdat is a list variable where snpdat[[i]] contains the ith DNA sequence.
This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram.
I also tried using the stringdist package, which takes more time to generate the same result.
Any help is highly appreciated!
I am not sure if this is the most optimal solution, but I was able to bring the run time down to around 15 minutes using Rcpp. I'll write the code here in case someone might find it useful someday...
This is the C++ code (I have used Sugar operators here)...
#include <Rcpp.h>
using namespace Rcpp;
double test5(const List& C, const int& x){
double HD;
for(int i = 0; i < 3156; i++) if(sum(CharacterVector(C[x])!=CharacterVector(C[i])) < 9829) HD++;
return HD;
}
After compiling:
library(Rcpp)
sourceCpp("hd_code.cpp")
I simply call this function from R:
library(foreach)
library(doParallel)
registerDoParallel(cores = 8)
t =Sys.time()
bla = foreach(i = 1:3156, .combine = "c") %dopar% test5(snpdat,i-1)
Sys.time() - t
Can anyone think of an even quicker way to do this?
What's happening, folks.
So, I've done a fair amount of research on merge sort, and in spite of getting the "gist" of it, I am still baffled by how one is supposed to store the subarrays in order to merge them back together—in other words, save them somewhere so that they would "know" each other, as you would otherwise—in classic recursive fashion—have all these independent function calls returning data that I would assume would go out of scope.
Here's what I first thought: create a new array named "subs" to store the subarrays in upon each division (I also considered using a closure to do this and would like to know whether this is advisable). But, as you proceed to the next division, what are you gonna do—replace each element in subs with its subarrays? Then, you would be facing more costly work, especially once you consider how you're gonna move things around in subs in order to ensure that each subarray has its own index.
Heh—I have a bad feeling that this might be a far cry from what's actually supposed to be done. I understand that this algorithm is a classic example of the divide-and-conquer approach, but it's just strange to me that one couldn't just cut to the chase by splitting the array into all of its elements right off the bat (after all, that's the base case, and what would be wrong with throwing in a greedy approach to solving the problem?).
Thanks!
EDIT:
Alright, so I figured it out.
To sum it up: I used indices to track where to place elements (and obviate the need for built-in list functions that may slow down runtime).
By using nested functions and a (hidden) pointer to the new array, I kept data in scope. An auxiliary array buffers data from the subarrays.
In retrospect, what I originally had in mind vaguely resembled insertion sort was, in fact, bottom-up merge sort. Having previously questioned the efficiency and purpose of top-down merge sort, I now understand that by breaking down the problem, it expedites comparisons and swaps (especially when operating on larger lists, which insertion sort would prove to be less efficient in sorting). I did not attempt to implement my initial idea because I did not have a clear enough picture of recursion and how data is passed.
#!/bin/python
import sys
def merge_sort(arr):
def merge(*indices): # indices = first, last, and pivot indices, respectively
head, tail = indices[0], indices[1]
pivot = indices[2]
i = head
j = pivot+1
k = 0
while (i <= pivot and j <= tail):
if new[i] <= new[j]:
aux[k] = new[i]
i += 1
k += 1
else:
aux[k] = new[j]
j += 1
k += 1
while (i <= pivot):
aux[k] = new[i]
i += 1
k += 1
while (j <= tail):
aux[k] = new[j]
j += 1
k += 1
for x in xrange(head, tail+1):
new[x] = aux[x-head]
# end merge
def split(a, *indices): # indices = first and last indices, respectively
head, tail = indices[0], indices[1]
pivot = (head+tail) / 2
if head < tail:
l_sub = a[head:pivot+1]
r_sub = a[pivot+1:tail+1]
split(l_sub, head, pivot)
split(r_sub, pivot+1, tail)
merge(head, tail, pivot)
# end split
new = arr
aux = list(new)
tail = len(new)-1
split(new, 0, tail)
return new
# end merge_sort
if __name__ == "__main__":
loops = int(raw_input().strip())
for _ in xrange(loops):
arr = map(int, raw_input().strip().split(' '))
result = merge_sort(arr)
print result
I wish to calculate the distance between each element in one vector and each element in another vector in the quickest possible way in R. A small example is:
distf<-function(a,b) abs(a-b)
x<-c(1,2,3)
y<-c(1,1,1)
result<-outer(x,y, distf)
The issue is that my x and y are now of length 30,000 each and R crashes while trying to do this computation. And this is only doing it once, but I have to repeat the process 1000 times in a simulation study. Are there any quicker functions to be able to achieve this?
I eventually need to identify which of these distances are less than a fixed number/calliper. I will be studying many such fixed callipers eventually, therefore, I need to save all these distances, especially if the computation is so demanding. A function called caliper in the R package optmatch does this process directly, but that cannot handle such a big computation as well.
Here's an Rcpp version that returns an integer matrix of 1s and 0s dependent on whether each pair wide comparison is <= a threshold. On my machine it took 22.5 secs to do 30,000 by 30,000. The output matrix is a little under 7 GB in RAM though.
fast_cal.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix fast_cal(NumericVector x, NumericVector y, double threshold) {
const long nr=x.length();
const long nc=y.length();
NumericMatrix output(nr, nc);
for (long i=0; i<nr; i++) {
for (long j=0; j<nc; j++) {
output(i, j) = (fabs(x(i) - y(j)) <= threshold) ? 1 : 0;
}
}
return output;
}
Testing
library("Rcpp")
sourceCpp("fast_cal.cpp")
x <- rnorm(30000)
y <- rnorm(30000)
out <- fast_cal(x, y, 0.5)
I am using MPI on a very simple computation of PI using numerical integration, using some mathematical rules, eventually I convert the calculation into a summation, where it has this format:
PI = ∑(f(i)), where i start from 1 to 100000, and f(i) is a function to return some double type value based on i.
It is quite strait forward that when programming, I can convert the sum into a for loop, iterating 100000 times. And with MPI using p processors, I divide the for loop into p segments, each processor gets 100000/p loops, (supposing 100000%p = 0). And later on using MPI_Reduce, MPI_SUM to collect those sub-results and sum them up to get the final results.
However, when I using different numbers of processes, the final results will be slightly different, my final PI result has 12 bit precision, and the results start to be different after around 7th bit.
I can not get the answer why the result will be different, as in my mind, it simply does exactly same tasks no mater how the tasks are distributed.
Any help will be appreciated very much!
The numerical result of floating point operations often depends on the order in which they were executed. To understand this, you first need to understand how floating point numbers are represented by a computer. One example is when adding numbers of different size: Due to the different exponents, one will be truncated (e.g. rounded). You can see this with this example:
double small, result1, result2;
small = 1. / 3000.;
result1 = 0.;
for (int i = 0; i < 10000; i++)
result1 += small;
result2 = 0.;
for (int i = 0; i < 100; i++) {
double tmp = 0.;
for (int j = 0; j < 100; j++)
tmp += small;
result2 += tmp;
}
printf("result1= %.17g, result2= %.17g\n", result1, result2);
By adding the numbers to a temporary result first, less truncation happens. It is very likely that something like this is happening in your code.
I'm having a bad time reading an article, which has lots of formulas. It has several sumatories (I mean like this: ∑h ∑i) Can I write it as two nested for loops?
Like:
for (h=1; h<=5; h++){
for(i=1; i<=5; i++){
sum+=i;
}
}
Thanks for your patience :)
If you look at the example 11 here:
Sum(Sum(x*y)) = Sum(x)*Sum(y)
The left side can be written as nested for loops:
for(x goes 1 to n)
for(y goes 1 to m)
add to the result (x*y)
The right side can be written as two independent loops.
for(x goes 1 to n)
add to the firstResult (x)
for(y goes 1 to m)
add to the secondResult (y)
set result to firstResult * secondResult
The right side improves the time efficiency O(n*m) vs O(n+m) but costs some space (to hold on to first and second result).