Running the same for loop on many GPU threads, using OpenCL - opencl

I need to subtract a 2D array, D, from many other different 2D arrays. I have linearized (flattened) all the arrays: D is a 25-element array, and imges is a 1D array were 4 25-element arrays have been joined together. Meaning: if I want to subtract D from 4 5x5 arrays, I'm just turning each one of those 5x5 arrays into one 25-element array, and then appending the 4 arrays. That's what imgs is, in this example it would be a 100-element array. I believe I am capturing this properly in my kernel, index-wise.
The only way that has come to mind to do the subtraction is to run a for loop, so that every element from D will get subtracted from the array in the corresponding thread. My idea was that this would work as follows:
Each thread would receive the D array to be subtracted, and one of the arrays from which D has to be subtracted from (in my example, 1/4 of imges)
I would iterate through the elements of both arrays with a for loop to do the subtraction element by element
However, it is not working as expected: it seems like just the last or first value of D gets chosen and then subtracted from all the elements of the other arrays.
I thought I had a hang of how indexing and threading worked on GPU, but now I am not so sure since this has been challenging me for a while. The kernel is below.
Is there a better way to do this other than with a for loop? Thanks a lot in advance.
__kernel void reduce(__global float* D, __global float* imges, __global float* res)
{
const int x = (int)get_global_id(0);
const int y = (int)get_global_id(1);
const int z = (int)get_global_id(2);
int im_i = imges[x+25]; //Images are 5x5 meaning a 25-size array
for(int j = 0; j < 25; j++){
res[x+25] = im_i - D[j];
}
}
Edit: I do not wish to parallelize the for loop itself, since the arrays will probably get bigger and I don't want to run into trouble with overhead.

If I understand what you are trying to do correctly, your kernel should look more like this:
__kernel void reduce(__global float* D, __global float* imges, __global float* res)
{
const int x = (int)get_global_id(0);
for(int j = 0; j < 25; j++){
res[x*25 + j] = imges[x*25 + j] - D[j];
}
}
This kernel will subtract the jth element of D from the jth element of each work-item's 25-element array in imges.

Related

Parallelize the other loop of a nested for loop using allgather

I am trying to parallelize a nested for loop below using allgather
for (int i=0; i<N1; i++) {
for (int j=0; j<N0; j++)
HS_1[i] += IN[j]*W0[j][i];
}
Here N1 is 1000 and N2 is 764.
I have four processes and I just want to parallelize the outer loop. Is there a way to do it?
This looks like a matrix-vector multiplication. Let's assume that you've distributed the HS output vector. Each component needs the full IN vector, so you indeed need an allgather for that. You also need to distribute the W0 matrix: each process gets part of the i indices, and all of the j indices.

Check equality of a value in all MPI ranks

Say I have some int x. I want to check if all MPI ranks get the same value for x. What's a good way to achieve this using MPI collectives?
The simplest I could think of is, broadcast rank0's x, do the comparison, and allreduce-logical-and the comparison result. This requires two collective operations.
...
x = ...
x_bcast = comm.bcast(x, root=0)
all_equal = comm.allreduce(x==x_bcast, op=MPI.LAND)
if not all_equal:
raise Exception()
...
Is there a better way to do this?
UPDATE:
From the OpenMPI user list, I received the following response. And I think it's quite a nifty trick!
A pattern I have seen in several places is to allreduce the pair p =
{-x,x} with MPI_MIN or MPI_MAX. If in the resulting pair p[0] == -p[1],
then everyone has the same value. If not, at least one rank had a
different value. Example:
bool is_same(int x) {
int p[2];
p[0] = -x;
p[1] = x;
MPI_Allreduce(MPI_IN_PLACE, p, 2, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
return (p[0] == -p[1]);
}
Solutions based on logical operators assume that you can convert between integers and logicals without any data loss. I think that's dangerous. You could do a bitwise AND where you make sure you use all the bytes of your int/real/whatever.
You could do two reductions: one max and one min, and see if they give the same result.
You could also write your own reduction operator: operate on two ints, and do a max on the first, min on the second. Then test if the two are the same.

Iterator for columns of matrix: copying matrix column to std::vector?

Trying to return an std::vector from a designated column in a matrix. This is the code I have so far:
template <typename T>
vector<T> ExtractMatrixColAsVector(NumericMatrix x, NumericVector column){
vector<T> values = as<vector<T> >(NumericVector(x(_,as<int>(column))));
return values;
}
I was wondering whether there was a better way of doing this if I wanted to convert the whole matrix into separate vectors? Is there an iterator for this purpose or some syntactic sugar that returns a vector of that column automatically?
Thanks for any help.
You could use a quick for loop to convert the whole matrix.
// [[Rcpp::export]]
vector< vector<double> > ExtractMatrixAsVectors(NumericMatrix x){
vector< vector<double> > values(x.nrow());
for(int i=0; i<values.size(); i++) values[i] = as< vector<double> >(NumericVector(x(_,i)));
return values;
}
Also, I don't see too much point in using a template. The output of a numeric matrix column will always be a double precision float.

Quickest distance computation between two large vectors in R

I wish to calculate the distance between each element in one vector and each element in another vector in the quickest possible way in R. A small example is:
distf<-function(a,b) abs(a-b)
x<-c(1,2,3)
y<-c(1,1,1)
result<-outer(x,y, distf)
The issue is that my x and y are now of length 30,000 each and R crashes while trying to do this computation. And this is only doing it once, but I have to repeat the process 1000 times in a simulation study. Are there any quicker functions to be able to achieve this?
I eventually need to identify which of these distances are less than a fixed number/calliper. I will be studying many such fixed callipers eventually, therefore, I need to save all these distances, especially if the computation is so demanding. A function called caliper in the R package optmatch does this process directly, but that cannot handle such a big computation as well.
Here's an Rcpp version that returns an integer matrix of 1s and 0s dependent on whether each pair wide comparison is <= a threshold. On my machine it took 22.5 secs to do 30,000 by 30,000. The output matrix is a little under 7 GB in RAM though.
fast_cal.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix fast_cal(NumericVector x, NumericVector y, double threshold) {
const long nr=x.length();
const long nc=y.length();
NumericMatrix output(nr, nc);
for (long i=0; i<nr; i++) {
for (long j=0; j<nc; j++) {
output(i, j) = (fabs(x(i) - y(j)) <= threshold) ? 1 : 0;
}
}
return output;
}
Testing
library("Rcpp")
sourceCpp("fast_cal.cpp")
x <- rnorm(30000)
y <- rnorm(30000)
out <- fast_cal(x, y, 0.5)

Using MPI for simple calculation, different numbers of processes get different results?

I am using MPI on a very simple computation of PI using numerical integration, using some mathematical rules, eventually I convert the calculation into a summation, where it has this format:
PI = ∑(f(i)), where i start from 1 to 100000, and f(i) is a function to return some double type value based on i.
It is quite strait forward that when programming, I can convert the sum into a for loop, iterating 100000 times. And with MPI using p processors, I divide the for loop into p segments, each processor gets 100000/p loops, (supposing 100000%p = 0). And later on using MPI_Reduce, MPI_SUM to collect those sub-results and sum them up to get the final results.
However, when I using different numbers of processes, the final results will be slightly different, my final PI result has 12 bit precision, and the results start to be different after around 7th bit.
I can not get the answer why the result will be different, as in my mind, it simply does exactly same tasks no mater how the tasks are distributed.
Any help will be appreciated very much!
The numerical result of floating point operations often depends on the order in which they were executed. To understand this, you first need to understand how floating point numbers are represented by a computer. One example is when adding numbers of different size: Due to the different exponents, one will be truncated (e.g. rounded). You can see this with this example:
double small, result1, result2;
small = 1. / 3000.;
result1 = 0.;
for (int i = 0; i < 10000; i++)
result1 += small;
result2 = 0.;
for (int i = 0; i < 100; i++) {
double tmp = 0.;
for (int j = 0; j < 100; j++)
tmp += small;
result2 += tmp;
}
printf("result1= %.17g, result2= %.17g\n", result1, result2);
By adding the numbers to a temporary result first, less truncation happens. It is very likely that something like this is happening in your code.

Resources