If I am having a for loop in c++ with some stride, how can I parallelized it in OpenCL kernel.
For example:
for(int i=0;i<100;i++4)
for(int j=0;j<60;j++4)
{
a[i]= b[j]+2;
}
In OpenCL, If I want to parallelize the loops, I can think of using "/" or "%", but Is there an other solution?
I am thinking something like this:
int id1= get_global_id(0);
int id2= get_global_id(1);
if((id1%4==0) && (id2%4==0))
{
a[id1] = b[id2]+2;
}
This is just an example, as I want to know, how can work with the stride. Is there any other way?
Multiply id1 and id2 with 4, and set the global size to 100/4 and 60/4 when you launch the kernel.
int id1= get_global_id(0)*4;
int id2= get_global_id(1)*4;
a[id1] = b[id2]+2;
Related
I have kernel which evaluate interaction between all pairs of neighbors of an atoms. Each atom has max. 4 neighbors so I store their indexes in int4. But in order to loop over these neighbors I need to access them by index (neighs[0] rather than neighs.x ).
The loop should look something like:
int iatom = get_global_id(0);
int4 ng = neighs[iatom]; // each atoms has 4 neighbors
float4 p0 = atom_pos[iatom];
float4 force = (float)(0.f,0.f,0.f,0.f);
for(int i=0; i<4; i++){
int ing = ng[i]; // HERE: index into vector
float4 pi = atom_pos[ing];
for(int j=i+1; j<4; j++){
int jng = ng[j]; // HERE: index into vector
float4 pj = atom_pos[jng];
force += evalInteraction( p0, pi, pj );
}
}
forces[iatom]=force;
I have some idea how it can be probably done but not sure:
Unroll the loops
since there are just 4*3/2=6 pair-interactions it would be probably even more efficient. But it would be much less readable and more difficult do modify.
cast int4 to int*
but is it fine ? Doesn't it break something? Doesn't it make some performance issue? I mean this:
int4 ng_ = neighs[iatom]; // make sure we copy it to local memory or register
int* ng = (int*)&ng_; // pointer to local memory can be optimized out, right ?
for(int i=0; i<4; i++){
int ing = ng[i];
...
}
You can cast directly, but you can also declare a union for easier access:
union
{
int components[4];
int4 vector;
} neighbors;
neighbors.vector = ng;
neighbors.components[i]; // Works now
can we parallelize a recursive function using MPI?
I am trying to parallelize the quick sort function, but don't know if it works in MPI because it is recursive. I also want to know where should I do the parallel region.
// quickSort.c
#include <stdio.h>
void quickSort( int[], int, int);
int partition( int[], int, int);
void main()
{
int a[] = { 7, 12, 1, -2, 0, 15, 4, 11, 9};
int i;
printf("\n\nUnsorted array is: ");
for(i = 0; i < 9; ++i)
printf(" %d ", a[i]);
quickSort( a, 0, 8);
printf("\n\nSorted array is: ");
for(i = 0; i < 9; ++i)
printf(" %d ", a[i]);
}
void quickSort( int a[], int l, int r)
{
int j;
if( l < r )
{
// divide and conquer
j = partition( a, l, r);
quickSort( a, l, j-1);
quickSort( a, j+1, r);
}
}
int partition( int a[], int l, int r) {
int pivot, i, j, t;
pivot = a[l];
i = l; j = r+1;
while( 1)
{
do ++i; while( a[i] <= pivot && i <= r );
do --j; while( a[j] > pivot );
if( i >= j ) break;
t = a[i]; a[i] = a[j]; a[j] = t;
}
t = a[l]; a[l] = a[j]; a[j] = t;
return j;
}
I would also really appreciate it if there is another simpler code for the quick sort.
Well, technically you can, but I'm afraid this would be efficient only in SMP. And does the array fit to single node? If no, then you cannot perform even the first pass of a quick-sort.
If you really need to sort an array on a parallel system using MPI, you might want to consider using merge sort instead (of course you still can use quick sort for single blocks at each node, before you begin merging the blocks).
If you still want to use quick sort, but you are confused with the recursive version, here is a sketch of non-recursive algorithm which hopefully can be parallelized a bit easier, although it's essentially the same:
std::stack<std::pair<int, int> > unsorted;
unsorted.push(std::make_pair(0, size-1));
while (!unsorted.empty()) {
std::pair<int, int> u = unsorted.top();
unsorted.pop();
m = partition(A, u.first, u.second);
// here you can send one of intervals to another node instead of
// pushing it into the stack, so it would be processed in parallel.
if (m+1 < u.second) unsorted.push(std::make_pair(m+1, u.second));
if (u.first < m-1) unsorted.push(std::make_pair(u.first, m-1));
}
Theoretically "anything" can be parallelized using MPI, but remember that MPI isn't doing any parallelization itself. It's just providing the communication layer between processes. As long as all of your sends and receives (or collective calls) match up, it's a correct program for the most part. That being said, it may not be the most efficient thing to use MPI, depending on your algorithm. If you are going to be sorting lots and lots of data (more than can fit in the memory of one node) then it could be efficient to use MPI (you probably want to take a look at the RMA chapter in that case) or some other higher level library that might make things even simpler for this type of application (UPC, Co-array Fortran, SHMEM, etc.).
I am trying to implement a "coupling to the past" algorithm in Rcpp. For this I need to store a matrix of random numbers, and if the algorithm did not converge create a new matrix of random numbers and store that as well. This might have to be done 10+ times or something until convergence.
I was hoping I could use a List and dynamically update it, similar as I would in R. I was actually very surprised it worked a bit but I got errors whenever the list size becomes large. This seems to make sense as I did not allocate the needed memory for the additional list elements, although I am not that familiar with C++ and not sure if that is the problem.
Here is an example of what I tried. however be aware that this will probably crash your R session:
library("Rcpp")
cppFunction(
includes = '
NumericMatrix RandMat(int nrow, int ncol)
{
int N = nrow * ncol;
NumericMatrix Res(nrow,ncol);
NumericVector Rands = runif(N);
for (int i = 0; i < N; i++)
{
Res[i] = Rands[i];
}
return(Res);
}',
code = '
void foo()
{
// This is the relevant part, I create a list then update it and print the results:
List x;
for (int i=0; i<10; i++)
{
x[i] = RandMat(100,10);
Rf_PrintValue(wrap(x[i]));
}
}
')
foo()
Does anyone know a way to do this without crashing R? I guess I could initiate the list at a fixed amount of elements here, but in my application the amount of elements is random.
You have to "allocate" enough space for your list. Maybe you can use something like a resizefunction:
List resize( const List& x, int n ){
int oldsize = x.size() ;
List y(n) ;
for( int i=0; i<oldsize; i++) y[i] = x[i] ;
return y ;
}
and whenever you want your list to be bigger than it is now, you can do:
x = resize( x, n ) ;
Your initial list is of size 0, so it expected that you get unpredictable behavior at the first iteration of your loop.
I have an ATI Firepro V4800 graphics card which does not support cl_khr_int64_base_atomics. I am trying to adapt the RadixSort algo for long integers. The algo uses atomic_inc, the 64-bit of which is atom_inc, which I cannot use in the kernel. So, my question is, is there a piece of code which performs the same function as atomic_inc which can be used? The piece of kernel code is given below:
__kernel void histogram(__global uint* unsortedData,
__global uint* buckets,
uint shiftCount,
__local uint* sharedArray)
{
size_t localId = get_local_id(0);
size_t globalId = get_global_id(0);
size_t groupId = get_group_id(0);
size_t groupSize = get_local_size(0);
uint numGroups = get_global_size(0) / get_local_size(0);
// Initialize shared array to zero //
sharedArray[localId] = 0;
barrier(CLK_LOCAL_MEM_FENCE);
// Calculate thread-histograms //
uint value = unsortedData[globalId];
value = value >> shiftCount & 0xFFU;
atomic_inc(sharedArray+value);
barrier(CLK_LOCAL_MEM_FENCE);
// Copy calculated histogram bin to global memory //
uint bucketPos = groupId * groupSize + localId ;
//uint bucketPos = localId * numGroups + groupId ;
buckets[bucketPos] = sharedArray[localId];
}
Any suggestions? Thank you.
Edit:
Another way for the same is given in this blogsite: http://suhorukov.blogspot.in/2011/12/opencl-11-atomic-operations-on-floating.html. This gives a very generic implementation of the Atomic Inc.
You could try something like this:
void atomInc64 (__local uint *counter)
{
uint old, carry;
old = atomic_inc (&counter [0]);
carry = old == 0xFFFFFFFF;
atomic_add (&counter [1], carry);
}
Where counter is an array of two 32-bit integers. While the two halves don't increment at exactly the same time, the total should be correct when the program completes.
I'm working in a GPU Kernel and I have some problems copying data from global to local memory
here is my kernel function:
__kernel void nQueens( __global int * data, __global int * result, int board_size)
so I want to copy from __global int * data to __local int aux_data[OBJ_SIZE]
I tried to copy like a normal array:
for(int i = 0; i < OBJ_SIZE; ++i)
{
aux_data[stack_size*OBJ_SIZE + i] = data[index*OBJ_SIZE + i];
}
and also with the functions to copy:
event_t e = async_work_group_copy ( aux_data, (data + (index*OBJ_SIZE)), OBJ_SIZE, 0);
wait_group_events (1, e);
And in both situations I get different values between the global and local memory.
I don't know what I'm doing wrong...
One of the problems with the way you are copying data in the first answer is that you are assigning data to parts of an array that don't exist. aux_data[stack_size*OBJ_SIZE + i] will overflow whenever stack_size > 1.
The problem with answer two might be that you need to pass an array of events, not just a single event.
One thing to make sure is to understand what index is referring to. I'm assuming for my solutions that it is referring to the group ID and not the thread ID. If it is indeed the thread ID, then we have other problems.
Possible Solution 1:
int gid = get_group_id(0);
int lid = get_local_id(0);
int l_s = get_local_id(0);
for(int i = lid; i < OBJ_SIZE; i += l_s)
{
aux_data[i] = data[gid*OBJ_SIZE + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
Possible Solution 2:
int gid = get_group_id(0);
event_t e = async_work_group_copy (aux_data, data + (gid*OBJ_SIZE), OBJ_SIZE, 0);
wait_group_events (1, &e);