I am learning opencl for the first time, and I am currently modifying the shortest path finding algorithm. I know that opencl usually uses the idea of parallel computing to solve problems. So I wonder if I can also use this parallel idea when I am dealing with finding the minimum value and its position in the array?
This is my previous attempt. I think that as long as the variable is the smallest, the result can be obtained regardless of whether the operation is locked or not. Unfortunately, when I use printf to view variables, although valid nodes have been judged, I can't get the correct results.
__kernel void findWay(__global int* A, __global int* B, __global int* minNode, __global int* minDis, __global int* isFinish)
//A: weightMatrix , B: usedNode
//dijkstra algorithm , src node is 0
size_t dst = get_global_id(1);
size_t src = get_global_id(0);
size_t vCount = get_global_size(0);
int index = dst * vCount + src;
while(isFinish[0] != vCount){
if((src == minNode[0])&&(B[dst] == 0)&&(A[index] != INT_MAX)){
A[dst*vCount] = min(A[dst*vCount + 0],A[minNode[0]*vCount + 0] + A[index]);
minDis[0] = INT_MAX;
//here is the bug
if((src == 0) &&(B[dst] == 0)){
if(minDis[0] > A[index]){
minDis[0] = A[index];
minNode[0] = dst;
B[minNode[0]] = 1;
if(index == 0){
In the end, I can only use a normal way to achieve this operation.
if((src == 0) &&(dst == 0)){
for(int i = 0 ; i < vCount ;i++){
if(B[i] == 0 && minDis[0] > A[i *vCount]){
minDis[0] = A[i*vCount];
minNode[0] = i;
I would like to ask about this search process, can the looping step be omitted?

Horizontal operations on the parallelized array are difficult. The general approach to them is binary-tree-like kernel passes. Start with the original array, make each GPU thread load 2 neighboring elements and choose the smaller one, write that in the same array to position of the first of the two elements. Next kernel loads two elements from the list of every second element, compares the two, writes the smaller one in the first position of the two. Repeat until there is only one element left.
I will illustrate it beloe. I mark values that are not touched by the kernel anymore with *.
original array: 5|2|1|6|9|3|4|8
after 1st kernel pass: 2 *|1 *|3 *|4 *
after 2nd kernel pass: 1 * * *|3 * * *
after 3nd kernel pass: 1 * * * * * * *
smallest element is 1.


Portable vector shift/permutation in OpenCL?

I'm trying to write a trimmed mean kernel that takes as input a set of frames (~100). I'm thinking of using an insertion sort (of size ~8). This means that I'll need to read one float/ uint/ushort at a time from the input images and compare it against an 8-wide vector, shifting the elements up and inserting the new value at the correct spot (if necessary), with the largest value added to the mean.
I'm having difficulties finding a portable way of shifting the elements in the vector and inserting the new one at the correct spot. I know that AMD GPUs have ds_permute for example, but those are not portable, and I can't figure out a clever way of using arithmetic and relational operators to do it (since those operate only on their lane and AFAIK unaligned vector accesses are UB in OpenCL).
If you only have 8 items in your list then you could add some indirection and have an index table uchar[8]. You assign the pre-sorted elements values 0-7. As you perform the sort you don't rearrange those items, instead you insert their indices into the table.
To get the speedup you then need to store each index using 4 bits to that all 8 fit into a 32-bit word. Honestly, I don't think this will be faster in your case though.
float elements[8];
uint index_table = 0;
uint sorted_size = 0;
// insert elements[i]
void insert(uint i)
uint temp = index_table
for (j = 0; j < sorted_size ; ++j)
if (elements[i] < elements[temp & 0xf])
// Insert i
temp = (temp << 4) | i;
index_table = (index_table & (4 * j - 1)) | (temp << (4 * j));
temp >>= 4;
// Insert at end
index_table |= i << 4 * sorted_size ;
void insertion_sort()
// We can skip the first iteration since the 1st element is always inserted at the start
for (sorted_size = 1; sorted_size < 8; ++sorted_size)
float ith_smallest(uint i)
return elements[(index_table >> 4 * i) & 0xf];

OpenCL Matrix Multiplication Altera Example

I am very new to OpenCL and am going through the Altera OpenCL examples.
In their matrix multiplication example, they have used the concept of blocks, where dimensions of the input matrices are multiple of block size. Here's the code:
void matrixMult( // Input and output matrices
__global float *restrict C,
__global float *A,
__global float *B,
// Widths of matrices.
int A_width, int B_width)
// Local storage for a block of input matrices A and B
__local float A_local[BLOCK_SIZE][BLOCK_SIZE];
__local float B_local[BLOCK_SIZE][BLOCK_SIZE];
// Block index
int block_x = get_group_id(0);
int block_y = get_group_id(1);
// Local ID index (offset within a block)
int local_x = get_local_id(0);
int local_y = get_local_id(1);
// Compute loop bounds
int a_start = A_width * BLOCK_SIZE * block_y;
int a_end = a_start + A_width - 1;
int b_start = BLOCK_SIZE * block_x;
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
running_sum += A_local[local_y][k] * B_local[local_x][k];
// Store result in matrix C
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
Assume block size is 2, then: block_x and block_y are both 0; and local_x and local_y are both 0.
Then A_local[0][0] would be A[0] and B_local[0][0] would be B[0].
Sizes of A_local and B_local are 4 elements each.
In that case, how would A_local and B_local access other elements of the block in that iteration?
Also would separate threads/cores be assigned for each local_x and local_y?
There is definitely a barrier missing in your code sample. The outer for loop as you have it will only produce correct results if all work items are executing instructions in lockstep fashion, thus guaranteeing the local memory is populated before the for k loop.
Maybe this is the case for Altera and other FPGAs, but this is not correct for CPUs and GPUs.
You should add barrier(CLK_LOCAL_MEM_FENCE); if you are getting unexpected results, or want to be compatible with other type of hardware.
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
running_sum += A_local[local_y][k] * B_local[local_x][k];
A_local and B_local are both shared by all work items of the work group, so all their elements are loaded in parallel (by all work items of the work group) at each step of the encompassing for loop.
Then each work item uses some of the loaded values (not necessarily the values the work item loaded itself) to do its share of the computation.
And finally, the work item stores its individual result into the global output matrix.
It is a classical tiled implementation of a matrix-matrix multiplication. However, I'm really surprised not to see any sort of call to a memory synchronisation function, such as work_group_barrier(CLK_LOCAL_MEM_FENCE) between the load of A_local and B_local and their use in the k loop... But I might very well have overlooked something here.

Hough transform and OpenCL

I'm trying to implement Hough transform for circles in OpenCL, but i've encountered really weird problem. Every time i run the Hough kernel, i end up with slightly different accumulator, even though parameters are the same and accumulator is always a freshly zero'ed table (ex. My kernel code is as below:
#define BLOCK_LEN 256
__kernel void HoughCirclesKernel(
__global int* A,
__global int* imgData,
__global int* _width,
__global int* _height,
__global int* r
__local int imgBuff[BLOCK_LEN];
int localThreadIndex = get_local_id(0); //threadIdx.x
int globalThreadIndex = get_local_id(0) + get_group_id(0) * BLOCK_LEN; //threadIdx.x + blockIdx.x * Block_Len
int width = *_width; int height = *_height;
int radius = *r;
A[globalThreadIndex] = 0;
if(globalThreadIndex < width*height)
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
if(imgBuff[localThreadIndex] > 0)
float s1, c1;
for(int i = 0; i<180; i++)
s1 = sincos(i, &c1);
int centerX = globalThreadIndex % width + radius * c1;
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
if(centerX < width && centerY < height)
atomic_inc(A + centerX + centerY * width);
Could this be the fault of how I am incrementing the accumulator?
if(globalThreadIndex < width*height)
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
this is undefined behaviour since there is a barrier inside a branch.
All streaming units in a compute unit must enter same memory fence.
Try this:
if(globalThreadIndex < width*height)
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
Alse there could be another issue if you are using multiple devices:
get_local_id(0) + get_group_id(0)
here get_group_id(0) is getting group id per device and it starts from 0 for all devices just as get_global_id starts zero too; so you should add proper offsets in the "ndrange" instruction when using multiple devices. Even though different devices can support same floatig point accuracy requirements, one of them may give better accuracy than other and can give slightly different results. If it is single device, then you should try lowering gpu frequencies as it may have defects or side effects of an overclock.
I have managed to solve my problem by finding and correcting three issues.
First of all the kernel code, the line:
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
should be:
int centerY = (globalThreadIndex / width) + radius * s1;
The main change here was dividing by width, not height. This caused inaccuracy problems.
if(centerX < width && centerY < height)
The above condition was changed to:
if(x < width && x >= 0)
if(y < height && y >=0)
As for the accumulator problem, first I will post the code I used to create clBuffer (I am using library for C#):
int[] a = new int[width*height]; //image size
ErrorCode error;
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite, (IntPtr)(a.Length * sizeof(int)), out error);
CheckErr(error, "Cl.CreateBuffer");
The fix here was simple and pretty much self-explainatory:
int[] a = Enumerable.Repeat(0, width * height).ToArray();
ErrorCode error;
GCHandle accHandle = GCHandle.Alloc(a, GCHandleType.Pinned);
IntPtr accPtr = accHandle.AddrOfPinnedObject();
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite | MemFlags.CopyHostPtr, (IntPtr)(a.Length * sizeof(int)), accPtr, out error);
CheckErr(error, "Cl.CreateBuffer");
I filled the accumulator table with zeros and then copied it to device buffer each time I executed the kernel.
The above errors caused the accumulator to look different and bit malformed each time I executed the kernel.

How to share work roughly evenly between processes in MPI despite the array_size not being cleanly divisible by the number of processes?

Hi all, I have an array of length N, and I'd like to divide it as best as possible between 'size' processors. N/size has a remainder, e.g. 1000 array elements divided by 7 processes, or 14 processes by 3 processes.
I'm aware of at least a couple of ways of work sharing in MPI, such as:
for (i=rank; i<N;i+=size){ a[i] = DO_SOME_WORK }
However, this does not divide the array into contiguous chunks, which I'd like to do as I believe is faster for IO reasons.
Another one I'm aware of is:
int count = N / size;
int start = rank * count;
int stop = start + count;
// now perform the loop
int nloops = 0;
for (int i=start; i<stop; ++i)
a[i] = DO_SOME_WORK;
However, with this method, for my first example we get 1000/7 = 142 = count. And so the last rank starts at 852 and ends at 994. The last 6 lines are ignored.
Would be best solution to append something like this to the previous code?
int remainder = N%size;
int start = N-remainder;
if (rank == 0){
for (i=start;i<N;i++){
a[i] = DO_SOME_WORK;
This seems messy, and if its the best solution I'm surprised I haven't seen it elsewhere.
Thanks for any help!
If I had N tasks (e.g., array elements) and size workers (e.g., MPI ranks), I would go as follows:
int count = N / size;
int remainder = N % size;
int start, stop;
if (rank < remainder) {
// The first 'remainder' ranks get 'count + 1' tasks each
start = rank * (count + 1);
stop = start + count;
} else {
// The remaining 'size - remainder' ranks get 'count' task each
start = rank * count + remainder;
stop = start + (count - 1);
for (int i = start; i <= stop; ++i) { a[i] = DO_SOME_WORK(); }
That is how it works:
# ranks: remainder size - remainder
/------------------------------------\ /-----------------------------\
rank: 0 1 remainder-1 size-1
tasks: | count+1 | count+1 | ...... | count+1 | count | count | ..... | count |
^ ^ ^ ^
| | | |
task #: rank * (count+1) | rank * count + remainder |
| |
task #: rank * (count+1) + count rank * count + remainder + count - 1
# tasks: remainder * count + remainder
Here's a closed-form solution.
Let N = array length and P = number of processors.
From j = 0 to P-1,
Starting point of array on processor j = floor(N * j / P)
Length of array on processor j = floor(N * (j + 1) / P) – floor(N * j / P)
Consider your "1000 steps and 7 processes" example.
simple division won't work because integer division (in C) gives you the floor, and you are left with some remainder: i.e. 1000 / 7 is 142, and there will be 6 doodads hanging out
ceiling division has the opposite problem: ceil(1000/7) is 143, but then the last processor overruns the array, or ends up with less to do than the others.
You are asking for a scheme to evenly distribute the remainder over processors. Some processes should have 142, others 143. There must be a more formal approach but considering the attention this question's gotten in the last six months maybe not.
Here's my approach. Every process needs to do this algorithm, and just pick out the answer it needs for itself.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char ** argv)
#define NR_ITEMS 1000
int i, rank, nprocs;;
int *bins;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
bins = calloc(nprocs, sizeof(int));
int nr_alloced = 0;
for (i=0; i<nprocs; i++) {
remainder = NR_ITEMS - nr_alloced;
buckets = (nprocs - i);
/* if you want the "big" buckets up front, do ceiling division */
bins[i] = remainder / buckets;
nr_alloced += bins[i];
if (rank == 0)
for (i=0; i<nprocs; i++) printf("%d ", bins[i]);
return 0;
I know this is long sense gone but a simple way to do this is to give each process the floor of the (number of items) / (number of processes) + (1 if process_num < num_items mod num_procs). In python, an array with work counts:
# Number of items
# Number of processes
# Items per process
[NI/NP + (1 if P < NI%NP else 0)for P in range(0,NP)]
Improving off of #Alexander's answer: make use of min to condense the logic.
int count = N / size;
int remainder = N % size;
int start = rank * count + min(rank, remainder);
int stop = (rank + 1) * count + min(rank + 1, remainder);
for (int i = start; i < stop; ++i) { a[i] = DO_SOME_WORK(); }
I think that the best solution is to write yourself a little function for splitting work across processes evenly enough. Here's some pseudo-code, I'm sure you can write C (is that C in your question ?) better than I can.
function split_evenly_enough(num_steps, num_processes)
return = repmat(0, num_processes) ! pseudo-Matlab for an array of num_processes 0s
steps_per_process = ceiling(num_steps/num_processes)
return = steps_per_process - 1 ! set all elements of the return vector to this number
return(1:mod(num_steps, num_processes)) = steps_per_process ! some processes have 1 more step
How about this?
int* distribute(int total, int processes) {
int* distribution = new int[processes];
int last = processes - 1;
int remaining = total;
int process = 0;
while (remaining != 0) {
if (process != last) {
else {
process = 0;
return distribution;
The idea is that you assign an element to the first process, then an element to the second process, then an element to the third process, and so on, jumping back to the first process whenever the last one is reached.
This method works even when the number of processes is greater than the number of elements. It uses only very simple operations and should therefore be very fast.
I had a similar problem, and here is my non optimum solution with Python and mpi4py API. An optimum solution would take into account how the processors are laid out, here extra work is ditributed to lower ranks. The uneven workload only differ by one task, so it should not be a big deal in general.
from mpi4py import MPI
import sys
def get_start_end(comm,N):
Distribute N consecutive things (rows of a matrix , blocks of a 1D array)
as evenly as possible over a given communicator.
Uneven workload (differs by 1 at most) is on the initial ranks.
comm: MPI communicator
N: int
Total number of things to be distributed.
rstart: index of first local row
rend: 1 + index of last row
Index is zero based.
P = comm.size
rank = comm.rank
rstart = 0
rend = N
if P >= N:
if rank < N:
rstart = rank
rend = rank + 1
rstart = 0
rend = 0
n = N//P # Integer division PEP-238
remainder = N%P
rstart = n * rank
rend = n * (rank+1)
if remainder:
if rank >= remainder:
rstart += remainder
rend += remainder
rstart += rank
rend += rank + 1
return rstart, rend
if __name__ == '__main__':
n = int(sys.argv[1])

OpenCL / try to understand Kernel Code

I am studying an OpenCL code wich simulates the N-body problem from the following tutorial :
My main issue relies on the kernel code :
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
19 pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
20 barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
21 for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
22 float4 p2 = pblock[j]; /* Read a cached particle position */
23 float4 d = p2 - p;
24 float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
25 float f = p2.w*invr*invr*invr;
26 a += f*d; /* Accumulate acceleration */
27 }
28 barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
29 }
I don't understand what exactly happens at the execution : the kernel code is executed n times where n is the number of work-items (which is also the number of threads) but in the above part of code, we use the local memory for each work-group (there are nb work-groups it seems)
So, at the execution, up to the first "barrier", do I fill locally the pblock array with the global values of pos_old ?
Always up to the first barrier, for another work-group, the pblock array will have contain the same values as the arrays of the others work-groups, since jb=0 before the barrier ?
It seems that's a way to share these arrays by all the work-groups but this is not totally clear for me.
Any help is welcome.
Can you post the entire kernel code please? I have to make assumptions about the params and private variables.
It looks like there are nt number of work items in the group, and ti represents the current work item. When the loop executes, each item in the group will copy only single element. Usually this copy is from a global data source. The first barrier forces the work item to wait until the other items have made their copy. This is necessary because every work item in the group needs to read the data copied from every other work item. The values should not be the same, because ti should be different for each work item. (jb*nt would still equal zero for the first loop though)
Here is the entire kernel code :
__global float4* pos ,
__global float4* vel,
int numBodies,
float deltaTime,
float epsSqr,
__local float4* localPos,
__global float4* newPosition,
__global float4* newVelocity)
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// Number of tiles we need to iterate
unsigned int numTiles = numBodies / localSize;
// position of this work-item
float4 myPos = pos[gid];
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for(int i = 0; i < numTiles; ++i)
// load one tile into local memory
int idx = i * localSize + tid;
localPos[tid] = pos[idx];
// Synchronize to make sure data is available for processing
// calculate acceleration effect due to each body
// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
for(int j = 0; j < localSize; ++j)
// Calculate acceleartion caused by particle j on particle i
float4 r = localPos[j] - myPos;
float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
float invDist = 1.0f / sqrt(distSqr + epsSqr);
float invDistCube = invDist * invDist * invDist;
float s = localPos[j].w * invDistCube;
// accumulate effect of all particles
acc += s * r;
// Synchronize so that next tile can be loaded
float4 oldVel = vel[gid];
// updated position and velocity
float4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
newPos.w = myPos.w;
float4 newVel = oldVel + acc * deltaTime;
// write to global memory
newPosition[gid] = newPos;
newVelocity[gid] = newVel;
There are "numTiles" work-groups with "localSize" work-items for each work-group.
"gid" is the global index and "tid" is the local index.
Let's start at the first iteration of the loop "for(int i = 0; i < numTiles; ++i)" with "i=0":
If I take for example :
numTiles = 4, localSize = 25 and numBodies = 100 = number of work-items.
Then, at the execution, if I have gid = 80, then tid = 5, idx = 5 and the first assignement will be : localPos[5] = pos[5]
Now, I take gid = 5, then tid = 5 and idx = 5, I will have the same assignement with : localPos[5] = pos[5]
So, from what I understand, in the first iteration and after the first "barrier", each work-items contains the same Local array "localPos", i.e the sub-array of the first global block, which is "pos[0:24]".
Is this a good explanation of what happens ?
