I would like to use the local/shared memory optimization to reduce global memory access, so I basically have this function
float __attribute__((always_inline)) test_unoptimized(const global float* data, ...) {
// ...
for(uint j=0; j<def_data_length; j++) {
const float x = data[j];
// do sime computation with x, like finding the minimum value ...
// ...
return x_min;
and do the usual local/shared memory optimization on it:
float __attribute__((always_inline)) test_optimized(const global float* data, ...) {
// ...
const uint lid = get_local_id(0); // shared memory optimization (only works with first ray)
local float cache_x[def_ws];
for(uint j=0; j<def_data_length; j+=def_ws) {
cache_x[lid] = data[j+lid];
#pragma unroll
for(uint k=0; k<min(def_ws, def_data_length-j); k++) {
const float x = cache_x[k];
// do sime computation with x, like finding the minimum value ...
// ...
return x_min;
Now the difficulty is that test_optimized is called in the kernel only in one of two possible if/else branches. If only some threads in a workgroup execute the else-branch, all other threads must not choose the if-branch for the local memory optimization in test_optimized to work. So I created a workaround: The condition for each thread in the workgroup is atomic_or-ed into an integer and then the integer, which is the same for all threads, is checked for branching. This ensures that, if 1 or more threads in the thread block choose the else-branch, all the others do too.
kernel void test_kernel(const global float* data, global float* result...) {
const uint n = get_global_id(0);
// ...
const bool condition = ...; // here I get some condition based on the thread ID n and global data
local uint condition_any; // make sure all threads within a workgroup are in the if/else part
condition_any = 0u;
atomic_or(&condition_any, condition);
if(condition_any==0u) {
// if-part is very short
result = 0;
} else {
// else-part calls test_optimized function
const float x_min = test_optimized(data, ...);
result = condition ? x_min : 0;
The above code works flawlessly and is about 25% faster than with the test_unoptimized function. But atomically jamming a bit into the same local memory from all threads in the workgroup seems a bit like a hack to me and it only runs efficiently for small workgroup size (def_ws) 32, 64 or 128, but not 256 or greater.
Is this trick used in other codes and does it have a name?
If not: Is there a better way to do it?

With OpenCL 1.2 or older, I don't think there's a way to do this any faster. (I'm not aware of any relevant vendor extensions, but check your implementation's list for anything promising.)
With OpenCL 2.0+, you can use workgroup functions, in this case specifically work_group_any() for this sort of thing.


I have some OpenCL code that looks like this:
__kernel void calc(__global double* output) {
size_t a = get_global_id(0);
size_t b = get_global_id(1);
double tot = 0.;
if(a == b) {
tot += f();
output[a * get_global_size(1) + b] = tot;
I.e., some work items take more time to execute than others. When I run this code on a GPU, everything works as expected. But when I run this on a Intel CPU, some of the output ends up incorrectly being 0.. Could it be that some of the writes to global memory are overwriting others due to caching, etc? Do I need to place a barrier before or after a write to global memory?

I wrote the following code for my test NVIDIA and AMD GPUs
kernel void computeLayerOutput_Rolled(
global Layer* layers,
global float* weights,
global float* output,
constant int* restrict netSpec,
int layer)
const int n = get_global_size(0);
const int nodeNumber = get_global_id(0); //There will be an offset depending on the layer we are operating on
int numberOfWeights;
float t;
//getPosition(i, netSpec, &layer, &nodeNumber);
numberOfWeights = layers[layer].nodes[nodeNumber].numberOfWeights;
//if (sizeof(Layer) > 60000) // This is the extra code add for nvidia
// exit(0);
t = 0;
for (unsigned int j = 0; j != numberOfWeights; ++j)
t += threeD_access(weights, layer, nodeNumber, j, MAXSIZE, MAXSIZE) *
twoD_access(output, layer-1, j, MAXSIZE);
twoD_access(output, layer, nodeNumber, MAXSIZE) = sigmoid(t);
At the beginning, I did not add the code that checking the size of Layer, and it works on AMD Kalindi GPU, but crash and report an error code -36 on NVIDIA Tesla C2075.
Since I had rewritten the struct type Layer and decreased the size of it a lot before, I decided to check the size of Layer to determine whether this struct defined well in kernel code. Then I added this code
if (sizeof(Layer) > 60000)
Then it is OK on NVIDIA. However, the strange thing is, when I add // before this just as the given code above, it still works. (I believe I do not need to make clean && make when I rewrite something in kernel code, but I still did it) Nevertheless, when I roll back to the version not contains this comment, it fails and the error code -36 appears again. It really puzzles me. I think two versions of my code are identical, isn't it?

I've looked all around this site and others, and nothing has worked. I'm resorting to posting a question for my specific case.
I have a bunch of matrices, and the goal is to use a kernel to let the GPU to do the same operation on all of them. I'm pretty sure I can get the kernel to work, but I can't get cudaMalloc / cudaMemcpy to work.
I have a pointer to a Matrix structure, which has a member called elements that points to some floats. I can do all the non-cuda mallocs just fine.
Thanks for any/all help.
typedef struct {
int width;
int height;
float* elements;
} Matrix;
int main void() {
int rows, cols, numMat = 2; // These are actually determined at run-time
Matrix* data = (Matrix*)malloc(numMat * sizeof(Matrix));
// ... Successfully read from file into "data" ...
Matrix* d_data;
cudaMalloc(&d_data, numMat*sizeof(Matrix));
for (int i=0; i<numMat; i++){
// The next line doesn't work
cudaMalloc(&(d_data[i].elements), rows*cols*sizeof(float));
// Don't know if this works
cudaMemcpy(d_data[i].elements, data[i].elements, rows*cols*sizeof(float)), cudaMemcpyHostToDevice);
// ... Do other things ...
You have to be aware where your memory resides. malloc allocates host memory, cudaMalloc allocates memory on the device and returns a pointer to that memory back. However, this pointer is only valid in device functions.
What you want could be achived as followed:
typedef struct {
int width;
int height;
float* elements;
} Matrix;
int main void() {
int rows, cols, numMat = 2; // These are actually determined at run-time
Matrix* data = (Matrix*)malloc(numMat * sizeof(Matrix));
// ... Successfully read from file into "data" ...
Matrix* h_data = (Matrix*)malloc(numMat * sizeof(Matrix));
memcpy(h_data, data, numMat * sizeof(Matrix);
for (int i=0; i<numMat; i++){
cudaMalloc(&(h_data[i].elements), rows*cols*sizeof(float));
cudaMemcpy(h_data[i].elements, data[i].elements, rows*cols*sizeof(float)), cudaMemcpyHostToDevice);
}// matrix data is now on the gpu, now copy the "meta" data to gpu
Matrix* d_data;
cudaMalloc(&d_data, numMat*sizeof(Matrix));
cudaMemcpy(d_data, h_data, numMat*sizeof(Matrix));
// ... Do other things ...
To make things clear:
Matrix* data contains the data on the host.
Matrix* h_data contains a pointer to the device memory in elements which can be passed to the kernels as parameters. The memory is on the GPU.
Matrix* d_data is completly on the GPU and can be used like data on the host.
in your kernel code you kann now access the matrix values, e.g.,
__global__ void doThings(Matrix* matrices)
matrices[i].elements[0] = 42;

I have an algorithm, performing two-staged parallel reduction on GPU to find the smallest elemnt in a string. I know that there is a hint on how to make it work faster, but I don't know what it is. Any ideas on how I can tune this kernel to speed my program up? It is not necessary to actually change algorithm, may be there are other tricks. All ideas are welcome.
Thank you!
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
accumulator = (accumulator < element) ? accumulator : element;
Use fmin function - it is exactly what you need, and it may result in faster code (call to built-in instruction, if available, instead of costly branching)
global_index += get_global_size(0);
What is your typical get_global_size(0)?
Though your access pattern is not very bad (it is coalesced, 128byte chunks for 32-warp) - it is better to access memory sequentially whenever possible. For instance, sequential access may aid memory prefetching (note, OpenCL code can be executed on any device, including CPU).
Consider following scheme: each thread would process range
[ get_global_id(0)*delta , (get_global_id(0)+1)*delta )
It will result in fully sequential access.

I have a kernel which I am running on a NVidia GTX 680 that increased in execution time when switching from using global memory to local memory.
My kernel which is part of a finite element ray tracer now loads each element into local memory before processing. The data for each element is stored in a struct fastTriangle which has the following definition :
typedef struct fastTriangle {
float cx, cy, cz, cw;
float nx, ny, nz, nd;
float ux, uy, uz, ud;
float vx, vy, vz, vd;
} fastTriangle;
I pass an array of these object to the kernel which is written as follows (I have removed the irrelevant code for brevity:
__kernel void testGPU(int n_samples, const int n_objects, global const fastTriangle *objects, __local int *x_res, __global int *hits) {
// Get gid, lid, and lsize
// Set up random number generator and thread variables
// Local storage for the two triangles being processed
__local fastTriangle triangles[2];
for(int i = 0; i < n_objects; i++) { // Fire ray from each object
event_t evt = async_work_group_copy((local float*)&triangles[0], (global float*)&objects[i],sizeof(fastTriangle)/sizeof(float),0);
//Initialise local memory x_res to 0's
wait_group_events(1, &evt);
Vector wsNormal = { triangles[0].cw*triangles[0].nx, triangles[0].cw*triangles[0].ny, triangles[0].cw*triangles[0].nz};
for(int j = 0; j < n_samples; j+= 4) {
// generate a float4 of random numbers here (rands
for(int v = 0; v < 4; v++) { // For each ray in ray packet
//load the first object to be intesected
evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[0],sizeof(fastTriangle)/sizeof(float),0);
// Some initialising code and calculate ray here
// Should have ray fully specified at this point;
for(int w = 0; w < n_objects; w++) { // Check for intersection against each ray
wait_group_events(1, &evt);
// Check for intersection against object w
float det = wsDir.x*triangles[1].nx + wsDir.y*triangles[1].ny + wsDir.z*triangles[1].nz;
float dett = triangles[1].nd - (triangles[0].cx*triangles[1].nx + triangles[0].cy*triangles[1].ny + triangles[0].cz*triangles[1].nz);
float detpx = det*triangles[0].cx + dett*wsDir.x;
float detpy = det*triangles[0].cy + dett*wsDir.y;
float detpz = det*triangles[0].cz + dett*wsDir.z;
float detu = detpx*triangles[1].ux + detpy*triangles[1].uy + detpz*triangles[1].uz + det*triangles[1].ud;
float detv = detpx*triangles[1].vx + detpy*triangles[1].vy + detpz*triangles[1].vz + det*triangles[1].vd;
// Interleaving the copy of the next triangle
evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[w+1],sizeof(fastTriangle)/sizeof(float),0);
// Complete intersection calculations
} // end for each object intersected
if(objectNo != -1) atomic_inc(&x_res[objectNo]);
} // end for sub rays
} // end for each ray
// Add all the local x_res to global array hits
} // end for each object
When I first wrote this kernel I did not buffer each object in local memory and instead just accessed it form global memory i.e instead of triangles[0].cx I would use objects[i].cx
When setting out to optimise I switched to using local memory as listed above but then observed a execution run time increase of around 25%.
Why would performance be worse when using local memory to buffer the objects instead of directly accessing them in global memory?
It really depends on your program if local memory helps you to run faster. There are two things to consider when using local memory:
you have additional computation when copying the data from global to local and from local to global again.
I see that you have 3 times "barrier(...)", these barriers are performance killers. All OpenCL tasks have to wait at the barrier for all others. This way the parallelism is hindered and the tasks don't run independent any more.
Local memory is great when you read data lots of times in your computation. But the fast reads and writes need to get you more performance gain than the copying and synchronizing takes.
