I have two functions:
void sum1(short * a, short * b, short * res, int size);
void sum2(float * a, float * b, float * res, int size);
and I have a single generic kernel
__kernel void sum(__global const T * a, __global const T * b, __global T * res, int size)
int x = get_global_id(0);
if (x < size) res[x] = a[x] + b[x];
is it safely to invoke this generic kernel from functions presented above with compile options -D T=short and -D T=float respectively? Do I need to use alignment or does OpenCL automatically align kernel arguments in this case to 2 and 4 bytes respectively?
In general, when I am passing cl_mem object to a kernel OpenCL does not know about a data type that stored in this cl_mem object and I could understand how OpenCL "transforms" cl_mem object to appropriate pointer in kernel arg.. I need help with that

1- Yes. It is safe to use directly -D T=short or float at compile time. Since it will generate 2 proper kernels.
2- OpenCL (and other programming languages with pointers) understand that the pointer you are passing has a type. And they stick to this type when addressing the memory.
At least in C, this is not a problem since automatic pointer casting is not allowed. And the programmer gets an error if the pointer doesn't match.
However in OpenCL, the memory zones of a buffer are considered generic or void pointers. When you assign them to a kernel, the cast is implicit in the assignment. But this doesn't mean it is correct !
For example. If you create a float buffer, fill it with floats, use it as an argument to short kernel. The result will be wrong, since the kernel will interpret the buffer wrong. However if you do it wrong by passing a short array to a float kernel, the result will be a SEG_FAULT.


How do I pass a device memory buffer with offset to my kernel

I have allocated a buffer on the device:
cl_mem buff;
I want to pass this buffer plus an offset to my kernel
buff + offset;
I find that this is not allowed. If I instead pass buff into my kernel and then
calculate the offset buffer inside the kernel, then this is fine. But it adds a needless calculation to each kernel run.
So, I get that the device memory space is different than the host, so I can't do simple pointer arithmetic. But, is there a way of taking an address to a device memory buffer,
calculating an offset, and passing this offset buffer into the kernel?
I think this may be possible with clCreateSubBuffer, but the offset needs to be aligned to the device's CL_DEVICE_MEM_BASE_ADDR_ALIGN, and this is not always possible for my kernel.
Using clCreateSubBuffer
If offset can be calculated statically, export macro, when building Program of your Kernel;
Assuming you are using C++
std::string macro;
std::stringstream ss;
// e. g. let it be 2^10
std::size_t offset = 1024;
ss << offset;
macro = "-D offset=";
macro += ss.str();
// When building Programm
clBuildProgram(..., macro.c_str(), ...);
//Inside your Kernel macro "offset" is defined
void __kenel my(
__global const uchar* data)
__global const uchar* data_with_shift = data + offset;
Though, calculations inside kernel are extreamly cheap, so Marco13 gave you good advice.

clSetKernelArg arg_value other than memory object

In OpenCL, can I set kernel argument as following?
cl_uint a = 0;
kernel.setArg(0, sizeof(a), &a);
I want to read&write one value from/to a kernel function, not only write to.
Setting a kernel argument in this manner can only be used for inputs to the kernel. Any output you want to read (either in a subsequent kernel or from the host program) must be written to a buffer or an image. In your case, that means you need to create a single-element buffer and pass the buffer to the kernel.
One way to think about this is that when you call setArg with the parameter &a, the OpenCL kernel is using the value of a, not the location of a. If the kernel were to write to kernel argument zero, your host program would have no way of recovering the value that was written.
Your code creates an argument of type unsigned int, not pointer to unsigned int.
clSetKernelArg takes a pointer to the argument value, not the value itself.
If you want to pass a pointer argument, you will have to create a buffer with clCreateBuffer (even if it's just one value in there) and call clSetKernelArg with the resulting cl_mem.
The following code creates a buffer for 1 cl_uint in __global memory, and copies the value of my_value to it. After running the kernel, it copies the (possibly modified) value back to my_value.
cl_uint my_value = 0;
const unsigned int count = 1;
// Allocate buffer
cl_mem hDeviceMem = clCreateBuffer(hContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, count * sizeof(cl_uint), &my_value, &nError);
// Set pointer to buffer as argument
clSetKernelArg(hKernel, 0, sizeof(cl_mem), &hDeviceMem);
// Run kernel
// Copy values back
clEnqueueReadBuffer(hCmdQueue, hDeviceMem, CL_TRUE, 0, count * sizeof(cl_uint), &my_value, 0, NULL, NULL);
Your kernel should then look like this:
__kernel void myKernel(__global unsigned int* value)
// read/write to *value here
This should work the same as sending a 1-length vector as a param. You might have to use __global uint aParam in your kernel definition.

Avoiding data alignment in OpenCL

I need to pass a complex data type to OpenCL as a buffer and I want (if possible) to avoid the buffer alignment.
In OpenCL I need to use two structures to differentiate the data passed in the buffer casting to them:
typedef struct
char a;
float2 position;
} s1;
typedef struct
char a;
float2 position;
char b;
} s2;
I define the kernel in this way:
__kernel void
__global const void* bufferData,
const int amountElements // in the buffer
// Now I cast to one of the structs depending on an extra value
__global s1* x = (__global s1*)bufferData;
And it works well only when I align the data passed in the buffer.
The question is: Is there a way to use _attribute_ ((packed)) or _attribute_((aligned(1))) to avoid the alignment in data passed in the buffer?
If padding the smaller structure is not an option, I suggest passing another parameter to let your kernel function know what the type is - maybe just the size of the elements.
Since you have data types that are 9 and 10 bytes, it may be worth a try padding them both out to 12 bytes depending on how many of them you read within your kernel.
Something else you may be interested in is the extension: cl_khr_byte_addressable_store
I didn't realize you were passing a mixed array, I thought It was uniform in type. If you want to track the type on a per-element basis, you should pass a list of the types (or codes). Using float2 on its own in bufferData would probably be faster as well.
__kernel void
__global const float2* bufferData,
__global const char* bufferTypes,
const int amountElements // in the buffer

OpenCL - is it possible to invoke another function from within a kernel?

I am following along with a tutorial located here:
The kernel they have listed is this, which computes the sum of two numbers and stores it in the output variable:
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
1) Say for example that the operation performed was much more complex than a summation - something that warrants its own function. Let's call it ComplexOp(in1, in2, out). How would I go about implementing this function such that vector_add_gpu() can call and use it? Can you give example code?
2) Now let's take the example to the extreme, and I now want to call a generic function that operates on the two numbers. How would I set it up so that the kernel can be passed a pointer to this function and call it as necessary?
Yes it is possible. You just have to remember that OpenCL is based on C99 with some caveats. You can create other functions either inside of the same kernel file or in a seperate file and just include it in the beginning. Auxiliary functions do not need to be declared as inline however, keep in mind that OpenCL will inline the functions when called. Pointers are also not available to use when calling auxiliary functions.
float4 hit(float4 ray_p0, float4 ray_p1, float4 tri_v1, float4 tri_v2, float4 tri_v3)
//logic to detect if the ray intersects a triangle
__kernel void detection(__global float4* trilist, float4 ray_p0, float4 ray_p1)
int gid = get_global_id(0);
float4 hitlocation = hit(ray_p0, ray_p1, trilist[3*gid], trilist[3*gid+1], trilist[3*gid+2]);
You can have auxiliary functions for use in the kernel, see OpenCL user defined inline functions . You can not pass function pointers into the kernel.

CUDA device pointer manipulation

I've used:
float *devptr;
cudaMalloc(&devptr, sizeofarray);
cudaMemcpy(devptr, hostptr, sizeofarray, cudaMemcpyHostToDevice);
in CUDA C to allocate and populate an array.
Now I'm trying to run a cuda kernel, e.g.:
__global__ void kernelname(float *ptr)
in that array but with an offset value.
In C/C++ it would be someting like this:
kernelname<<<dimGrid, dimBlock>>>(devptr+offset);
However, this doesn't seem to work.
Is there a way to do this without sending the offset value to the kernel in a separate argument and use that offset in the kernel code?
Any ideas on how to do this?
Pointer arithmetic does work just fine in CUDA. You can add an offset to a CUDA pointer in host code and it will work correctly (remembering the offset isn't a byte offset, it is a plain word or element offset).
EDIT: A simple working example:
#include <cstdio>
int main(void)
const int na = 5, nb = 4;
float a[na] = { 1.2, 3.4, 5.6, 7.8, 9.0 };
float *_a, b[nb];
size_t sza = size_t(na) * sizeof(float);
size_t szb = size_t(nb) * sizeof(float);
cudaMalloc((void **)&_a, sza );
cudaMemcpy( _a, a, sza, cudaMemcpyHostToDevice);
cudaMemcpy( b, _a+1, szb, cudaMemcpyDeviceToHost);
for(int i=0; i<nb; i++)
printf("%d %f\n", i, b[i]);
Here, you can see a word/element offset has been applied to the device pointer in the second cudaMemcpy call to start the copy from the second word, not the first.
Pointer arithmetic does work on host side code, it's used fairly often in the example code provided by nvidia.
"Linear memory exists on the device in a 40-bit address space, so separately allocated entities can reference one another via pointers, for example, in a binary tree."
Read more at:
And from the performance primitives (npp) documentation, a perfect example of pointer arithmetic.
"4.5.1 Select-Channel Source-Image Pointer
This is a pointer to the channel-of-interest within the first pixel of the source image. E.g. if pSrc is the
pointer to the first pixel inside the ROI of a three channel image. Using the appropriate select-channel copy
primitive one could copy the second channel of this source image into the first channel of a destination
image given by pDst by offsetting the pointer by one:
nppiCopy_8u_C3CR(pSrc + 1, nSrcStep, pDst, nDstStep, oSizeROI);"
*Note: this works without multiplying by the number of bytes per data element because the compiler is aware of the data type of the pointer, and calculates the address accordingly.
In C and C++, pointer arithmetic can be accomplished as above or by the notation &ptr[offset] (to return device memory address of data instead of value, value will not work on device memory from host side code). When using either notation the size of the data type is automatically handled, and the offset is specified as a number of data elements rather than bytes.
