How do I pass an array to an OpenCL kernel? - opencl

I have an array which I want pass to an OpenCL kernel. Part of my code is
cl_mem arr_cl;
unsigned int arr[4] = { 0 };
arr_cl = clCreateBuffer(ocl.context, CL_MEM_ALLOC_HOST_PTR, 4*sizeof(unsigned int), NULL, &status);
arr = (unsigned int*)clEnqueueMapBuffer(ocl.command_queue, arr_cl, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, 4*sizeof(unsigned int), 0, NULL, NULL, NULL);
status |= clSetKernelArg(ocl.kernel, 0, sizeof(cl_mem), &(arr_cl));
The above code compiles but crashes during run time. Please let me know what I'm doing wrong here.
I am using OpenCL 2.0.

source: https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/clEnqueueMapBuffer.html
CL_MAP_READ or CL_MAP_WRITE and CL_MAP_WRITE_INVALIDATE_REGION are
mutually exclusive.
you should only read or only write within a mapping if opencl version >=1.2.
Also when changing alloc_host_ptr to use_host_ptr, the array should be aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN value/query.

Related

OpenCL: Why using clEnqueueMapBuffer Crash without return error?

I have problem when I use the clEnqueueMapBuffer to get the calculation result from kernel, it will crash without any error back. My process is under:
.
.
// Create a buffer in Host, and use CL_MEM_USE_HOST_PTR in device
int out_data;
cl_mem cl_dst = clCreateBuffer(context_CL, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(int), &out_data, &err);
.
.
//Do something in a kernel(ignore the detail of kernel and other input data, because there is no wrong there. the output is cl_dst(INT))
err = clEnqueueNDRangeKernel(....)
.
.
//Mapping the result back to the host
clEnqueueMapBuffer(queue_CL, cl_dst, CL_TRUE, CL_MAP_READ, 0, sizeof(cl_dst), 0, NULL, NULL, &err);
//And then my graphic card shut down here at this command...
.
.
.
My graphic card is Intel HD 5500 (support OpenCL 2.0)
Do I have a wrong flags somewhere or missing some important concepts?
I think the size of mapped region is not correct:
clEnqueueMapBuffer(queue_CL, cl_dst, CL_TRUE, CL_MAP_READ, 0, sizeof(cl_dst), 0, NULL, NULL, &err);
It should be:
clEnqueueMapBuffer(queue_CL, cl_dst, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL, &err);
Per OpenCL 2.0 spec:
"offset and size are the offset in bytes and the size of the region in
the buffer object that is being mapped."

OpenCL - adding to a single global value

I'm fighting a bug related to adding to a single global value from an OpenCL kernel.
Consider this (oversimplified) example:
__kernel some_kernel(__global unsigned int *ops) {
unsigned int somevalue = ...; // a non-zero value is assigned here
*ops += somevalue;
}
I pass in an argument initialized as zero through clCreateBuffer and clEnqueueWriteBuffer. I assumed that after adding to the value, letting the queue finish and reading it back, I'd get a non-zero value.
Then I figured this might be some weird conflict, so I tried to do an atomic operation:
__kernel some_kernel(__global unsigned int *ops) {
unsigned int somevalue = ...; // a non-zero value is assigned here
atomic_add(ops, somevalue);
}
Alas, no dice - after reading the value back to a host pointer, it's still zero. I've already verified that somevalue has non-zero values in kernel executions, and am at a loss.
By request, the code for creating the memory:
unsigned int *cpu_ops = new unsigned int;
*cpu_ops = 0;
cl_mem_flags flags = CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR;
cl_int error;
cl_mem buffer = clCreateBuffer(context, flags, sizeof(unsigned int), (void*)cpu_ops, &error);
// error code check snipped
error = clEnqueueWriteBuffer(queue, buffer, CL_TRUE, 0, sizeof(unsigned int), (void*)cpu_ops, 0, NULL, NULL);
// error code check snipped
// snip: program setup - it checks out, no errors
cl_kernel some_kernel = clCreateKernel(program, "some_kernel", &error);
// error code check snipped
cl_int error = clSetKernelArg(some_kernel, 0, sizeof(cl_mem), &buffer);
// error code check snipped
//global_work_size and local_work_size set elsewhere
cl_int error = clEnqueueNDRangeKernel(queue, some_kernel, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL);
// error code check snipped
clFinish(queue);
cl_int error = clEnqueueReadBuffer(queue, buffer, CL_TRUE, 0, sizeof(unsigned int), (void*)cpu_ops, 0, NULL, NULL);
// error code check snipped
// at this point, cpu_ops still has it's initial value (whatever that value might have been set to)'
I've skipped the error checking code since it does not error out. I'm actually using a bunch of custom helper functions for sending and receiving data, setting up the platform and context, compiling the program and so on, so the above is constructed of the bodies of the appropriate helpers with the parameters' names changed to make sense.
I'm fairly sure that this is a slip-up or lack of understanding on my part, but desperately need input on this.
Never mind. I was confused about my memory handles - just a stupid error. The code is probably fine.

Base Address of Memory Object OpenCL

I want to traverse a tree at GPU with OpenCL, so i assemble the tree in a contiguous block at host and i change the addresses of all pointers so as to be consistent at device as follows:
TreeAddressDevice = (size_t)BaseAddressDevice + ((size_t)TreeAddressHost - (size_t)BaseAddressHost);
I want the base address of the memory buffer:
At host i allocate memory for the buffer, as follows:
cl_mem tree_d = clCreateBuffer(...);
The problem is that cl_mems are objects that track an internal representation of the data. Technically they're pointers to an object, but they are not pointers to the data. The only way to access a cl_mem from within a kernel is to pass it in as an argument via setKernelArgs.
Here http://www.proxya.net/browse.php?u=%3A%2F%2Fwww.khronos.org%2Fmessage_boards%2Fviewtopic.php%3Ff%3D37%26amp%3Bt%3D2900&b=28 i found the following solution, but it doesnot work:
__kernel void getPtr( __global void *ptr, __global void *out )
{
*out = ptr;
}
that can be invoked as follows
Code:
...
cl_mem auxBuf = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(void*), NULL, NULL );
void *gpuPtr;
clSetKernelArg( getterKernel, 0, sizeof(cl_mem), &myBuf );
clSetKernelArg( getterKernel, 1, sizeof(cl_mem), &auxBuf );
clEnqueueTask( commandQueue, getterKernel, 0, NULL, NULL );
clEnqueueReadBuffer( commandQueue, auxBuf, CL_TRUE, 0, sizeof(void*), &gpuPtr, 0, NULL, NULL );
clReleaseMemObject(auxBuf);
...
Now "gpuPtr" should contain the address of the beginning of "myBuf" in GPU memory space.
The solution is obvious and i can't find it? How can I get back a pointer to device memory when creating buffers?
It's because in the OpenCL model, host memory and device memory are disjoint. A pointer in device memory will have no meaning on the host.
You can map a device buffer to host memory using clEnqueueMapBuffer. The mapping will synchronize device to host, and unmapping will synchronize back host to device.
Update. As you explain in the comments, you want to send a tree structure to the GPU. One solution would be to store all tree nodes inside an array, replacing pointers to nodes with indices in the array.
As Eric pointed out, there are two sets of memory to consider: host memory and device memory. Basically, OpenCL tries to hide the gritty details of this interaction by introducing the buffer object for us to interact with in our program on the host side. Now, as you noted, the problem with this methodology is that it hides away the details of our device when we want to do something trickier than the OpenCL developers intended or allowed in their scope. The solution here is to remember that OpenCL kernels use C99 and that the language allows us to access pointers without any issue. With this in mind, we can just demand the pointer be stored in an unsigned integer variable to be referenced later.
Your implementation was on the right track, but it needed a little bit more C syntax to finish up the transfer.
OpenCL Kernel:
// Kernel used to obtain pointer from target buffer
__kernel void mem_ptr(__global char * buffer, __global ulong * ptr)
{
ptr[0] = &buffer[0];
}
// Kernel to demonstrate how to use that pointer again after we extract it.
__kernel void use_ptr(__global ulong * ptr)
{
char * print_me = (char *)ptr[0];
/* Code that uses all of our hard work */
/* ... */
}
Host Program:
// Create the buffer that we want the device pointer from (target_buffer)
// and a place to store it (ptr_buffer).
cl_mem target_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
MEM_SIZE * sizeof(char), NULL, &ret);
cl_mem ptr_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
1 * sizeof(cl_ulong), NULL, &ret);
/* Setup the rest of our OpenCL program */
/* .... */
// Setup our kernel arguments from the host...
ret = clSetKernelArg(kernel_mem_ptr, 0, sizeof(cl_mem), (void *)&target_buffer);
ret = clSetKernelArg(kernel_mem_ptr, 1, sizeof(cl_mem), (void *)&ptr_buffer);
ret = clEnqueueTask(command_queue, kernel_mem_ptr, 0, NULL, NULL);
// Now it's just a matter of storing the pointer where we want to use it for later.
ret = clEnqueueCopyBuffer(command_queue, ptr_buffer, dst_buffer, 0, 1 * sizeof(cl_ulong),
sizeof(cl_ulong), 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, ptr_buffer, CL_TRUE, 0,
1 * sizeof(cl_ulong), buffer_ptrs, 0, NULL, NULL);
There you have it. Now, keep in mind that you don't have to use the char variables I used; it works for any type. However, I'd recommend using cl_ulong for the storing of pointers. This shouldn't matter for devices with less than 4GB of accessible memory. But for devices with a larger address space, you have to use cl_ulong. If you absolutely NEED to save space on your device but have a device whose memory > 4GB, then you might be able to create a struct that can store the lower 32 LSB of the address into a uint type, with the MSB's being stored in a small type.

Incorrect copying of memory in OpenCL

I am getting some unusual behaviour in my openCL program.
In a host part of the program I create an array of double and set all elements to zero. That array is copied to the GPU using:
memObjects[4] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * I_numel, I, NULL);
Inside the kernel some elements are set to 1 depending on some condition and then I read it back to the host with:
errNum = clEnqueueReadBuffer(commandQueue, memObjects[4], CL_TRUE, 0,
I_numel * sizeof(double), I, 0, NULL, NULL);
However, some of the elements that were supposed to be zero have changed to very small ( 6.953267903e-310 ) or very large numbers ( 2.0002319483e+161 ) !?!
I've tried changing double to float but the results are similar. I am using nvidia implementation of openCL, version is 1.1. Does anyone know what is the problem?
I suspect there's something wrong with your kernel code. What happens if you do just the clEnqueueRead without running the kernel at all, do you then get all zeros? How about if you drop the CL_MEM_COPY_HOST_PTR and clear the buffer with clEnqueueWrite instead?
I tried to reproduce the issue with this simplified kernel, but the output was just alternating zeros and ones, as expected:
kernel void enqueueReadBuffer(global float* outputValueArray) {
int gid = get_global_id(0);
if (gid % 2 == 0) {
outputValueArray[gid] = 1.0f;
}
}
I ran this on three different OpenCL drivers on Windows 7, including NVIDIA Quadro FX4800 (R307.45), and got the correct result on all of them.
Try replacing the shown code with this and then post the err numbers
cl_int err;
memObjects[4] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * I_numel, I, &err);
printf("Buffer creation error no = %d", err);
And for the copy back
cl_int err2;
err2= clEnqueueReadBuffer(commandQueue, memObjects[4], CL_TRUE, 0,
I_numel * sizeof(double), I, 0, NULL, NULL);
printf("Copy back error no = %d", err2);

Opencl Buffer and kernel execution

Taken from OpenCL by Action
The following code achieves the target shown in the figure.
It creates two buffer objects and copies the content of Buffer 1 to Buffer 2 with clEnqueueCopyBuffer.
Then clEnqueueMapBuffer maps the content of Buffer 2 to host memory and memcpy transfers the mapped memory to an array.
My question is will my code still work If I do not write the following lines in the code:
err = clSetKernelArg(kernel, 0, sizeof(cl_mem),
&buffer_one);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem),
&buffer_two);
queue = clCreateCommandQueue(context, device, 0, &err);
err = clEnqueueTask(queue, kernel, 0, NULL, NULL);
The kernel is blank, it's doing nothing. What is the need of setting kernel argument, and enqueueing the task?
...
float data_one[100], data_two[100], result_array[100];
cl_mem buffer_one, buffer_two;
void* mapped_memory;
...
buffer_one = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(data_one), data_one, &err);
buffer_two = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(data_two), data_two, &err);
err = clSetKernelArg(kernel, 0, sizeof(cl_mem),
&buffer_one);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem),
&buffer_two);
queue = clCreateCommandQueue(context, device, 0, &err);
err = clEnqueueTask(queue, kernel, 0, NULL, NULL);
err = clEnqueueCopyBuffer(queue, buffer_one,
buffer_two, 0, 0, sizeof(data_one),
0, NULL, NULL);
mapped_memory = clEnqueueMapBuffer(queue,
buffer_two, CL_TRUE, CL_MAP_READ, 0,
sizeof(data_two), 0, NULL, NULL, &err);
memcpy(result_array, mapped_memory, sizeof(data_two));
err = clEnqueueUnmapMemObject(queue, buffer_two,
mapped_memory, 0, NULL, NULL);
}
...
I believe the point of calling the enqueueTask would be to ensure that the data is actually resident on the device. It is possible that when using the CL_MEM_COPY_HOST_PTR flag that the memory is still kept on the host side until it is needed by a kernel. Enqueueing the task therefore ensures that the memory is brought to the device. This may also happen on some devices but not others.
You could test this theory by instrumenting your code and measuring the time taken to run the task, both when the task has arguments, and when it does not. If the task takes significantly longer with arguments, then this is likely what is going on.

Resources