Taken from OpenCL by Action
The following code achieves the target shown in the figure.
It creates two buffer objects and copies the content of Buffer 1 to Buffer 2 with clEnqueueCopyBuffer.
Then clEnqueueMapBuffer maps the content of Buffer 2 to host memory and memcpy transfers the mapped memory to an array.
My question is will my code still work If I do not write the following lines in the code:
err = clSetKernelArg(kernel, 0, sizeof(cl_mem),
&buffer_one);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem),
&buffer_two);
queue = clCreateCommandQueue(context, device, 0, &err);
err = clEnqueueTask(queue, kernel, 0, NULL, NULL);
The kernel is blank, it's doing nothing. What is the need of setting kernel argument, and enqueueing the task?
...
float data_one[100], data_two[100], result_array[100];
cl_mem buffer_one, buffer_two;
void* mapped_memory;
...
buffer_one = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(data_one), data_one, &err);
buffer_two = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(data_two), data_two, &err);
err = clSetKernelArg(kernel, 0, sizeof(cl_mem),
&buffer_one);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem),
&buffer_two);
queue = clCreateCommandQueue(context, device, 0, &err);
err = clEnqueueTask(queue, kernel, 0, NULL, NULL);
err = clEnqueueCopyBuffer(queue, buffer_one,
buffer_two, 0, 0, sizeof(data_one),
0, NULL, NULL);
mapped_memory = clEnqueueMapBuffer(queue,
buffer_two, CL_TRUE, CL_MAP_READ, 0,
sizeof(data_two), 0, NULL, NULL, &err);
memcpy(result_array, mapped_memory, sizeof(data_two));
err = clEnqueueUnmapMemObject(queue, buffer_two,
mapped_memory, 0, NULL, NULL);
}
...
I believe the point of calling the enqueueTask would be to ensure that the data is actually resident on the device. It is possible that when using the CL_MEM_COPY_HOST_PTR flag that the memory is still kept on the host side until it is needed by a kernel. Enqueueing the task therefore ensures that the memory is brought to the device. This may also happen on some devices but not others.
You could test this theory by instrumenting your code and measuring the time taken to run the task, both when the task has arguments, and when it does not. If the task takes significantly longer with arguments, then this is likely what is going on.
Related
I have an array which I want pass to an OpenCL kernel. Part of my code is
cl_mem arr_cl;
unsigned int arr[4] = { 0 };
arr_cl = clCreateBuffer(ocl.context, CL_MEM_ALLOC_HOST_PTR, 4*sizeof(unsigned int), NULL, &status);
arr = (unsigned int*)clEnqueueMapBuffer(ocl.command_queue, arr_cl, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, 4*sizeof(unsigned int), 0, NULL, NULL, NULL);
status |= clSetKernelArg(ocl.kernel, 0, sizeof(cl_mem), &(arr_cl));
The above code compiles but crashes during run time. Please let me know what I'm doing wrong here.
I am using OpenCL 2.0.
source: https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/clEnqueueMapBuffer.html
CL_MAP_READ or CL_MAP_WRITE and CL_MAP_WRITE_INVALIDATE_REGION are
mutually exclusive.
you should only read or only write within a mapping if opencl version >=1.2.
Also when changing alloc_host_ptr to use_host_ptr, the array should be aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN value/query.
I'm developing a program that implements a recursive ray tracing in OpenCL.
To run the kernel I have to options of devices: the Intel one that is integrated with the system and the Nvidia GeForce graphic Card.
When I run the project with the first device there's no problem; it runs correctly and shows the result of the algorithm just fine.
But when I try to run it with the Nvidia device, it crashes in the callback function that has the synchronous buffer map.
The part of the code where it crashes is the following:
clEnqueueNDRangeKernel( queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
// 7. Look at the results via synchronous buffer map.
cl_float4 *ptr = (cl_float4 *) clEnqueueMapBuffer( queue, buffer, CL_TRUE, CL_MAP_READ, 0, kWidth * kHeight * sizeof(cl_float4), 0, NULL, NULL, NULL );
cl_float *viewTransformPtr = (cl_float *) clEnqueueMapBuffer( queue, viewTransform, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
cl_float *worldTransformsPtr = (cl_float *) clEnqueueMapBuffer( queue, worldTransforms, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
memcpy(viewTransformPtr, viewMatrix, sizeof(float)*16);
memcpy(worldTransformsPtr, sphereTransforms, sizeof(float)*16);
clEnqueueUnmapMemObject(queue, viewTransform, viewTransformPtr, 0, 0, 0);
clEnqueueUnmapMemObject(queue, worldTransforms, worldTransformsPtr, 0, 0, 0);
unsigned char* pixels = new unsigned char[kWidth*kHeight*4];
for(int i=0; i < kWidth * kHeight; i++){
pixels[i*4] = ptr[i].s[0]*255;
pixels[i*4+1] = ptr[i].s[1]*255;
pixels[i*4+2] = ptr[i].s[2]*255;
pixels[i*4+3] = 1;
}
glBindTexture(GL_TEXTURE_2D, 1);
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
glTexImage2D(GL_TEXTURE_2D, 0, 4, kWidth, kHeight, 0, GL_RGBA, GL_UNSIGNED_BYTE, pixels);
delete [] pixels;
The two last calls to clEnqueueMapBuffer return the error -5 that matches CL_OUT_OF_RESOURCES but I believe that the sizes of the buffers are correct.
According to the CL spec, calling CL blocking calls from a callback is undefined. It is likely your code is correct, but you can't use it from a Callback. In Intel platform with integrated memory, the maps are no-ops, thus, not failing.
CL spec: clSetEventCallbacks
The behavior of calling expensive system routines, OpenCL API calls to
create contexts or command-queues, or blocking OpenCL operations from
the following list below, in a callback is undefined.
clFinish
clWaitForEvents
blocking calls to clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueWriteBuffer, and clEnqueueWriteBufferRect
blocking calls to clEnqueueReadImage and clEnqueueWriteImage
blocking calls to clEnqueueMapBuffer and clEnqueueMapImage
blocking calls to clBuildProgram
If an application needs to wait for completion of a routine from the
above l ist in a callback, please use the non-blocking form of the
function, and assign a completion callback to it to do the remainder
of your work. Note that when a callback (or other code) enqueues
commands to a command-queue, the commands are not required to begin
execution until the queue is flushed. In standard usage, blocking
enqueue calls serve this role by implicitly flushing the queue. Since
blocking calls are not permitted in callbacks, those callbacks that
enqueue commands on a command queue should either call clFlush on the
queue before returning or arrange for clFlush to be called later on
another thread.
I have problem when I use the clEnqueueMapBuffer to get the calculation result from kernel, it will crash without any error back. My process is under:
.
.
// Create a buffer in Host, and use CL_MEM_USE_HOST_PTR in device
int out_data;
cl_mem cl_dst = clCreateBuffer(context_CL, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(int), &out_data, &err);
.
.
//Do something in a kernel(ignore the detail of kernel and other input data, because there is no wrong there. the output is cl_dst(INT))
err = clEnqueueNDRangeKernel(....)
.
.
//Mapping the result back to the host
clEnqueueMapBuffer(queue_CL, cl_dst, CL_TRUE, CL_MAP_READ, 0, sizeof(cl_dst), 0, NULL, NULL, &err);
//And then my graphic card shut down here at this command...
.
.
.
My graphic card is Intel HD 5500 (support OpenCL 2.0)
Do I have a wrong flags somewhere or missing some important concepts?
I think the size of mapped region is not correct:
clEnqueueMapBuffer(queue_CL, cl_dst, CL_TRUE, CL_MAP_READ, 0, sizeof(cl_dst), 0, NULL, NULL, &err);
It should be:
clEnqueueMapBuffer(queue_CL, cl_dst, CL_TRUE, CL_MAP_READ, 0, sizeof(int), 0, NULL, NULL, &err);
Per OpenCL 2.0 spec:
"offset and size are the offset in bytes and the size of the region in
the buffer object that is being mapped."
I am coding some program and i need to perform some data transfer repeatedly between host and device. I will try to minimize best i can, but is there any faster way to perform this? Here, array copied to the device is changing on each iteration, hence device needs to be updated with the new array values. Any suggestions/pointers/help will be appreciated.
for (i = 0; i <= SEVERALCALLS; i++) {
wrtBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(double) * num, NULL, &ret);
if (ret != 0) {
printf("clCreateBuffer wrtBuffer error: %d. couldn't load\n", ret);
exit(1);
}
// update cti array
ret = clEnqueueWriteBuffer(command_queue, wrtBuffer, CL_TRUE, 0, sizeof(double) * num, cti, 0, NULL, NULL);
if (ret != 0) {
printf("clEnqueueWriteBuffer wrtBuffer error: %d. couldn't load\n", ret);
exit(1);
}
// NDRange Kernel call
ret = clEnqueueReadBuffer(command_queue, readBuffer, CL_TRUE, 0, sizeof(double) * num, newcti, 0, NULL, NULL);
if (ret != 0) {
printf("clEnqueueReadBuffer readBuffer error: %d. couldn't load\n", ret);
exit(1);
}
}
Three ways to optimize this:
On integrated GPU (like Intel or AMD APU) use "zero copy" buffers so you won't pay for any transfers.
On NVIDIA use pinned host memory for the host-side memory source for clEnqueueWriteBuffer or receive buffer for clEnqueueReadBuffer. This will be faster than using normal malloc memory and won't block.
Overlap transfer and compute. Use three command queues, one for upload, one for compute, and one for download, and use events to ensure dependencies. See NVIDIA's example: oclCopyComputeOverlap (although it is suboptimal; it can go slightly faster than they say it can).
In the OpenCL, I have 2 work groups, and each 100 work items in it. So I will do some thing like this:
....
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&hDeviceMemInput);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&hDeviceMemOutput);
clSetKernelArg(kernel, 2, sizeof(cl_float) * 100, NULL);
clSetKernelArg(kernel, 3, sizeof(cl_int) * 1, &mCount);
clEnqueueNDRangeKernel(CmdQueue, Kernel, 1, 0, 200, 100, 0, 0, 0);
....
OpenCL code :
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
output[gtid] = temp[ltid] * temp[ltid];
}
}
As I understand, each group has a float[100] local temp variable. In my case, there are two float[100] on device. If there are n work group, there are n float[100] on the device. Is that right? Is __local float *temp just used on device? Can I access it out of the kernel, by using something like:
clEnqueueReadBuffer(CmdQueue, ??, CL_TRUE, 0, 100* sizeof(cl_float),
host_temp, 0, 0, 0);
Is the local memory much faster than global memory? Do you have tip for using local memory?
Local memory is a very fast temporal memory. So, no you can't access it or read it back. Because it is overwritten continuously. In fact the memory is not reserved in the device, so it can be the case (and it will be) that your 2 work groups use the same local memory but at different times. If you have 100 groups and 2 compute unit... imagine how many times the overwrite will ocur.
If you want to read the result of local memory you have to copy it first to global, then read from there.
Local memory intention is to share something between the work-items for a temporal intermediate result and fast access. After that it will be destroyed. This is usefull for many things, one simple example is filtering an image.
EDIT:
You can think about local memory as a register, a HW resource. You can't use a register as RAM. The same as you can't use local memory as global memory.