OpenCL overlap communication and computation - opencl

There is an example in OpenCL NVIDIA SDK, oclCopyComputeOverlap, that uses 2 queues to alternatively transfer buffers / execute kernels.
In this example mapped memory is used.
**//pinned memory**
cmPinnedSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBuffBytes, NULL, &ciErrNum);
**//host pointer for pinned memory**
fSourceA = (cl_float*)clEnqueueMapBuffer(cqCommandQueue[0], cmPinnedSrcA, CL_TRUE, CL_MAP_WRITE, 0, szBuffBytes, 0, NULL, NULL, &ciErrNum);
...
**//normal device buffer**
cmDevSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, szBuffBytes, NULL, &ciErrNum);
**//write half the data from host pointer to device buffer**
ciErrNum = clEnqueueWriteBuffer(cqCommandQueue[0], cmDevSrcA, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceA[0], 0, NULL, NULL);
I have 2 questions:
1) Is there any need to use pinned memory for the overlap to occur? Couldn't fSourceA be just a simple host pointer,
fSourceA = (cl_float *)malloc(szBuffBytes);
...
//write random data in fSourceA
2) cmPinnedSrcA is not used in the kernel, instead cmDevSrcA is used. Doesn't the space occupied by the buffers on the device still grow? (space required for cmPinnedSrcA added to the space required for cmDevSrcA)
Thank you

If I understood your question properly:
1)
Yes, you can use any kind of memory (pinned, host pointer, etc..) and the overlap will still occur. As far as you use two queues and the HW/drivers supports it.
But remaind that, the queues are always unsynced. And in this case, events are needed to prevent the copy queue to copy non-consistent data of the running kernel.
2) I think you are using 2 times the memory if you use pinned memory, one for the pinned and another one for a temporary copy. But I am not 100% sure, maybe it is only a pointer.

Related

Equivalent of cudaSetDevice in OpenCL?

I have a function that I wrote for 1 gpu, and it runs for 10 seconds with one set of args, and I have a very long list of args to go through. I would like to use both my AMD gpus, so I have some wrapper code that launches 2 threads, and runs my function on thread 0 with an argument gpu_idx 0 and on thread 1 with an argument gpu_idx 1.
I have a cuda version for another machine, and I just run checkCudaErrors(cudaSetDevice((unsigned int)device_id)); to get my desired behavior.
With openCL I have tried to do the following:
void createDevice(int device_idx)
{
cl_device_id *devices;
ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
HANDLE_CLERROR_G(ret);
ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, 0, NULL, &ret_num_devices);
HANDLE_CLERROR_G(ret);
devices = (cl_device_id*)malloc(ret_num_devices*sizeof(cl_device_id));
ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, ret_num_devices, devices, &ret_num_devices);
HANDLE_CLERROR_G(ret);
if (device_idx >= ret_num_devices)
{
fprintf(stderr, "Found %i devices but asked for device at index %i\n", ret_num_devices, device_idx);
exit(1);
}
device_id = devices[device_idx];
// usleep(((unsigned int)(500000*(1-device_idx)))); // without this line multithreaded 2 gpu execution does not work.
context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
HANDLE_CLERROR_G(ret);
}
context is a static variable in my *c file that I then use later again when I create the kernel.
This code works when I run only with device_idx 0, or only with device_idx 1, and even if I manually in two terminal windows run the executable "simultaneously" with device_idx 0 and device_idx 1.
BUT, there is something about the threads being "too" concurrent that prevents this code from working. In fact, depending on the amount of sleep (commented above), I get different behavior (sometimes both threads do work on gpu 0, sometimes both threads do work on gpu 1, sometimes threads are balanced on both gpus). If I sleep for too little time, I either get: CL_INVALID_CONTEXT and if I don't sleep at all I get CL_INVALID_KERNEL_NAME.
Like I said, I don't get any errors when running on gpu 0 or gpu 1 alone, only when spawning multiple threads that call this code (as an *so with an extern C function from go) simultaneously with device_idx 0 in thread 0 and device_idx 1 in thread 1.
How can I solve my problem? I am attached to the idea that I have an executable that works on 1 gpu, for which I specify which gpu, and that specification should be respected.
What is the proper way to pick the device when both devices need to be used, one completely separate from the other?
Whoops! Instead of saving device_id into a static variable I started returning from the above code and using it as a local variable, and everything works as expected, and is now thread safe.

How to use OpenCL to write directly to linux framebuffer with zero-copy?

I am using OpenCL to do some image processing and want to use it to write RGBA image directly to framebuffer. Workflow is shown below:
1) map framebuffer to user space.
2) create OpenCL buffer using clCreateBuffer with flags of "CL_MEM_ALLOC_HOST_PTR"
3) use clEnqueueMapBuffer to map the results to framebuffer.
However, it doesn't work. Nothing on the screen. Then I found that the mapped virtual address from framebuffer are not same as the virtual address mapped OpenCL. Has any body done a zero-copy move of data from GPU to framebuffer?Any help on what approach should I use for this?
Some key codes:
if ((fd_fb = open("/dev/fb0", O_RDWR, 0)) < 0) {
printf("Unable to open /dev/fb0\n");
return -1;
}
fb0 = (unsigned char *)mmap(0, fb0_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd_fb, 0);
...
cmDevSrc4 = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(cl_uchar) * imagesize * 4, NULL, &status);
...
fb0 = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmDevSrc4, CL_TRUE, CL_MAP_READ, 0, sizeof(cl_uchar) * imagesize * 4, 0, NULL, NULL, &ciErr);
For zero-copy with an existing buffer you need to use CL_MEM_USE_HOST_PTR flag in the clCreateBuffer() function call. In addition you need give the pointer to the existing buffer as second to last argument.
I don't know how linux framebuffer internally works but it is possible that even with the zero-copy from device to host it leads to extra copying the data to GPU for rendering. So you might want to render the OpenCL buffer directly with OpenGL. Check out cl_khr_gl_sharing extension for OpenCL.
I don't know OpenCL yet, I was just doing a search to find out about writing to the framebuffer from it and hit your post. Opening it and mmapping it like in your code looks good.
I've done that with the CPU: https://sourceforge.net/projects/fbgrad/
That doesn't always work, it depends on the computer. I'm on an old Dell Latitude D530 and not only can't I write to the framebuffer but there's no GPU, so no advantage to using OpenCL over using the CPU. If you have a /dev/fb0 and you can get something on the screen with
cat /dev/random > /dev/fb0
Then you might have a chance from OpenCL. With a Mali at least there's a way to pass a pointer from the CPU to the GPU. You may need to add some offset (true on a Raspberry Pi I think). And it could be double-buffered by Xorg, there are lots of reasons why it might not work.

allocate memory using huge page and numa_tonode_memory giving "Bus Error"

I am trying to allocate a 2GB buffer using huge TLB page (1GB) and bind the memory region to a specific numa node.
To allocate the buffer using huge TLB page, I am using the following code:
shmid = shmget (IPC_PRIVATE, buf_size,
SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
buf = (uint64_t *) shmat (shmid, 0, 0);
Then, I called:
numa_tonode_memory (buf, buf_size, 3);
to move the buffer to a specific node.
When I run the program, as soon as I access buffer offset larger than 1GB, the program would stop with "Bus error (core dumped)".
Removing numa_tonode_memory would avoid the error, however, it would also destroy the purpose of allocating memory on a specific node.
I am wondering if there is any work around on this problem,
Thank you,

Parallel copy and opencl kernel execution

I would like to implement an image filtering algorithm using OpenCL but the image size is very large (4096 x 4096). I understand that the copy time to the OpenCL device may take too long.
Do you think it makes sense to address this problem by using a parallel copy in combination with OpenCL kernel execution?
E.g., below is my approach:
1) Split the full image into 2 parts.
2) Copy the first half to the device.
3) Execute the image filtering kernel on the device, then copy the 2nd half of the image to the device.
4) Block the kernel execution until the first half completes, then call the kernel again to process the 2nd part.
5) Block until the 2nd part finishes.
Best regards,
OpenCL thread of execution is completely independent to your application. So there is no need to "wait" after each call. Just flush all the order to OpenCL and it should schedule them properly.
The only need is to have 2 queues, in order to be able to run commands in parallel. So you will need a IO queue, and an execution queue. A single queue (even in out of order mode), can never run 2 operations in parallel.
Here you have one example approach with events, you can call clFlush() on the queues just after doing the enqueues in order to speed them up.
//Create 2 queues (at creation only!)
mQueueIO = cl::CommandQueue(context, device[0], 0);
mQueueRun = cl::CommandQueue(context, device[0], 0);
//Everytime you run your image filter
//Queue the 2 writes
cl::Event wev1; //Event to known when the write finishes
mQueueIO.enqueueWriteBuffer(ImageBufferCL, CL_FALSE, 0, size/2, imageCPU, NULL, &wev1);
cl::Event wev2; //Event to known when the write finishes
mQueueIO.enqueueWriteBuffer(ImageBufferCL, CL_FALSE, size/2, size/2, imageCPU+size/2, &wev2);
//Queue the 2 runs (with the proper dependency)
std::vector<cl::Event> wait;
wait.push_back(wev1);
cl::Event ev1; //Event to track the finish of the run command
mQueueRun.enqueueNDRangeKernel(kernel, cl::NDRange(0), cl::NDRange(size/2), cl::NDRange(localsize), &wait, &ev1);
wait[0] = wev2;
cl::Event ev2; //Event to track the finish of the run command
mQueueRun.enqueueNDRangeKernel(kernel, cl::NDRange(size/2), cl::NDRange(size/2), cl::NDRange(localsize), &wait, &ev2);
//Read back the data when it has finished
std::vector<cl::Event> rev(2);
wait[0] = ev1;
mQueueIO.enqueueReadBuffer(ImageBufferCL, CL_FALSE, 0, size/2, imageCPU, &wait, &rev[0]);
wait[0] = ev1;
mQueueIO.enqueueReadBuffer(ImageBufferCL, CL_FALSE, size/2, size/2, imageCPU + size/2, &wait, &rev[1]);
rev[0].wait();
rev[1].wait();
Notice how I create 2 events for writing, these are the wait events of the execution; and 2 events for execution that are the wait events for reading.
In the last part I create another 2 events for reading but they are not really needed, you can use a blocking read.
Try using out of order queues - most implementation's hardware should support them. You'll want to use the global offset parameter in your kernels along with global_id where applicable. At some point you will get diminishing returns with a division strategy like this but there should exist a number such that you can get a good payoff in latency reduction - I would guess it's in [2, 100] is probably a good interval to brute force profile. Be aware that only one kernel can write to a memory buffer at a time and make sure the input buffer is const (read-only). Be aware that you must also merge the result from N buffer splits in one kernel to an output - this means you will effectively write all pixels twice to GDS. OpenCL 2.0 may be able to save us all these divided writes with it's image types, if you are able to use it.
cl::CommandQueue queue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE|CL_QUEUE_ON_DEVICE);
cl::Event last_event;
std::vector<Event> events;
std::vector<cl::Buffer> output_buffers;//initialize with however many splits you have, ensure there is at least enough for what is written and update the kernel perhaps to only write to it's relative region.
//you might approach finer granularity with even more splits
//just make sure the kernel is using the global offset -
//in which case adjust this code into a loop
set_args(kernel, image_input, image_outputs[0]);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(0, 0), cl::NDRange(cols * local_size[0], (rows/2) * local_size[0]), cl::NDRange(local_size[0], local_size[1]), &events, &last_event); events.push_back(last_event);
set_args(kernel, image_input, image_outputs[0]);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(0, size/2 * local_size), cl::NDRange(cols * local_size[0], (size - size/2) * local_size[1]), cl::NDRange(local_size[0], local_size[1]), &events, &last_event); events.push_back(last_event);
set_args(merge_buffers_kernel, output_buffers...)
queue.enqueueNDRangeKernel(merge_buffers_kernel, NDRange(), NDRange(cols * local_size[0], rows * local_size[1])
cl::waitForEvents(events);

OpenCL/OpenGL Interop on OSX -- has anyone ever shared a Renderbuffer or Texture?

Problem: Attempting to share a Renderbuffer (or a Texture), bound to a Framebuffer, fails when clSetKernelArg() is called. Thorough error checking reports no problems until that call.
My program generates frames for a video projector that runs at 60fps (16.7ms frames).
My kernel runs in (typically) 24ms, but it's taking 50ms between each frame. I assume that some of the extra cost is because I'm using the GPU to calculate the pixels, then enqueuing a readbuffer to pull the data off the GPU, then using glDrawPixels to put it back onto the GPU for display. Perfect situation to try OpenGL/OpenCL interoperation, right?, to avoid the two extra copy operations.
There are many examples, and I have succeeded in sharing a VBO with OpenCL, and can write to it, but that doesn't help me. I don't want to write vertex data, just a 2-D image that's been calculated.
There are examples of two different ways to do this, and they both involve Framebuffer objects.
You can attach a Renderbuffer to a Framebuffer, or you can attach a Texture to a Framebuffer.
Then you should be able to write to that buffer in opencl and display it with opengl, no extra copies.
I have found a few examples of this in code, and I think I'm doing everything exactly the way the examples say to do it, but maybe it is broken in OSX? .. because it doesn't work. The FBO is "Complete", no errors along the way, until I try to do the clSetKernelArg. That call returns error -38, CL_INVALID_MEM_OBJECT.
*note: I would rather use a Renderbuffer than a Texture, since all I'm doing is making a 2-D RGB image that I want to display. But I tried a Texture out of desperation. Still no help.
I do these steps, in this order, with some other stuff in between:
kCGLContext = CGLGetCurrentContext();
kCGLShareGroup = CGLGetShareGroup( kCGLContext );
glGenFramebuffers( 1, &fboid );
glBindFramebuffer( GL_FRAMEBUFFER, fboid );
glGenRenderbuffers( 1, &rboid );
glBindRenderbuffer( GL_RENDERBUFFER, rboid );
glRenderbufferStorage( GL_RENDERBUFFER, GL_RGBA, rb_wid, rb_hgt );
glboid = rboid;
glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, rboid );
then:
cl_context_properties ourprops[] = { CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE, (cl_context_properties)kCGLShareGroup, 0 };
contextZ = clCreateContext( ourprops, 1, &dev_idZ[0], clLogMessagesToStdoutAPPLE, NULL, &err );
clbo = clCreateFromGLRenderbuffer( contextZ, CL_MEM_WRITE_ONLY, glboid, &err );
then clCreateCommandQueue, clCreateProgramWithBinary, clBuildProgram, clCreateKernel, all no errors
then later:
glFinish();
clEnqueueAcquireGLObjects( queueZ, 1, &clbo, 0,0,0 );
err = clSetKernelArg( kernelZ, 1, sizeof(cl_mem), &clbo );
... which fails with error -38, CL_INVALID_MEM_OBJECT.
clbo is a static cl_mem, just like the buffer object that's used when interop is not on. The difference being that it was created using clCreateFromGLRenderbuffer instead of clCreateBuffer, and it's in a context created in association with the gl sharegroup.
(I've tried adding a second Renderbuffer and attaching it to a Depth Attachment Point, in case that was needed. No help.)
(I've tried the same thing with a Texture bound to the FBO, and I get the same error in the same place.)
... does anybody have any ideas at all?
Ok; got it!
The problem was that the kernel was specifying a uint * as its output argument instead of image2d_t. I didn't think this would matter at least for the setkarg call; they are both cl_mem on the host side (image2d_t is #defined as cl_mem). However, once you call clCreateFromGLTexture2D or clCreateFromGLRenderbuffer, that object acquires properties the ocl knows about. When the kernel was changed to specify image2d_t, more useful error messages appeared, and it now works.
Bonus fact: you can't write to a UNORM_INT8 image with write_imageui; you have to use write_imagef.

Resources