OpenCL - iteratively updating GPU-resident buffer?

OpenCL - iteratively updating GPU-resident buffer? - opencl

I need to have an OpenCL kernel iteratively update a buffer and return the results. To clarify:
Send initial buffer to contents to the kernel
Kernel/worker updates each element in the buffer
Host code reads the results - HOPEFULLY asynchronously, though I'm not sure how to do this without blocking the kernel.
Kernel runs again, again updating each element, but the new value depends on the previous value.
Repeat for some fixed number of iterations.
So far, I've been able to fake this by providing an input and output buffer, copying the output back to the input when the kernel finishes executing, and restarting the kernel. This seems like a huge waste of time and abuse of limited memory bandwidth as the buffer is quite large (~1GB).
Any suggestions/examples? I'm pretty new at OpenCL so this may have a very simple answer.
If it matters, I'm using Cloo/OpenCL.NET on an NVidia GTX460 and two GTX295s.

I recomend you to create a cl_mem in the device. Copy the data there. And iterate with the kernel.
Use the same memory to store the results, that will be easyer for you, as your kernel will have just 1 parameter.
Then you just need to copy the data to the cl_mem, and run the kernel. After that, extract the data from the device, and run the kernel again.
If you don't care if this iteration can have some data from the next iteration. You can boost up a lot the performance, usign events, and OUT_OF_ORDER_QUEUE. This way the kernel can be running while you copy the data back.

You can write your initial data to the device and change its content with your kernel. As soon as the kernel is finished with its iteration you can read the same memory buffer back and restart the kernel for its next iteration. The data can stay on the OpenCL device. There is no need to send it again to the device.
There is not way, as far as I know, to synchronize the work between host and device. You can only start the kernel wait and for its return. Then read back the result and start again. Asynchronous read would be dangerous, because you could get inconsistent results.

Related

OpenCL data dependency between kernels

I need to write an OpenCL program for reducing a large buffer (several million floats) into a single float. For the simplicity of the question I will suppose here that I need to compute the sum of all floats.
So I have written a kernel which takes a float buffer as input, and sums it by packets of 64. It writes the result to a buffer which is 64 times smaller. I then iterate the call of this kernel until the data is small enough to be copied back on the host and summed by the CPU.
I'm new to OpenCL, do I need to have a barrier between each kernel so that they are run sequentially, or is OpenCL smart enough to detect that the nth kernel pass is writing to an output buffer used as the input buffer of the n+1th kernel?
Or is there a smarter approach?

If you are using a single, in-order command queue for all of your kernel launches (i.e. you do not use the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property), then each kernel invocation will run to completion before the next begins - you do not need any explicit barriers to enforce this behaviour.
If you are using an out-of-order command queue or multiple queues, you can enforce data dependencies via the use of OpenCL events. Each call to clEnqueueNDRangeKernel can optionally return an event object, which can be passed to subsequent commands as dependencies.

cuda unified memory: memory transfer behaviour

I am learning cuda, but currently don't access to a cuda device yet and am curious about some unified memory behaviour. As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
1 Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
2 Additionally I would be grateful if you could clarify something else related to the memory transfer behaviour. It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer? for example if I had 2 arrays of the same UM pointers (the data in the pointer is currently on the gpu and the following code is executed from the cpu) and were to slice the first array, maybe to delete an element, would the iterating step over the pointers being placed into a new array so access the data to do a cudamem transfer? surely not.

As far as i understood, the unified memory functionality, transfers data from host to device on a need to know basis. So if the cpu calls some data 100 times, that is on the gpu, it transfers the data only during the first attempt and clears that memory space on the gpu. (is my interpretation correct so far?)
The first part is correct: when the CPU tries to access a page that resides in device memory, it is transferred in main memory transparently. What happens to the page in device memory is probably an implementation detail, but I imagine it may not be cleared. After all, its contents only need to be refreshed if the CPU writes to the page and if it is accessed by the device again. Better ask someone from NVIDIA, I suppose.
Assuming this, is there some behaviour that, if the programmatic structure meant to fit on the gpu is too large for the device memory, will the UM exchange some recently accessed data structures to make space for the next ones needed to complete to computation or does this still have to be achieved manually?
Before CUDA 8, no, you could not allocate more (oversubscribe) than what could fit on the device. Since CUDA 8, it is possible: pages are faulted in and out of device memory (probably using an LRU policy, but I am not sure whether that is specified anywhere), which allows one to process datasets that would not otherwise fit on the device and require manual streaming.
It seems obvious that data would be transferred back on fro upon access of the actual data, but what about accessing the pointer?
It works exactly the same. It makes no difference whether you're dereferencing the pointer that was returned by cudaMalloc (or even malloc), or some pointer within that data. The driver handles it identically.

What is the correct way in OpenCL to concatenate results of work-groups?

Suppose that in an OpenCL kernel, each work-group outputs unknown amount of data. Is there any efficient way to align that output on the global memory so that there are no holes in it?

One method might be to use atomic_add() to acquire an index into an array, once you know how large a chunk your workgroup requires. In OpenCL 1.0, this type of operation required an extension (cl_khr_global_int32_base_atomics). These operations may be very slow, possibly going as far as locking the whole global memory bus (whose latency we tend to avoid like the plague), so you probably don't want to use it on a per item basis. A downside to this scheme is that you don't know the order your results will be stored, as workgroups can (and will) execute out of order.
Another approach is to simply not store the data contiguously, but allocate enough for each workgroup. Once they finish, you can run a second batch of work to rearrange the data as required (most likely into a second buffer, as memmove-like tricks don't parallellize easily). If you're passing the data back to CPU, feel free to just run all clEnqueueReadBuffer calls in one batch before waiting for them to complete. A command queue with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE might help a bit; you can use the cl_event arguments to specify dependencies when they occur.

OpenCL Copy-Once Share a lot

I am implementing a solution using OpenCL and I want to do the following thing, say for example you have a large array of data that you want to copy in the GPU once and have many kernels process batches of it and store the results in their specific output buffers.
The actual question is here which way is faster? En-queue each kernel with the portion of the array it needs to have or pass out the whole array before hand an let each kernel (in the same context) process the required batch, since they would have the same address space and could each map the array concurrently. Of course the said array is read-only but is not constant as it changes every time I execute the kernel(s)... (so I could cache it using a global memory buffer).
Also if the second way is actually faster could you point me with direction on how this could be implemented, as I haven't found anything concrete yet (although I am still searching :)).
Cheers.

I use the second memory normally. Sharing the memory is easy. Just pass the same buffer to each kernel. I do this in my real-time ray-tracer. I render with one kernel and post-process (image process) with another.
Using the C++ bindings it looks something like this
cl_input_mem = cl::Buffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uchar4)*npixels, NULL, &err);
kernel_render.setArg(0, cl_input_mem);
kernel_postprocess.setArg(0, cl_input_mem);
If you want one kernel to operate on a different segment of the array/memory you can pass an offset value to the kernel arguments and add that to e.g. the global memory pointer for each kernel.
I would use the first method if the array (actually the sum of each buffer - including output) does not fit in memory. Another reason to use the first method is if you're running on multiple devices. In my ray tracer I use the first method when I render on multiple devices. For example I have one GTX 580 render the upper half of the screen and the other GTX 580 rendering the lower half (actually I do this dynamically so one device may render 30% while the other 70% but that's besides the point). I have each device only render it's fraction of the output and then I assemble the output on the CPU. With PCI 3.0 the transfer back and forth between CPU and GPU (multiple times) has a negligible effect on the frame rate even for 1920x1080 images.

Memory considerations when enqueing a long sequence of kernels and reads

I have a long sequence of kernels I need to run on some data like
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):
inputdata = clCreateBuffer(...hostBuffer[0]);
for (int i = 0; i < N; ++i)
{
// create output buffer
outputdata = clCreateBuffer(...);
// run kernel
kernel = clCreateKernel(...);
kernel.setArg(0, inputdata);
kernel.setArg(1, outputdata);
enqueueNDRangeKernel(kernel);
// read intermediate result
enqueueReadBuffer(outputdata, hostBuffer[i]);
// output of operation becomes input of next
inputdata = outputdata;
}
There are several ways to schedule these operations:
Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.

Your Second case is the way to go.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects
in the long chain of actions and release them after
everything is complete?
Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.
Don't worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:
data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
-> I/O operation -> I/O operation
For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.
NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.
Importantly, how does OpenCL handle the case when the sum
of all memory objects exceeds that of the total memory available on the
device?
Gives an error OUT_OF_RESOURCES or similar when allocating.
At any point a kernel only needs the input and output kernels
(which should fit in memory), but what if 4 or 5 of these buffers
exceed the total, how does OpenCL allocate/deallocate these memory
objects behind the scenes? How does this affect the reads?
It will not do this automatically, except you have set the memory as a host PTR. But I'm unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.

I was under the impression (sorry, I was going to cite specification but can't find it today, so I downgraded the strength of my assertion) that when you enqueue a kernel with cl_mem references, it takes a retain on those objects, and releases them when the kernel is done.
This could allow you to release these objects after enqueing a kernel without actually having to wait for the kernel to finish running. This is how the async "clEnqueue" operations are reconciled with the synchronous operations (i.e., memory release), and prevents the use of released memory objects by the runtime and kernel.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex