Using OpenCL for multiple devices (multiple GPU) - opencl

Hello fellow StackOverflow Users,
I have this problem : I have one very big image which i want to work on. My first idea is to divide the big image to couple of sub-images and then send this sub-images to different GPUs. I don't use the Image-Object, because I don't work with the RGB-Value, but I'm only using the brightness value to manipulate the image.
My Question are:
Can I use one context with many commandqueues for every device? or should I use one context with one commandqueue for each device ?
Can anyone give me an example or ideas, how I can dynamically change the inputMem-Data (sub-images data) for setting up the kernel arguments to send to the each device ? (I only know how to send the same input data)
For Example, If I have more sub-images than the GPU-number, how can I distribute the sub-images to the GPUs ?
Or maybe another smarter approach?
I'll appreciate every help and ideas.
Thank you very much.

Use 1 context, and many queues. The simple method is one queue per device.
Create 1 program, and a kernel for each device (created from the same program). Then create different buffers (one per device) and set each kernel with each buffer. Now you have different kernels, and you can queue them in parallel with different arguments.
To distribute the jobs, simple use the event system. Checking if a GPU is empty and queing there the next job.
I can provide more detailed example with code, but as general sketch that should be the way to follow.

AMD APP SDK has few samples on multi gpu handling. You should be looking at these 2 samples
SimpleMultiDevice: shows how to create multiple commandqueues on single context and some performance results
BinomailoptionMultiGPU: look at loadBalancing method. It divides the buffer based on compute units & max clock freq of available gpus

Related

Can opencl chain multiple passes without returning to CPU?

I want to auto scale some data. So, I want to pass through all the data and find the maximum extents of the data. Then I want to go through the data, do calculations, and send the results to opengl for rendering. Is this type of multipass thing possible in opencl? Or does the CPU have to direct the "find extents" calc, get the results, and then direct the other calc with that?
It sounds like you would need two OpenCL kernels, one for calculating the min and max and the other to actually scale the data. Using OpenCL command queues and events you can queue up these two kernels in order and store the results from the first in global memory, reading those results in the second kernel. The semantics of OpenCL command queues and events (assuming you don't have out-of-order execution enabled) will ensure that one completes before the other without any interaction from your host application (see clEnqueueNDRangeKernel).

OpenCL Copy-Once Share a lot

I am implementing a solution using OpenCL and I want to do the following thing, say for example you have a large array of data that you want to copy in the GPU once and have many kernels process batches of it and store the results in their specific output buffers.
The actual question is here which way is faster? En-queue each kernel with the portion of the array it needs to have or pass out the whole array before hand an let each kernel (in the same context) process the required batch, since they would have the same address space and could each map the array concurrently. Of course the said array is read-only but is not constant as it changes every time I execute the kernel(s)... (so I could cache it using a global memory buffer).
Also if the second way is actually faster could you point me with direction on how this could be implemented, as I haven't found anything concrete yet (although I am still searching :)).
Cheers.
I use the second memory normally. Sharing the memory is easy. Just pass the same buffer to each kernel. I do this in my real-time ray-tracer. I render with one kernel and post-process (image process) with another.
Using the C++ bindings it looks something like this
cl_input_mem = cl::Buffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uchar4)*npixels, NULL, &err);
kernel_render.setArg(0, cl_input_mem);
kernel_postprocess.setArg(0, cl_input_mem);
If you want one kernel to operate on a different segment of the array/memory you can pass an offset value to the kernel arguments and add that to e.g. the global memory pointer for each kernel.
I would use the first method if the array (actually the sum of each buffer - including output) does not fit in memory. Another reason to use the first method is if you're running on multiple devices. In my ray tracer I use the first method when I render on multiple devices. For example I have one GTX 580 render the upper half of the screen and the other GTX 580 rendering the lower half (actually I do this dynamically so one device may render 30% while the other 70% but that's besides the point). I have each device only render it's fraction of the output and then I assemble the output on the CPU. With PCI 3.0 the transfer back and forth between CPU and GPU (multiple times) has a negligible effect on the frame rate even for 1920x1080 images.

Translating C code to OpenCL

I am trying to translate a smaller program written in C into openCL. I am supposed to transfer some input data to the GPU and then perform ALL calculations on the device using successive kernel calls.
However, I am facing difficulties with parts of the code that are not suitable for parallelization since I must avoid transferring data back and forth between CPU and GPU because of the amount of data used.
Is there a way to execute some kernels without the parallel processing so I can replace these parts of code with them? Is this achieved by setting global work size to 1?
Yes, you can execute code serially on OpenCL devices. To do this, write your kernel code the same as you would in C and then execute it with the clEnqueueTask() function.
You could manage two devices :
the GPU for highly parallelized code
the CPU for sequential code
This is a bit complex as you must manage one command-queue by device to schedule each kernel on the appropriate device.
If the devices are part of the same platform (typically AMD) you can use the same context, otherwise you will have to create one more context for the CPU.
Moreover, if you want to have a more fine-grained CPU task-parallelization you could use device-fission if your CPU supports it.

OpenCL - Multiple GPU Buffer Synchronization

I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?
It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
Further:
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,
Ryan

SoundMixer.computeSpectrum with microphone

Flex has the SoundMixer.computeSpectrum function that lets you compute an FFT from the currently playing sound. What I'd like to do is compute an FFT without playing the sound. Since Flash 10.1 lets us access the microphone bytes directly, it seems like we should be able to compute the FFT directly off of what the user is speaking.
Unfortunately this doesn't work as far as I know. As stated on the Adobe help pages:
The SoundMixer.computeSpectrum()
method lets an application read the
raw sound data for the waveform that
is currently being played. If more
than one SoundChannel object is
currently playing the
SoundMixer.computeSpectrum() method
shows the combined sound data of every
SoundChannel object mixed together.
This implies two drawbacks:
It just works on the output (SoundChannel)
It just works on the mix of all outputs.
If you don't need the output channel at all, you may turn down it's volume to zero or near to zero!? Don't know if that could work.
For myself I don't see any other chance at the moment to implement the FFT on my own to compute a spectrum on the microphone data.
I'm not sure if there's a way to pass that data, but if all else fails, you can always compute the FFT yourself.

Resources