Running an OpenCL kernel on multiple GPUs - opencl

I have an OpenCL kernel and I want to run it on all detected OpenCL capable devices (like all available GPUs) on different systems, I'd be thankful to know if there is any straightforward method. I mean like creating a single command queue for all devices.
You can't create a single command queue for all devices; a given command queue is tied to a single device. However, you can create separate command queues for each OpenCL device and feed them work, which should execute concurrently.

As Dithermaster points out you first create a separate command queue for each device, for instance you might have multiple GPUs. You can then place these in an array, e.g., here is a pointer to an array that you can setup:
cl_command_queue* commandQueues;
However in my experience it has not always been a "slam-dunk" in getting the various command queues executing concurrently, as can be verified using event timing information (checking for overlap) which you can get through your own profiling or using 3rd party profiling tools. You should do this step anyway to verify what does or does not work on your setup.
An alternative approach which can work quite nicely is to use OpenMP to execute the command queues concurrently, e.g., you do something like:
#pragma omp parallel for default(shared)
for (int i = 0; i < numDevices; ++i) {
someOpenCLFunction(commandQueues[i], ....);

Suppose you have N devices, and a 100 elements of work (jobs). What you should do is something like this:
#define SIZE 3
std::vector<cl::Commandqueue> queues(SIZE); //One queue for each device (same context)
std::vector<cl::Kernel> kernels(SIZE); //One kernel for each device (same context)
std::vector<cl::Buffer> buf_in(SIZE), buf_out(SIZE); //One buffer set for each device (same context)
// Initialize the queues, kernels, buffers etc....
//Create the kernel, buffers and queues, then set the kernel[0] args to point to buf_in[0] and buf_out[0], and so on...
// Create the events in a finished state
std::vector<cl::Event> events;
cl::UserEvent ev; ev.setStatus(CL_COMPLETE);
for(int i=0; i<queues.size(); i++)
//Run all the elements (a "first empty, first run" scheduler)
for(int i=0; i<jobs.size(); i++){
bool found = false;
int x = -1;
//Try all the queues
for(int j=0; j<queue.size(); j++)
found = true;
x = j;
if(!found) Sleep(50); //Sleep a while if not all the queues have completed, other options are possible (like asigning the job to a random one)
//Run it
events[x] = cl::Event(); //Clean it
queues[x].enqueueWriteBuffer(...); //Copy buf_in
queues[x].enqueueNDRangeKernel(kernel[x], .... ); //Launch the kernel
queues[x].enqueueReadBuffer(... , events[x]); //Read buf_out
//Wait for completion
for(int i=0; i<queues.size(); i++)


Using OpenMP with GPU

I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
OpenCL MultiGPU slower than single GPU

I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.
However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.
Does anyone have a good example of using multiple devices on independant data pieces simultaneously?
Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);
// Set kernel arguments
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
float* modulus = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);
// Do something with the modulus;
Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:
You should change your code to do it like this at least:
With this code:
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
float* modulus[numDevices];
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
modulus[i] = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
// Do something with the modulus;
Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.
The safer approach for a non-comunication between GPUs are different contexts.
I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:

segmentation fault when using shared memory created by open_shm on Xeon Phi

I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
return shared;
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
int i=0;
int k,j;
sharedRegion[j] = sendbuff[i];
if( rank == 0)
recvbuff[k][i] = sharedRegion[j];
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.

Effect of using page-able memory for asynchronous memory copy?

In CUDA C Best Practices Guide Version 5.0, Section 6.1.2, it is written that:
In contrast with cudaMemcpy(), the asynchronous transfer version
requires pinned host memory (see Pinned Memory), and it contains an
additional argument, a stream ID.
It means the cudaMemcpyAsync function should fail if I use simple memory.
But this is not what happened.
Just for testing purpose, I tried the following program:
__global__ void kernel_increment(float* src, float* dst, int n)
int tid = blockIdx.x * blockDim.x + threadIdx.x;
dst[tid] = src[tid] + 1.0f;
int main()
float *hPtr1, *hPtr2, *dPtr1, *dPtr2;
const int n = 1000;
size_t bytes = n * sizeof(float);
cudaStream_t str1, str2;
hPtr1 = new float[n];
hPtr2 = new float[n];
for(int i=0; i<n; i++)
hPtr1[i] = static_cast<float>(i);
dim3 block(16);
dim3 grid((n + block.x - 1)/block.x);
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
for(int i=0; i<n; i++)
delete[] hPtr1;
delete[] hPtr2;
return 0;
The program gave correct output. The array incremented successfully.
How did cudaMemcpyAsync execute without page locked memory?
Am I missing something here?
cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.
Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.
In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:
with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.
This is absolutely expected behaviour.

Asynchronous execution of commands from two command queues in OpenCL

I am trying to work out an application that can utilize both CPU and GPU at the same time by OpenCL. Specifically, I have two kernels, one for CPU executing, and one for GPU. CPU kernel will change the content of one buffer, and GPU will do other things when GPU detects that the buffer has been changed by CPU.
__kernel void cpuKernel(__global uint * dst1,const uint size)
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < size)
tid += size;
__kernel void gpuKernel(__global uint * dst1, __global uint * dst2, const uint size)
uint tid = get_global_id(0);
uint size = get_global_size(0);
while(tid < vectorSize)
while(dst1[vectorOffset + tid] != 10)
dst2[vectorOffset + tid] = dst1[vectorOffset+tid];
tid += size;
As shown above, cpuKernel will change each element of dst1 buffer to 10, correspondingly, after GPU detect such changes, it will assign the element value (10) to the same place of another buffer dst2. cpuKernel is queued in command1 which is associated with CPU device, and gpuKernel is queued in command2 which is associated with GPU device, two command queues have been set CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE flag.
Then I make two cases:
case 1:
case 2:
But the results show that the time consumed in two cases are nearly the same, but I expect there will be some overlapping in case 1, but there is not. Can anyone help me? Thanks!
Or, can anyone help to explain how to implement two kernels running on two devices asynchronously in OpenCL?
You are asking too much. As you have probably noticed, buffer objects are relative to a context, while command queues are related to devices.
If a kernel operates on a buffer object, the corresponding data must be on this device. If you do not transfer it explicitely with clEnqueueWriteBuffer(), OpenCL will do that for you.
Hence, if you modify a buffer object with a kernel on one device (for example the CPU), and just after on another device (for example the GPU), the OpenCL driver will wait for the first kernel to finish, transfer the data, and then run the second kernel.
