OpenCL MultiGPU slower than single GPU - opencl

I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.
However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.
Does anyone have a good example of using multiple devices on independant data pieces simultaneously?
Thanks in advance.
EDIT:
Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);
// Set kernel arguments
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
float* modulus = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);
// Do something with the modulus;
}

Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:
Wr:-Copy1--Copy2--------------------
G1:---------------RUN1--------------
G2:---------------RUN2--------------
Re:-------------------Read1--Read2--
You should change your code to do it like this at least:
Wr:-Copy1-Copy2-----------
G1:------RUN1-------------
G2:------------RUN2-------
Re:----------Read1-Read2--
With this code:
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Set kernel arguments //YOU SHOULD DO THIS AT INIT STAGE, IT IS SLOW TO DO IT IN A LOOP
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
float* modulus[numDevices];
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
modulus[i] = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
}
clFinish();
// Do something with the modulus;
Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.
The safer approach for a non-comunication between GPUs are different contexts.
I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:
Wr:-Copy1-Copy2-Copy3----------
G1:------RUN1--RUN2--RUN3------
Re:----------Read1-Read2-Read3-

Related

Can MPI_Bcast correctly while using multiple threads?

When I use the MPI, there are multiple threads doing MPI_Bcast. Like
#pragma omp parallel for
for(int i = 0; i < k; i++)
{
MPI_Bcast(&a[i], 1, MPI_INT32_T, TargetRank, MPI_COMM_WORLD);
}
If the data size and type are the same, it seems they will broadcast to the wrong places.
How could I fix it? (Now I use a my_bcast with Tag)
Problem Schematic

Using OpenMP with GPU

Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)

OpenCL 'non-blocking' reads have higher cost than expected

Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include <chrono>
#include <stdio.h>
static const int size = 100000;
int host_buf[size];
int main() {
cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
std::vector<cl::Device> devices;
ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
printf("Using OpenCL devices: \n");
for (auto &dev : devices) {
std::string dev_name = dev.getInfo<CL_DEVICE_NAME>();
printf(" %s\n", dev_name.c_str());
}
cl::CommandQueue queue(ctx);
cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);
std::vector<int> values(size);
// Warmup
queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
queue.finish();
// Run from 1 to 100000 sized chunks
for (int k = 1; k <= size; k *= 10) {
auto cstart = std::chrono::high_resolution_clock::now();
for (int j = 0; j < k; j++)
queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
queue.finish();
auto cend = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration<double>(cend - cstart).count() * 1000000.0;
printf("%8d: %8.02f us\n", k, time);
}
return 0;
}
As always, there is some random variation but the typical output for me is like this:
1: 10.03 us
10: 107.93 us
100: 794.54 us
1000: 8301.35 us
10000: 83741.06 us
100000: 981607.26 us
Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.
Is this expected behavior?
Do other GPUs behave the same way?
Is there any workaround other than always doing random-access reads from inside a kernel?
You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.
The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working.
In your code, you use clFinish which blocks until all reads are done.
So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.
As long as you create an in-order command queue (the default), other GPUs will behave the same.
If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.
Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.
To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector. Pinned host buffers are usually created via clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag.

Arduino IDE - global variables storage in RAM or flash memory

Conventional wisdom has it that global and static data is stored in the bottom of RAM along with other stuff. Somewhere above that is the heap, then free memory and at the top of RAM, is the stack. See the code below before reading on.
When I compile this with the Arduino IDE (1.8.10) on a MEGA2560 I get the following statistics:
Sketch uses 2750 bytes (1%) of program storage space. Maximum is 253952 bytes.
Global variables use 198 bytes (2%) of dynamic memory, leaving 7994 bytes for local variables. Maximum is 8192 bytes.
If I change ARRAY_SIZE from 1 to 7001, I get exactly the same numbers. I expected that dynamic memory should increase by 7000. When I do the same comparison with AtmelStudio V7, dynamic memory does indeed increase by 7000.
One more piece of information along these lines. If I do a malloc of 7000 which is pretty close to free memory, one would expect that the malloc should be successful when ARRAY_SIZE equals one and would fail when the ARRAY_SIZE equals 7001. I was dismayed to find that the malloc was successful with both the small and the large array sizes. With AtmelStudio this does not happen.
I suspect respective compiler/linker options somewhere could explain the difference (AtmelStudio - project properties and Arduino IDE - platform.txt perhaps?).
I also suspect that the Arduino IDE dynamically allocates global variables in FlashMemory.
I am not a newbie, but I am not a guru - comments anyone?
Thanks
Michèle
#define ARRAY_SIZE 7001
uint8_t globalArray[ARRAY_SIZE]{1};
void setup() {
Serial.begin(115200);
for (int i = 0; i < ARRAY_SIZE; i++) globalArray[i] = 1;
Serial.print(F("Just initialized globalArray, size = "));Serial.println(ARRAY_SIZE);
uint8_t* testPointer = (uint8_t*) malloc(7000);
Serial.print(F("Allocation of 7000 bytes "));
if ( testPointer != (uint8_t*) 0) {
Serial.print(F("SUCCESSFUL"));
} else {
Serial.print(F("NOT SUCCESSFUL"));
}
} // setup
void loop() {} // loop
I ran some more tests to figure out why AtmelStudio and the Arduino IDE are supplying vastly different RAM usage values after declaring an array. The response from juraj (thank you) was that the compiler optimized unused code away. This answer was true however I had included an array initialization loop to make sure that the compiler would include the array in the code.
It turns out that AtmelStudio and the Arduino IDE have different criteria as to code what it means "code being used". The outcome is that globalArray, in the initialization line,
for (int i = 0; i < ARRAY_SIZE; i++) globalArray[i] = 1;
is considered by AtmelStudio as being used and by the Arduino IDE as not being used.
The Arduino IDE requires that globalArray appear on the left of an assignment statement to consider it as being used thus the need for the "a+=globalArray[i];" line. The exact same code below reports:
a+=globalArray[i]; not used
Atmel Studio: Data Memory Usage as being 7211 bytes
Arduino IDE: Global variables use 198 bytes
a+=globalArray[i]; used
Atmel Studio: Data Memory Usage as being 7211 bytes
Arduino IDE: Global variables use 7199 bytes
Q.E.D. Interesting how the two IDEs do not quite mean the same thing with "being used".
Thanks - My first time on this forum got my question answered rather quickly.
Michèle
#define ARRAY_SIZE 7001
uint8_t globalArray[ARRAY_SIZE];
void setup() {
Serial.begin(115200);
for (int i = 0; i < ARRAY_SIZE; i++) globalArray[i] = 1;
Serial.print(F("Just initialized globalArray, size = "));
Serial.println(ARRAY_SIZE);
// uint16_t a {0};
// for (int i = 0; i < ARRAY_SIZE; i++) a+=globalArray[i];
// Serial.print(F("Value of a = ")); Serial.println(a);
uint8_t* testPointer = (uint8_t*) malloc(7000);
Serial.print(F("Allocation of 7000 bytes "));
if ( testPointer != (uint8_t*) 0) Serial.print(F("SUCCESSFUL"));
else Serial.print(F("NOT SUCCESSFUL"));
} // setup

Running an OpenCL kernel on multiple GPUs

I have an OpenCL kernel and I want to run it on all detected OpenCL capable devices (like all available GPUs) on different systems, I'd be thankful to know if there is any straightforward method. I mean like creating a single command queue for all devices.
Thanks in advance :]
You can't create a single command queue for all devices; a given command queue is tied to a single device. However, you can create separate command queues for each OpenCL device and feed them work, which should execute concurrently.
As Dithermaster points out you first create a separate command queue for each device, for instance you might have multiple GPUs. You can then place these in an array, e.g., here is a pointer to an array that you can setup:
cl_command_queue* commandQueues;
However in my experience it has not always been a "slam-dunk" in getting the various command queues executing concurrently, as can be verified using event timing information (checking for overlap) which you can get through your own profiling or using 3rd party profiling tools. You should do this step anyway to verify what does or does not work on your setup.
An alternative approach which can work quite nicely is to use OpenMP to execute the command queues concurrently, e.g., you do something like:
#pragma omp parallel for default(shared)
for (int i = 0; i < numDevices; ++i) {
someOpenCLFunction(commandQueues[i], ....);
}
Suppose you have N devices, and a 100 elements of work (jobs). What you should do is something like this:
#define SIZE 3
std::vector<cl::Commandqueue> queues(SIZE); //One queue for each device (same context)
std::vector<cl::Kernel> kernels(SIZE); //One kernel for each device (same context)
std::vector<cl::Buffer> buf_in(SIZE), buf_out(SIZE); //One buffer set for each device (same context)
// Initialize the queues, kernels, buffers etc....
//Create the kernel, buffers and queues, then set the kernel[0] args to point to buf_in[0] and buf_out[0], and so on...
// Create the events in a finished state
std::vector<cl::Event> events;
cl::UserEvent ev; ev.setStatus(CL_COMPLETE);
for(int i=0; i<queues.size(); i++)
events.push_back(ev);
//Run all the elements (a "first empty, first run" scheduler)
for(int i=0; i<jobs.size(); i++){
bool found = false;
int x = -1;
//Try all the queues
while(!found){
for(int j=0; j<queue.size(); j++)
if(events[j].getInfo<CL_EVENT_COMMAND_ EXECUTION_STATUS>() == CL_COMPLETED){
found = true;
x = j;
break;
}
if(!found) Sleep(50); //Sleep a while if not all the queues have completed, other options are possible (like asigning the job to a random one)
}
//Run it
events[x] = cl::Event(); //Clean it
queues[x].enqueueWriteBuffer(...); //Copy buf_in
queues[x].enqueueNDRangeKernel(kernel[x], .... ); //Launch the kernel
queues[x].enqueueReadBuffer(... , events[x]); //Read buf_out
}
//Wait for completion
for(int i=0; i<queues.size(); i++)
queue[i].Finish();

Resources