When I use the MPI, there are multiple threads doing MPI_Bcast. Like
#pragma omp parallel for
for(int i = 0; i < k; i++)
{
MPI_Bcast(&a[i], 1, MPI_INT32_T, TargetRank, MPI_COMM_WORLD);
}
If the data size and type are the same, it seems they will broadcast to the wrong places.
How could I fix it? (Now I use a my_bcast with Tag)
Problem Schematic
Related
Consider the following code, which enqueues between 1 and 100000 'non-blocking' random access buffer reads and measures the time:
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include <chrono>
#include <stdio.h>
static const int size = 100000;
int host_buf[size];
int main() {
cl::Context ctx(CL_DEVICE_TYPE_DEFAULT, nullptr, nullptr, nullptr);
std::vector<cl::Device> devices;
ctx.getInfo(CL_CONTEXT_DEVICES, &devices);
printf("Using OpenCL devices: \n");
for (auto &dev : devices) {
std::string dev_name = dev.getInfo<CL_DEVICE_NAME>();
printf(" %s\n", dev_name.c_str());
}
cl::CommandQueue queue(ctx);
cl::Buffer gpu_buf(ctx, CL_MEM_READ_WRITE, sizeof(int) * size, nullptr, nullptr);
std::vector<int> values(size);
// Warmup
queue.enqueueReadBuffer(gpu_buf, false, 0, sizeof(int), &(host_buf[0]));
queue.finish();
// Run from 1 to 100000 sized chunks
for (int k = 1; k <= size; k *= 10) {
auto cstart = std::chrono::high_resolution_clock::now();
for (int j = 0; j < k; j++)
queue.enqueueReadBuffer(gpu_buf, false, sizeof(int) * (j * (size / k)), sizeof(int), &(host_buf[j]));
queue.finish();
auto cend = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration<double>(cend - cstart).count() * 1000000.0;
printf("%8d: %8.02f us\n", k, time);
}
return 0;
}
As always, there is some random variation but the typical output for me is like this:
1: 10.03 us
10: 107.93 us
100: 794.54 us
1000: 8301.35 us
10000: 83741.06 us
100000: 981607.26 us
Whilst I did expect a relatively high latency for a single read, given the need for a PCIe round trip, I am surprised at the high cost of adding subsequent reads to the queue - as if there isn't really a 'queue' at all but each read adds the full latency penalty. This is on a GTX 960 with Linux and driver version 455.45.01.
Is this expected behavior?
Do other GPUs behave the same way?
Is there any workaround other than always doing random-access reads from inside a kernel?
You are using a single in-order command queue. Hence, all enqueued reads are performed sequentially by the hardware / driver.
The 'non-blocking' aspect simply means that the call itself is asynchronous and will not block your host code while GPU is working.
In your code, you use clFinish which blocks until all reads are done.
So yes, this is the expected behavior. You will pay the full time penalty for each DMA transfer.
As long as you create an in-order command queue (the default), other GPUs will behave the same.
If your hardware / driver support out-of-order queues, you could use them to potentially overlap DMA transfers. Alternatively you could use multiple in-order queues. But the performance is of-course hardware & driver dependent.
Using multiple queues / out-of-order queues is a bit more advanced. You should make sure you to properly utilize events to avoid race conditions or cause undefined behavior.
To reduce latency associated with GPU-Host DMA transfers, it is recommended you use a pinned host buffer rather then std::vector. Pinned host buffers are usually created via clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag.
I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.
However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.
Does anyone have a good example of using multiple devices on independant data pieces simultaneously?
Thanks in advance.
EDIT:
Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);
// Set kernel arguments
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
float* modulus = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);
// Do something with the modulus;
}
Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:
Wr:-Copy1--Copy2--------------------
G1:---------------RUN1--------------
G2:---------------RUN2--------------
Re:-------------------Read1--Read2--
You should change your code to do it like this at least:
Wr:-Copy1-Copy2-----------
G1:------RUN1-------------
G2:------------RUN2-------
Re:----------Read1-Read2--
With this code:
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Set kernel arguments //YOU SHOULD DO THIS AT INIT STAGE, IT IS SLOW TO DO IT IN A LOOP
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
float* modulus[numDevices];
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
modulus[i] = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
}
clFinish();
// Do something with the modulus;
Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.
The safer approach for a non-comunication between GPUs are different contexts.
I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:
Wr:-Copy1-Copy2-Copy3----------
G1:------RUN1--RUN2--RUN3------
Re:----------Read1-Read2-Read3-
I'm new to OpenCL and in order to get a better grasp of a few concepts I contrived a simple example of a geometric progression as follows (emphasis on contrived):
An array of N values and N coefficients (whose values could be
anything, but in the example they all are the same) are allocated.
M steps are performed in sequence where each value in the values array
is multiplied by its corresponding coefficient in the coefficients
array and assigned as the new value in the values array. Each step needs to fully complete before the next step can complete. I know this part is a bit contrived, but this is a requirement I want to enforce to help my understanding of OpenCL.
I'm only interested in the values in the values array after the final step has completed.
Here is the very simple OpenCL kernel (MultiplyVectors.cl):
__kernel void MultiplyVectors (__global float4* x, __global float4* y, __global float4* result)
{
int i = get_global_id(0);
result[i] = x[i] * y[i];
}
And here is the host program (main.cpp):
#include <CL/cl.hpp>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
int main ()
{
auto context = cl::Context (CL_DEVICE_TYPE_GPU);
auto *sourceFile = fopen("MultiplyVectors.cl", "r");
if (sourceFile == nullptr)
{
perror("Couldn't open the source file");
return 1;
}
fseek(sourceFile, 0, SEEK_END);
const auto sourceSize = ftell(sourceFile);
auto *sourceBuffer = new char [sourceSize + 1];
sourceBuffer[sourceSize] = '\0';
rewind(sourceFile);
fread(sourceBuffer, sizeof(char), sourceSize, sourceFile);
fclose(sourceFile);
auto program = cl::Program (context, cl::Program::Sources {std::make_pair (sourceBuffer, sourceSize + 1)});
delete[] sourceBuffer;
const auto devices = context.getInfo<CL_CONTEXT_DEVICES> ();
program.build (devices);
auto kernel = cl::Kernel (program, "MultiplyVectors");
const size_t vectorSize = 1024;
float coeffs[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
coeffs[i] = 1.000001;
}
auto coeffsBuffer = cl::Buffer (context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof (coeffs), coeffs);
float values[vectorSize] {};
for (size_t i = 0; i < vectorSize; ++i)
{
values[i] = static_cast<float> (i);
}
auto valuesBuffer = cl::Buffer (context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof (values), values);
kernel.setArg (0, coeffsBuffer);
kernel.setArg (1, valuesBuffer);
kernel.setArg (2, valuesBuffer);
auto commandQueue = cl::CommandQueue (context, devices[0]);
for (size_t i = 0; i < 1000000; ++i)
{
commandQueue.enqueueNDRangeKernel (kernel, cl::NDRange (0), cl::NDRange (vectorSize / 4), cl::NullRange);
}
printf ("All kernels enqueued. Waiting to read buffer after last kernel...");
commandQueue.enqueueReadBuffer (valuesBuffer, CL_TRUE, 0, sizeof (values), values);
return 0;
}
What I'm basically asking is for advice on how to best optimize this OpenCL program to run on a GPU. I have the following questions based on my limited OpenCL experience to get the conversation going:
Could I be handling the buffers better? I'd like to minimize any
unnecessary ferrying of data between the host and the GPU.
What's the optimal work group configuration (in general at least, I
know this can very by GPU)? I'm not actually sharing any data
between work items and it doesn't seem like I'd benefit from work
groups much here, but just in case.
Should I be allocating and loading anything into local memory for a
work group (if that would at all makes sense)?
I'm currently enqueing one kernel for each step, which will create a
work item for each 4 floats to take advantage of a hypothetical GPU with a SIMD
width of 128 bits. I'm attempting to enqueue all of this
asynchronously (although I'm noticing the Nvidia implementation I have
seems to block each enqueue until the kernel is complete) at once
and then wait on the final one to complete. Is there a whole better
approach to this that I'm missing?
Is there a design that would allow for only one call to
enqueueNDRangeKernel (instead of one call per step) while
maintaining the ability for each step to be efficiently processed in
parallel?
Obviously I know that the example problem I'm solving can be done in much better ways, but I wanted to have as simple of an example as possible that illustrated a vector of values being operated on in a series of steps where each step has to be completed fully before the next. Any help and pointers on how to best go about this would be greatly appreciated.
Thanks!
I have an OpenCL kernel and I want to run it on all detected OpenCL capable devices (like all available GPUs) on different systems, I'd be thankful to know if there is any straightforward method. I mean like creating a single command queue for all devices.
Thanks in advance :]
You can't create a single command queue for all devices; a given command queue is tied to a single device. However, you can create separate command queues for each OpenCL device and feed them work, which should execute concurrently.
As Dithermaster points out you first create a separate command queue for each device, for instance you might have multiple GPUs. You can then place these in an array, e.g., here is a pointer to an array that you can setup:
cl_command_queue* commandQueues;
However in my experience it has not always been a "slam-dunk" in getting the various command queues executing concurrently, as can be verified using event timing information (checking for overlap) which you can get through your own profiling or using 3rd party profiling tools. You should do this step anyway to verify what does or does not work on your setup.
An alternative approach which can work quite nicely is to use OpenMP to execute the command queues concurrently, e.g., you do something like:
#pragma omp parallel for default(shared)
for (int i = 0; i < numDevices; ++i) {
someOpenCLFunction(commandQueues[i], ....);
}
Suppose you have N devices, and a 100 elements of work (jobs). What you should do is something like this:
#define SIZE 3
std::vector<cl::Commandqueue> queues(SIZE); //One queue for each device (same context)
std::vector<cl::Kernel> kernels(SIZE); //One kernel for each device (same context)
std::vector<cl::Buffer> buf_in(SIZE), buf_out(SIZE); //One buffer set for each device (same context)
// Initialize the queues, kernels, buffers etc....
//Create the kernel, buffers and queues, then set the kernel[0] args to point to buf_in[0] and buf_out[0], and so on...
// Create the events in a finished state
std::vector<cl::Event> events;
cl::UserEvent ev; ev.setStatus(CL_COMPLETE);
for(int i=0; i<queues.size(); i++)
events.push_back(ev);
//Run all the elements (a "first empty, first run" scheduler)
for(int i=0; i<jobs.size(); i++){
bool found = false;
int x = -1;
//Try all the queues
while(!found){
for(int j=0; j<queue.size(); j++)
if(events[j].getInfo<CL_EVENT_COMMAND_ EXECUTION_STATUS>() == CL_COMPLETED){
found = true;
x = j;
break;
}
if(!found) Sleep(50); //Sleep a while if not all the queues have completed, other options are possible (like asigning the job to a random one)
}
//Run it
events[x] = cl::Event(); //Clean it
queues[x].enqueueWriteBuffer(...); //Copy buf_in
queues[x].enqueueNDRangeKernel(kernel[x], .... ); //Launch the kernel
queues[x].enqueueReadBuffer(... , events[x]); //Read buf_out
}
//Wait for completion
for(int i=0; i<queues.size(); i++)
queue[i].Finish();
I have two questions regarding the multidimensional arrays. I declared a 3D array using two stars but when I try to access the elements I get a used-without-initializing error.
unsigned **(test[10]);
**(test[0]) = 5;
Howcome I get that error while when I use the following code, I don't get an error - What's the difference?
unsigned test3[10][10][10];
**(test3[0]) = 5;
My second question is this: I'm trying to port a piece of code that was written for Unix to Windows. One of the lines is this:
unsigned **(precomputedHashesOfULSHs[nnStruct->nHFTuples]);
*nHFTuples is of type int but it's not a constant, and this the error that I'm getting;
error C2057: expected constant expression
Is it possible that I'm getting this error because I'm running it on Windows not Unix? - and how would I solve this problem? I can't make nHFTuples a constant because the user will need to provide the value for it!
In the first one, you didn't declare a 3D array, you declared an array of 10 pointers to pointers to unsigned ints. When you dereference it, you're dereferencing a garbage pointer.
In the second one, you declared the array correctly but you're using it wrong. Arrays are not pointers and you don't dereference them.
Do this:
unsigned test3[10][10][10];
test3[0][0][0] = 5;
To answer your second question, you have to use a number that can be known at compile time as the size of an array. GCC has a nonstandard extension that allows you to do that, but it's not portable and not part of the standard (though C99 introduced them). To fix it, you'll have to use malloc and free:
int i, j, k;
unsigned*** precomputedHashOfULSHs = malloc(nnStruct->nHFTuples * sizeof(unsigned));
for (i = 0; i < firstDimensionLength; ++i) {
precomputedHashOfULSHs[i] = malloc(sizeOfFirstDimension * sizeof(unsigned));
for (j = 0; j < secondDimensionLength; ++j) {
precomputedHashOfULSHs[i][j] = malloc(sizeOfSecondDimension * sizeof(unsigned));
for (k = 0; k < sizeOfSecondDimension; ++k)
precomputedHashOfULSHs[i][j][k] = malloc(sizeof(unsigned));
}
}
// then when you're done...
for (i = 0; i < firstDimensionLength; ++i) {
for (j = 0; j < secondDimensionLength; ++j) {
for (k = 0; k < sizeOfSecondDimension; ++k)
free(precomputedHashOfULSHs[i][j][k]);
free(precomputedHashOfULSHs[i][j]);
}
free(precomputedHashOfULSHs[i]);
}
free(precomputedHashOfULSHs);
(Pardon me if that allocation/deallocation code is wrong, it's late :))
Although you don't specify it, I think you're using a compiler on unix that supports C99 (SUch as GCC), whereas the compiler you use on windows does not support it. (Visual Studio uses only C89 here).
You have three options:
You can hard-code a suitable maximum array size.
You could allocate the array yourself using malloc or calloc. Don't forget to free it when you're done.
Port the program to C++, and use std::vector.
If you choose option 3, then you'll want something like:
std::vector<unsigned int> precomputedHashOfULSHs;
For a single-dimension vector, or for a two-dimensional vector, use:
std::vector<std::vector<unsigned int> > precomputedHashOfULSHs;
Do note that vectors default to being empty, of zero length, so you will need to add each element from the original set.
In the case of test3 as an example, you'll want:
std::vector<std::vector<std::vector<unsigned int> > > precomputedHashOfULSHs;
precomputedHashOfULSHs.resize(10);
for(int i = 0; i < 10; i++) {
precomputedHashOfULSHs[i].resize(10);
for(int ii=0; ii<10; ii++) {
precomputedHashOfULSHs[i][ii].resize(10);
}
}
I haven't tested this code, but it should be right. C++ will manage the memory of that vector for you automatically.