clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU) - opencl

On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.
This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.
Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:
std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices = queues.size();
/* Calculation of increment (local is gotten from the function parameters)*/
//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{
//Update the offset for the current device
offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);
//Start a new thread for each call to enqueueNDRangeKernel
enqueueReturns.push_back(std::async(
std::launch::async,
&cl::CommandQueue::enqueueNDRangeKernel,
&queues[i],
kernels[kernel],
offset,
increment,
local,
(const std::vector<cl::Event>*)NULL,
(cl::Event*)NULL));
//Without those last two casts, the program won't even compile
}
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{
execError = enqueueReturns[i].get();
if(execError != CL_SUCCESS)
std::cerr << "Informative error omitted due to length" << std::endl
}
The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.
P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.
EDIT: Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.

I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:
queues - So you can represent each device and enqueue orders to it
buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
kernel - I think this one's optional, but it makes specifying arguments a lot easier.
Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.
After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...

Yes, you're right. AFAIK - the nvidia implementation has a synchronous "clEnqueueNDRange". I have noticed this when using my library (Brahma) as well. I don't know if there is a workaround or a way of preventing this, save using a different implementation (and hence device).

Related

How can I make CUDA return control after kernel launch?

It might be a stupid question but is there a way to return asynchronously from a kernel? For example, I have this kernel which does a first stream compaction which is outputted to the user but before it must do a second stream compaction to update its internal structure.
Is there a way to return the control to the user after the first stream compaction done while the GPU continues its second stream compaction in the background? Of course, the second stream compaction works only on shared memory and global memory, but nothing the user should retrieve.
I can't use thrust.

A GPU kernel does not, in itself, take control from the "user", i.e. from CPU threads on the system with the GPU.
However, with CUDA's runtime, the default way to invoke a GPU kernel has your thread wait until the kernel's execution concludes:
my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size>>>(args,go,here);
but you can also use streams. These are hardware-supported execution queues on which you can enqueue work (memory copying, kernel execution etc.) asynchronously, just like you asked.
Your launch in this case may look like:
cudaStream_t my_stream;
cudaError_t result = cudaStreamCreateWithFlags(&my_stream, cudaStreamNonBlocking);
if (result != cudaSuccess) { /* error handling */ }
my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size,my_stream>>>(args,go,here);
There are lots of resources on using streams; try this blog post for starters. The CUDA programming guide has a larg section on asynchronous execution .
Streams and various libraries
Thrust has offered asynchronous functionality for a while, using thrust::future and other constructs. See here.
My own Modern-C++ CUDA API wrappers make it somewhat easier to work with streams, relieving you of the need to check for errors all the time and to remember to destroy streams and release memory before it goes out of scope. make it somewhat easier to work with streams. See this example; the syntax looks something like this:
auto stream = device.create_stream(cuda::stream::async);
stream.enqueue.copy(d_a.get(), a.get(), nbytes);
stream.enqueue.kernel_launch(my_kernel, launch_config, d_a.get(), more, args);
(and errors throw an exception)

Issue with #pragma acc host_data use_device

I'd like the MPI function MPI_Sendrecv() to run on the GPU. Normally I use something like:
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, N, MPI_DOUBLE, proc[0], 0,
recv_buf, N, MPI_DOUBLE, proc[0], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
And it works fine. However now, I call MPI_Sendrecv() inside a loop. If I try to accelerate this loop (with #pragma acc parallel loop) or even accelerate the whole routine (#pragma acc routine) where the loop and the MPI call are situated, I get an error:
64, Accelerator restriction: loop contains unsupported statement type
78, Accelerator restriction: unsupported statement type: opcode=ACCHOSTDATA
How can I make run the call on the device if, like in this case, the call is in an accelerated region?
An alternative could be maybe to do not accelerate the routine and the loop, and use #pragma acc host_data use_device(send_buf, recv_buf) alone, but the goal of having everything on the gpu would fail.
EDIT
I removed the #pragma. Anyway, the application runs hundreds of time slower and I cannot figure why.
I'm using nsight-sys to check: Do you have and idea why MPI_Sendrecv is slowing down the app? Now all the routine where it's called is running on the host. If I move the mouse pointer on the NVTX (MPI) section, it prints "ranges on this row have been projected from the CPU on the GPU". What does this mean?
Sorry if this is not clear but I lack of practicality with nsight and I don't know how to analyze the results properly. If you need more details I'm happy to give them to you.
However it seemes weird to me that the MPI calls appear in the GPU section.

You can't make MPI calls from within device code.
Also, the "host_data" is saying to use a device pointer within host code so can't be used within device code. Device pointers are used by default in device code, hence no need for the "host_data" construct.
Questions after edit:
Do you have and idea why MPI_Sendrecv is slowing down the app?
Sorry, no idea. I don't know what you're comparing to or anything about your app so hard for me to tell. Though Sendrecv is a blocking call so putting in in a loop will cause all the sends and receives to wait on the previous ones before proceeding. Are you able to rewrite the code to use ISend and IRecv instead?
"ranges on this row have been projected from the CPU on the GPU". What
does this mean?
I haven't seen this before, but presume it just means that even though these are host calls, the NVTX instrumentation is able to project them onto the GPU timeline. Most likely so the CUDA Aware MPI device to device data transfers will be correlated to the MPI region.

Large execution time of OpenCL Kernel causes crash

I'm currently building a ray marcher to look at things like the mandelbox, etc. It works great. However, with my current program, it uses each worker as a ray projected from the eye. This means that there is a large amount of execution per worker. So when looking at an incredibly complex object or trying to render with large enough precision it causes my display drivers to crash because the kernel was taking too long to execute on a single worker. I'm trying to avoid changing my registry values to make the timeout longer as I want this application to work on multiple computers.
Is there any way to remedy this? As it stands the executions of each work-item are completely independent of the work items nearby. I've contemplated subscribing a buffer to the GPU that would store the current progress on that ray and only execute a small amount of iterations. Then, I would just call the program over and over and the result would hopefully refine a bit more. The problem with this is that I am unsure how to deal with branching rays (eg. reflecting and refraction) unless I have a max number of each to anticipate.
Anyone have any pointers on what I should do to remedy this problem? I'm quite the greenhorn to OpenCL and have been having this issue for quite some time. I feel as though I'm doing something wrong or misusing OpenCL principally since my single workitems have a lot of logic behind them, but I don't know how to split the task as it is just a series of steps and checks and adjustments.

The crash you are experiencing is caused by the HW watchdog timer of nVIDIA. Also, the OS may as well detect the GPU as not responsive and reboot it (at least Windows7 does it).
You can avoid it by many ways:
Improve/optimize your kernel code to take less time
Buy faster Hardware ($$$$)
Disable the watchdog timer (but is not an easy task, and not all the devices have the feature)
Reduce the amount of work queued to the device each time, by launching multiple small kernels (NOTE: There is a small overhead of doing it this way, introduced by the launch of each small kernel)
The easier and straightforward solution is the last one. But if you can, try the first one as well.
As an example, a call like this (1000x1000 = 1M work items, Global size):
clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(0,0)/*Offset*/, NDRange(1000,1000)/*Global*/, ... );
Can be split up in many small calls of ((100x100)x(10x10) = 1M ). Since the global size is now 100 times smaller the watchdog should not be triggered:
for(int i=0; i<10; i++)
for(int j=0; j<10; j++)
clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(i*100,j*100)/*Offset*/, NDRange(100,100)/*Global*/, ... );

OpenCL computation freezes the screen

As the title says, when I run my OpenCL kernel the entire screen stops redrawing (the image displayed on monitor remains the same until my program is done with calculations. This is true even in case I unplug it from my notebook and plug it back - allways the same image is displayed) and the computer does not seem to react to mouse moves either - the cursor stays in the same position.
I am not sure why this happens. Could it be a bug in my program, or is this a standard behaviour ?
While searching on Google I found this thread on AMD's forum and some people there suggested it's normal as the GPU can't refresh the screen, when it is busy with computations.
If this is true, is there still any way to work around this ?
My kernel computation can take up to several minutes and having my computer practically unusable for whole that time is really painful.
EDIT1: this is my current setup:
graphics card is ATI Mobility Radeon HD 5650 with 512 MB of memory and latest Catalyst beta driver from AMD's website
the graphics is switchable - Intel integrated/ATI dedicated card, but
I have disabled switching in BIOS, because otherwise I could not get
the driver working on Ubuntu.
the operating system is Ubuntu 12.10 (64-bits), but this happens on Windows 7 (64-bits) as well.
I have my monitor plugged in via HDMI (but the notebook screen freezes
too, so this should not be a problem)
EDIT2: so after a day of playing with my code, I took the advices from your responses and changed my algorithm to something like this (in pseudo code):
for (cl_ulong chunk = 0; chunk < num_chunks; chunk += chunk_size)
{
/* set kernel arguments that are different for each chunk */
clSetKernelArg(/* ... */);
/* schedule kernel for next execution */
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
/* read out the results from kernel and append them to output array on host */
clEnqueueReadBuffer(cmd_queue, of_buf, CL_TRUE, 0, chunk_size, output + chunk, 0, NULL, NULL);
}
So now I split the whole workload on host and send it to GPU in chunks. For each chunk of data I enqueue a new kernel and the results that I get from it are appended to the output array at a correct offset.
Is this how you meant that the calculation should be divided ?
This seems like the way to remedy the freeze problem and even more now I am able to process data much larger than the available GPU memory, but I will yet have to make some good performance meassurements, to see what is the good chunk size...

Whenever a GPU is running an OpenCL kernel it is completely dedicated to OpenCL. Some modern Nvidia GPUs are the exception, I think from the GeForce GTX 500 series onwards, which could run multiple kernels if those kernels did not use all available compute units.
Your solutions are to divide your calculations into multiple short kernel calls, which is the best all round solution since it will work even on single GPU machines, or to invest in a cheap GPU for driving your display.
If you are going to run long kernels on your GPUs then you must disable timeout detection and recovery for GPUs or make the timeout delay longer than the maximum kernel runtime (better as bugs can still be caught), see here for how to do this.

Every time I have had a display freeze or "Display driver stopped responding and has recovered" it's been due to a bug. It can freeze the whole system and the only thing I can do is reset. Instead, now I develop on the CPU first. This never crashes my whole system. It's easier to debug this way as well since I can use printf. Once I got my code working bug free on the CPU I try it on the GPU.

I am new to opencl and encountered a similar problem. I found a short calculation works OK, but a longer one freezes the mouse cursor. For my problem, Windows leaves a yellow triangle in the tray area, and puts a message in the event log about "Display driver stopped responding and has recovered". The solution I found is to break up the calculation into small parts that take no more than a couple of seconds each. These run back to back, yet apparently let the video driver in enough to keep it happy. If I set global_work_size to a value high enough to maximize throughput, the video response is painfully slow, but the driver restart/mouse freeze problem never occurs.

Why is there a CL_DEVICE_MAX_WORK_GROUP_SIZE?

I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.
It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(i);
barrier();
moreCode(i);
barrier();
finalCode(i);
}
then it could be converted automatically to an execution with work group size 100 on this kernel:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(5*i);
someCode(5*i+1);
someCode(5*i+2);
someCode(5*i+3);
someCode(5*i+4);
barrier();
moreCode(5*i);
moreCode(5*i+1);
moreCode(5*i+2);
moreCode(5*i+3);
moreCode(5*i+4);
barrier();
finalCode(5*i);
finalCode(5*i+1);
finalCode(5*i+2);
finalCode(5*i+3);
finalCode(5*i+4);
}
However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?

I think that the origin of the CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.
Multiple threads are running simultaneously on computing units and every one of them needs to keep state (for call, jmp, etc). Most implementations use a stack for this and if you look at the AMD Evergreen family their is an hardware limit for the number of stack entries that are available (every stack entry has subentries). Which in essence limits the number of threads every computing unit can handle simultaneously.
As for the compiler can do this to make it possible. It could work but understand that it would mean to recompile the kernel over again. Which isn't always possible. I can imagine situations where developers dump the compiled kernel for each platform in a binary format and ships it with their software just for "not so open-source" reasons.

Those constants are queried from the device by the compiler in order to determine a suitable work group size at compile-time (where compiling of course refers to compiling the kernel). I might be getting you wrong, but it seems you're thinking of setting those values by yourself, which wouldn't be the case.
The responsibility is within your code to query the system capabilities to be prepared for whatever hardware it will run on.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex