As the title says, when I run my OpenCL kernel the entire screen stops redrawing (the image displayed on monitor remains the same until my program is done with calculations. This is true even in case I unplug it from my notebook and plug it back - allways the same image is displayed) and the computer does not seem to react to mouse moves either - the cursor stays in the same position.
I am not sure why this happens. Could it be a bug in my program, or is this a standard behaviour ?
While searching on Google I found this thread on AMD's forum and some people there suggested it's normal as the GPU can't refresh the screen, when it is busy with computations.
If this is true, is there still any way to work around this ?
My kernel computation can take up to several minutes and having my computer practically unusable for whole that time is really painful.
EDIT1: this is my current setup:
graphics card is ATI Mobility Radeon HD 5650 with 512 MB of memory and latest Catalyst beta driver from AMD's website
the graphics is switchable - Intel integrated/ATI dedicated card, but
I have disabled switching in BIOS, because otherwise I could not get
the driver working on Ubuntu.
the operating system is Ubuntu 12.10 (64-bits), but this happens on Windows 7 (64-bits) as well.
I have my monitor plugged in via HDMI (but the notebook screen freezes
too, so this should not be a problem)
EDIT2: so after a day of playing with my code, I took the advices from your responses and changed my algorithm to something like this (in pseudo code):
for (cl_ulong chunk = 0; chunk < num_chunks; chunk += chunk_size)
{
/* set kernel arguments that are different for each chunk */
clSetKernelArg(/* ... */);
/* schedule kernel for next execution */
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
/* read out the results from kernel and append them to output array on host */
clEnqueueReadBuffer(cmd_queue, of_buf, CL_TRUE, 0, chunk_size, output + chunk, 0, NULL, NULL);
}
So now I split the whole workload on host and send it to GPU in chunks. For each chunk of data I enqueue a new kernel and the results that I get from it are appended to the output array at a correct offset.
Is this how you meant that the calculation should be divided ?
This seems like the way to remedy the freeze problem and even more now I am able to process data much larger than the available GPU memory, but I will yet have to make some good performance meassurements, to see what is the good chunk size...
Whenever a GPU is running an OpenCL kernel it is completely dedicated to OpenCL. Some modern Nvidia GPUs are the exception, I think from the GeForce GTX 500 series onwards, which could run multiple kernels if those kernels did not use all available compute units.
Your solutions are to divide your calculations into multiple short kernel calls, which is the best all round solution since it will work even on single GPU machines, or to invest in a cheap GPU for driving your display.
If you are going to run long kernels on your GPUs then you must disable timeout detection and recovery for GPUs or make the timeout delay longer than the maximum kernel runtime (better as bugs can still be caught), see here for how to do this.
Every time I have had a display freeze or "Display driver stopped responding and has recovered" it's been due to a bug. It can freeze the whole system and the only thing I can do is reset. Instead, now I develop on the CPU first. This never crashes my whole system. It's easier to debug this way as well since I can use printf. Once I got my code working bug free on the CPU I try it on the GPU.
I am new to opencl and encountered a similar problem. I found a short calculation works OK, but a longer one freezes the mouse cursor. For my problem, Windows leaves a yellow triangle in the tray area, and puts a message in the event log about "Display driver stopped responding and has recovered". The solution I found is to break up the calculation into small parts that take no more than a couple of seconds each. These run back to back, yet apparently let the video driver in enough to keep it happy. If I set global_work_size to a value high enough to maximize throughput, the video response is painfully slow, but the driver restart/mouse freeze problem never occurs.
Related
I'm currently building a ray marcher to look at things like the mandelbox, etc. It works great. However, with my current program, it uses each worker as a ray projected from the eye. This means that there is a large amount of execution per worker. So when looking at an incredibly complex object or trying to render with large enough precision it causes my display drivers to crash because the kernel was taking too long to execute on a single worker. I'm trying to avoid changing my registry values to make the timeout longer as I want this application to work on multiple computers.
Is there any way to remedy this? As it stands the executions of each work-item are completely independent of the work items nearby. I've contemplated subscribing a buffer to the GPU that would store the current progress on that ray and only execute a small amount of iterations. Then, I would just call the program over and over and the result would hopefully refine a bit more. The problem with this is that I am unsure how to deal with branching rays (eg. reflecting and refraction) unless I have a max number of each to anticipate.
Anyone have any pointers on what I should do to remedy this problem? I'm quite the greenhorn to OpenCL and have been having this issue for quite some time. I feel as though I'm doing something wrong or misusing OpenCL principally since my single workitems have a lot of logic behind them, but I don't know how to split the task as it is just a series of steps and checks and adjustments.
The crash you are experiencing is caused by the HW watchdog timer of nVIDIA. Also, the OS may as well detect the GPU as not responsive and reboot it (at least Windows7 does it).
You can avoid it by many ways:
Improve/optimize your kernel code to take less time
Buy faster Hardware ($$$$)
Disable the watchdog timer (but is not an easy task, and not all the devices have the feature)
Reduce the amount of work queued to the device each time, by launching multiple small kernels (NOTE: There is a small overhead of doing it this way, introduced by the launch of each small kernel)
The easier and straightforward solution is the last one. But if you can, try the first one as well.
As an example, a call like this (1000x1000 = 1M work items, Global size):
clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(0,0)/*Offset*/, NDRange(1000,1000)/*Global*/, ... );
Can be split up in many small calls of ((100x100)x(10x10) = 1M ). Since the global size is now 100 times smaller the watchdog should not be triggered:
for(int i=0; i<10; i++)
for(int j=0; j<10; j++)
clEnqueueNDRangeKernel(queue, kernel, 2, NDRange(i*100,j*100)/*Offset*/, NDRange(100,100)/*Global*/, ... );
I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.
__kernel
void kernel__test_warpsize(
__global T* dataSet,
uint size
)
{
size_t idx = get_global_id(0);
T value = dataSet[idx];
if (idx<size-1)
dataSet[idx+1] = value;
}
In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values:
dataSet[0]=0
dataSet[1]=0
dataSet[2]=1
...
dataSet[63]=62
dataSet[64]=63
dataSet[65]=63
dataSet[66]=65
...
dataSet[127]=126
However, I found dataSet[65] is 64, not 63, which is not as my expectation.
My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?
You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.
In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.
Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.
Based on your comments I edited this part:
Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.
And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.
But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).
CUDA model - what is warp size?
I think this is a good answer which explains the warp briefly.
But I am a bit confused about what sharpneli said such as
" [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".
I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.
I am implementing a solution using OpenCL and I want to do the following thing, say for example you have a large array of data that you want to copy in the GPU once and have many kernels process batches of it and store the results in their specific output buffers.
The actual question is here which way is faster? En-queue each kernel with the portion of the array it needs to have or pass out the whole array before hand an let each kernel (in the same context) process the required batch, since they would have the same address space and could each map the array concurrently. Of course the said array is read-only but is not constant as it changes every time I execute the kernel(s)... (so I could cache it using a global memory buffer).
Also if the second way is actually faster could you point me with direction on how this could be implemented, as I haven't found anything concrete yet (although I am still searching :)).
Cheers.
I use the second memory normally. Sharing the memory is easy. Just pass the same buffer to each kernel. I do this in my real-time ray-tracer. I render with one kernel and post-process (image process) with another.
Using the C++ bindings it looks something like this
cl_input_mem = cl::Buffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uchar4)*npixels, NULL, &err);
kernel_render.setArg(0, cl_input_mem);
kernel_postprocess.setArg(0, cl_input_mem);
If you want one kernel to operate on a different segment of the array/memory you can pass an offset value to the kernel arguments and add that to e.g. the global memory pointer for each kernel.
I would use the first method if the array (actually the sum of each buffer - including output) does not fit in memory. Another reason to use the first method is if you're running on multiple devices. In my ray tracer I use the first method when I render on multiple devices. For example I have one GTX 580 render the upper half of the screen and the other GTX 580 rendering the lower half (actually I do this dynamically so one device may render 30% while the other 70% but that's besides the point). I have each device only render it's fraction of the output and then I assemble the output on the CPU. With PCI 3.0 the transfer back and forth between CPU and GPU (multiple times) has a negligible effect on the frame rate even for 1920x1080 images.
On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.
This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.
Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:
std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices = queues.size();
/* Calculation of increment (local is gotten from the function parameters)*/
//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{
//Update the offset for the current device
offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);
//Start a new thread for each call to enqueueNDRangeKernel
enqueueReturns.push_back(std::async(
std::launch::async,
&cl::CommandQueue::enqueueNDRangeKernel,
&queues[i],
kernels[kernel],
offset,
increment,
local,
(const std::vector<cl::Event>*)NULL,
(cl::Event*)NULL));
//Without those last two casts, the program won't even compile
}
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{
execError = enqueueReturns[i].get();
if(execError != CL_SUCCESS)
std::cerr << "Informative error omitted due to length" << std::endl
}
The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.
P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.
EDIT: Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.
I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:
queues - So you can represent each device and enqueue orders to it
buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
kernel - I think this one's optional, but it makes specifying arguments a lot easier.
Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.
After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...
Yes, you're right. AFAIK - the nvidia implementation has a synchronous "clEnqueueNDRange". I have noticed this when using my library (Brahma) as well. I don't know if there is a workaround or a way of preventing this, save using a different implementation (and hence device).
I have a thoroughly complex kernel processing audio input data. It will run for a couple of minutes, 60 times a second, and then hang. That's on the GPU; on the CPU it will run for hours. The input data are constantly changing, but each variable is always within proscribed ranges. I have inserted test code before uploading the inputs to the kernel each frame; in this test code, I can force these inputs to be well below their valid input range, but it still will eventually crash. (Say the valid range for a particular input is 0->400; I can force it to 0->1 and it will STILL eventually crash. I can force it to be below 0.1 and it will still ultimately bite the dust.) However, if I force the input variables to zero, the GPU will happily dance for hours. Of course, that input-free dance is not so particularly interesting.
I'm at a loss so far, though I have clues. I can make it crash much faster than 2 minutes if an input variable is high in its approved range. I can make it crash in less then 10 seconds under the right circumstances. BUT, I can't seem to _back_off_of_ those certain circumstances such that they go away. As said above, I can force the input vars into ridiculously small portions of their valid range, and the kernel (let's call him Harlan Sanders) will eventually go belly-up. BUT, if they're forced to actual zero, no problems puppy, we can run all day long.
To repeat, I'm a bit at a loss - although I have things that look like clues, I have not yet figured out what they are hinting at, though I've been trying for a few days. Frankly, I do not expect to find a real solution by asking here; whenever I stumble over a problem in opencl it seems that my fate is to be the first to articulate that particular problem. I guess this is part of the fun of being in on a technology during its infancy!!!!!!!!!! BUT, I want to do some serious, sustainable work with this "baby" (or, maybe, "toddler").
Op details: MacBook Pro 2010, OS 10.6.8, nv 330M GPU, xcode 3.2.5, shorts, teeshirt.
bonus P.S. for those who've read this far, including a related question:
My laptop, soldier that it has proved to be, is not powerful enough for the next stage. I must sell some stocks/bonds and purchase a Mac Pro. I'm looking at the ATI 5870. So, PERHAPS my problem will simply go away when I compile the .cl for the ATI??? Maybe I have run into a bug in the nV implementation. Maybe my kernel is so complex that I'm running into undetected resource limits (it's 1300 lines of code). So, SINCE I run fine on the CPU, perhaps I'll have no bugs, or different bugs, on the ATI card???
Any thoughts?
Thanks, guys & dolls --
Dave
Use "cl_" data types on the CPU side, because maybe you are not coping data the right way, or it is not being understood by the GPU. This could lead to GPU hangs on invalid pointers while handing the data.
You should also try -Werror, and read the error output. You can be doing smt wrong.
Without any code, we can only guess. But I haven't found any bug in the actual OpenCL NV or ATI implementations.
Make sure you release all resources. Events returned by Enqueue functions must be released. This error sometimes occurs after accessing buffers out of range.