I'm seeing massive delays between a kernel being submitted to an AMD GPU and actually executed. My program is doing blocking writes/reads (with blocking=CL_TRUE) to ensure that I/O isn't interfering with the kernel. I then use clGetEventProfilingInfo to get info on kernel queueing, submitting, starting and ending. The data (and code) below shows that the kernel spends about 5 seconds submitted, and then 5 seconds running. In general, it looks like the submitted time scales with the running time. I've looked at a number of forum posts about delays in kernel execution (for instance, http://devgurus.amd.com/thread/166587) but there doesn't seem to be a resolution there. I've checked that the GPU is not in low-power mode. Has anyone else seen this or have suggestions of how to diagnose it?
write 131.000000 ms
kernel queued->submitted 0.022348 ms
kernel submitted->started 5553.957663 ms
kernel started->ended 5529.893060 ms
read 39.000000 ms
1306 cl_ulong end, queued, start, submit;
1307 clGetEventProfilingInfo(jniContext->exec_event,
1308 CL_PROFILING_COMMAND_QUEUED, sizeof(queued), &queued, NULL);
1309 clGetEventProfilingInfo(jniContext->exec_event,
1310 CL_PROFILING_COMMAND_SUBMIT, sizeof(submit), &submit, NULL);
1311 clGetEventProfilingInfo(jniContext->exec_event,
1312 CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL);
1313 clGetEventProfilingInfo(jniContext->exec_event,
1314 CL_PROFILING_COMMAND_END, sizeof(end), &end, NULL);
Follow up: after upgrading from the latest release drivers (v13.4 released on 5/29/2013) to the latest beta drivers (release 11/22/2013) we're no longer seeing this performance issue. This problem occurred on 64-bit Centos using a AMD A10-6700, but if you're seeing this issue and have a different chipset I'd recommend upgrading to the latest beta drivers and seeing if that fixes it.
Related
Could anyone offer any troubleshooting ideas or pointers on where/how to get more information on the difference between sys and real time from the output below?
It is my understanding that the command finished processing in the OS in 4 seconds, but then IO where queued and processing and 38.3 seconds (is that right?). It is somewhat a block box at this point to me on how to get some additional details.
time prealloc /myfolder/testfile 2147483648
real 42.5
user 0.0
sys 4.2
You are writing 2 GB to disk on an HP-UX system; this is most likely using spinning disks (physical hard disks).
The system is writing 2GiB / 42s = 51 MB/s which doesn't sound slow to me.
On these systems you can use tools such as sar. Use sar -ud 5 to see CPU and disk usage during your prealloc command; you will likely see disk usage pegged at 100%.
I implement some image processing on opencl on GPU. On host program i launch this kernel 4 times, total time of this about 13 ms(on AMD profiler), it is good result, i think, but if i measure kernel execution time on host by QueryPerformanceTimer it shows about 26 ms. clEnqueueNDRangeKernel execution time smaller than 1 ms. Where is 26-13 ms ? how fix it? i launch it on GPU 1: AMD Radeon HD 6900 Series, using AMD SDK 3.0. If i launch kernel once, but in kernel i add 4 times cycle resulte is same.
´clEnqueueNDRangeKernel´ as the name says it is an "enqueue" call. SoO it only queues work to a command queue. That does not mean that the work is completed before the call returns, in fact it may have not been even started.
The API has probably just packed the work in a tidy structure of commands, and add it to the queue (submit phase).
You have to measure the kernel execution using the event timer (clEvents) with a Profiling enabled queue. That is the real execution time on the device.
Alternatively, it is possible to measure the total "roundtrip" time by measuring from ´enqueue´ to ´clFinish´. But that will include all the overheads that usually are hided in a pipeline scenario, so normally the first approach is preferred.
My OpenCL program (don't be scared, this is auto-generated code for 3D CFD) shows strange behavior -- a lot of time are spent in opencl_enq_job_* procedures (opencl_code.c), where are only async OpenCL commands:
clEnqueueWriteBuffer(..,CL_FALSE,...,&event1);
clSetKernelArg(...);
...
clEnqueueNDRangeKernel(...,1,&event1,&event2);
clEnqueueReadBuffer(...,CL_FALSE,...,1,&event2,&event3);
clSetEventCallback(event3,...);
clFlush(...);
In program output the time, spent in opencl_enq_job_* shown as:
OCL waste: 0.60456248727985751
It's mean 60% of time wasted in that procedures.
Most of time (92%) are spent in clEnqueueReadBuffer function and ~5% in clSetEventCallback.
Why so much? What's wrong in this code?
My configuration:
Platform: NVIDIA CUDA
Device 0: Tesla M2090
Device 1: Tesla M2090
Nvidia cuda_6.0.37 SDK and drivers.
Linux localhost 3.12.0 #6 SMP Thu Apr 17 20:21:10 MSK 2014 x86_64 x86_64 x86_64 GNU/Linux
Update: Nvidia accepted this as a bug.
Update1: On my laptop (MBP15, AMD GPU, Apple OpenCL) the program show similar behavior, but waiting more in clFlush (>99%). On CUDA SDK the program works without clFlush, on Apple program without clFlush hangs (submitted tasks never finishes).
I have tried memory pining and it significantly improved the situation!
Problem was solved.
I think this is not really a bug; I just missed something in the documentation. My investigation lead me to the conclusion, that driver just cannot perform async load/store of non-pinned buffer -- even if non-blocking calls are used. The driver just waits for an opportunity to store/load data, which can be performed only after task finish, and this breaks parallelism.
I accidentally wrote a while loop that would never break in a kernel and I sent this to the GPU. After 30 seconds my screens started flickering, I realised what I have done and terminated the application by force. The problem is that I had to shut down the computer afterwards to make sure the kernels are gone. Therefore my questions are:
If I forcefully terminate the program (the program that's launching the kernels) without it freeing the GPU resources (freeing buffers, queues, kernels, CL.destroying) will the kernels still run?
If they are still running can I do anything to stop them? Say, like, release resources I don't have a handle to any more.
If you are using an NVIDIA card, then by terminating the application you will eventually free the resources on the card to allow it to run again. This is because NVIDIA has a watchdog monitor on the device (which you can turn off).
If you are using an AMD card, you are out of luck AFAIK and will have to restart the machine after every crash.
When I call an OpenCL function I wouldn't imagine would create new threads, in this case to simply get platform IDs, my program creates 8 new threads.
cl_platform_id platforms[10] = {0};
cl_uint numberofplatforms = 0;
clGetPlatformIDs(10, platforms, &numberofplatforms);//this creates 8 threads
Due to me not creating a context, but simply asking for platform IDs to see what is available, why does this function create all these threads? I'm using Windows 7 64 bit, i7 920 with HT (I am suspicious it is creating 8 threads because I have 8 cores), both the Intel and Nvidia SDK( I have a GTS 250 and GTX 560), while I'm linking with the Nvidia OpenCL libraries and using its headers.
This isn't a big concern, but what if I decide to not use OpenCL after analyzing the devices, only to have 8 useless threads "lying around". Anyone know why this happens?
A lot of the OpenCL functions are non-blocking, meaning that they issue commands to the device in the form of a queue and I'm pretty sure the threads are used to control the device while the host program continues to run the rest of the code.
To illustrate: When you call clEnqueueNDRangeKernel, the kernel is not necessarily run at once, but the program continues to run code after the clEnqueueNDRangeKernel call. So I guess this functions passes some information to separate threads that controls the compute device and makes sure that the kernel is run eventually.