I implement some image processing on opencl on GPU. On host program i launch this kernel 4 times, total time of this about 13 ms(on AMD profiler), it is good result, i think, but if i measure kernel execution time on host by QueryPerformanceTimer it shows about 26 ms. clEnqueueNDRangeKernel execution time smaller than 1 ms. Where is 26-13 ms ? how fix it? i launch it on GPU 1: AMD Radeon HD 6900 Series, using AMD SDK 3.0. If i launch kernel once, but in kernel i add 4 times cycle resulte is same.
´clEnqueueNDRangeKernel´ as the name says it is an "enqueue" call. SoO it only queues work to a command queue. That does not mean that the work is completed before the call returns, in fact it may have not been even started.
The API has probably just packed the work in a tidy structure of commands, and add it to the queue (submit phase).
You have to measure the kernel execution using the event timer (clEvents) with a Profiling enabled queue. That is the real execution time on the device.
Alternatively, it is possible to measure the total "roundtrip" time by measuring from ´enqueue´ to ´clFinish´. But that will include all the overheads that usually are hided in a pipeline scenario, so normally the first approach is preferred.
Related
I'm trying to see the performance of the Opencl Programming model on GPUs, now while testing the Programming model, i have to launch the kernel by using clEnqueueNDkernel(), I'm trying to call this function multiple times, so that I can see how it performs when two or four concurrent kernels are launched.
I observe that the program is taking the same amount of time as launching one kernel, now I'm assuming that it is just running the kernel once, cause there is no way it takes the same amount of time to run two or four concurrent kernels.
Now I want to know how to launch multiple kernels on one GPU.
eg: I want to launch something like :
clEnqueueNDkernel()
clEnqueueNDkernel()
How can I do this?
First of all, check if your Device supports concurrent kernel execution. Latest AMD & Nvidia cards do.
Then, create multiple command queues. If you enqueue kernels into same queue, they will be executed consequently one after another.
Finally, check that kernels were indeed executed in parallel. Use profilers from SDK or OpenCL events to gather profiling info.
I am running my OpenCL C codes on our institution's GPU cluster, which has 8 nodes and each node has an Intel Xeon 8C proc with 3 NVIDIA Tesla M2070 GPUs (in total 24 GPUs). I need to find a way from my host code to identify which of the GPUs are already occupied and which are free and to submit my jobs to those available GPUs. The closest answer that i could find was in
How to programmatically discover specific GPU on platform with multiple GPUs (OpenCL 1.1)?
How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?.
Can anyone help me out with how to choose a node and choose a GPU which is free for computation. I am writing in OpenCL C.
Gerald
Unfortunately, there is no standard way to do such a thing.
If you want to squeeze the full power of GPUs for computations and your problem is not a memory hog, I can suggest to use two contexts per device: as kernels at the first one end computation, kernels of the second one are still working and you have time to fill the buffers with data and start next task in the first context, and vice versa. In my case (AMD GPU, OpenCL 1.2) if saves from 0 to 20 % of computational time. Three contexts provide sometimes slower execution, sometimes faster, so I do not recommend this as a standard technique, but you can try. Four and more contexts are useless, from my experience.
Have a command queue for each device, then use OpenCL Events with each kernel submission, and check the state of them before submitting a new kernel for execution. Whichever command queue has the least unfinished kernels is the one you should enqueue to.
I am working on a program written in OpenCL and running on Fusion APU (CPU+GPU on one die). I wan to get some performance counters such as instructions number, branch number and so on. I have two tools on hand: AMD APP Profiler and CodeAnalyst. When I use the APP Profiler, I found that it seems can only provide instructions counter for GPU, cannot for CPU. Then I use CodeAnalyst, but then three confusions occurred.
On App Profiler, it can give the number of ALUInsts (i.e. the number of executed ALU instructions per work-item) is about 70000. The whole thread space on GPU has 8192 threads, so I intuitively think there are 70000 * 8192 instructions executed by GPU. Is that right?
When I use CodeAnalyst to measure the instructions for the same program on CPU part, it just gave "Ret inst", "Ret branch" such kind of counters, but I am not sure about one thing: this program runs on both CPU and GPU at the same time, what are these counters for? For CPU only, for GPU only? or the sum?
No matter what these counters for, I found that the value of Ret Inst (i.e. retired instructions) is about 40000, it seems too small for the whole program, I guess the instructions for a program should be at order of billions, how it could be only 4w? The attached pic shows the results.
Is there any people can help me resolve these confusions, I am just a tyro here, wish kind help from all of you. Thanks!
As I know, CUDA has a stream function. It make it possible that memory transportation and kernel execution run in the same. Of course, the data in memory transportation and kernel execution is different. Can I do this things with OpenCL. Beacuse sometime when you do some processing on video. the bottleneck is the memory transportation .
Yes, you can overlap memory operations and kernel execution in OpenCL. Just set the blocking_read parameter of the clEnqueueReadBUffer function to CL_FALSE. But you need to make sure that the transfer has been completed before you operate on the data. Use events for that.
I was curious as to how the GPU executes the same kernel multiple times.
I have a kernel which is being queued hundreds (possibly thousands) of times in a row, and using the AMD App Profiler I noticed that it would execute clusters of kernels extremely fast, then like clockwork every so often a kernel would "hang" (i.e. take orders of magnitude longer to execute). I think it's every 64th kernel that hangs.
This is odd because each time through the kernel performs the exact same operations with the same local and global sizes. I'm even re-using the same buffers.
Is there something about the execution model that I'm missing (perhaps other programs/the OS accessing the GPU or the timing frequency of the GPU memory). I'm testing this on an ATI HD5650 card under Windows 7 (64-bit), with AMD App SDK 2.5 with in-order queue execution.
As a side note, if the I don't have any global memory accesses in my kernel (a rather impractical prospect), the profiler puts a gap in between the quick executing kernels and where the slow executing kernels were before is now a large empty gap where none of my kernels are being executed.
As a follow-up question, is there anything that can be done to fix this?
It's probable you're seeing the effects of your GPU's maximum number of concurrent tasks. Each task enqueued is assigned to one or more multiprocessors, which are frequently capable of running hundreds of workitems at a time - of the same kernel, enqueued in the same call. Perhaps what you're seeing is the OpenCL runtime waiting for one of the multiprocessors to free up. This relates most directly to the occupancy issue - if the work size can't keep the multiprocessor busy, through memory latencies and all, it has idle cycles. The limit here depends on how many registers (local or private memory) your kernel requires. In summary, you want to write your kernel to operate on multiple pieces of data more so than queueing it many times.
Did your measurement include reading back results from the apparently fast executions?