OpenCL: too slow async commands submiting

OpenCL: too slow async commands submiting - opencl

My OpenCL program (don't be scared, this is auto-generated code for 3D CFD) shows strange behavior -- a lot of time are spent in opencl_enq_job_* procedures (opencl_code.c), where are only async OpenCL commands:
clEnqueueWriteBuffer(..,CL_FALSE,...,&event1);
clSetKernelArg(...);
...
clEnqueueNDRangeKernel(...,1,&event1,&event2);
clEnqueueReadBuffer(...,CL_FALSE,...,1,&event2,&event3);
clSetEventCallback(event3,...);
clFlush(...);
In program output the time, spent in opencl_enq_job_* shown as:
OCL waste: 0.60456248727985751
It's mean 60% of time wasted in that procedures.
Most of time (92%) are spent in clEnqueueReadBuffer function and ~5% in clSetEventCallback.
Why so much? What's wrong in this code?
My configuration:
Platform: NVIDIA CUDA
Device 0: Tesla M2090
Device 1: Tesla M2090
Nvidia cuda_6.0.37 SDK and drivers.
Linux localhost 3.12.0 #6 SMP Thu Apr 17 20:21:10 MSK 2014 x86_64 x86_64 x86_64 GNU/Linux
Update: Nvidia accepted this as a bug.
Update1: On my laptop (MBP15, AMD GPU, Apple OpenCL) the program show similar behavior, but waiting more in clFlush (>99%). On CUDA SDK the program works without clFlush, on Apple program without clFlush hangs (submitted tasks never finishes).

I have tried memory pining and it significantly improved the situation!
Problem was solved.
I think this is not really a bug; I just missed something in the documentation. My investigation lead me to the conclusion, that driver just cannot perform async load/store of non-pinned buffer -- even if non-blocking calls are used. The driver just waits for an opportunity to store/load data, which can be performed only after task finish, and this breaks parallelism.

Related

Intel OpenCL: Tools for looking timeline of concurrent kernel execution

In case of CUDA, NSIGHT would give us detail time lines of each kernel.
Is there similar tool for Intel Opencl? Basically I want to see if my three kernels are running in concurrently or not.

Large submit->start delay in OpenCL kernel

I'm seeing massive delays between a kernel being submitted to an AMD GPU and actually executed. My program is doing blocking writes/reads (with blocking=CL_TRUE) to ensure that I/O isn't interfering with the kernel. I then use clGetEventProfilingInfo to get info on kernel queueing, submitting, starting and ending. The data (and code) below shows that the kernel spends about 5 seconds submitted, and then 5 seconds running. In general, it looks like the submitted time scales with the running time. I've looked at a number of forum posts about delays in kernel execution (for instance, http://devgurus.amd.com/thread/166587) but there doesn't seem to be a resolution there. I've checked that the GPU is not in low-power mode. Has anyone else seen this or have suggestions of how to diagnose it?
write 131.000000 ms
kernel queued->submitted 0.022348 ms
kernel submitted->started 5553.957663 ms
kernel started->ended 5529.893060 ms
read 39.000000 ms
1306 cl_ulong end, queued, start, submit;
1307 clGetEventProfilingInfo(jniContext->exec_event,
1308 CL_PROFILING_COMMAND_QUEUED, sizeof(queued), &queued, NULL);
1309 clGetEventProfilingInfo(jniContext->exec_event,
1310 CL_PROFILING_COMMAND_SUBMIT, sizeof(submit), &submit, NULL);
1311 clGetEventProfilingInfo(jniContext->exec_event,
1312 CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL);
1313 clGetEventProfilingInfo(jniContext->exec_event,
1314 CL_PROFILING_COMMAND_END, sizeof(end), &end, NULL);

Follow up: after upgrading from the latest release drivers (v13.4 released on 5/29/2013) to the latest beta drivers (release 11/22/2013) we're no longer seeing this performance issue. This problem occurred on 64-bit Centos using a AMD A10-6700, but if you're seeing this issue and have a different chipset I'd recommend upgrading to the latest beta drivers and seeing if that fixes it.

can we read and program the microcodes of AMD processor?

we can know that microcodes in Intel processors is encrypted (as issued in "Intel® 64 and IA-32 Architectures Software Developer’s Manual"). One cannot programm the Intel microcodes as he wants.
So, does anyone know how about the AMD microcodes? Are the microcodes of AMD CPU encrypted ?
Anyone knows how to program microcodes? It's doesn't limit on AMD or Intel CPUs.
Thank you in advance!
(ps: Not the microcodes in GPU, but in CPU).

This article provides information on the microcode of AMD's Opteron (K8) family. It claims that it is not encrypted and provides information on the microcode format and updating the microcode.

Anyone knows how to program microcodes? It's doesn't limit on AMD or Intel CPUs.
Not too many people do that kind of work. It's often written with a C compiler tweaked to generate the necessary microcode.
To answer your question in regard "is there other processors accepting microcode?" FPGA's are only programmed using such. These are not CPUs, what you program in them "is written at the hardware level". The microcode changes the doors and the result is your program. It can become very tedious as everything runs in parallel (true hardware parallelism).

AMD microcode for recent processors is, indeed, encrypted and authenticated, much like Intel's. You need to have the proper crypto key to sign a microcode update the processor will accept.
Intel does it by embedding in the processor mask (hardware read-only) microcode a hash of the valid key(s?): the key itself is too large to bother embedding in the processor, so it will be present in the update data itself as seen here. Also, the Intel microcode update is actually an unified processor-package update data, it updates more than just the microcode for the decode unit. It can update all sort of internal processor parameters, as well as control sequences for other units than the decoder... it also has both opcode (and likely microcode) that the processor runs before(?)/after applying the update.

HyperQ support in OpenCL

I want to run heterogeneous kernels that execute on a single GPU asynchronously. I think this is possible in Nvidia Kepler K20(Or any device having compute capability 3.5+) by launching each of this kernels to a different stream and the runtime system maps them to different hardware queues based on the resource availability.
Is this feature accessible in OpenCL?
If it is so, what is the equivalent of a CUDA 'Stream' in OpenCL?
Do Nvidia drivers support such an execution on their K20 cards through OpenCL?
Is their any AMD GPU that has similar feature(or is there anything on development)?
Answer for any of these questions will help me a lot.

In principle, you can use OpenCL command queues to achieve CKE (Concurrent Kernel Execution). You can launch them from different CPU threads. Here are few links that might help you get started:
How do I know if the kernels are executing concurrently?
http://devgurus.amd.com/thread/142485
I am not sure how would it work with NVIDIA Kepler GPUs as we are having strange issues using OpenCL on K20 GPU.

IO COmpletion Ports for Mac OS X

Is there any equivalent of IO COmpletion ports on Mac OS X for implementing Asynchronous IO on files....
Thank you....

Unfortunately, no.
kqueue is the mechanism for high-performance asynchronous i/o on OSX and FreeBSD. Like Linux epoll it signals in the opposite end of i/o compared to IOCPs (Solaris, AIX, Windows). kqueue and epoll will signal when it's ok to attempt a read or a write, whereas IOCPs will callback when a read or a write has completed. Many find the signalling mechanism used by epoll and kqueue difficult to understand compared to the IOCP model. So while kqueue and IOCP are both mechanisms for high-performance asynchronous i/o, they are not comparable.
It is possible to implement IOCPs using epoll or kqueue and a thread pool. You can find an example of that in the Wine project.
Correction:
Mac OS X has an implementation of IOCP like functions in Grand Central Dispatch. It uses the GCD thread pool and kqueue APIs internally. Convinience functions are dispatch_read and dispatch_write. Like IOCP the asynchronous I/O functions in GCD signals at the completion of an I/O task, not when the file descriptor is ready like the raw kqueue API.
Beware that GCD APIs are not "fork safe", and cannot be used on both sides of a POSIX fork without an exec. If you do, the function call will never return.
Also beware that kqueue in Mac OS X is rumored to be less performant than kqueue in FreeBSD, so it might be better for development than production. GCD (libdispatch) is Open Source however, and can be used on other platforms as well.
Update Jan 3, 2015:
FreeBSD has GCD from version 8.1. Wine has epoll-based IOCP for Linux. It is therefore possible to use IOCP design to write server code that should run on Windows, Linux, Solaris, AIX, FreeBSD, MacOSX (and iOS, but not Android). This is different from using kqueue and epoll directly, where a Windows server must be restructured to use its IOCPs, and very likely be less performant.

Since you asked for a Windows specific feature for OS X, instead of using kqueue directly you may try libevent. It's a thin wrapper to different AIO mechanisms and it work on both platforms.

Use Kqueue
http://en.wikipedia.org/wiki/Kqueue

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

OpenCL: too slow async commands submiting - opencl

Related

Intel OpenCL: Tools for looking timeline of concurrent kernel execution

Large submit->start delay in OpenCL kernel

can we read and program the microcodes of AMD processor?

HyperQ support in OpenCL

IO COmpletion Ports for Mac OS X

Categories

Resources