OpenCL kernel runs correctly on Intel GPU but failed with error -9999 on NVidia GPU - opencl

I'm working on development of a library for calculation of various metrics of video streams.
It was implemented for CPU and GPU and validated successfully on CPU and Intel Xe GPU
but recently I found an issue with NVidia GPU.
In a few words: there are two kernels, 1st kernel does processing of some input and write intermediate results
to global SVM buffer, 2nd kernel uses these data from global buffer to calculate results.
The global buffer was created with read/write access (CL_MEM_READ_WRITE) without any error and filled from host using call to clEnqueueFillBuffer().
It works correctly on Intel Xe GPU but on NVidia card (Geforce 1030) I got error -9999 from clWaitForEvents() when
waiting for completion of 1st kernel.
If I commented writes to the global buffer then no any error reported.
I checked the size of the buffer (it's correct, 13MB approximately).
I checked initial buffer's content on kernel side using printf and it is valid.
I checked the required alignment of access (short, 2) and it is correct too.
What could be the reason for the above error?

Intel supports the more feature-rich OpenCL C 2.x, Nvidia only OpenCL C 1.2. "OpenCL 3.0" is the platform version, where 3.0 = rebranded 1.2, and the language standard is still 1.2.
If you are using any OpenCL C 2.x-specific features in the kernel, it will work on an Intel GPU, but not on Nvidia. The solution is to try to replace these 2.x features with standard OpenCL 1.2.

Related

OpenCL: Writing to pointer in main memory

Is it possible, using OpenCL's DMA capabilities, to write to a main memory address that is passed into the cl program? I understand doing so would likely break the program, but the intent here is to run a GPU process and then overwrite the address space of the CPU program used to run it, so breakage is expected.
Thanks!
Which version of the OpenCL API are you targeting?
In OpenCL 2.0 and above you can use Shared Virtual Memory (SVM) to share address between host and device(s) in platforms that support it.
You can get more information about it in the Intel OpenCL SVM overview.
If you are using previous versions, or your hardware does not support it, you can use pinned memory with the appropriate flags to clCreateBuffer. In particular, CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR, see clCreateBuffer in Khronos.
Note that, when using CL_MEM_USE_HOST_PTR has some alignment restrictions.
In general, in OpenCL, when and how the DMA is used depends on the hardware platform, so you should refer to the vendor documentation for details.

Checking if gpu is integrated or not

I couldn't find any query command about device being integrated/embedded in cpu or using system ram or its own dedicated gddr memory? I can benchmark mapping/unmapping versus reading/writing to get a conclusion but that device can be under load at that time and behave wrong and it would add complexity to already complex load balancing algorithm that I'm using.
Is there a simple way to check if a gpu is using same memory with cpu so I can choose directly mapping/unmapping instead of reading/writing?
Edit: there is CL_DEVICE_LOCAL_MEM_TYPE
CL_GLOBAL or CL_LOCAL
is this an indication of integratedness?
OpenCL 1.x has the device query CL_DEVICE_HOST_UNIFIED_MEMORY:
Is CL_TRUE if the device and the host have a unified memory subsystem
and is CL_FALSE otherwise.
This query is deprecated as of OpenCL 2.0, but should probably still work on OpenCL 2.x platforms for now. Otherwise, you may be able to produce a heuristic from the result of CL_DEVICE_SVM_CAPABILITIES instead.

OpenCL "cross"-compile x64 / 32-bit-pointer GPU

I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.

HyperQ support in OpenCL

I want to run heterogeneous kernels that execute on a single GPU asynchronously. I think this is possible in Nvidia Kepler K20(Or any device having compute capability 3.5+) by launching each of this kernels to a different stream and the runtime system maps them to different hardware queues based on the resource availability.
Is this feature accessible in OpenCL?
If it is so, what is the equivalent of a CUDA 'Stream' in OpenCL?
Do Nvidia drivers support such an execution on their K20 cards through OpenCL?
Is their any AMD GPU that has similar feature(or is there anything on development)?
Answer for any of these questions will help me a lot.
In principle, you can use OpenCL command queues to achieve CKE (Concurrent Kernel Execution). You can launch them from different CPU threads. Here are few links that might help you get started:
How do I know if the kernels are executing concurrently?
http://devgurus.amd.com/thread/142485
I am not sure how would it work with NVIDIA Kepler GPUs as we are having strange issues using OpenCL on K20 GPU.

Can a GPU be the host of a OpenCL program?

Little disclaimer: This is more the kind of theoretical / academic question than an actual problem I've got.
The usual way of setting up a parallel program in OpenCL is to write a C/C++ program, which sets up the devices (GPU and/or other CPUs), kernel and data buffers for executing the kernel on the device.
This program gets launched from the host, which used to be a CPU.
Would it be possible to write a OpenCL program where the host is a GPU and the devices other GPUs and/or CPUs?
What would be the prerequisites for such a scenario?
Do one need a special GPU or would it be possible to use any OpenCL-capable GPU?
Are you looking for a complete host or just a kernel launcher?
Up coming CUDA (v 5.0) introduces a feature to launch a kernel inside a kernel. Therefore, a device can be used for launching a kernel on itself. May be this feature will be supported by OpenCL too in near future.

Resources