OpenCL bytecode running on another card - opencl

I have program that use OpenCL for calculation, OpenCL code is big and compile time is about 2 minutes with 100% CPU load. Of course i save binary results of compilation. And second launch load opencl program from binary. Can i use same binary on another video card with same chip but different characteristics (RAM,CLOCK,etc.)?

As far as the OpenCL specification is concerned, you only have guarantees that a program binary can be re-used on the same device on which it was created.
In reality, the binaries that are returned by many OpenCL implementations are compatible with a wider range of devices available from that same vendor. For example, NVIDIA return PTX when you request binaries from their implementation, which is a reasonably high level intermediate representation (i.e. not native instructions). This is certainly compatible with other devices using the same architecture on which it was created (e.g. all GK110 devices, or all GF104 devices), and quite likely to be portable across a range of other NVIDIA GPU architectures too. Other vendors also return various types of intermediate representation (usually LLVM IR based) that allow this kind of binary compatibility.
So yes, you can probably re-use binaries across different devices that have the same architecture, but you'll really just have to try it and see. You could always implement a scheme that tries to use the binary and it that fails resort back to the source code.
In the future, we will hopefully see a large number of vendors supporting the recently ratified SPIR specification, which is a platform-portable intermediate representation for OpenCL device programs. This would allow you to generate binaries that are not only compatible with devices from a single vendor's architecture, but also across devices from many other vendors that also support SPIR. There would clearly be some remaining compilation overhead to lower SPIR to the native instruction set, but this should still result in significant speed-ups compared to compiling raw OpenCL C code.

Related

OpenCL kernel compilation time

I am new to the FPGA world. I tried to compile some OpenCl programs, but I noticed that it takes very long time to compile even the "Hello_World" program (couple hours). So that I am wondering why compiling OpenCL kernel on FPGA's takes long time (hours)? In addition, did the FPGA get re-programmed when we compile/execute the OpenCl on it?
Converting sequential code to hardware is difficult, and in some cases the compiler tries multiple versions of things to find the most optimal combination of hardware. It's not like compiling for CPUs and GPUs, so the workflow is quite different (you compile kernels at build time and not runtime). The end result is often hardware that is faster and/or uses less energy than more general purpose compute devices like CPUs and GPUs. There are some excellent "OpenCL on Altera" videos that explain how the compilation works, but a summary is: Compile to a abstract machine, and for each instruction/step, remove the abstract machine hardware not needed for that step, then merge all the remaining hardware into what gets programmed on the chip. The data "flows" through the hardware rather than living in memory and registers like it does on a CPU/GPU.

OpenCL, Vulkan, Sycl

I am trying to understand the OpenCL ecosystem and how Vulkan comes into play.
I understand that OpenCL is a framework to execute code on GPUs as well as CPUs, using kernels that may be compiled to SPIR.
Vulkan can also be used as a compute-API using the same SPIR language.
SYCL is a new specification that allows writing OpenCL code as proper standard-conforming C++14. It is my understanding that there are no free implementations of this specification yet.
Given that,
How does OpenCL relate to Vulkan? I understand that OpenCL is higher level and abstracts the devices, but does ( or could ) it use Vulkan internally? (instead of relying on vendor specific drivers)
Vulkan is advertised as both a compute and graphics API, however I found very little resources for the compute part. Why is that ?
Vulkan has performance advantages over OpenGL. Is the same true for Vulkan vs OpenCl? (OpenCL is sadly notorious to being slower than CUDA.)
Does SYCL use OpenCL internally or could it use Vulkan ? Or does it use neither and instead rely on low level, vendor specific APIs to be implemented ?
How does OpenCL relates to vulkan ? I understand that OpenCL is higher level and abstracts the devices, but does ( or could ) it uses Vulkan internally ?
They're not related to each other at all.
Well, they do technically use the same intermediate shader language, but Vulkan forbids the Kernel execution model, and OpenCL forbids the Shader execution model. Because of that, you can't just take a shader meant for OpenCL and stick it in Vulkan, or vice-versa.
Vulkan is advertised as both a compute and graphics api, however I found very little resources for the compute part - why is that ?
Because the Khronos Group likes misleading marketing blurbs.
Vulkan is no more of a compute API than OpenGL. It may have Compute Shaders, but they're limited in functionality. The kind of stuff you can do in an OpenCL compute operation is just not available through OpenGL/Vulkan CS's.
Vulkan CS's, like OpenGL's CS's, are intended to be used for one thing: to support graphics operations. To do frustum culling, build indirect graphics commands, manipulate particle systems, and other such things. CS's operate at the same numerical precision as graphical shaders.
Vulkan has a performance advantages over OpenGL. Is the same true for Vulkan vs OpenCl?
The performance of a compute system is based primarily on the quality of its implementation. It's not OpenCL that's slow; it's your OpenCL implementation that's slower than it possibly could be.
Vulkan CS's are no different in this regard. The performance will be based on the maturity of the drivers.
Also, there's the fact that, again, there's a lot of stuff you can do in an OpenCL compute operation that you cannot do in a Vulkan CS.
Does SYCL uses OpenCL internally or could it use vulkan ?
From the Khronos Group:
SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL...
So yes, it's built on top of OpenCL.
How does OpenCL relates to vulkan ?
They both can pipeline a separable work from host to gpu and gpu to host using queues to reduce communication overhead using multiple threads. Directx-opengl cannot?
OpenCL: Initial release August 28, 2009. Broader hardware support. Pointers allowed but only to be used in device. You can use local memory shared between threads. Much easier to start a hello world. Has api overhead for commands unless they are device-side queued. You can choose implicit multi-device synchronization or explicit management. Bugs are mostly fixed for 1.2 but I don't know about version 2.0.
Vulkan: Initial release 16 February 2016(but progress from 2014). Narrower hardware support. Can SPIR-V handle pointers? Maybe not? No local-memory option? Hard to start hello world. Less api overhead. Can you choose implicit multi-device management? Still buggy for Dota-2 game and some other games. Using both graphics and compute pipeline at the same time can hide even more latency.
if opencl had vulkan in it, then it has been hidden from public for 7-9 years. If they could add it, why didn't they do it for opengl?(maybe because of pressure by physx/cuda?)
Vulkan is advertised as both a compute and graphics api, however I
found very little resources for the compute part - why is that ?
It needs more time, just like opencl.
You can check info aboout compute shaders here:
https://www.khronos.org/registry/vulkan/specs/1.0/xhtml/vkspec.html#fundamentals-floatingpoint
Here is an example of particle system managed by compute shaders:
https://github.com/SaschaWillems/Vulkan/tree/master/computeparticles
below that, there are raytracers and image processing examples too.
Vulkan has a performance advantages over OpenGL. Is the same true for
Vulkan vs OpenCl?
Vulkan doesn't need to synchronize for another API. Its about command buffers synchronization between commandqueues.
OpenCL needs to synchronize with opengl or directx (or vulkan?) before using a shared buffer(cl-gl or dx-cl interop buffers). This has an overhead and you need to hide it using buffer swapping and pipelining. If no shared buffer exists, it can run concurrently on modern hardware with opengl or directx.
OpenCL is sadly notorious to being slower than CUDA
It was, but now its mature and challenges cuda, especially with much wider hardware support from all gaming gpus to fpgas using version 2.1, such as in future Intel can put an fpga into a Core i3 and enable it for (soft-x86 core ip) many-core cpu model closing the gap between a gpu performance and a cpu to upgrade its cpu-physx gaming experience or simply let an opencl physics implementation shape it and use at least %90 die-area instead of a soft-core's %10-%20 effectively used area.
With same price, AMD gpus can compute faster on opencl and with same compute power Intel igpus draw less power. (edit: except when algorithms are sensitive to cache performance where Nvidia has upperhand)
Besides, I wrote a SGEMM opencl kernel and run on a HD7870 at 1.1 Tflops and checked internet then saw a SGEMM henchmark on a GTX680 for same performance using a popular title on CUDA!(price ratio of gtx680/hd7870 was 2). (edit: Nvidia's cc3.0 doesn't use L1 cache when reading global arrays and my kernel was purely local/shared memory + some registers "tiled")
Does SYCL uses OpenCL internally or could it use vulkan ? Or does it
use neither and instead relies on low level, vendor specific apis to
be implemented ?
Here,
https://www.khronos.org/assets/uploads/developers/library/2015-iwocl/Khronos-SYCL-May15.pdf
says
Provides methods for dealing with targets that do not have
OpenCL(yet!)
A fallback CPU implementation is debuggable!
so it can fall back to a pure threaded version(similar to java's aparapi).
Can access OpenCL objects from SYCL objects
Can construct SYCL objects from OpenCL object
Interop with OpenGL remains in SYCL
- Uses the same structures/types
it uses opencl(maybe not directly, but with an upgraded driver communication?), it develops parallel to opencl but can fallback to threads.
from the smallest OpenCL 1.2 embedded device to the most advanced
OpenCL 2.2 accelerators

Checking if gpu is integrated or not

I couldn't find any query command about device being integrated/embedded in cpu or using system ram or its own dedicated gddr memory? I can benchmark mapping/unmapping versus reading/writing to get a conclusion but that device can be under load at that time and behave wrong and it would add complexity to already complex load balancing algorithm that I'm using.
Is there a simple way to check if a gpu is using same memory with cpu so I can choose directly mapping/unmapping instead of reading/writing?
Edit: there is CL_DEVICE_LOCAL_MEM_TYPE
CL_GLOBAL or CL_LOCAL
is this an indication of integratedness?
OpenCL 1.x has the device query CL_DEVICE_HOST_UNIFIED_MEMORY:
Is CL_TRUE if the device and the host have a unified memory subsystem
and is CL_FALSE otherwise.
This query is deprecated as of OpenCL 2.0, but should probably still work on OpenCL 2.x platforms for now. Otherwise, you may be able to produce a heuristic from the result of CL_DEVICE_SVM_CAPABILITIES instead.

OpenCL "cross"-compile x64 / 32-bit-pointer GPU

I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.

Atomic operations between multiple devices

I am developing something in heterogeneous systems with CPU and GPU (AMD APU, in fact) with OpenCL. Since I will use atomic operations to guarantee the integrity of data, and the data is shared among CPU device and GPU device, on each of which there is a kernel running on the shared data. My question is: is atomic operation still valid between these two devices? Hope anyone can help me. Many thanks.
Appendix A of the OpenCL Specification covers the synchronization of memory objects between different devices. There is no guarantee both devices will access the memory objects at the same physical location: one of the devices may work on a copy of the buffer, and only synchronization as described in Appendix A will ensure the other devices gets a copy of it.
Your implementation on the AMD APU may allow both CPU and GPU to share the same address space, and may not require the inter device synchronization. I would suggest to check AMD documentations and experiment.

Resources