Calling external functions when using OpenCL for CPU Device - opencl

I am evaluating the possibility for using OpenCL for just-in-time compilation of performance-critical mathematical expressions for CPU devices. I am currently using LLVM directly (or rather, I have a working proof-of-concept), but would find the abstraction offered by OpenCL very useful going forward.
I am now trying to figure out if there is some way to call functions with external linkage when using OpenCL for CPU devices, equivalent to the following in LLVM:
... = llvm::Function::Create(..., llvm::Function::ExternalLinkage, "...", ...);
Since my OpenCL implementation at least is built on top of LLVM, I was hoping that this would be possible somehow.

Does this function http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNativeKernel.html
accomplish what you are after?
Edit: credit where credit is due: https://stackoverflow.com/a/10807728/717881

Related

Translate OpenCL SPIR-V to Vulkan SPIR-V

Is it possible to translate OpenCL-style SPIR-V to Vulkan-style SPIR-V?
I know that it is possible to use clspv to compile OpenCL C to Vulkan-style SPIR-V, but I haven't seen any indication that it also supports ingesting OpenCL-style SPIR-V.
Thank you for any suggestions if you know how to achieve this :)
I know that it is possible to use clspv to compile OpenCL C to
Vulkan-style SPIR-V, but I haven't seen any indication that it also
supports ingesting OpenCL-style SPIR-V.
clspv compiles to "Opencl-style SPIR-V". IOW, it uses OpenCL execution model and also OpenCL memory model. The answer to your question is no (in general). The problem is that e.g. GLSL uses logical memory model, which means pointers are abstract, so you can't have pointers to pointers. While OpenCL allows this, because it uses physical memory model. Plus there are other things in OpenCL which cannot be expressed in GLSL. You could try to write some translator, and it might work for some very simple code, but that's about it.

Alternative to clGetGLContextInfoKHR?

I am writing a N-body physics simulation. I would like to ask if there is an alternative to the OpenCL clGetGLContextInfoKHR() function ? I need to find out during runtime which GPU is used for OpenGL rendering so that I can use OpenCL for vertex manipulation on this same GPU (for performance reasons).
I have searched the OpenCL.dll for the function clGetGLContextInfoKHR() using Dependancy Walker, but it seems that the implementation that is installed on my computer does not support it sice this function is missing from the DLL. I have also tried glGetString(GL_RENDERER), but the name string it returns differres from the name string which clGetDeviceInfo(.., CL_DEVICE_NAME, ...) returns (not by much, but enough to make it for example difficult to destinguish two GPUs from the same manufacturer). Is there any other way except manually choosing the correct OpenCL device ?
Thanks for help !

Shipping reliable OpenCL applications - Tools/Techniques/Tips?

I want to ship OpenCL code that should work on all OpenCL 1.1 compatible GPUs. Rather than buying a bunch of GPUs and testing on them, are there any tools that can help ensure reliability?
If anyone has experience shipping OpenCL applications to a wide hardware base, I'd be interested in knowing about any other methods for testing reliability.
I've a bit of knowledge on this. Unfortunately, the answer is: depends on what the kernel is doing.
My biggest gripe is with NVIDIA and OpenCL, since they don't seem to support: vectors (float2, 4, etc) and global offsets. Kind of obnoxious. Intel and ATI are both solid, but even then vector sizes can differ. The above doesn't really matter if you are doing image convolution.
It matters if you want to run AMD FFT on an NVIDIA card, are doing matrix math, etc. To address the vector issue, you can write multiple kernels that each have a different vector size and call the right one: MatrixMult_float4(...).
You can check whether your code compiles by using the AMD KernelAnalyzer2, although this does need some component of the Catalyst drivers so it only works for me on PCs with AMD GPUs. There is also the Intel Kernel Builder, which works for devices with Intel OpenCL SDK support. Nvidia's implementation has bugs in it, especially on newer GPUs in my experience so there the best is to test one GPU from each generation.
To avoid extensions and validate CL language versions, one could try to test compile the code using the LLVM, or just getting the grammar for validation, e.g. as BNF.
There's a promising open source project, which probably contains useful stuff: http://bazaar.launchpad.net/~pocl/pocl/master/files/head:/lib/CL/
However, the problems I encountered were:
Newline characters caused build breakers on certain implementations (CR, LF, CRLF) in OpenCL source files. Specifying one of these as the only valid line ending would be just stupid. If one is editing source files on different platforms in conjunction with an SCM, it could get inconvenient. So I remove comments and clean up line breaks before compilation.
Performance: Feeding the GPU efficiently using multithreading; different hardware constellations have different bottlenecks. Here I needed a client side pipeline with multiple dispatcher threads. Of course, the amount of work that remains for the CPU depends on the task or capabilities, amount and resources of computing devices. Things that needed serialized execution or dynamic loop counts have been such candidates.

Does CUDA support recursion?

Does CUDA support recursion?
It does on NVIDIA hardware supporting compute capability 2.0 and CUDA 3.1:
New language features added to CUDA C
/ C++ include:
Support for function
pointers and recursion make it easier
to port many existing algorithms to
Fermi GPUs
http://developer.nvidia.com/object/cuda_3_1_downloads.html
Function pointers:
http://developer.download.nvidia.com/compute/cuda/sdk/website/CUDA_Advanced_Topics.html#FunctionPointers
Recursion:
I can't find a code sample on NVIDIA's website, but on the forum someone post this:
__device__ int fact(int f)
{
if (f == 0)
return 1;
else
return f * fact(f - 1);
}
Yes, see the NVIDIA CUDA Programming Guide:
device functions only support recursion in device code compiled for devices
of compute capability 2.0.
You need a Fermi card to use them.
Even though it only supports recursion for specific chips, you can sometimes get away with "emulated" recursion: see how I used compile-time recursion for my CUDA raytracer.
In CUDA 4.1 release CUDA supports recursion only for __device__ function but not for __global__ function.
Only after 2.0 compute capability on compatible devices
Any recursive algorithm can be implemented with a stack and a loop. It's way more of a pain, but if you really need recursion, this can work.
Sure it does, but it requires the Kepler architecture to do so.
Check out their latest example on the classic quick sort.
http://blogs.nvidia.com/2012/09/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/
As far as i know, only latest Kepler GK110 supports dynamic parallelism, which allow this kind of recursive call and spawning of new threads within the kernel. Before Kepler GK110, it was not possible. And note that not all Kepler architecture supports this, only GK110 does.
If you need recursion, you probably need the Tesla K20.
I'm not sure if Fermi does supports it,never read of it. :\
But Kepler sure does. =)
CUDA 3.1 supports recursion
If your algorithm invovles alot of recursions, then support or not, it is not designed for GPUs, either redesign your algorthims or get a better CPU, either way it will be better (I bet in many cases, maginitudes better) then do recurisons on GPUs.
Yeah, it is supported on the actual version. But despite the fact it is possible to execute recursive functions, you must have in mind that the memory allocation from the execution stack cannot be predicted (the recursive function must be executed in order to know the true depth of the recursion), so your stack could result being not enough for your purposes and it could need a manual increment of the default stack size
Yes, it does support recursion. However, it is not a good idea to do recursion on GPU. Because each thread is going to do it.
Tried just now on my pc with a NVIDIA GPU with 1.1 Compute capability. It says recursion not yet supported. So its not got anything to do with the runtime but the hardware itself

OpenCL vs. DirectCompute?

I'm looking for comparisons between OpenCL and DirectCompute, but I haven't found anything. OpenCL's advantages of being cross-platform and having a wider range of supported GPUs don't matter to me. I'm fine with coding on Windows against DX11 GPUs only. Assuming that, what are the pros and cons of each API?
I know this question was raised before, but I'm looking for more details.
I'm not interested in CUDA, since I don't want to restrict myself to only Nvidia hardware.
Probably the biggest difference for a coder is that DirectCompute is programmed by a language which is similar to HLSL, and OpenCL is programmed via a C-like language.
Another difference to consider is that, generally, for commodity level GPUs, the DirectX support is better (faster and less buggy) than OpenGL support on Windows. This may translate to more stable support for DirectCompute, but really, this is just speculation.
Well the major advantage of OpenCL is that it is not just limited to graphics cards. You can make use of your multicore CPU, Graphics Card and potentially any number of other hardware acceleration devices (DSPs etc) all from the same program.
I'm not sure if DirectCompute allows that freedom.
The OpenCL cross-platform-ness is not just a detail, as the host code (the one calling the OpenCL API and submitting kernels) can itself be cross-platform (see link text, link text...).
Write once, run on any GPGPU, anywhere.
Otherwise the OpenCL tooling is really getting better, with an ATI Stream plugin for Visual Studio, the NVidia & ATI SDKs that contains tons of samples, etc...
Another option now is C++ AMP which gives you modern C++ syntax without a need for a seperate compiler while still preserving hardware portability. Please follow links from here for more info and feel free to post questions as you have them: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/13/c-amp-in-a-nutshell.aspx
I use OpenCL because i can easily port my App to Linux but with DirectCompute this is not possible.
I think also that the performance of the OpenCL implementation will increase with time (that it comes at the same Level like CUDA for NVidia Cards) and also that the (driver)bugs will (hopefully ;) ) be eliminated with time.

Resources