OpenCL scan code - opencl

I'm looking for a fast implementation of scan(prefixsum) in OpenCL. The best thing that I found is in the Nvidia SDK but it's old(2010).
Does anyone know any other implementation of Scan in OpenCL?

There are several open-source implementations of scan operation in OpenCL:
CLOGS, a library for higher-level operations on top of the OpenCL C++ API.
Boost.Compute, a C++ GPU Computing Library for OpenCL.
VexCL, a C++ vector expression template library for OpenCL/CUDA.
Bolt, a C++ template library optimized for GPUs.
The author of CLOGS wrote a paper comparing performance of scan (and sort) operations in these implementations.

if your device supports 2.0 then, use builtin operations for that.
https://stackoverflow.com/a/32394920/4877550
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/

Related

How OpenMP differs from OpenCL when it comes to GPGPU?

When program is run on GPGPU, how would it's execution differ if implemented with OpenMP vs OpenCL?
Does OpenMP utilizes GPGPUs through OpenCL?
If not, what's the common GPGPU API for them I can use directly (without any OpenMP/OpenCL built on top of it)?
P.S. On Linux, OpenMP uses just pthread to manage threads. I couldn't find any other API to GPGPU besides OpenCL and CUDA, so it is obviously (but pretty painful) to admit that OpenMP, when it comes to GPGPU, utilizes OpenCL (or CUDA if GPGPU is by NVIDIA and OpenMP is that smart).
As far as I concern, OpenMP is a set of compilers directives to provide a parallelism on shared memory architectures and GPGPU is in generally NOT such one.
You can use them both together in order to archive better performance or you can use OpenACC, OpenHMPP or C++ AMP, which can quasi substitute them or you can use such libraries as AMD Bolt or ArrayFire - they can allow you to utilize GPGPU without lot of efforts.

Does Nvidia GPUs support pipe like structures?

I am trying to write OpenCL code that takes advantage of the latest OpenCL 2.0 features like pipes. I have been working on AMD GPUs until now and they support pipes. But Nvidia driver doesnt support OpenCL 2.0. So are there any pipe like structures available for Nvidia GPUs? My intention is to transfer data directly between 2 kernels instead of passing it via global memory. So anything that helps me do this can be used.
I'm not aware of any. Do contact NVIDIA and let them know you'd like to see OpenCL 2.0 support.

OpenCL intermediate language info?

I cannot find anything just explaining the syntax of the intermediate language. Does anybody know of any good documentation?
AFAIK, nothing called "OpenCL intermediate language" exists. There are vendor-specific intermediate languages used by some OpenCL implementations (such as NVIDIA's PTX and AMD's IL).
There is also the "Standard Portable Intermediate Representation" (SPIR) specification from Khronos which aims to be a cross-platform intermediate representation for OpenCL device code.

Finding number of copy engines using OpenCL

In OpenCL is there any API for finding number of copy engines in GPU? In cuda we can check this with asyncEngineCount. What is the alternative in OpenCL?
Within the OpenCL standard, there is no alternative because this is a hardware- and vendor-specific implementation detail. However, in the future, NVIDIA might update its cl_nv_device_attribute_query extension. This extension already contains the CL_DEVICE_GPU_OVERLAP_NV device info, that returns true if asynchronous copy is possible.

How to handle OpenCL code on unsupported hardware in a C++ app

I've been doing some research in to OpenCL, and the possibility of using it on a project. The question I have is, is there a way to run OpenCL code on a CPU that is unsupported by the OpenCL SDKs in a C++ application. I know Java has Aparapi, however I'm wondering how to run OpenCL code in a C++ application without hardware that is supported by the SDKs. There is some code I would like to write in OpenCL kernels to take advantage of the OpenCL parallelism where available, however I'm unsure if I wouldn't be able to run it on older hardware (still X86, but not recent hardware). Could anyone explain to me how this can be done, or if it is even a problem at all to run OpenCL code on older systems?
Thanks,
Peter
I would say best way to approach this is to check if the device supports OpenCL via OpenCL API calls such as clPlatformIDs then once you figure it isn't a OpenCL device then run the required code as normal C/C++ function otherwise run it using openCL kernel. But for portability you need to write the program logic twice once in .cl file and once as normal c/c++ method/function.

Resources