How OpenMP differs from OpenCL when it comes to GPGPU?

How OpenMP differs from OpenCL when it comes to GPGPU? - opencl

When program is run on GPGPU, how would it's execution differ if implemented with OpenMP vs OpenCL?
Does OpenMP utilizes GPGPUs through OpenCL?
If not, what's the common GPGPU API for them I can use directly (without any OpenMP/OpenCL built on top of it)?
P.S. On Linux, OpenMP uses just pthread to manage threads. I couldn't find any other API to GPGPU besides OpenCL and CUDA, so it is obviously (but pretty painful) to admit that OpenMP, when it comes to GPGPU, utilizes OpenCL (or CUDA if GPGPU is by NVIDIA and OpenMP is that smart).

As far as I concern, OpenMP is a set of compilers directives to provide a parallelism on shared memory architectures and GPGPU is in generally NOT such one.
You can use them both together in order to archive better performance or you can use OpenACC, OpenHMPP or C++ AMP, which can quasi substitute them or you can use such libraries as AMD Bolt or ArrayFire - they can allow you to utilize GPGPU without lot of efforts.

Related

Is there such a thing as OpenCL-aware MPI - on Nvidia and AMD?

Do Nvidia GPUs support an OpenCL aware MPI, with similar functionality as its CUDA aware MPI?
Also, do AMD GPUs' OpenCL provide such functionality?
I would have thought a simple google search would have answered this, but surprisingly I did not find much info on this while there is a ton of info on CUDA aware MPI!

OpenCL scan code

I'm looking for a fast implementation of scan(prefixsum) in OpenCL. The best thing that I found is in the Nvidia SDK but it's old(2010).
Does anyone know any other implementation of Scan in OpenCL?

There are several open-source implementations of scan operation in OpenCL:
CLOGS, a library for higher-level operations on top of the OpenCL C++ API.
Boost.Compute, a C++ GPU Computing Library for OpenCL.
VexCL, a C++ vector expression template library for OpenCL/CUDA.
Bolt, a C++ template library optimized for GPUs.
The author of CLOGS wrote a paper comparing performance of scan (and sort) operations in these implementations.

if your device supports 2.0 then, use builtin operations for that.
https://stackoverflow.com/a/32394920/4877550
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/

Rust on grid computing

I'm looking to create Rust implementations of some small bioinformatics programs for my research. One of my main considerations is performance, and while I know that I could schedule the Rust program to run on a grid with qsub - the cluster I have access to uses Oracle's GridEngine - I'm worried that the fact that I'm not calling MPI directly will cause performance issues with the Rust program.
Will scheduling the program without using an MPI library hinder performance greatly? Should I use an MPI library in Rust, and if so, are there any known MPI libraries for Rust? I've looked for one but I haven't found anything.

I have used several supercomputing facilities (I'm an astrophysicist) and have often faced the same problem: I know C/C++ very well but prefer to work with other languages.
In general, any approach other than MPI will do, but consider that often such supercomputers have heavily optimised MPI libraries, often tailored for the specific hardware integrated in the cluster. It is difficult to tell how much the performance of your Rust programs will be affected if you do not use MPI, but the safest bet is to stay with the MPI implementation provided on the cluster.
There is no performance penalty in using a Rust wrapper around a C library like a MPI library, as the bottleneck is the time needed to transfer data (e.g. via a MPI_Send) between nodes, not the negligible cost of an additional function call. (Moreover, this is not the case for Rust: there is no additional function call, as already stated above.)
However, despite the very good FFI provided by Rust, it is not going to be easy to create MPI bindings. The problem lies in the fact that MPI is not a library, but a specification. Popular MPI libraries are OpenMPI (http://www.open-mpi.org) and MPICH (http://www.mpich.org). Each of them differs slightly in the way they implement the standard, and they usually cover such differences using C preprocessor macros. Very few FFIs are able to deal with complex macros; I don't know how Rust scores here.
As an instance, I am implementing an MPI Program in Free Pascal but I am not able to use the existing MPICH bindings (http://wiki.lazarus.freepascal.org/MPICH), as the cluster I am using provides its own MPI library and I prefer to use this one for the reason stated above. I was unable to reuse MPICH bindings, as they assumed that constants like MPI_BYTE were hardcoded integer constants. But in my case they are pointers to opaque structures that seem to be created when MPI_Init is called.
Julia bindings to MPI (https://github.com/lcw/MPI.jl) solve this problem by running C and Fortran programs during the installation that generate Julia code with the correct values for such constants. See e.g. https://github.com/lcw/MPI.jl/blob/master/deps/make_f_const.f
In my case I preferred to implement a middleware, I.e., a small C library which wraps MPI calls with a more "predictable" interface. (This is more or less what the Python and Ocaml bindings do too, see https://forge.ocamlcore.org/projects/ocamlmpi/ and http://mpi4py.scipy.org.) Things are running smoothly, so far I haven't got any problem.

Will scheduling the program without using an MPI library hinder performance greatly?
There are lots of ways to carry out parallel computing. MPI is one, and as comments to your question indicate you can call MPI from Rust with a bit of gymnastics.
But there are other approaches, like the PGAS family (Chapel, OpenSHMEM, Co-array Fortran), or alternative messaging like what Charm++ uses.
MPI is "simply" providing a (very useful, highly portable, aggressively optimized) messaging abstraction, but as long as you have some way to manage the parallelism, you can run anything on a cluster.

AMD APP OpenCL SDK on Intel

I have seen that AMD APP SDK samples work on a machine having only Intel CPU.
How can this happen? How does the compiler target a different machine architecture?
Do I not need Intel's set of compilers for running the code on the intel CPU?
I think if we have to run an OpenCL application on a specific hardware, I have to (re)compile it using device's vendor specifics compiler.
Where is my understanding wrong?

Firstly, OpenCL is built to work on CPU's and GPU's. You can compile and run the same source code on either type of device. However, its very likely that CPU code will be sub-optimal for a GPU and vice-versa.
AMD H/W is 7% - 14% of total x86/x64 CPU's. So AMD must develop compilers for both AMD and Intel chips to be relevant. AMD have history developing compilers for both sets of chips. Conversely, Intel have developed compilers that either don't work on AMD chips or don't work that well. That's no surprise.
With OpenCL, the AMD APP SDK is the most flexible it will work well on AMD and Intel CPU's and AMD GPUs. Intel's OpenCL SDK doesn't even install on AMD x86 H/W.
If you compile an OpenCL program to binary, you can save and reuse it as long as it matches the OpenCL Platform and Device that created it. So, if you compile for one device and use on another you are very likely to get an error.

The power of OpenCL is abstracting the underlaying hardware and offer massive, parallel and heterogeneous computing power.
Some SDKs and platforms offers some specific features to "optimize" the code, i honestly think that such features are just marketing and they introduce boilerplate code making the application less portable.
There are also some pseudo-new technologies that are just wrappers to OpenCL or they are really similar in the concept like the Intel quick sync.
About Intel i should say that at the first place they were supporting all the iCore generation and even some C2D, now the new SDK only support the 3rd iCore generation, i don't get their strategy honestly, probably Intel is the last option if you want to adopt OpenCL and targeting the biggest possible audience, also their SDK doesn't seems to be really good at all .
Stick with the standard and you will avoid both possible legal and performance issues and your code will also be more portable.

The bottom line is that the AMD SDK includes a compiler for targeting x86 CPUs for OpenCL. That means that even though you are running an Intel CPU the generated code will run on it. It's the same concept as compiling a C program to run on an x86 CPU: it works on Intel and AMD CPUs (or any that implement the x86 instruction set).
The vendor's compiler might have specific optimizations, like user827992 mentions, but in my experience the performance of AMD's CPU compiler isn't that bad when running on an Intel CPU. I haven't tried Intel's OpenCL implementation.
It is true that for some (maybe most in the future) hardware, only the vendor's compiler will support it. AMD's SDK won't build code that will run on an NVIDIA card, and vice-versa. CPUs happen to be a bit of a special case in that the basic instruction set is so widely deployed that the CPU compiler will work on most machines you're likely to come in contact with.

How to handle OpenCL code on unsupported hardware in a C++ app

I've been doing some research in to OpenCL, and the possibility of using it on a project. The question I have is, is there a way to run OpenCL code on a CPU that is unsupported by the OpenCL SDKs in a C++ application. I know Java has Aparapi, however I'm wondering how to run OpenCL code in a C++ application without hardware that is supported by the SDKs. There is some code I would like to write in OpenCL kernels to take advantage of the OpenCL parallelism where available, however I'm unsure if I wouldn't be able to run it on older hardware (still X86, but not recent hardware). Could anyone explain to me how this can be done, or if it is even a problem at all to run OpenCL code on older systems?
Thanks,
Peter

I would say best way to approach this is to check if the device supports OpenCL via OpenCL API calls such as clPlatformIDs then once you figure it isn't a OpenCL device then run the required code as normal C/C++ function otherwise run it using openCL kernel. But for portability you need to write the program logic twice once in .cl file and once as normal c/c++ method/function.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex