OpenCL vs OpenMP performance [closed] - opencl

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Have there been any studies comparing OpenCL to OpenMP performance? Specifically I am interested in the overhead cost of launching threads with OpenCL, e.g., if one were to decompose the domain into a very large number of individual work items (each run by a thread doing a small job) versus heavier weight threads in OpenMP were the domain was decomposed into sub domains whose number equals the number of cores.
It seems that the OpenCL programming model is more targeted towards massively parallel chips (GPUs, for instance), rather than CPUs that have fewer but more powerful cores.
Can OpenCL be an effective replacement for OpenMP?

The benchmarks I've seen indicate that OpenCL and OpenMP running on the same hardware are usually comparable in performance, or OpenMP has slightly better performance. However, I haven't seen any benchmarks that I would consider conclusive, because they've been mostly lacking in detailed explanations of their methodology. However, there are a few useful things to consider:
OpenCL will always have some extra overhead when compiling the kernel at runtime. Any benchmark either needs to list this time separately, use pre-compiled native kernels, or run long enough that the kernel compilation is insignificant.
OpenCL implementations will vary. GPU vendors like NVidia have no incentive to make sure their CPU-based OpenCL implementation is as fast as possible. None of the OpenCL implementations are likely to be as mature as a good OpenMP implementation.
The OpenCL spec says basically nothing about how CPU-based implementations use threading under the hood, so any discussion of whether the threading is relatively lightweight or heavyweight will necessarily be implementation-specific.
When you're running OpenCL code on a CPU, your work items don't have to be tiny and numerous. You can break down the problem in the same way you would for OpenMP.
Even if OpenCL has a bit more overhead, there may be other reasons to prefer it.
Obviously, if your code can make good use of a GPU, you will want to have an OpenCL implementation. OpenCL performance on a CPU may be good enough that it isn't worth it to also maintain an OpenMP fallback code path for users who don't have powerful GPUs.
A good CPU-based OpenCL implementation means that you will automatically get the benefit of whatever instruction set extensions the CPU and OpenCL implementation support. With OpenMP, you have to do extra work to make sure that your executable includes both SSEx and AVX code paths.
OpenCL vector primitives can help you express some explicit parallelism without the portability and readibility sacrifices you get from using SSE intrinsics.

I have a program which has the option to use either openCL or openMP on some key bottlenecks, basically adding vectors and performing reductions.
In my case, openMP takes 13 seconds where openCL takes 10 seconds, on the CPU. Intel I5.
The fastest configuration for me so far is to add the vectors using openCL GPU, and do the reductions on openMP getting me down to 7 seconds. When I do the reduction on the openCL kernel, on GPU, it takes a total of 8 seconds.
So from my experience I would say maybe it depends on the use, and much you can optimize your openCL kernel.

Related

Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

Intel has several SIMD ISAs, such as SSE, AVX, AVX2, AVX-512 and IMCI on Xeon Phi. These ISAs are supported on different processors. For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake, but not on Xeon Phi. AVX-512F, AVX-512 CDI, AVX-512 ERI and AVX-512 PFI are supported on both the Skylake and Xeon Phi.
Why doesn't Intel design a more universal SIMD ISA that can run on all of its advanced processors?
Also, Intel removes some intrinsics and adds new ones when developing ISAs. A lot of intrinsics have many flavours. For example, some work on packed 8-bit while some work on packed 64-bit. Some flavours are not widely supported. For example, Xeon Phi is not going to have the capability to process packed 8-bit values. Skylake, however, will have this.
Why does Intel alter its SIMD intrinsics in such an inconsistent way?
If the SIMD ISAs are more compatible with each other, an existed AVX code may be ported to AVX-512 with much less efforts.
I see the reason why as three-fold.
(1) When they originally designed MMX they had very little area to work with so made it as simple as possible. They also did it in such a way that was fully compatible with the existing x86 ISA (precise interrupts + some state saving on context switches). They hadn't anticipated that they would continually enlarge the SIMD register widths and add so many instructions. Every generation when they added wider SIMD registers and more sophisticated instructions they had to maintain the old ISA for compatibility.
(2) This weird thing you're seeing with AVX-512 is from the fact that they are trying to unify two disparate product lines. Skylake is from Intel's PC/server line therefore their path can be seen as MMX -> SSE/2/3/4 -> AVX -> AVX2 -> AVX-512. The Xeon Phi was based on an x86-compatible graphics card called Larrabee that used the LRBni instruction set. This is more or less the same as AVX-512, but with less instructions and not officially compatible with MMX/SSE/AVX/etc...
(3) They have different products for different demographics. For example, (as far as I know) the AVX-512 CD instructions won't be available in the regular SkyLake processors for PCs, just in the SkyLake Xeon processors used for servers in addition to the Xeon Phi used for HPC. I can understand this to an extent since the CD extensions are targeted at things like parallel histogram generation; this case is more likely to be a critical hotspot in servers/HPC than in general-purpose PCs.
I do agree it's a bit of mess. Intel are beginning to see the light and planning better for additional expansions; AVX-512 is supposedly ready to scale to 1024 bits in a future generation. Unfortunately it's still not really good enough and Agner Fog discusses this on the Intel Forums.
For me I would have liked to see a model that can be upgraded without the user having to recompile their code each time. For example, instead of defining AVX register as 512-bits in the ISA, this should be a parameter stored in the microarchitecture and retrievable by the programmer at runtime. The user asks what is the maximum SIMD width available on this machine?, the architecture returns XYZ, and the user has generic control flow to cope with whatever that XYZ is. This would be much cleaner and scalable than the current technique which uses several versions of the same function for every possible SIMD version. :-/
There is SIMD ISA convergence between Xeon and Xeon Phi and ultimately they may become identical. I doubt you will ever get the same SIMD ISA across the whole Intel CPU line - bear in mind that it stretches from a tiny Quark SOC to Xeon Phi. There will be a long time, possibly infinite, before AVX-1024 migrates from Xeon Phi to Quark or a low end Atom CPU.
In order to get better portability between different CPU families, including future ones, I advise you to use higher level concepts than bare SIMD instructions or intrinsics. Use OpenCL, OpenMP, Cilk Plus, C++ AMP and autovectorizing compiler. Quite often, they will do a good job generating platform specific SIMD instructions for you.

Rust on grid computing

I'm looking to create Rust implementations of some small bioinformatics programs for my research. One of my main considerations is performance, and while I know that I could schedule the Rust program to run on a grid with qsub - the cluster I have access to uses Oracle's GridEngine - I'm worried that the fact that I'm not calling MPI directly will cause performance issues with the Rust program.
Will scheduling the program without using an MPI library hinder performance greatly? Should I use an MPI library in Rust, and if so, are there any known MPI libraries for Rust? I've looked for one but I haven't found anything.
I have used several supercomputing facilities (I'm an astrophysicist) and have often faced the same problem: I know C/C++ very well but prefer to work with other languages.
In general, any approach other than MPI will do, but consider that often such supercomputers have heavily optimised MPI libraries, often tailored for the specific hardware integrated in the cluster. It is difficult to tell how much the performance of your Rust programs will be affected if you do not use MPI, but the safest bet is to stay with the MPI implementation provided on the cluster.
There is no performance penalty in using a Rust wrapper around a C library like a MPI library, as the bottleneck is the time needed to transfer data (e.g. via a MPI_Send) between nodes, not the negligible cost of an additional function call. (Moreover, this is not the case for Rust: there is no additional function call, as already stated above.)
However, despite the very good FFI provided by Rust, it is not going to be easy to create MPI bindings. The problem lies in the fact that MPI is not a library, but a specification. Popular MPI libraries are OpenMPI (http://www.open-mpi.org) and MPICH (http://www.mpich.org). Each of them differs slightly in the way they implement the standard, and they usually cover such differences using C preprocessor macros. Very few FFIs are able to deal with complex macros; I don't know how Rust scores here.
As an instance, I am implementing an MPI Program in Free Pascal but I am not able to use the existing MPICH bindings (http://wiki.lazarus.freepascal.org/MPICH), as the cluster I am using provides its own MPI library and I prefer to use this one for the reason stated above. I was unable to reuse MPICH bindings, as they assumed that constants like MPI_BYTE were hardcoded integer constants. But in my case they are pointers to opaque structures that seem to be created when MPI_Init is called.
Julia bindings to MPI (https://github.com/lcw/MPI.jl) solve this problem by running C and Fortran programs during the installation that generate Julia code with the correct values for such constants. See e.g. https://github.com/lcw/MPI.jl/blob/master/deps/make_f_const.f
In my case I preferred to implement a middleware, I.e., a small C library which wraps MPI calls with a more "predictable" interface. (This is more or less what the Python and Ocaml bindings do too, see https://forge.ocamlcore.org/projects/ocamlmpi/ and http://mpi4py.scipy.org.) Things are running smoothly, so far I haven't got any problem.
Will scheduling the program without using an MPI library hinder performance greatly?
There are lots of ways to carry out parallel computing. MPI is one, and as comments to your question indicate you can call MPI from Rust with a bit of gymnastics.
But there are other approaches, like the PGAS family (Chapel, OpenSHMEM, Co-array Fortran), or alternative messaging like what Charm++ uses.
MPI is "simply" providing a (very useful, highly portable, aggressively optimized) messaging abstraction, but as long as you have some way to manage the parallelism, you can run anything on a cluster.

Optimize mathematical library (libm)

Have anyone tried to compile glibc with -march=corei7 to see if there's any performance improvement over the version that comes by default with any Linux x68_64 distribution? GCC is compiled with -march=i686. I think (not sure) that the mathematical library is also compiled the same way. Can anybody confirm this?
Most Linux distributions for x86 compile using only i686 instructions, but asking for scheduling them for later processors. I haven't really followed later developments.
A long while back different versions of system libraries according to processor lines were common, but the performance differences were soon deemed too small for the cost. And machines got more uniform in performance meanwhile.
One thing that has to be remembered always is that today's machines are memory bound. I.e., today a memory access takes a few hundred times longer than an instruction, and the gap is growing. Not to mention that this machine (an oldish laptop, was top-of-the-line some 2 years back) has 4 cores (8 threads), all battling to get data/instructions from memory. Making the code run a tiny bit faster, so the CPU can wait longer for RAM, isn't very productive.

Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL

The Peak GFLOPS of the the cores for the Desktop i7-4770k # 4GHz is 4GHz * 8 (AVX) * (4 FMA) * 4 cores = 512 GFLOPS. But the latest Intel IGP (Iris Pro 5100/5200) has a peak of over 800 GFLOPS. Some algorithms will therefore run even faster on the IGP. Combining the cores with the IGP together would even be better. Additionally, the IGP keeps eating up more silicon. The Iris Pro 5100 takes up over 30% of the silicon now. It seems clear which direction Intel desktop processors are headed.
As far as I have seen the Intel IGP, however, is mostly ignored by programmers with the exception of OpenCL/OpenGL. I'm curious to know how one can program the Intel HD Graphics hardware for compute (e.g. SGEMM) without OpenCL?
Added comment:
Their is no Intel support for HD graphics and OpenCL on Linux. I found beignet which is open source attempt to add support to Linux at least for Ivy Bridge HD graphics. I have not tried it. Probably the people developing Beignet know how to program the HD graphics hardware without OpenCL then.
Keep in mind that there is a performance hit to copy the data to the video card and back, so this must be taken into account. AMD is close to releasing APU chips that have unified memory for the CPU and GPU on the same die, which will go a long way towards alleviating this problem.
The way the GPU used to be utilized before CUDA and OpenCL were to represent the memory to be operated on as a texture utilizing DirectX or OpenGL. Thank goodness we don't have to do that anymore!
AMD is really pushing the APU / OpenCL model, so more programs should take advantage of the GPU via OpenCL - if the performance trade off is there. Currently, GPU computing is a bit of a niche market relegated to high performance computing or number crunching that just isn't needed for web browsing and word processing.
It doesn't make sense any more for vendors to let you program using low-level ISA.
It's very hard and most programmers won't use it.
It keeps them from adjusting the ISA in future revisions.
So programmers use a language (like C99 in OpenCL) and the runtime does ISA-specific optimizations right on the user's machine.
An example of what this enables: AMD switched from VLIW vector machines to scalar machines and existing kernels still ran (most ran faster). You couldn't do this if you wrote ISA directly.
Programming a coprocessor like iris without opencl is rather like driving a car without the steering wheel.
OpenCL is designed to expose the requisite parallelism that iris needs to achieve its theoretical performance. You cant just spawn 100s of threads or processes on it and expect performance. Having blocks of threads doing the same thing, at the same time, on similar memory addresses, is the whole crux of the matter.
Maybe you can think of a better paradigm than opencl for achieving that goal; but until you do, I suggest you try learning some opencl. If you are into python; pyopencl is a great place to start.

Will mvapich be substantially better than openmpi? And How? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am a Computational Fluid Dynamist (CFD), but I dont know mpi very well.
Since heavy CFD jobs requires Infiniband support, and people say that mvapich is usually much better than other mpi implementations. I want to know is this true? Any one has any real experience and references that I can look at? And how come that mvapich is better than the openmpi, etc.? Is it written by Infiniband company or what?
Thanks a lot!
So the answer is "probably not, and it doesn't matter anyway".
You write your code using the MPI API, and you can always install multiple MPI libraries and test against each, as you might with several LAPACK implementations. If one's consistently faster for your application, use it. But the MPI community is very concerned with performance, and the free competitors are all open source and publish their methods in papers, and all publish lots of benchmarks. The friendly rivalary, combined with the openness, tends to mean that no implementation has significant performance advantages for long.
On our big x86 cluster we've done "real world" and micro benchmarks with MPICH2, OpenMPI, MVAPICH2, and IntelMPI. Amongst the three open-source versions, there was no clear winner; on some cases one would win by 10-20%, on others it would lose by the same amount. On those few occassions where we were interested enough to dig into the details to find out why, it was often just a matter of defaults for things like eager limits or crossover between different collective algorithms, and by setting a couple of environment variables we got performance within noise between them. In other cases, a performance advantage persisted but was either not large enough or not consistent enough for us to investigate further.
(IntelMPI, which costs significant amounts of money, was noticibly and mostly-consistently faster, although what we consider the big win there was substantially improved startup times for very large jobs.)
MVAPICH was one of the first MPI implementations to really go after Infiniband performance, after having lots of experience with Myrinet, and they did have a significant advantage there for quite some time, and there are probably benchmarks in which they still win; but ultimately there was no consistent and important performance win and we went with OpenMPI for our main Open Source MPI option.
I would agree with Jonathan regarding the answer and add a few points from a cluster administration perspective.
As a person who at times dips into cluster administration I would add that tuning InfiniBand on a large cluster is not an easy task. You have to make sure that the OFED stack sits well upon your kernel. That the hardware is not faulty and the switches are working as expected without performance issues in a sustained mode and the application maps correctly onto the InfiniBand topology and lots more.
OpenMPI stack is considerably different from MPICH/MVAPICH. I find that OpenMPI component architecture makes it easier to find and debug issues than the architecture of MPICH/MVAPICH which I find more monolithic.
Speaking of vendors recall that MPICH comes from the MCS department of Argonne.
Update: Since version 3.1 MPICH supports OFED InfiniBand via the ib network module. Since 3.2 MPICH will support also the Mellanox MXM interface.
MVAPICH is built on top of MPICH sources by the people from the department of CS&E at Ohio State.
Many hardware vendors build either on top of MPICH or MVAPICH to provide InfiniBand support for their respective HW. One example is Intel MPI. Another is Voltaire MPI.
OpenMPI is developed by several teams supported by InfiniBand switch vendors like Cisco.
HP MPI used to be another very good MPI implementation for generic clusters that is currently available from Platfrom.
CFL codes don't scale well.
I can't speak directly to MVAPICH2, but I would recommend using whatever MPI is native to your cluster. So if you are using a Cray machine, you would go with Cray's MPI. It works like magic. Using your vendors recommended mpi makes a significant difference.
To directly answer your question, if your message size falls in the short range MVAPICH2 has a sweet spot where it beats OpenMPI. I think your CFL codes may fall in this range. On large clusters I have found that something goes wrong with MVAPICH2 with latency values when operating on over 2k PEs - but people don't run CFL on 2k PEs.
Ultimately, there is sufficient motivation to test this theory. Which code are you running OpenFOAM, Fluent?

Resources