Differences between clBLAS and ViennaCL?

Differences between clBLAS and ViennaCL? - opencl

Looking at the OpenCL libraries out there I am trying to get a complete grasp of each one. One library in particular is clBLAS. Their website states that it implements BLAS level 1,2, & 3 methods. That is great but ViennaCL also has BLAS routines, linear algebra solvers, supports OpenCL and CUDA backends, and is header only. It seems to me, at the moment, that there doesn't appear to be a reason to use clBLAS over ViennaCL but I was wondering if anyone had any reasons why one would use clBLAS over ViennaCL?
Although similar, this is meant to be an extension of this previous question comparing VexCL, Thrust, and Boost.Compute.

clBlas is implemented by AMD, so one can hope that it would be faster on AMD hardware. That is usually the sole advantage of vendor BLAS implementations. Unfortunately, this seems to not be the case here.
In this talk ViennaCL authors report that due to their autotuning framework they are able to either outperform clBLAS, or show similar performance.

Related

Constraint handling, integer & parallel optimization

I have recently been assigned to a project where an optimization tool will be developed in python.
Various online search points out there are multiple libraries/platforms that come with pros and cons. As far as I have looked up with the existing openmdao framework we can not have an optimizer that can do constraint handling, mixed-integer, parallel optimization. Here with parallel it is meant that each iteration should be parallellized as in GADriver. I wanted to ask some advice from the developers considering the future possible improvements on openmdao:
Is it a good idea to look into writing a wrapper for an existing optimizer that can handle the aforementioned request or should one opt out from openmdao completely as openmdao may not be the strongest platform in this specific problem?
if writing a wrapper is a good idea i assume one should look for driver routines in the openmdao 2.2.X github. Do you have any advice for an optimizer type within python (paid or free) that can be easily compatible with openmdao.

There is an AIAA paper titled "Next generation aircraft design considering airline operations and economics", which described current state-of-the-art research into mixed integer programming problems. The approach here used a hybrid method that takes advantage of the efficient gradient based capabilities of OpenMDAO to handle larger numbers of continuous design variables.
In general, there is no limitation on mixed integer programming. You just need to write your own driver to handle it. These algorithsm are complex, but SimpleGADriver is a decent place to start to see how to run the model in parallel.

Shipping reliable OpenCL applications - Tools/Techniques/Tips?

I want to ship OpenCL code that should work on all OpenCL 1.1 compatible GPUs. Rather than buying a bunch of GPUs and testing on them, are there any tools that can help ensure reliability?
If anyone has experience shipping OpenCL applications to a wide hardware base, I'd be interested in knowing about any other methods for testing reliability.

I've a bit of knowledge on this. Unfortunately, the answer is: depends on what the kernel is doing.
My biggest gripe is with NVIDIA and OpenCL, since they don't seem to support: vectors (float2, 4, etc) and global offsets. Kind of obnoxious. Intel and ATI are both solid, but even then vector sizes can differ. The above doesn't really matter if you are doing image convolution.
It matters if you want to run AMD FFT on an NVIDIA card, are doing matrix math, etc. To address the vector issue, you can write multiple kernels that each have a different vector size and call the right one: MatrixMult_float4(...).

You can check whether your code compiles by using the AMD KernelAnalyzer2, although this does need some component of the Catalyst drivers so it only works for me on PCs with AMD GPUs. There is also the Intel Kernel Builder, which works for devices with Intel OpenCL SDK support. Nvidia's implementation has bugs in it, especially on newer GPUs in my experience so there the best is to test one GPU from each generation.

To avoid extensions and validate CL language versions, one could try to test compile the code using the LLVM, or just getting the grammar for validation, e.g. as BNF.
There's a promising open source project, which probably contains useful stuff: http://bazaar.launchpad.net/~pocl/pocl/master/files/head:/lib/CL/
However, the problems I encountered were:
Newline characters caused build breakers on certain implementations (CR, LF, CRLF) in OpenCL source files. Specifying one of these as the only valid line ending would be just stupid. If one is editing source files on different platforms in conjunction with an SCM, it could get inconvenient. So I remove comments and clean up line breaks before compilation.
Performance: Feeding the GPU efficiently using multithreading; different hardware constellations have different bottlenecks. Here I needed a client side pipeline with multiple dispatcher threads. Of course, the amount of work that remains for the CPU depends on the task or capabilities, amount and resources of computing devices. Things that needed serialized execution or dynamic loop counts have been such candidates.

BLAS sdot operation implementation in MKL library

I tested the BLAS sdot interface for single precise floating point dot operations. I found that the results of Intel MKL library are a little different from that of the BLAS fortran code given in http://netlib.org/blas/. The MKL ones appear more accurate.
I just wonder is there any optimization made by MKL? Or how does MKL implement it to make it more accurate?

Well, since the MKL is especially written by a specific CPU vendor for their own products, I guess they can use a bit more knowledge about the underlying machine than the reference implementation can.
First thoughts may be that they use optimized assembly and always keep the running sum on the x87 80bit floating point stack without rounding it down to 32bit in each iteration. Or maybe they use SSE(2) and compute the whole sum in double precision (which shouldn't make much of a difference for addition and multiplication, performance-wise). Or maybe they use a completely different computation or what black magic machine tricks ever.
The point is that these routines are far more optimized for a specific hardware than the basic reference implementation, but without seeing their implementation we cannot say in which way. The above mentioned ideas are just simple approaches.

OpenCL vs. DirectCompute?

I'm looking for comparisons between OpenCL and DirectCompute, but I haven't found anything. OpenCL's advantages of being cross-platform and having a wider range of supported GPUs don't matter to me. I'm fine with coding on Windows against DX11 GPUs only. Assuming that, what are the pros and cons of each API?
I know this question was raised before, but I'm looking for more details.
I'm not interested in CUDA, since I don't want to restrict myself to only Nvidia hardware.

Probably the biggest difference for a coder is that DirectCompute is programmed by a language which is similar to HLSL, and OpenCL is programmed via a C-like language.
Another difference to consider is that, generally, for commodity level GPUs, the DirectX support is better (faster and less buggy) than OpenGL support on Windows. This may translate to more stable support for DirectCompute, but really, this is just speculation.

Well the major advantage of OpenCL is that it is not just limited to graphics cards. You can make use of your multicore CPU, Graphics Card and potentially any number of other hardware acceleration devices (DSPs etc) all from the same program.
I'm not sure if DirectCompute allows that freedom.

The OpenCL cross-platform-ness is not just a detail, as the host code (the one calling the OpenCL API and submitting kernels) can itself be cross-platform (see link text, link text...).
Write once, run on any GPGPU, anywhere.
Otherwise the OpenCL tooling is really getting better, with an ATI Stream plugin for Visual Studio, the NVidia & ATI SDKs that contains tons of samples, etc...

Another option now is C++ AMP which gives you modern C++ syntax without a need for a seperate compiler while still preserving hardware portability. Please follow links from here for more info and feel free to post questions as you have them: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/13/c-amp-in-a-nutshell.aspx

I use OpenCL because i can easily port my App to Linux but with DirectCompute this is not possible.
I think also that the performance of the OpenCL implementation will increase with time (that it comes at the same Level like CUDA for NVidia Cards) and also that the (driver)bugs will (hopefully ;) ) be eliminated with time.

mpi under the hood

I need to deliver a presentation on programming in MPI. I need to add a segment on how MPI works under the hood. For Example What happens when I call MPI_Init?
Do you know of any good source from where I can learn these details?

The MPI Spec contains the description of the knobs, sliders, and displays that are on the outside of the "black box" of each API.
The interior details of the black boxes will be implementation dependent...and will also depend on the interconnect (e.g. TCP, IBV, DAPL, etc), the OS (e.g. is the implementation using LSB, or native libraries, etc), and on many other factors to a lesser degree (e.g. message size thresholds will trigger different code paths, and so on). Using "strace" and "ltrace" on the a.out may provide some insight into the actual goings on inside the blackbox.
The best recommendation is to pick an open source implementation and examine the code to determine the internal details.

MPI is a specification, not a particular implementation. The observable behavior is given in the MPI spec. How it works under the hood depends on the particular implementation. If you'd like to take a look at an example implementation, you might be interested in looking at MPICH2 and browsing their source code.

Complement your study of the source code of an implementation of MPI with consideration of how you would implement MPI_Init on your platform of choice. MPI sits on top of already available O/S functionality. I don't mean to suggest that you can figure out how a particular version of MPI is implemented by this approach, but to suggest that you can learn better what is going on under the hood by tackling the problem from another angle.

MPI is only a spec. MPI spec is implemented by various groups and organizations. You will want to pick one implementation, say, MPICH, and you can find their design documentation. That will tell you how the MPI spec is implemented by that group.
If you just want to describe what happens when an application written in MPI is started, you can read about MPI and MPI programming. I highly recommend http://www.citutor.org