I recently looked into the usage of GPU computation, where the usage of package seemed to be confusing.
For example, CuArrays and ArrayFire seemed to be doing the same thing, where ArrayFire seemed to be the "official" package on Nvidia developers' webpage.(https://devblogs.nvidia.com/gpu-computing-julia-programming-language )
Also, there were CUDAdrv and CUDAnative Packages..., which seemed to be confusing, as their functionality seemed to be not as straightforward as the others.
What does these packages do? Is there any difference between CuArrays and ArrayFire?
As explained in the blog post you shared, it is quite simply as given below
The Julia package ecosystem already contains quite a few GPU-related
packages, targeting different levels of abstraction as Figure 1 shows.
At the highest abstraction level, domain-specific packages like
MXNet.jl and TensorFlow.jl can transparently use the GPUs in your
system. More generic development is possible with ArrayFire.jl, and if
you need a specialized CUDA implementation of a linear algebra or deep
neural network algorithm you can use vendor-specific packages like
cuBLAS.jl or cuDNN.jl. All these packages are essentially wrappers
around native libraries, making use of Julia’s foreign function
interfaces (FFI) to call into the library’s API with minimal overhead.
CUDAdrv and CUDAnative packages are meant for directly using CUDA runtime API and writing kernels from Julia itself. I believe that is where CuArray come in handy - wrapping native Julia objects into CUDA accessible format, roughly speaking.
ArrayFire on the other hand is a generic library that wraps around all(cuBLAS, cuSparse, cuSolve, cuFFT) CUDA provided domain specific libraries into nice interface(functions). Apart from the interface to CUDA's domain specific libraries, ArrayFire by itself provides lot of other functions in the areas of statistics, image processing, computer vision etc. It has nice JIT feature where user's code is compiled to a runtime kernel - simply put. ArrayFire.jl is an language binding with some extra Julia specific improvements at wrapper level.
That's the general difference. From a developers perspective, using a library(like ArrayFire) basically takes out the burden of keeping up with CUDA API and maintaining/tweaking the kernels for optimum performance which I think takes lot of time.
PS. I am a member of ArrayFire development team.
Related
I started to learn OpenCl.
I read these links:
https://en.wikipedia.org/wiki/OpenCL
https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/os_tooling.md
https://www.khronos.org/opencl/
but I did not understand well that OpenCl is a library by including header file in source code or it is a compiler by using OpenCl C Compiler?!
It is both a library and a compiler.
The OpenCL C/C++ bindings that you include as header files and that you link against are a library. These provide the necessary functions and commands in C/C++ to control the device (GPU).
For the device, you write OpenCL C code. This language is not C or C++, but rather based on C99 and has some specific limitations (for example only 1D arrays) as well as extensions (math and vector functionality).
The OpenCL compiler sits in between the C/C++ bindings and the OpenCL C part. Using the C/C++ bindings, you run the compiler at runtime of the executable with the command clBuildProgram (C bindings) or program.build(...) (C++). It then at runtime once compiles the OpenCL C code to the device-specific assembly, which is different for every vendor. With an Nvidia GPU, you can for example look at the Compiler PTX assembly for the device.
If you know OpenGL then OpenCL works on the same principle.
Disclaimer: Self-learned knowledge ahead. I wanted to learn OpenCL and found the OpenCL term as confusing as you do. So I did some painful research until I got my first OpenCL hello-world program working.
Overview
OpenCL is an open standard - i.e. just API specification which targets heterogeneous computing hardware in particular.
The standard comprises a set of documents available here. It is up to the manufacturers to implement the standard for their devices and make OpenCL available e.g. through GPU drivers to users. Perhaps in the form of a shared library.
It is also up to the manufacturers to provide tools for developer to make applications using OpenCL.
Here's where it gets complicated.
SDKs
Manufacturers provide SDKs - software packages that contain everything the said developer needs. (See the link above). But they are specific for each - e.g. NVIDIA SDK won't work without their gpu.
ICD Loader
Because of SDKs being tied to a signle vendor, the most portable(IMHO) solution is to use what is known as Khronos' ICD loader. It is kind of "meta-driver" that will, during run-time, search for other ICDs present in the system by AMD, Intel, NVIDIA, and others; then forward them calls from our application. So, as a developer, we can develop against this generic driver and use clGetPlatformIDs to fetch the available platforms and devices. It is availble as libOpenCL.so, at least on Linux, and we should link against it.
Counterpart for OpenGL's libOpenGL, well almost, because the vast majority of OpenGL(1.1+) is present in the form of extensions and must be loaded separately with e.g. GLAD. In that sense, GLAD is very similar to the ICD loader.
Again, it does not contain any actual "computing" code, only stub implementations of the API which forward everything to the chosen platform's ICD.
Headers
We are still missing the headers, thankfully Khronos organization releases C headers and also C++ bindings. But nothing is stopping you from writing them yourself based on the official API documents. It would just be really tedious and error-prone.
Here we can find yet another parallel with OpenGL because the headers are also just the consequence of the Standard and GLAD generates them directly from its XML version! How cool is that?!
Summary
To write a simple OpenCL application we need to:
Download an ICD from the device's manufactures - e.g. up-to-date GPU drivers is enough.
Download the headers and place them in some folder.
Download, build, and install an ICD loader. It will likely need the headers too.
Include the headers, use API in them, and link against the ICD loader.
For Debian, maybe Ubuntu and others there is a simpler approach:
Download the drivers... look for <vendor>-opencl-icd, the drivers on Linux are usually not as monolithic as on Windows and might span many packages.
Install ocl-icd-opencl-dev which contains C, C++ headers + the loader.
Use the headers and link the library.
I'm about to convert some GPU kernels of my project from OpenCL/Cuda to Metal in order to run my application on Apple devices. Currently, my project was written completely in C/C++. After doing some research, I think I need to get my hand dirty with Swift or Objective-C. But to be honest, I'm not sure about this stuff because Metal language for computation and deep learning is quite new.
I know there's a library called "CoreML", but my app requires some custom kernels. My question: what is the best way to deal with low-level API of Apple devices in my situation?
The Metal Shading Language is a version of C++. I haven't had too much trouble with porting OpenCL or CUDA kernels to Metal.
Core ML only supports a limited set of layers. You can write your own custom layers, which involves writing a CPU version and optionally a GPU version (in the Metal language).
I wrote a blog post about this: http://machinethink.net/blog/coreml-custom-layers/
I am writing a machine learning toolkit to run algorithm with different settings in parallel (each process run the algorithm for one setting). I am thinking about either to use mpi4py or python's build-in multiprocessing ?
There are a few pros and cons I am considering about.
Easy-to-use:
mpi4py: It seems more concepts to learn and a bit more tricks to make it work well
multiprocessing: quite easy and clean API
Speed:
mpi4py: people say it is more low level, so I am expect it can be faster than python multiprocessing ?
multiprocessing: compared with mpi4py, much slower ?
Clean and short code:
mpi4py: seems more code to write
multiprocessing: preferred, easy to use API
The working context is I am aiming at running the code basically in one computer or a GPU server. Not really targeting at running in different machines in the network (which only MPI can do it).
And since the main goal is doing machine learning, so the parallelization is not really required to be very optimal, the key goal I want to have is to balance easy, clean and quick to maintain code base but at the same time like to exploit the benefits of parallelization.
With the background described above, is it recommended that using multiprocessing should just be enough ? Or is there a very strong reason to use mpi4py ?
By using mpi4py you can divide the task into multiple threads, but with a single computer with limited performance or number of cores the usability will be limited. However you might find it handy during training.
mpi4py is constructed on top of the MPI-1/2 specifications and provides an object oriented interface which closely follows MPI-2 C++ bindings.
MPI for Python provides MPI bindings for the Python language, allowing programmers to exploit multiple processor computing systems.
MPI for Python supports convenient, pickle-based communication of generic Python object as well as fast, near C-speed, direct array data communication of buffer-provider objects
I'm looking to create Rust implementations of some small bioinformatics programs for my research. One of my main considerations is performance, and while I know that I could schedule the Rust program to run on a grid with qsub - the cluster I have access to uses Oracle's GridEngine - I'm worried that the fact that I'm not calling MPI directly will cause performance issues with the Rust program.
Will scheduling the program without using an MPI library hinder performance greatly? Should I use an MPI library in Rust, and if so, are there any known MPI libraries for Rust? I've looked for one but I haven't found anything.
I have used several supercomputing facilities (I'm an astrophysicist) and have often faced the same problem: I know C/C++ very well but prefer to work with other languages.
In general, any approach other than MPI will do, but consider that often such supercomputers have heavily optimised MPI libraries, often tailored for the specific hardware integrated in the cluster. It is difficult to tell how much the performance of your Rust programs will be affected if you do not use MPI, but the safest bet is to stay with the MPI implementation provided on the cluster.
There is no performance penalty in using a Rust wrapper around a C library like a MPI library, as the bottleneck is the time needed to transfer data (e.g. via a MPI_Send) between nodes, not the negligible cost of an additional function call. (Moreover, this is not the case for Rust: there is no additional function call, as already stated above.)
However, despite the very good FFI provided by Rust, it is not going to be easy to create MPI bindings. The problem lies in the fact that MPI is not a library, but a specification. Popular MPI libraries are OpenMPI (http://www.open-mpi.org) and MPICH (http://www.mpich.org). Each of them differs slightly in the way they implement the standard, and they usually cover such differences using C preprocessor macros. Very few FFIs are able to deal with complex macros; I don't know how Rust scores here.
As an instance, I am implementing an MPI Program in Free Pascal but I am not able to use the existing MPICH bindings (http://wiki.lazarus.freepascal.org/MPICH), as the cluster I am using provides its own MPI library and I prefer to use this one for the reason stated above. I was unable to reuse MPICH bindings, as they assumed that constants like MPI_BYTE were hardcoded integer constants. But in my case they are pointers to opaque structures that seem to be created when MPI_Init is called.
Julia bindings to MPI (https://github.com/lcw/MPI.jl) solve this problem by running C and Fortran programs during the installation that generate Julia code with the correct values for such constants. See e.g. https://github.com/lcw/MPI.jl/blob/master/deps/make_f_const.f
In my case I preferred to implement a middleware, I.e., a small C library which wraps MPI calls with a more "predictable" interface. (This is more or less what the Python and Ocaml bindings do too, see https://forge.ocamlcore.org/projects/ocamlmpi/ and http://mpi4py.scipy.org.) Things are running smoothly, so far I haven't got any problem.
Will scheduling the program without using an MPI library hinder performance greatly?
There are lots of ways to carry out parallel computing. MPI is one, and as comments to your question indicate you can call MPI from Rust with a bit of gymnastics.
But there are other approaches, like the PGAS family (Chapel, OpenSHMEM, Co-array Fortran), or alternative messaging like what Charm++ uses.
MPI is "simply" providing a (very useful, highly portable, aggressively optimized) messaging abstraction, but as long as you have some way to manage the parallelism, you can run anything on a cluster.
The makeCluster function for the SNOW package has the different cluster types of "SOCK", "PVM", "MPI", and "NWS" but I'm not very clear on the differences among them, and more specifically which would be best for my program.
Currently I have a queue of tasks of different length going into a load balancing cluster with clusterApplyLB and am using a 64bit 32-core Windows machine.
I am looking for a brief description of the differences among the four cluster types, which would be best for my use and why.
Welcome to parallel programming. You may want to peruse the vignette of the excellent parallel package that comes with R as it gives a general introduction. It also gives you an idea of what you can or cannot do on Windows -- in short, PVM and MPI are standard parallel programming approaches supported by namesake libraries. These exists on Windows, but are less frequently used and often not as mature as their Unix counterparts.
If you want to stick with snow, your options are essentially limited to SOCK types clusters. Again, the package documentation will have pointers.