mpi4py or multiprocessing in Python ? - mpi

I am writing a machine learning toolkit to run algorithm with different settings in parallel (each process run the algorithm for one setting). I am thinking about either to use mpi4py or python's build-in multiprocessing ?
There are a few pros and cons I am considering about.
Easy-to-use:
mpi4py: It seems more concepts to learn and a bit more tricks to make it work well
multiprocessing: quite easy and clean API
Speed:
mpi4py: people say it is more low level, so I am expect it can be faster than python multiprocessing ?
multiprocessing: compared with mpi4py, much slower ?
Clean and short code:
mpi4py: seems more code to write
multiprocessing: preferred, easy to use API
The working context is I am aiming at running the code basically in one computer or a GPU server. Not really targeting at running in different machines in the network (which only MPI can do it).
And since the main goal is doing machine learning, so the parallelization is not really required to be very optimal, the key goal I want to have is to balance easy, clean and quick to maintain code base but at the same time like to exploit the benefits of parallelization.
With the background described above, is it recommended that using multiprocessing should just be enough ? Or is there a very strong reason to use mpi4py ?

By using mpi4py you can divide the task into multiple threads, but with a single computer with limited performance or number of cores the usability will be limited. However you might find it handy during training.
mpi4py is constructed on top of the MPI-1/2 specifications and provides an object oriented interface which closely follows MPI-2 C++ bindings.
MPI for Python provides MPI bindings for the Python language, allowing programmers to exploit multiple processor computing systems.
MPI for Python supports convenient, pickle-based communication of generic Python object as well as fast, near C-speed, direct array data communication of buffer-provider objects

Related

Confusion over Nvidia GPU packages in Julia, CuArrays and ArrayFire

I recently looked into the usage of GPU computation, where the usage of package seemed to be confusing.
For example, CuArrays and ArrayFire seemed to be doing the same thing, where ArrayFire seemed to be the "official" package on Nvidia developers' webpage.(https://devblogs.nvidia.com/gpu-computing-julia-programming-language )
Also, there were CUDAdrv and CUDAnative Packages..., which seemed to be confusing, as their functionality seemed to be not as straightforward as the others.
What does these packages do? Is there any difference between CuArrays and ArrayFire?
As explained in the blog post you shared, it is quite simply as given below
The Julia package ecosystem already contains quite a few GPU-related
packages, targeting different levels of abstraction as Figure 1 shows.
At the highest abstraction level, domain-specific packages like
MXNet.jl and TensorFlow.jl can transparently use the GPUs in your
system. More generic development is possible with ArrayFire.jl, and if
you need a specialized CUDA implementation of a linear algebra or deep
neural network algorithm you can use vendor-specific packages like
cuBLAS.jl or cuDNN.jl. All these packages are essentially wrappers
around native libraries, making use of Julia’s foreign function
interfaces (FFI) to call into the library’s API with minimal overhead.
CUDAdrv and CUDAnative packages are meant for directly using CUDA runtime API and writing kernels from Julia itself. I believe that is where CuArray come in handy - wrapping native Julia objects into CUDA accessible format, roughly speaking.
ArrayFire on the other hand is a generic library that wraps around all(cuBLAS, cuSparse, cuSolve, cuFFT) CUDA provided domain specific libraries into nice interface(functions). Apart from the interface to CUDA's domain specific libraries, ArrayFire by itself provides lot of other functions in the areas of statistics, image processing, computer vision etc. It has nice JIT feature where user's code is compiled to a runtime kernel - simply put. ArrayFire.jl is an language binding with some extra Julia specific improvements at wrapper level.
That's the general difference. From a developers perspective, using a library(like ArrayFire) basically takes out the burden of keeping up with CUDA API and maintaining/tweaking the kernels for optimum performance which I think takes lot of time.
PS. I am a member of ArrayFire development team.

Rust on grid computing

I'm looking to create Rust implementations of some small bioinformatics programs for my research. One of my main considerations is performance, and while I know that I could schedule the Rust program to run on a grid with qsub - the cluster I have access to uses Oracle's GridEngine - I'm worried that the fact that I'm not calling MPI directly will cause performance issues with the Rust program.
Will scheduling the program without using an MPI library hinder performance greatly? Should I use an MPI library in Rust, and if so, are there any known MPI libraries for Rust? I've looked for one but I haven't found anything.
I have used several supercomputing facilities (I'm an astrophysicist) and have often faced the same problem: I know C/C++ very well but prefer to work with other languages.
In general, any approach other than MPI will do, but consider that often such supercomputers have heavily optimised MPI libraries, often tailored for the specific hardware integrated in the cluster. It is difficult to tell how much the performance of your Rust programs will be affected if you do not use MPI, but the safest bet is to stay with the MPI implementation provided on the cluster.
There is no performance penalty in using a Rust wrapper around a C library like a MPI library, as the bottleneck is the time needed to transfer data (e.g. via a MPI_Send) between nodes, not the negligible cost of an additional function call. (Moreover, this is not the case for Rust: there is no additional function call, as already stated above.)
However, despite the very good FFI provided by Rust, it is not going to be easy to create MPI bindings. The problem lies in the fact that MPI is not a library, but a specification. Popular MPI libraries are OpenMPI (http://www.open-mpi.org) and MPICH (http://www.mpich.org). Each of them differs slightly in the way they implement the standard, and they usually cover such differences using C preprocessor macros. Very few FFIs are able to deal with complex macros; I don't know how Rust scores here.
As an instance, I am implementing an MPI Program in Free Pascal but I am not able to use the existing MPICH bindings (http://wiki.lazarus.freepascal.org/MPICH), as the cluster I am using provides its own MPI library and I prefer to use this one for the reason stated above. I was unable to reuse MPICH bindings, as they assumed that constants like MPI_BYTE were hardcoded integer constants. But in my case they are pointers to opaque structures that seem to be created when MPI_Init is called.
Julia bindings to MPI (https://github.com/lcw/MPI.jl) solve this problem by running C and Fortran programs during the installation that generate Julia code with the correct values for such constants. See e.g. https://github.com/lcw/MPI.jl/blob/master/deps/make_f_const.f
In my case I preferred to implement a middleware, I.e., a small C library which wraps MPI calls with a more "predictable" interface. (This is more or less what the Python and Ocaml bindings do too, see https://forge.ocamlcore.org/projects/ocamlmpi/ and http://mpi4py.scipy.org.) Things are running smoothly, so far I haven't got any problem.
Will scheduling the program without using an MPI library hinder performance greatly?
There are lots of ways to carry out parallel computing. MPI is one, and as comments to your question indicate you can call MPI from Rust with a bit of gymnastics.
But there are other approaches, like the PGAS family (Chapel, OpenSHMEM, Co-array Fortran), or alternative messaging like what Charm++ uses.
MPI is "simply" providing a (very useful, highly portable, aggressively optimized) messaging abstraction, but as long as you have some way to manage the parallelism, you can run anything on a cluster.

Howto compile MPI application in "serial" mode (without using MPI compiler)?

This question might sound a bit weird...
Imagine I have an MPI application, but I don't have a system with MPI installed.
So I want to compile the application with no MPI support (1-process, 1-thread) without modifying source code.
Is that possible?
I found somewhere a "mimic_mpi.h" wrapper which is supposed to do exactly what I want. But there were some MPI functions missing in there (e.g., MPI_Cart_create, MPI_Cart_get, etc.), so I didn't succeed.
mimic_mpi.h http://openmx.sourcearchive.com/documentation/3.2.4.dfsg-3/mimic__mpi_8h-source.html
mimic_mpi.c http://openmx.sourcearchive.com/documentation/3.2.4.dfsg-3/mimic__mpi_8c-source.html
Do you know any other approach I could use to compile MPI apps with no MPI support?
Thanks in advance!
You can run a "real" MPI application easily with a single process. In practice this even works without using mpiexec/mpirun although I'm not sure if that's officially supported. That said a full and confirming 1-process MPI "serial" implementation would probably become rather complex and its own library - so in that case, why not just use a real full MPI implementation?
I hope you see the circle I'm trying to draw:
If you want full MPI behavior, just use an MPI implementation - regardless if it's just limited to a single process.
In practice, applications that want to be able to function with or without MPI often seem to use their own MPI abstractions using domain specific communication wrappers, #ifdef HAVE_MPI or more complex macros.

Thread Building Block library or MPI ? which one is better for me?

I have planned to learn parallel computing. Now I'm thinking of MPI or TBB. In fact, I do not have much experience at this. I suppose I better start with something easy to manage. At first, I may try something like coarse-grained code. Which one could be easier for me? Thanks.
It depends on your definition of parallelism and what you are trying to acheive. The point of TBB is to take advantage of multi-core processors. In this scenario, parallelism means running on multiple cores simultaneously.
However, the main benefit of MPI is distributed memory parallel computing. In this scenario, running your application on a cluster of different physical machines and communicating with each other over TCP or another proprietary protocol.
In summary, depending on your problem space one or the other may be better. It really depends on the problem you are trying to solve.
MPI and TBB are very, very different in approach.
MPI is basically a message passing library. This is more difficult to use for developing a threaded application, since it enforces much stricter isolation rules. It does, however, allow you to scale up to multiple processes, running on multiple systems.
TBB, on the other hand, is really geared towards making threading within a single application much simpler and more approachable. It is very similar in approach to Microsoft's concurrency runtime, and even the TPL in the .NET world.

Is the PVM (parallel virtual machine) library widely used in HPC?

Has everyone migrated to MPI (message passing interface) or is PVM still widely used in supercomputers and HPC?
My experience is that PVM is not widely utilized in high-performance computing. MPI seems widely used and something like co-array Fortran might be the path forward for massively parallel systems of the future.
I use a library called InterComm to couple physics models together as separate executables. InterComm currently utilizes PVM for communication between these coupled models. PVM and InterComm boast that they work on homogeneous and heterogeneous network environments (I've been told MPI does not support heterogeneous compute/network environments). However, this is a feature that we've never used (and I highly doubt we ever will).
I have had a difficult time running PVM on academic compute environments. Some sys-admin/support-type people at reputable national computing centers have even suggested that we "simply" re-code our 20 year-old O(10^4) line code to use MPI because of issues we ran into while porting the code to a particular supercomputer in which the router/queing environment didn't like launching multiple parallel executables alongside PVM.
If you're at the architecture/design stage of a project, I'd recommend staying away from PVM unless you need to work on heterogeneous compute/network environments!
It may be highly site-dependent but in my experience MPI completely
dominates PVM in the (academic at least) HPC space. You can't
realistically launch a new HPC interconnect without MPI support but
PVM seems to be decidedly optional. Is there a PVM implementation for
Infiniband for instance?

Resources