Julia BenchmarkTools MPI - julia

I aim to benchmark an MPI function, precisely the main() function in https://github.com/mcreel/JuliaMPIMonteCarlo.jl). I am using
#benchmark main()
but it seems that each process reports a time. Does BenchmarkTools has support for parallel MPI jobs?

Related

Disallow mpi4py to interfere with internal handling of mpi by a python API

I am using mpiexec on a cluster to run large-scale simulations using pyNEST (mpiexec -n $N python simulate.py). I export a large number of small files which often tends to exceed my inode quota on the cluster. So, I am trying to reduce the number of exported files by allowing only one of the MPI processes ("mother") to export by gather()-ing the data of interest. Theoretically, this works fine. However, when I try this on pyNEST (v2.20.2) just import mpi4py (implicit call to MPI.Init()) interferes with the internal mpi handling mechanism of pyNEST API. Somehow, instead of receiving N processes, only one mpi process is received by the API---which crashes the kernel because it expects N processes (you have to explicitly specify N in your pyNEST code).
Is there a way to prevent mpi4py from interfering with the API's internal mpi mech?
Alternatively, can you suggest a file format that supports parallel writes?
I've looked into zarr, however, parallel zarr writes are only optimal if the chunk size are uniform. In my case, the chunk size (data exported per mpi process) are never uniform. Sometimes their length differ by more than 10x and I cannot predict what the chunk sizes will be.
You can avoid the implicit MPI_Init:
import mpi4py
mpi4py.rc.initialize = False
mpi4py.rc.finalize = False
from mpi4py import MPI
and then you need to Init/Finalize explicitly, or I guess have them done by PyNest.
File I/O: MPI has its own parallel file I/O. Alternatively, check out hdf5 for a very popular system.

Can we call arrayFire fft function inside opencl Kernel

I want to execute fft on gpu. I am using arrayFire library for that. Since whatever we write inside the opencl kernel will be executing on the gpu(specified device). Can we call fft function inside kernel of opencl.
ArrayFire is a high level library that allows users to compute on GPUs without having to write kernels. ArrayFire provides a high level API for this purpose.
It is NOT possible to call ArrayFire function from inside a kernel as it violates the basic principles of GPU computing.

Intel OpenCL: Tools for looking timeline of concurrent kernel execution

In case of CUDA, NSIGHT would give us detail time lines of each kernel.
Is there similar tool for Intel Opencl? Basically I want to see if my three kernels are running in concurrently or not.

How to kill a doMC worker when it's done?

The documentation for doMC seems very sparse, listing only doMC-package and registerDoMC(). The problem I'm encountering is I'll spawn several workers via doMC/foreach, but then when the job is done they just sit there taking up memory. I can go and hunt their process IDs, but I often kill the master process by accident.
library(doMC)
library(foreach)
registerDoMC(32)
foreach(i=1:32) %dopar% foo()
##kill command here?
I've tried following with registerDoSEQ() but it doesn't seem to kill off the processes.
The doMC package is basically a wrapper around the mclapply function, and mclapply forks workers that should exit before it returns. It doesn't use persistent workers like the snow package or the snow-derived functions in the parallel package, so it doesn't need a function like stopCluster to shutdown the workers.
Do you see the same problem when using mclapply directly? Does it work any better when you call registerDoMC with a smaller value for cores?
Are you using doMC from a IDE such as RStudio or R.app on a Mac? If so, you might want try using R from a terminal to see if that makes a difference. There could be a problem calling fork in an IDE.
I never did find a suitable solution for doMC, so for a while I've been doing the following:
library(doParallel)
cl <- makePSOCKcluster(4) # number of cores to use
registerDoParallel(cl)
## computation
stopCluster(cl)
Works every time.
if you using doParallel package, and using registerDoParallel(8) with numbers
you can using unloadNamespace("doParallel") to kill the multi process
And if you has the name for the clusters you can using stopCluster(cl) to remove extra workers
By using registerDoSEQ() you simply register the sequential worker, so all parallel workers should stop. This is not a complete solution, but it should work in some cases.

How to implement a program in openCL using MPI on a single cpu machine

I'm new to GPU programming , I have laptop without graphics card,i want to develop a matrix multiplication program on intel openCL, and implement this application using MPI..
any guidelines and helpfull links can be posted.
I'm confused about the MPI thing, do we have to write code for MPI , or do we have to use some developed MPIs to run our application?
this is the project proposal of what i want to do
GPU cluster computation (C++, OpenCL and MPI)
Study MPI for distributing the problem
Implement OpenCL apps on a single machine (matrix multiplication/ 2D image processing)
Implement apps with MPI (e.g. large 2D image processing)
So the thing to understand is that MPI and OpenCL for your purposes are completely orthogonal. MPI is for communicating between your GPU nodes; OpenCL is for accelerating your local computation on a single node by using the GPU (or multiple CPU cores). For any of these problems, you'd start with writing a serial C++ version of the code. The next step would be to (in any order) work on an OpenCL implementation for a single node, and work on an MPI version which decomposes the problems (you don't want to user master-slave for any of the above listed problems) onto multiple processes, with each process doing their local part of the computation which contributes to the global solution. Once both of those parts are done, you'd merge the two and have a distributed-memory (the MPI part) GPU (the OpenCL part) version of a code to solve this problem.
It won't quite be that easy, of course, and combining the two will take a fair bit of work, but that's the basic approach to keep in mind. Start with one problem, get it working on a single processor in C++, then try it with one or the other. Don't try to do everything at once or you'll never get anywhere.
For problems like matrix multiplication, there are many many examples on the internet of both GPU and MPI implementations to learn from.
Simplified:
MPI is a library for communicating proccesses, but also a platform for running applications in a cluster. You write a program that use MPI library and then that program should be executed with MPI. MPI fork that application N times in the cluster and allow to communicate that applicacion instances with messages.
The tasks that the make the instances, if they are the same or different workers, and the topology is up to you.
I think 3 ways to use (OpenCL and MPI):
MPI start (K+1) instances, one master and K slaves. The master split the data in chunks and the slaves proccess the data in the GPUS using OpenCL. All slaves are the same.
MPI start (k+1) instances, one master and k slaves. Each slave compute a specialized problem (slave 1 matrix multiplication, slave 2 block compression, ...etc) and the master direct the data in a workflow kind of task.
MPI start (k+1) instances, one master and k slaves. Same that case 1, but the master also send to the slaves the OpenCL program to proccess data.

Resources