MPI OpenMp hybrid - mpi

I am trying to run a program written for MPI and OpenMP on a cluster of Linux dual cores.
When I try to set the OMP_NUM_THREADS variable
export OMP_NUM_THREADS=2
I get a message
OMP_NUM_THREADS: Undefined variable.
I don't get a better performance with OpenMP... I also tried:
mpiexec -n 10 -genv OMP_NUM_THREADS 2 ./binary
and omp_set_num_threads(2) inside the program, but it didn't get any better...
Any ideas?
update: when I run mpiexec -n 1 ./binary with omp_set_num_threads(2) execution time is 4s and when I run mpiexec -f machines -n 1 ./binary execution time is 8s.

I would suggest doing an $echo OMP_NUM_THREADS first and further querying for the number of threads inside the program to make sure that threads are being spawned. Use the omp_get_num_threads() function for this. Further if you're using a MacOS then this blogpost can help:
https://whiteinkdotorg.wordpress.com/2014/07/09/installing-mpich-using-macports-on-mac-os-x/
The latter part in this post will help you to successfully compile and run Hybrid programs. Whether a Hybrid program gets better performance or not depends a lot on contention of resources. Excessive usage of locks, barriers - can further slow the program down. It will be great if you post your code here for others to view and to actually help you.

Related

How do I use hybrid OpenMP/OpenMPI parallelization together with GNU compilers?

I am running a physics solver that was written to use hybrid OpenMP/MPI parallelization. The job manager on our cluster is SLURM. Everything goes as expected when I am running in a pure MPI mode. However, once I try to use hybrid parallelization strange things happen:
1) First I tried the following SLURM block:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
(hint: 16 is the number of physical cores on the processors on the cluster)
However what happens is that the simulation runs on 4 nodes and there I see 4 used cores each (in htop). Moreover the solver tells me it is started on 16 cores, which I do not really understand. It should be 8*16=128 I think.
2) As the above was not successful, I added the following loop to my SLURM script:
if [ -n "$SLURM_CPUS_PER_TASK" ]; then
omp_threads=$SLURM_CPUS_PER_TASK
else
omp_threads=1
fi
export OMP_NUM_THREADS=$omp_threads
What happens is that the solver tells me now that it is started on 128 cores. But when using htop on the respective nodes it becomes obvious that these OpenMP threads use the same cores, so the solver is ultra slow. The developer of the code told me he never used the loop I added so there might be something wrong about that, but I do not understant why the OpenMP threads use the same cores. However, in htop, the threads seem to be there. Another strange thing is that htop shows me 4 active cores per cluster... I would have expected either 2 (for the 2 MPI tasks per node) or rather, if everything would go as planned, 32 (2 MPI tasks running 16 OMP threads each).
We once head an issue already as the developer uses an Intel Fortran compiler and I use a GNU fortran compiler (mpif90 respectively mpifort).
Has anyone an idea how I can make my OpenMP threads use all the cores available instead of only some few?
Some system / code info:
Linux distro: OpenSUSE Leap 15.0
Compiler: mpif90
Code: FORTRAN90
So few things, by using:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
You tell you want 8 tasks (i.e MPI worker), and have two of them per nodes, so it is normal to have the code starting on 4 nodes.
Then you tell each MPI worker to use 16 OMP threads. You say :
Moreover the solver tells me it is started on 16 cores
Probably the solver look at OMP threads, so it is normal for him to indicate 16. I don't know the detail of your code, but usually if you solve a problem on a grid, you will split the grids in subdomains (1 per MPI) and solve with OMP on this subdomains. So you have in your case 8 solvers running in parallel, each of them using 16 cores.
The command export OMP_NUM_THREADS=$omp_threads and the if block you added are correct (btw this is not a loop).
If you have 16 cores per nodes on the cluster, your configuration should rather be:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
So one MPI per node and then 1 OMP per core, instead of two now which will probably just slow down the code.
Finally, how do you get the htopoutput, do you log to the compute node? It is usually not. A good idea on clusters.
I know this is not a full answere but without the actual code it is a bit hard to tell more and this was too long to be posted as a comment.

Override -j setting for one source file?

I have a test script that takes from hours to days to run. The test script repeatedly builds a library and runs its self tests under different configurations.
On desktops and servers, the script enjoys a speedup because it uses -j N, where N is the number of cores available. It will take about 2 hours to run the test script.
On dev-boards like a LeMaker Hikey (8-core ARM64/2GB RAM) and CubieTruck (8-core ARM/2GB RAM), I can't use -j N (for even N=2 or N=4) because one file is a real monster and causes an OOM kill. In this case it can take days for the script to run.
My question is, how can I craft a make recipe that tells GNUmake to handle this one source file with -j 1? Is it even possible?
I'm not sure if it is possible. It isn't clear how Make splits jobs amongst cores.
4.9 Special Built-in Target Names mentions
.NOTPARALLEL
If .NOTPARALLEL is mentioned as a target, then this invocation of make will be run serially, even if the -j option is given. Any
recursively invoked make command will still run recipes in parallel
(unless its makefile also contains this target). Any prerequisites on
this target are ignored.
However, 5.7.3 Communicating Options to a Sub-make says:
The -j option is a special case (see Parallel
Execution).
If you set it to some numeric value N and your operating system
supports it (most any UNIX system will; others typically won’t), the
parent make and all the sub-makes will communicate to ensure that
there are only N jobs running at the same time between them all.
Note that any job that is marked recursive (see Instead of Executing
Recipes) doesn’t count against the total jobs (otherwise we could get
N sub-makes running and have no slots left over for any real work!)
If your operating system doesn’t support the above communication, then
-j 1 is always put into MAKEFLAGS instead of the value you
specified. This is because if the -j option were passed down to
sub-makes, you would get many more jobs running in parallel than you
asked for. If you give -j with no numeric argument, meaning to run
as many jobs as possible in parallel, this is passed down, since
multiple infinities are no more than one.
This suggests to me there is no way to assign a specific job to a single core. It's worth giving a shot though.
Make the large target first,
then everything else afterwards in parallel.
.PHONY: all
all:
⋮
.PHONY: all-limited-memory
all-limited-memory:
${MAKE} -j1 bigfile
${MAKE} all
So now
$ make -j16 all works as expected.
$ make -j4 all-memory-limited builds bigfile serially (exiting if error), carrying on to do the rest in parallel.

How to use ltrace for mpi programs?

I want to know how to use ltrace to get library function calls of mpi application but simply ltrace doesn't work and my mpirun cannot succeed.
Any idea?
You should be able to simply use:
$ mpiexec -n 4 -other_mpiexec_options ltrace ./executable
But that will create a huge mess since the outputs from the different ranks will merge. A much better option is to redirect the output of ltrace to a separate file for each rank. Getting the rank is easy with some MPI implementations. For example, Open MPI exports the world rank in the environment variable OMPI_COMM_WORLD_RANK. The following wrapper script would help:
#!/bin/sh
ltrace --output trace.$OMPI_COMM_WORLD_RANK $*
Usage:
$ mpiexec -n 4 ... ltrace_wrapper ./executable
This will produce 4 trace files, one for each rank: trace.0, trace.1, trace.2, and trace.3.
For MPICH and other MPI implementations based on it and using the Hydra PM exports PMI_RANK and the above given script has to be modified and OMPI_COMM_WORLD_RANK replaced with PMI_RANK. One could also write an universal wrapper that works with both families of MPI implementations.

When using mpirun with R script, should I copy manually file/script on clusters?

I'm trying to understand how openmpi/mpirun handle script file associated with an external program, here a R process ( doMPI/Rmpi )
I can't imagine that I have to copy my script on each host before running something like :
mpirun --prefix /home/randy/openmpi -H clust1,clust2 -n 32 R --slave -f file.R
But, apparently it doesn't work until I copy the script 'file.R' on clusters, and then run mpirun. Then, when I do this, the results are written on cluster, but I expected that they would be returned to working directory of localhost.
Is there another way to send R job from localhost to multiple hosts, including the script to be evaluated ?
Thanks !
I don't think it's surprising that mpirun doesn't know details of how scripts are specified to commands such as "R", but the Open MPI version of mpirun does include the --preload-files option to help in such situations:
--preload-files <files>
Preload the comma separated list of files to the current working
directory of the remote machines where processes will be
launched prior to starting those processes.
Unfortunately, I couldn't get it to work, which may be because I misunderstood something, but I suspect it isn't well tested because very few use that option since it is quite painful to do parallel computing without a distributed file system.
If --preload-files doesn't work for you either, I suggest that you write a little script that calls scp repeatedly to copy the script to the cluster nodes. There are some utilities that do that, but none seem to be very common or popular, which I again think is because most people prefer to use a distributed file system. Another option is to setup an sshfs file system.

using strace with mpiexec

How do I strace all processes of MPI parallel job, started with mpiexec (MPICH2, linux)?
-o will mess outputs from different processes
PS To some editors: who may think that MPICH is the name of the library. MPICH2 is a particular version.. MPICH2 is actually MPICH2 is an all-new implementation of MPI and I sometimes had to used both mpich and mpich2. So, we can't replace mpich2 with mpich.
Create a wrapper around your program, which will be launched by mpiexec. Something like:
#!/bin/sh
LOGFILE="strace-$(hostname).$$"
exec strace -o"$LOGFILE" my_mpi_program
You may want to try STAT (Stack Trace Analysis Tool).
Check out the STAT Homepage.
It will give you a high level overview of your process behavior, and works
especially well in the case of a hung process.

Resources