How to use ltrace for mpi programs? - mpi

I want to know how to use ltrace to get library function calls of mpi application but simply ltrace doesn't work and my mpirun cannot succeed.
Any idea?

You should be able to simply use:
$ mpiexec -n 4 -other_mpiexec_options ltrace ./executable
But that will create a huge mess since the outputs from the different ranks will merge. A much better option is to redirect the output of ltrace to a separate file for each rank. Getting the rank is easy with some MPI implementations. For example, Open MPI exports the world rank in the environment variable OMPI_COMM_WORLD_RANK. The following wrapper script would help:
#!/bin/sh
ltrace --output trace.$OMPI_COMM_WORLD_RANK $*
Usage:
$ mpiexec -n 4 ... ltrace_wrapper ./executable
This will produce 4 trace files, one for each rank: trace.0, trace.1, trace.2, and trace.3.
For MPICH and other MPI implementations based on it and using the Hydra PM exports PMI_RANK and the above given script has to be modified and OMPI_COMM_WORLD_RANK replaced with PMI_RANK. One could also write an universal wrapper that works with both families of MPI implementations.

Related

OpenMPI specify executable for specific nodes

I have a heterogeneous computing cluster that I would like to run parallel computing tasks on using OpenMPI. Since not all nodes in the cluster can run the same executable (by virtue of being heterogeneous) I would like for some of the nodes to compile their own version of the program and have Open MPI invoke that executable on those nodes. My first question is whether OpenMPI enables this kind of computing across heterogeneous architectures.
If so, my second question is how to specify which executables to run on which nodes. For example lets say node0, node1, and node2 can run executable prog1 and node3, node4, and node5 can run executable prog2, where prog1 and prog2 are the same program but compiled for different architectures using the mpicc or mpic++ wrapper compilers.
If I wanted to run this program in parallel across all nodes would I do the following:
mpirun -n 3 --hosts node0,node1,node2 prog1 : -n 3 --hosts node3,node4,node5 prog2
If not, what would I do to achieve this effect? This post indicates that heterogeneous cluster computing is supported by OpenMPI but I must build OpenMPI with the --enable-heterogeneous flag. I'm not sure how to do this since my cluster is running ArchLinux and I installed OpenMPI with pacman.
Note there is a typo (--host does not require an ending s), so your command should be
mpirun -n 3 --host node0,node1,node2 prog1 : -n 3 --host node3,node4,node5 prog2
--enable-heterogeneous is needed so Open MPI can be ran on heterogeneous systems (for example between Intel x86_64 (little endian) and a sparcv9 (big endian) nodes). If OpenMPI (coming with ArchLinux) was not configured with this flag, then you should rebuild this package. an other option is to rebuild Open MPI and install it into an alternate directory.
Last but not least, heterogeneous support is (very) lightly tested, and i strongly encourage you to use the latest Open MPI 3.0 series.

Override -j setting for one source file?

I have a test script that takes from hours to days to run. The test script repeatedly builds a library and runs its self tests under different configurations.
On desktops and servers, the script enjoys a speedup because it uses -j N, where N is the number of cores available. It will take about 2 hours to run the test script.
On dev-boards like a LeMaker Hikey (8-core ARM64/2GB RAM) and CubieTruck (8-core ARM/2GB RAM), I can't use -j N (for even N=2 or N=4) because one file is a real monster and causes an OOM kill. In this case it can take days for the script to run.
My question is, how can I craft a make recipe that tells GNUmake to handle this one source file with -j 1? Is it even possible?
I'm not sure if it is possible. It isn't clear how Make splits jobs amongst cores.
4.9 Special Built-in Target Names mentions
.NOTPARALLEL
If .NOTPARALLEL is mentioned as a target, then this invocation of make will be run serially, even if the -j option is given. Any
recursively invoked make command will still run recipes in parallel
(unless its makefile also contains this target). Any prerequisites on
this target are ignored.
However, 5.7.3 Communicating Options to a Sub-make says:
The -j option is a special case (see Parallel
Execution).
If you set it to some numeric value N and your operating system
supports it (most any UNIX system will; others typically won’t), the
parent make and all the sub-makes will communicate to ensure that
there are only N jobs running at the same time between them all.
Note that any job that is marked recursive (see Instead of Executing
Recipes) doesn’t count against the total jobs (otherwise we could get
N sub-makes running and have no slots left over for any real work!)
If your operating system doesn’t support the above communication, then
-j 1 is always put into MAKEFLAGS instead of the value you
specified. This is because if the -j option were passed down to
sub-makes, you would get many more jobs running in parallel than you
asked for. If you give -j with no numeric argument, meaning to run
as many jobs as possible in parallel, this is passed down, since
multiple infinities are no more than one.
This suggests to me there is no way to assign a specific job to a single core. It's worth giving a shot though.
Make the large target first,
then everything else afterwards in parallel.
.PHONY: all
all:
⋮
.PHONY: all-limited-memory
all-limited-memory:
${MAKE} -j1 bigfile
${MAKE} all
So now
$ make -j16 all works as expected.
$ make -j4 all-memory-limited builds bigfile serially (exiting if error), carrying on to do the rest in parallel.

MPI OpenMp hybrid

I am trying to run a program written for MPI and OpenMP on a cluster of Linux dual cores.
When I try to set the OMP_NUM_THREADS variable
export OMP_NUM_THREADS=2
I get a message
OMP_NUM_THREADS: Undefined variable.
I don't get a better performance with OpenMP... I also tried:
mpiexec -n 10 -genv OMP_NUM_THREADS 2 ./binary
and omp_set_num_threads(2) inside the program, but it didn't get any better...
Any ideas?
update: when I run mpiexec -n 1 ./binary with omp_set_num_threads(2) execution time is 4s and when I run mpiexec -f machines -n 1 ./binary execution time is 8s.
I would suggest doing an $echo OMP_NUM_THREADS first and further querying for the number of threads inside the program to make sure that threads are being spawned. Use the omp_get_num_threads() function for this. Further if you're using a MacOS then this blogpost can help:
https://whiteinkdotorg.wordpress.com/2014/07/09/installing-mpich-using-macports-on-mac-os-x/
The latter part in this post will help you to successfully compile and run Hybrid programs. Whether a Hybrid program gets better performance or not depends a lot on contention of resources. Excessive usage of locks, barriers - can further slow the program down. It will be great if you post your code here for others to view and to actually help you.

When using mpirun with R script, should I copy manually file/script on clusters?

I'm trying to understand how openmpi/mpirun handle script file associated with an external program, here a R process ( doMPI/Rmpi )
I can't imagine that I have to copy my script on each host before running something like :
mpirun --prefix /home/randy/openmpi -H clust1,clust2 -n 32 R --slave -f file.R
But, apparently it doesn't work until I copy the script 'file.R' on clusters, and then run mpirun. Then, when I do this, the results are written on cluster, but I expected that they would be returned to working directory of localhost.
Is there another way to send R job from localhost to multiple hosts, including the script to be evaluated ?
Thanks !
I don't think it's surprising that mpirun doesn't know details of how scripts are specified to commands such as "R", but the Open MPI version of mpirun does include the --preload-files option to help in such situations:
--preload-files <files>
Preload the comma separated list of files to the current working
directory of the remote machines where processes will be
launched prior to starting those processes.
Unfortunately, I couldn't get it to work, which may be because I misunderstood something, but I suspect it isn't well tested because very few use that option since it is quite painful to do parallel computing without a distributed file system.
If --preload-files doesn't work for you either, I suggest that you write a little script that calls scp repeatedly to copy the script to the cluster nodes. There are some utilities that do that, but none seem to be very common or popular, which I again think is because most people prefer to use a distributed file system. Another option is to setup an sshfs file system.

using strace with mpiexec

How do I strace all processes of MPI parallel job, started with mpiexec (MPICH2, linux)?
-o will mess outputs from different processes
PS To some editors: who may think that MPICH is the name of the library. MPICH2 is a particular version.. MPICH2 is actually MPICH2 is an all-new implementation of MPI and I sometimes had to used both mpich and mpich2. So, we can't replace mpich2 with mpich.
Create a wrapper around your program, which will be launched by mpiexec. Something like:
#!/bin/sh
LOGFILE="strace-$(hostname).$$"
exec strace -o"$LOGFILE" my_mpi_program
You may want to try STAT (Stack Trace Analysis Tool).
Check out the STAT Homepage.
It will give you a high level overview of your process behavior, and works
especially well in the case of a hung process.

Resources