I am running a physics solver that was written to use hybrid OpenMP/MPI parallelization. The job manager on our cluster is SLURM. Everything goes as expected when I am running in a pure MPI mode. However, once I try to use hybrid parallelization strange things happen:
1) First I tried the following SLURM block:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
(hint: 16 is the number of physical cores on the processors on the cluster)
However what happens is that the simulation runs on 4 nodes and there I see 4 used cores each (in htop). Moreover the solver tells me it is started on 16 cores, which I do not really understand. It should be 8*16=128 I think.
2) As the above was not successful, I added the following loop to my SLURM script:
if [ -n "$SLURM_CPUS_PER_TASK" ]; then
omp_threads=$SLURM_CPUS_PER_TASK
else
omp_threads=1
fi
export OMP_NUM_THREADS=$omp_threads
What happens is that the solver tells me now that it is started on 128 cores. But when using htop on the respective nodes it becomes obvious that these OpenMP threads use the same cores, so the solver is ultra slow. The developer of the code told me he never used the loop I added so there might be something wrong about that, but I do not understant why the OpenMP threads use the same cores. However, in htop, the threads seem to be there. Another strange thing is that htop shows me 4 active cores per cluster... I would have expected either 2 (for the 2 MPI tasks per node) or rather, if everything would go as planned, 32 (2 MPI tasks running 16 OMP threads each).
We once head an issue already as the developer uses an Intel Fortran compiler and I use a GNU fortran compiler (mpif90 respectively mpifort).
Has anyone an idea how I can make my OpenMP threads use all the cores available instead of only some few?
Some system / code info:
Linux distro: OpenSUSE Leap 15.0
Compiler: mpif90
Code: FORTRAN90
So few things, by using:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
You tell you want 8 tasks (i.e MPI worker), and have two of them per nodes, so it is normal to have the code starting on 4 nodes.
Then you tell each MPI worker to use 16 OMP threads. You say :
Moreover the solver tells me it is started on 16 cores
Probably the solver look at OMP threads, so it is normal for him to indicate 16. I don't know the detail of your code, but usually if you solve a problem on a grid, you will split the grids in subdomains (1 per MPI) and solve with OMP on this subdomains. So you have in your case 8 solvers running in parallel, each of them using 16 cores.
The command export OMP_NUM_THREADS=$omp_threads and the if block you added are correct (btw this is not a loop).
If you have 16 cores per nodes on the cluster, your configuration should rather be:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
So one MPI per node and then 1 OMP per core, instead of two now which will probably just slow down the code.
Finally, how do you get the htopoutput, do you log to the compute node? It is usually not. A good idea on clusters.
I know this is not a full answere but without the actual code it is a bit hard to tell more and this was too long to be posted as a comment.
Related
I call cvxopt.glpk.ilp in Python 3.6.6, cvxopt==1.2.3 for a boolean optimization problem with about 500k boolean variables. It is solved in 1.5 hours, but it seems to run on just one core! How can I make it run on all or a specific set of cores?
The server with Linux Ubuntu x86_64 has 16 or 32 physical cores. My process affinity is 64 cores (I assume due to hyperthreading).
> grep ^cpu\\scores /proc/cpuinfo | uniq
16
> grep -c ^processor /proc/cpuinfo
64
> taskset -cp <PID>
pid <PID> current affinity list: 0-63
However top shows only 100% CPU for my process, and htop shows that only one core is 100% busy (some others are slightly loaded presumably by other users).
I set OMP_NUM_THREADS=32 and started my program again, but still one core. It's a bit difficult to restart the server itself. I don't have root access to the server.
I installed cvxopt from a company's internal repo which should be a mirror of PyPI. The following libs are installed in /usr/lib: liblapack, liblapack_atlas, libopenblas, libblas, libcblas, libatlas.
Here some SO-user writes, that GLPK is not multithreaded. This is the solver used by default as cvxopt has no own MIP-solver.
As cvxopt only supports GLPK as open-source mixed-integer programming solver, you are out of luck.
Alternatively you can use CoinOR's Cbc, which is usually a much better solver than GLPK while still being open-source. This one also can be compiled with parallelization. See some benchmarks which also indicate that GLPK is really without parallel support.
But as there is no support in cvxopt, you will need some alternative access-point:
own C/C++ based wrapper
pulp
binary install available
python-mip
binary install available
Google's ortools
binary install available
cylp
cvxpy + cylp
binary install available for cvxpy; without cylp-build
Those:
have very different modelling-styles (from completely low-level: cylp to very high-level: cvxpy)
i'm not sure if all those builds are compiled with enable-parallel (which is needed when compiling Cbx)
Furthermore: don't expect too much gain from multithreading. It's usually way worse than linear speedup (as for all combinatorial-optimization problems which are not based on brute-force).
(Imho the GIL does not matter as all those are C-extensions where the GIL is not in the way)
I have a heterogeneous computing cluster that I would like to run parallel computing tasks on using OpenMPI. Since not all nodes in the cluster can run the same executable (by virtue of being heterogeneous) I would like for some of the nodes to compile their own version of the program and have Open MPI invoke that executable on those nodes. My first question is whether OpenMPI enables this kind of computing across heterogeneous architectures.
If so, my second question is how to specify which executables to run on which nodes. For example lets say node0, node1, and node2 can run executable prog1 and node3, node4, and node5 can run executable prog2, where prog1 and prog2 are the same program but compiled for different architectures using the mpicc or mpic++ wrapper compilers.
If I wanted to run this program in parallel across all nodes would I do the following:
mpirun -n 3 --hosts node0,node1,node2 prog1 : -n 3 --hosts node3,node4,node5 prog2
If not, what would I do to achieve this effect? This post indicates that heterogeneous cluster computing is supported by OpenMPI but I must build OpenMPI with the --enable-heterogeneous flag. I'm not sure how to do this since my cluster is running ArchLinux and I installed OpenMPI with pacman.
Note there is a typo (--host does not require an ending s), so your command should be
mpirun -n 3 --host node0,node1,node2 prog1 : -n 3 --host node3,node4,node5 prog2
--enable-heterogeneous is needed so Open MPI can be ran on heterogeneous systems (for example between Intel x86_64 (little endian) and a sparcv9 (big endian) nodes). If OpenMPI (coming with ArchLinux) was not configured with this flag, then you should rebuild this package. an other option is to rebuild Open MPI and install it into an alternate directory.
Last but not least, heterogeneous support is (very) lightly tested, and i strongly encourage you to use the latest Open MPI 3.0 series.
I have a test script that takes from hours to days to run. The test script repeatedly builds a library and runs its self tests under different configurations.
On desktops and servers, the script enjoys a speedup because it uses -j N, where N is the number of cores available. It will take about 2 hours to run the test script.
On dev-boards like a LeMaker Hikey (8-core ARM64/2GB RAM) and CubieTruck (8-core ARM/2GB RAM), I can't use -j N (for even N=2 or N=4) because one file is a real monster and causes an OOM kill. In this case it can take days for the script to run.
My question is, how can I craft a make recipe that tells GNUmake to handle this one source file with -j 1? Is it even possible?
I'm not sure if it is possible. It isn't clear how Make splits jobs amongst cores.
4.9 Special Built-in Target Names mentions
.NOTPARALLEL
If .NOTPARALLEL is mentioned as a target, then this invocation of make will be run serially, even if the -j option is given. Any
recursively invoked make command will still run recipes in parallel
(unless its makefile also contains this target). Any prerequisites on
this target are ignored.
However, 5.7.3 Communicating Options to a Sub-make says:
The -j option is a special case (see Parallel
Execution).
If you set it to some numeric value N and your operating system
supports it (most any UNIX system will; others typically won’t), the
parent make and all the sub-makes will communicate to ensure that
there are only N jobs running at the same time between them all.
Note that any job that is marked recursive (see Instead of Executing
Recipes) doesn’t count against the total jobs (otherwise we could get
N sub-makes running and have no slots left over for any real work!)
If your operating system doesn’t support the above communication, then
-j 1 is always put into MAKEFLAGS instead of the value you
specified. This is because if the -j option were passed down to
sub-makes, you would get many more jobs running in parallel than you
asked for. If you give -j with no numeric argument, meaning to run
as many jobs as possible in parallel, this is passed down, since
multiple infinities are no more than one.
This suggests to me there is no way to assign a specific job to a single core. It's worth giving a shot though.
Make the large target first,
then everything else afterwards in parallel.
.PHONY: all
all:
⋮
.PHONY: all-limited-memory
all-limited-memory:
${MAKE} -j1 bigfile
${MAKE} all
So now
$ make -j16 all works as expected.
$ make -j4 all-memory-limited builds bigfile serially (exiting if error), carrying on to do the rest in parallel.
I am trying to run a program written for MPI and OpenMP on a cluster of Linux dual cores.
When I try to set the OMP_NUM_THREADS variable
export OMP_NUM_THREADS=2
I get a message
OMP_NUM_THREADS: Undefined variable.
I don't get a better performance with OpenMP... I also tried:
mpiexec -n 10 -genv OMP_NUM_THREADS 2 ./binary
and omp_set_num_threads(2) inside the program, but it didn't get any better...
Any ideas?
update: when I run mpiexec -n 1 ./binary with omp_set_num_threads(2) execution time is 4s and when I run mpiexec -f machines -n 1 ./binary execution time is 8s.
I would suggest doing an $echo OMP_NUM_THREADS first and further querying for the number of threads inside the program to make sure that threads are being spawned. Use the omp_get_num_threads() function for this. Further if you're using a MacOS then this blogpost can help:
https://whiteinkdotorg.wordpress.com/2014/07/09/installing-mpich-using-macports-on-mac-os-x/
The latter part in this post will help you to successfully compile and run Hybrid programs. Whether a Hybrid program gets better performance or not depends a lot on contention of resources. Excessive usage of locks, barriers - can further slow the program down. It will be great if you post your code here for others to view and to actually help you.
How do I strace all processes of MPI parallel job, started with mpiexec (MPICH2, linux)?
-o will mess outputs from different processes
PS To some editors: who may think that MPICH is the name of the library. MPICH2 is a particular version.. MPICH2 is actually MPICH2 is an all-new implementation of MPI and I sometimes had to used both mpich and mpich2. So, we can't replace mpich2 with mpich.
Create a wrapper around your program, which will be launched by mpiexec. Something like:
#!/bin/sh
LOGFILE="strace-$(hostname).$$"
exec strace -o"$LOGFILE" my_mpi_program
You may want to try STAT (Stack Trace Analysis Tool).
Check out the STAT Homepage.
It will give you a high level overview of your process behavior, and works
especially well in the case of a hung process.