I have a heterogeneous computing cluster that I would like to run parallel computing tasks on using OpenMPI. Since not all nodes in the cluster can run the same executable (by virtue of being heterogeneous) I would like for some of the nodes to compile their own version of the program and have Open MPI invoke that executable on those nodes. My first question is whether OpenMPI enables this kind of computing across heterogeneous architectures.
If so, my second question is how to specify which executables to run on which nodes. For example lets say node0, node1, and node2 can run executable prog1 and node3, node4, and node5 can run executable prog2, where prog1 and prog2 are the same program but compiled for different architectures using the mpicc or mpic++ wrapper compilers.
If I wanted to run this program in parallel across all nodes would I do the following:
mpirun -n 3 --hosts node0,node1,node2 prog1 : -n 3 --hosts node3,node4,node5 prog2
If not, what would I do to achieve this effect? This post indicates that heterogeneous cluster computing is supported by OpenMPI but I must build OpenMPI with the --enable-heterogeneous flag. I'm not sure how to do this since my cluster is running ArchLinux and I installed OpenMPI with pacman.
Note there is a typo (--host does not require an ending s), so your command should be
mpirun -n 3 --host node0,node1,node2 prog1 : -n 3 --host node3,node4,node5 prog2
--enable-heterogeneous is needed so Open MPI can be ran on heterogeneous systems (for example between Intel x86_64 (little endian) and a sparcv9 (big endian) nodes). If OpenMPI (coming with ArchLinux) was not configured with this flag, then you should rebuild this package. an other option is to rebuild Open MPI and install it into an alternate directory.
Last but not least, heterogeneous support is (very) lightly tested, and i strongly encourage you to use the latest Open MPI 3.0 series.
Related
I call cvxopt.glpk.ilp in Python 3.6.6, cvxopt==1.2.3 for a boolean optimization problem with about 500k boolean variables. It is solved in 1.5 hours, but it seems to run on just one core! How can I make it run on all or a specific set of cores?
The server with Linux Ubuntu x86_64 has 16 or 32 physical cores. My process affinity is 64 cores (I assume due to hyperthreading).
> grep ^cpu\\scores /proc/cpuinfo | uniq
16
> grep -c ^processor /proc/cpuinfo
64
> taskset -cp <PID>
pid <PID> current affinity list: 0-63
However top shows only 100% CPU for my process, and htop shows that only one core is 100% busy (some others are slightly loaded presumably by other users).
I set OMP_NUM_THREADS=32 and started my program again, but still one core. It's a bit difficult to restart the server itself. I don't have root access to the server.
I installed cvxopt from a company's internal repo which should be a mirror of PyPI. The following libs are installed in /usr/lib: liblapack, liblapack_atlas, libopenblas, libblas, libcblas, libatlas.
Here some SO-user writes, that GLPK is not multithreaded. This is the solver used by default as cvxopt has no own MIP-solver.
As cvxopt only supports GLPK as open-source mixed-integer programming solver, you are out of luck.
Alternatively you can use CoinOR's Cbc, which is usually a much better solver than GLPK while still being open-source. This one also can be compiled with parallelization. See some benchmarks which also indicate that GLPK is really without parallel support.
But as there is no support in cvxopt, you will need some alternative access-point:
own C/C++ based wrapper
pulp
binary install available
python-mip
binary install available
Google's ortools
binary install available
cylp
cvxpy + cylp
binary install available for cvxpy; without cylp-build
Those:
have very different modelling-styles (from completely low-level: cylp to very high-level: cvxpy)
i'm not sure if all those builds are compiled with enable-parallel (which is needed when compiling Cbx)
Furthermore: don't expect too much gain from multithreading. It's usually way worse than linear speedup (as for all combinatorial-optimization problems which are not based on brute-force).
(Imho the GIL does not matter as all those are C-extensions where the GIL is not in the way)
I have an MPI application which has a command line option -ss to specify an argument. I've been running this successfully on various Cray machines, including ARCHER (www.archer.ac.uk) an XC30, for years. The OS was recently upgraded and as part of this ALPS was upgraded to version 5.1.1-2.0501.8507.1.1
Now when I launch the program on the compute nodes with aprun, the program is receiving the the option as --ss.
Checking with a shell script instead of a full application
#!/bin/bash
echo $*
confirms that this option is getting double-dashed by aprun.
Clearly there is a bug in aprun (I've reported it) but how can I work around the issue until this is patched?
I am trying to run a program written for MPI and OpenMP on a cluster of Linux dual cores.
When I try to set the OMP_NUM_THREADS variable
export OMP_NUM_THREADS=2
I get a message
OMP_NUM_THREADS: Undefined variable.
I don't get a better performance with OpenMP... I also tried:
mpiexec -n 10 -genv OMP_NUM_THREADS 2 ./binary
and omp_set_num_threads(2) inside the program, but it didn't get any better...
Any ideas?
update: when I run mpiexec -n 1 ./binary with omp_set_num_threads(2) execution time is 4s and when I run mpiexec -f machines -n 1 ./binary execution time is 8s.
I would suggest doing an $echo OMP_NUM_THREADS first and further querying for the number of threads inside the program to make sure that threads are being spawned. Use the omp_get_num_threads() function for this. Further if you're using a MacOS then this blogpost can help:
https://whiteinkdotorg.wordpress.com/2014/07/09/installing-mpich-using-macports-on-mac-os-x/
The latter part in this post will help you to successfully compile and run Hybrid programs. Whether a Hybrid program gets better performance or not depends a lot on contention of resources. Excessive usage of locks, barriers - can further slow the program down. It will be great if you post your code here for others to view and to actually help you.
I want to know how to use ltrace to get library function calls of mpi application but simply ltrace doesn't work and my mpirun cannot succeed.
Any idea?
You should be able to simply use:
$ mpiexec -n 4 -other_mpiexec_options ltrace ./executable
But that will create a huge mess since the outputs from the different ranks will merge. A much better option is to redirect the output of ltrace to a separate file for each rank. Getting the rank is easy with some MPI implementations. For example, Open MPI exports the world rank in the environment variable OMPI_COMM_WORLD_RANK. The following wrapper script would help:
#!/bin/sh
ltrace --output trace.$OMPI_COMM_WORLD_RANK $*
Usage:
$ mpiexec -n 4 ... ltrace_wrapper ./executable
This will produce 4 trace files, one for each rank: trace.0, trace.1, trace.2, and trace.3.
For MPICH and other MPI implementations based on it and using the Hydra PM exports PMI_RANK and the above given script has to be modified and OMPI_COMM_WORLD_RANK replaced with PMI_RANK. One could also write an universal wrapper that works with both families of MPI implementations.
How do I strace all processes of MPI parallel job, started with mpiexec (MPICH2, linux)?
-o will mess outputs from different processes
PS To some editors: who may think that MPICH is the name of the library. MPICH2 is a particular version.. MPICH2 is actually MPICH2 is an all-new implementation of MPI and I sometimes had to used both mpich and mpich2. So, we can't replace mpich2 with mpich.
Create a wrapper around your program, which will be launched by mpiexec. Something like:
#!/bin/sh
LOGFILE="strace-$(hostname).$$"
exec strace -o"$LOGFILE" my_mpi_program
You may want to try STAT (Stack Trace Analysis Tool).
Check out the STAT Homepage.
It will give you a high level overview of your process behavior, and works
especially well in the case of a hung process.