Unable to use all cores with mpirun - mpi

I'm testing a simple MPI program on my desktop (Ubuntu LTS 16.04/ Intel® Core™ i3-6100U CPU # 2.30GHz × 4/ gcc 4.8.5 /OpenMPI 3.0.0) and mpirun won't let me use all of the cores on my machine (4). When I run:
$ mpirun -n 4 ./test2
I get the following error:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
./test2
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
But if I run with:
$ mpirun -n 2 ./test2
everything works fine.
I've seen from other answers that I can check the number of processors with
cat /proc/cpuinfo | grep processor | wc -l
and this tells me that I have 4 processors. I'm not interested in oversubscribing, I'd just like to be able to use all my processors. Can anyone help?

Your processor has 4 hyperthreads but only 2 cores (see the specs here).
By default, Open MPI does not run more than one MPI task per core.
You can have Open MPI run up to one MPI task per hyperthread with the following option
mpirun --use-hwthread-cpus ...
FWIW
The command you mentioned reports the number of hyperthreads.
A better way to figure out the topology of a machine is via the lstopo command from the hwloc package.
MPI tasks are not bound on cores nor threads on OS X, so if you are running on a Mac, the --oversubscribe -np 4 would lead to the same result.

To resolve your problem, you can use the --use-hwthread-cpus command line arguments for mpirun, as already pointed out by Gilles Gouaillardet. In this case, Open MPI will treat the thread provided by hyperthreading as the Open MPI processor. Otherwise, it will treat a CPU core as an Open MPI processor, which is the default behavior. When using --use-hwthread-cpus, it will correctly determine the total number of processors available to you, that is, all processors available on all hosts specified in the Open MPI host file. Therefore, you do not need to specify the "-n" parameter. In addition, when using the --use-hwthread-cpus command line parameter, Open MPI refers to the threads provided by hyperthreading as "hardware threads". With this technique, you will not oversubscribe, and if some Open MPI processor will run on a virtual machine, it will use the correct number of threads assigned to that virtual machine. And if your processor has more than two threads per core, as a Xeon Phi (Knights Mill, Knights Landing, etc.), it will take all four threads per core as an Open MPI processor.

Use $ lscpu the number of cores per socket * number of sockets would give you number of physical cores(the ones that you can use for mpi) where as number of cores per socket * number of sockets * threads per core will give you number of logical cores(the one that you get by using the command $ cat /proc/cpuinfo | grep processor | wc -l)

Related

"no enough slots" error of running Open MPI on databricks cluster with Linux

I try to use mpi to run a C application on databricks clusters.
I have downloaded Open MPI from
https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.3.tar.gz
and installed on databricks cluster.
It was built on databricks cluster with Ubuntu.
Operating system/version: Linux 4.4.0 Ubuntu
Computer hardware: x86_64
Network type: databricks
I am trying to run from python notebook on databricks:
%sh
mpirun --allow-run-as-root -np 20 MY_c_Application
The MY_c_Application was written by C and compiled on databricks Linux.
My databricks cluster has 21 nodes with one as driver. Each node has 32 cores.
When I run the above command, I got the error as follows.
Could you please let me know how this could be caused ?
Or, do I miss something ?
thanks
There are not enough slots available in the system to satisfy the 20
slots that were requested by the application:
MY_c_application
Either request fewer slots for your application, or make more slots available for use.
A "slot" is the Open MPI term for an allocatable unit where we can launch a process.
The number of slots available are defined by the environment in which Open MPI processes are run:
Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided)
The --host command line parameter, via a ":N" suffix on the hostname
(N defaults to 1 if not provided)
Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
If none of a hostfile, the --host command line parameter, or an RM
is present, Open MPI defaults to the number of processor cores In
all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use
the --use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to launch.
UPDATE
After adding a hostfile , this problem is gone.
sudo mpirun --allow-run-as-root -np 25 --hostfile my_hostfile ./MY_C_APP
thanks
Sharing the answer as per by the original poster:
After adding a hostfile, the problem as resolved.
sudo mpirun --allow-run-as-root -np 25 --hostfile my_hostfile ./MY_C_APP

why does mpirun behave as it does when used with slurm?

I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?

How to list avaliable resources per node in MPI?

I have an access to MPI cluster. It is a pure, clean lan cluster, no SLURM or anething except OpenMP, mpicc, mpirun installed. I have sudo rights. Accessible and configured MPI nodes are all listed in /etc/hosts. I can compile and run MPI programms, yet how to get information on MPI cluster abilities: totall cores avaliable, processors info, total memory, currently running tasks?
Generaly I search for analog of sinfo and squeue that would work in MPI environment?
total cores avaliable:
total memory:
You can try to use Portable Hardware Locality hwloc to see the hardware topology and get info about total cores and total memory.
Additionally you can get information about CPU using lscpu or cat /proc/cpuinfo
currently running tasks:
You can use the monitoring software nmon from IMB (its free)
The option -t of nmon reports the top running process (like top command). You can use nmon online or offline mode.
The following example is from IMB developerWorks
nmon -fT -s 30 -c 120
Is getting one "snapshot" every 30 seconds until it gets 120 snapshots. Then you can examine the output.
If you run it without -f you will see the results live

mpi job submission on lsf cluster

I usually process data on the University's cluster. Most jobs done before are based on parallel batch shell (divide job to several batches, then submit them parallel). An example of this shell is shown below:
#! /bin/bash
#BSUB -J model_0001
#BSUB -o z_output_model_0001.o
#BSUB -n 8
#BSUB -e z_output_model_0001.e
#BSUB -q general
#BSUB -W 5:00
#BSUB -B
#BSUB -N
some command
This time, I am testing some mpi job (based on mpi4py). The code has been tested on my laptop working on single task(1 task using 4 processor to run). Now I need to submit multi-task (30) jobs on the cluster (1 task using 8 processor to run). My design is like this: prepare 30 similar shell files above. command in each shell fill is my mpi command (something like "mpiexec -n 8 mycode.py args"). And each shell reserves 8 processors.
I submitted the jobs. But I am not sure if I am doing correctly. It's running but I am not sure if it runs based on mpi. How can I check? Here are 2 more questions:
1) For normal parallel jobs, usually there is a limit number I can reserve for single task -- 16. Above 16, I never succeeded. If I use mpi, can I reserve more? Because mpi is different. Basically I do not need continuous memory.
2) I think there is a priority rule on the cluster. For normal parallel jobs, usually when I reserve more processors for 1 task (say 10 tasks and 16 processors per task), it requires much more waiting time in the queue than reserving less less processors for single task (say divide each task to 8 sub-tasks (80 sub-tasks in total) and 2 processors per sub-task). If I can reserve more processors for mpi. Does it affects this rule? I worry that I am going to wait forever...
Well, increasing "#BSUB -n" is exactly what you need to do. That option tells how many execution "slots" you are reserving. So if you want to run an MPI job with 20 ranks, you need
#BSUB -n 20
IIRC the execution slots do not need to be allocated on the same node, LSF will allocate slots from as many nodes are required for the request to be satisfied. But it's been a while since I've used LSF, and I currently don't have access to a system using it, so I could be wrong (and it might depend on the local cluster LSF configuration).

MPI: mpiexec third parameter not clear

What exactly is the third parameter in the following MPI command
mpiexec -n 2 cpi
Is it no. of cores? So if I am running on Pentium 4 , shall I make it 1?
-n 2: spawn two processes.
cpi: the executable.
Experiment with what is faster, one or two or more processes. Some codes run best with one process per core, some codes benefit from oversubscription.

Resources