"A request was made to bind to that would result in binding more processes than cpus on a resource" mpirun command (for mpi4py)

"A request was made to bind to that would result in binding more processes than cpus on a resource" mpirun command (for mpi4py) - mpi

I am running OpenAI baselines, specifically the Hindsight Experience Replay code. (However, I think this question is independent of the code and is an MPI-related one, hence why I'm posting on StackOverflow.)
You can see the README there but the point is, the command to run is:
python -m baselines.her.experiment.train --num_cpu 20
where the number of CPUs can vary and is for MPI.
I am successfully running the HER training script with 1-4 CPUs (i.e., --num_cpu x for x=1,2,3,4) on a single machine with:
Ubuntu 16.04
Python 3.5.2
TensorFlow 1.5.0
One TitanX GPU
The number of CPUs seems to be 8 as I have a quad-core i7 Intel processor with hyperthreading, and Python confirms that it sees 8 CPUs.
(py3-tensorflow) daniel#titan:~/baselines$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import os, multiprocessing
In [2]: os.cpu_count()
Out[2]: 8
In [3]: multiprocessing.cpu_count()
Out[3]: 8
Unfortunately, when I run with 5 or more CPUs, I get this message blocking the code from running:
(py3-tensorflow) daniel#titan:~/baselines$ python -m baselines.her.experiment.train --num_cpu 5
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: titan
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
And here's where I got lost. There's no error message or line of code that I need to fix. I am therefore unsure about where I even add overload-allowed in the code?
The way this code works at a high level is that it takes in this argument and uses the python subprocess module to run an mpirun command. However, checking mpirun --help on the command line doesn't reveal overload-allowed as a valid argument.
Googling this error message leads to questions in the openmpi repository, for instance:
https://github.com/open-mpi/ompi/issues/626 (seems to have died out without resolving issue)
https://github.com/open-mpi/ompi/issues/2158 (not sure how this relates to my issue, didn't get clear resolution)
But I'm not sure if it's an OpenMPI thing or an mpi4py thing?
Here's pip list in my virtual environment if it helps:
(py3.5-mpi-practice) daniel#titan:~$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
decorator (4.2.1)
ipython (6.2.1)
ipython-genutils (0.2.0)
jedi (0.11.1)
line-profiler (2.1.2)
mpi4py (3.0.0)
numpy (1.14.1)
parso (0.1.1)
pexpect (4.4.0)
pickleshare (0.7.4)
pip (9.0.1)
pkg-resources (0.0.0)
pprintpp (0.3.0)
prompt-toolkit (1.0.15)
ptyprocess (0.5.2)
Pygments (2.2.0)
setuptools (20.7.0)
simplegeneric (0.8.1)
six (1.11.0)
traitlets (4.3.2)
wcwidth (0.1.7)
So, TL;DR:
How do I fix this error in my code?
If I add the "overload-allowed" thing, what happens? Is it safe?
Thanks!

overload-allowed is a qualifier that is passed to --bind-to parameter of mpirun (source).
mpirun ... --bind-to core:overload-allowed
Beware that hyperthreading thing is more about marketing than about performance bonuses.
Your i7 can actually have four silicon cores and four "logical" ones. The logical ones basically try to use resources of the silicon cores that are currently unused. The problem is that a good HPC program will use 100% of the CPU hardware, and hyperthreading won't have resources to successfully operate.
So, it is safe to "overload" "cores", but it's not a performance boost candidate #1.
Regarding the advice that the paper authors give about reproducing the results. In the best case less cpus just means slow learning. However, if learning doesn't converge to an expected value no matter how hyperparams are tweaked, then it is a reason to look closer at the proposed algorithm.
While IEEE754 computations do differ if done in different order, this difference should not play the crucial role.

The error message suggests that mpi4py is built on top of Open MPI.
By default, a slot is a core, but if you want a slot to be an hyperthread, then you should
mpirun --use-hwthread-cpus ...

Related

mpirun error of oneAPI with Slurm (and PBS) in old cluster

Recently I installed Intel OneAPI including c compiler, FORTRAN compiler and mpi library and complied VASP with it.
Before presenting the question, there are some tricks I need to clarify during the installation of VASP:
GLIBC2.14: the cluster is an old machine with a glibc version of 2.12, where OneAPI needs a version of 2.14. So I compile the GLIBC2.14 and export the ld_path: export LD_LIBRARY_PATH="~/mysoft/glibc214/lib:$LD_LIBRARY_PATH"
ld 2.24: The ld version is 2.20 in the cluster, while a higher version is needed. So I installed binutils 2.24.
There is one master computer connected with 30 calculating nodes in the cluster. The calculation is executed with 3 ways:
When I do the calculation in the master, it's totally OK.
When I login the nodes manually with rsh command, the calculation in the logged node is also no problem.
But usually I submit the calculation script from the master (with slurm or pbs), and then do the calculation in the node. In that case, I met following error message:
[mpiexec#node3.alineos.net] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#node3.alineos.net] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#node3.alineos.net] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec#node3.alineos.net] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec#node3.alineos.net] Possible reasons:
[mpiexec#node3.alineos.net] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec#node3.alineos.net] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec#node3.alineos.net] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec#node3.alineos.net] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I only met this error with oneAPI compiled codes but Intel® Parallel Studio XE compiled. Do you have any idea of this error? Your response will be highly appreciated.
Best,
Léon

Could it be a permissions error with the Slurm agent not having the correct permissions or library path?

Why is MKL in parallel not faster than serial in R 3.6?

I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.
It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.
I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.
Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.
require(callr)
#> Loading required package: callr
f <- function(i) r(function() crossprod(matrix(1:1e9, ncol=1000))[1],
env=c(rcmd_safe_env(),
R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH,
MKL_NUM_THREADS=as.character(i),
OMP_NUM_THREADS="1")
)
system.time(f(1))
#> user system elapsed
#> 14.675 2.945 17.789
system.time(f(4))
#> user system elapsed
#> 54.528 2.920 19.598
system.time(f(8))
#> user system elapsed
#> 115.628 3.181 20.364
system.time(f(32))
#> user system elapsed
#> 787.188 7.249 36.388
Created on 2020-05-13 by the reprex package (v0.3.0)
EDIT 5/18
Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
for f(8), it shows NThr:8
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8
I still am not getting any expected performance increase from extra cores.
EDIT 2
I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?

Only seeing this now: Did you check the obvious one, ie whether R on CentOS actually picks up the MKL?
As I recall, R on CentOS it is built in a more, ahem, "restricted" mode with the shipped-with-R reference BLAS. And if and when that is the case you simply cannot switch and choose another one as we have done within Debian and Ubuntu for 20+ years as that requires a different initial choice when R is compiled.
Edit: Per subsequent discussions (see comments below) we all re-realized that it is important to have the threading libraries / models aligned. The MKL is an Intel product and defaults to using their threading library, on Linux the GNU compiler is closer to the system and has its own. That latter one needs to be selected. In my writeup / script for the MKL on .deb systems I use
echo "MKL_THREADING_LAYER=GNU" >> /etc/environment
so set this "system-wide" on the machine, one can also add it just to the R environment files.

I am not sure exactly how R call MKL but if the crossprod function calls mkl's gemm underneath then we have to see very good scalability results with such inputs. What is the input problem sizes?
MKL supports the verbose mode. This option could help to see the many useful runtime info when dgemm will be running. Could you try to export the MKL_VERBOSE=1 environment and see the log file?
Though, I am not pretty sure if R will not suppress the output.

Change the number of cores OpenMPI can "see"

I am running an executable (I don't have access to source code) that calls mpirun. I'm getting the following error which is common if more cores are requested than available on the CPU:
There are not enough slots available in the system to satisfy the 12
slots that were requested by the application:
/Users/me/Library/app/executable
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
My problem is that I cannot change the command-line options for mpirun e.g. using --oversubscribe. Rather, I need to change the default number of cores that OpenMPI "sees". (This would be an easy fix otherwise as in this case)
Is there an environment variable or something I can update to trick openMPI into working?

Ah. I found the default OpenMPI hostfile at /usr/local/etc/openmpi-default-hostfile (on Mac) and added at the end (in a new line):
localhost slots=12
So OpenMPI was reading a default slots count of 6, since I have 6 cores on my system (The error only occurred for >6 requested CPUs). But I have 12 threads and wanted to use the CPU at full capacity.
This worked for me since I was not running mpirun in the command line.

How to fix modprobe error from MPSS 4.4.1

Reaching out to the wider group as I am totally stumped trying to install Intel MPSS 4.x for my Xeon Phi 7220P.
I followed the precise steps in this link: Intel MPSS Linux User Guide Rev 4.4.1, and did it 3 times to make sure I wasn't missing any steps or making mistakes, but keep getting the following error readout:
modprobe: WARNING: Module mic_x200_dma not found.
modprobe: WARNING: Module scif_bus not found.
modprobe: WARNING: Module vop_bus not found.
modprobe: WARNING: Module cosm_bus not found.
modprobe: WARNING: Module scif not found.
modprobe: WARNING: Module vop not found.
modprobe: WARNING: Module mic_cosm not found.
modprobe: WARNING: Module mic_x200 not found.
As a result of this, I cant do the basic MPSS commands such as micctrl -s etc nor use the Xeon Phi whatsoever.
I am running Centos 7 (862 kernel), and know its not listed in the Intel pdf, but did not think this should be causing an issue as it seems that the above kernel modules are simply not being installed seemingly from Intel MPSS - but not sure if this diagnosis is correct.
Would appreciate your help - many thanks in advance!

It is complaining about that because at one point your kernel got updated from 3.10.0-514.el7 to a later version, (it happens automatically when you do a yum update, annoying I know)
Check your kernel version by running
uname -r
When you installed/compiled all the modules they were placed into /lib/modules/3.10.0-514.el7.x86_64 which is where the source code you have exported to.
You have 2 options:
Recompile the source code to work for your current kernel version (which is a pain and has it's own problems)
Revert your host kernel back to 3.10.0-514.el7 via grub config (example here) and everything will work nicely
I too struggled with this very much in the beginning and I had to read pretty much every line of source code and spend countless hours debugging until I found out. At this point there is nothing I do not know about the Xeon Phi x100/x200.
The documentation is not bad, but it didn't cover this bare essential, which is frustrating.

How to use Intel Pin on MPI code

I am pretty new to MPI and Intel Pin. I already installed pin-2.13-62732-gcc.4.4.7-linux on my linux environment, I need t use this tool on MPI codes. foreexample I want to get the number of instruction(such as inscount0 which is already existed in pin) of MPI code (like imul.c). Would you tell me what I can do?

The least painful way I found is to use tau_pin. https://www.cs.uoregon.edu/research/tau/docs/old/re39.html

You can start analysis of your MPI application following way:
mpirun –np $NPROCS pin -t $PIN_TOOL -- $APP
It the same as in case of Valgrind: Using valgrind to spot error in mpi code

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex