I have a program running with MPI (OpenMPI) that is filling the output with many instances of this:
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
Other than that, the program seems to be running fine.
I wonder if there is a way to suppress this kind of messages?
-q didn't done it.
Related
I have a machine learning Python program that makes some calculations and then runs a C++ finite-volume code (OpenFOAM) repeatedly (thousands of times). The Python code uses "multiprocessing" for parallel processing which means running several instances of that C++ solver at the same time. Additionally, the C++ solver itself is also parallelized with MPI.
The whole framework works just fine on my local computer. But when I use SLURM clusters (Vera and Tetralith) the procedure becomes extremely slow. Although, each instance of the finite volume solver runs rather fast, when one instance is finished the code waits a significant amount of time to run the next one. It appears that the code needs to wait until some specific cores are freed, which is strange as I reserve the required number of cores through a SBATCH script.
Let's say I run the Python code on 8 cores and each core runs a C++ solver using 50 cores with MPI. Thus, I reserve 400 (8 times 50) cores for the whole job through the following script (I even tried requesting twice the number of cores but did not work):
#!/bin/bash
#SBATCH -A MY_PROJECT_NAME
#SBATCH -p MY_CLUSTER_NAME
#SBATCH -J MY_CASE_NAME
#SBATCH -n 400
#SBATCH -t 100:00:00
#SBATCH -o slurm-%j.out
#SBATCH --exclusive
#-----------------------------------------------------------
module load ALL_THE_REQUIRED_MODULES
#-----------------------------------------------------------
python3 -u training.py &> log.training
Sometimes after finishing one C++ solver and before the next one, I get the following messages which I guess indicate that the code is waiting for the cores to be freed. But the cores should already be free as many unused cores exist.
srun: Job 21234173 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 21234173 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 21234173
Any help or idea would be much appreciated.
Saeed
I am trying to run WRF (real.exe, wrf.exe) through the crontab using compute nodes but compute nodes are not able to run slurm job. I think there is some issue with the MPI library when it's running through the cron environment.
Tried to replicate the terminal path variables while running the crontab job.
The log file generated while running job on terminal and crontab is attached as with_terminal and with_crontab respectively in the link.
https://drive.google.com/drive/folders/1YE9OchSB8alpZSdRl-8uIbBPm6lI-0DQ
Error while running the job from crontab is as follows
#################################
compute-dy-c5n18xlarge-1
#################################
Processes 4
[mpiexec#compute-dy-c5n18xlarge-1] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on compute-dy-c5n18xlarge-1 (pid 23070, exit code 65280)
[mpiexec#compute-dy-c5n18xlarge-1] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec#compute-dy-c5n18xlarge-1] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec#compute-dy-c5n18xlarge-1] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:772): error waiting for event
[mpiexec#compute-dy-c5n18xlarge-1] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1938): error setting up the boostrap proxies
Thanks for looking in the issue.
I am using Intel MPI and have encountered some confusing behavior when using mpirun in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1 and run the same mpirun from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2 to -n 3 (still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1) (plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun and slurm (salloc) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?
Normally I'd use mpiexec to run a process on multiple hosts like:
mpiexec -n 8 --hostfile hosts.txt python my_mpi_script.py
where my_mpi_script.py depends on mpi4py.
Supposing I couldn't run mpiexec or mpirun, how would I be able to run my_mpi_script.py on multiple hosts -- would this be possible by changing my script or execution environment?
Edit: I'm working with a system that runs the same command on many hosts. Normally, processes would discover each other on the local network rather than all be spawned by MPI. My current solution involves: checking which host I'm on and running mpiexec on exactly one of the hosts. This doesn't work well due to some networking limitations.
Thanks.
I would like to write slurm batches (sbatch) to run several mpi applications. Thus I would like to be able to run something like that
salloc --nodes=1 mpirun -n 6 hostname
But I get this message :
There are not enough slots available in the system to satisfy the 6 slots
that were requested by the application:
hostname
Either request fewer slots for your application, or make more slots available for use.
The node has actually 4 CPUs. I therefore looking for something allowing more task per CPU but I cannot find the proper option. I know that mpi alone is able to run several processes when physical resources are missing. I think the problem is on the slurm side.
Do you have any suggestions/comments?
Use srun and supply the option --overcommit, e.g. like that:
test.job:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --overcommit
srun hostname
Run sbatch test.job
From man srun:
Normally, srun will not allocate more than one process per CPU. By specifying --overcommit you are explicitly allowing more than one process per CPU.
Note depending on your cluster configuration this may or may not work also with mpirun, but I'd stick with srun unless you have a good reason not to.
An important warning: Most MPI implementations by default have terrible performance when running in overcommited. How to address that is a different, much more difficult, question.