Assign MPI Processes to Nodes - mpi

I have an MPI program that uses a master process and multiple worker processes. I want to have the master process running on a single compute node alone, while the worker processes run on another node. The worker processes should be assigned by socket (for example as it is done with the --map-by-socket option). Is there any option to assign the master process and the working processes to different nodes or to assign it manually, by consulting the rank maybe?
Thanks

Assignment of ranks to hosts simultaneously with binding is possible via the use of rankfiles. In your case, assuming that each node has two 4-core CPUs, something like this should do it (for Open MPI 1.7 and newer):
rank 0=host1 slots=0-7
rank 1=host2 slots=0:0-3
rank 2=host2 slots=1:0-3
For older versions, instead of slots=0:0-3 and slots=1:0-3 one should use slots=0-3 and slots=4-7 respectively (assuming that cores are numbered linearly which might not be the case). Then the rankfile is supplied to mpiexec via the --rankfile option. It supersedes the hostfile.
Another option would be to do an MIMD launch. In that case one could split the MPI job into several parts and provide different distribution and binding arguments for each part:
mpiexec -H host1 -n 1 --bind-to none ./program : \
-H host2 -n 2 --bind-to socket --map-by socket ./program

The easiest way I am aware of doing this is by using the --hostfile option of OpenMPI.
If you are using any decent batch system you should have a list of your hosts and slots in some simple file or environment variable and you can parse that into a hostfile.
If you run your application "by hand" you can generate such a list on your own.

Related

Mulitple processes launched by mpirun at different times within a single slurm allocation

I am trying to launch multiple mpirun commands from within a single slurm allocation but launching at different times and with different core counts.
At the minute the script launches something like
mpirun -np 2 python parallel_script.py input1
mpirun -np 3 python parallel_script.py input2
Doing this naively leads to all the processes being launched on the same (0-n) cores. This can be fixed for a single node by turning off process binding, but it seems there is no way to extend this to multiple node jobs.
Is there any easy mechanism to tell it to keep track of which cores are in use? Alternatively, is there a way to specify an offset to the mpi mapping so as to specify that I want the process to actually launch on logical core X rather than logical core 0.
I'm trying to avoid having to rely too heavily on the scheduler implementation here so as this is useful on more than one system.

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

How are MPI processes started?

When starting an MPI job with mpirun or mpiexec, I can understand how one might go about starting each individual process. However, without any compiler magic, how do these wrapper executables communicate the arrangement (MPI communicator) to the MPI processes?
I am interested in the details, or a pointer on where to look.
Details on how individual processes establish the MPI universe are implementation specific. You should look into the source code of the specific library in order to understand how it works. There are two almost universal approaches though:
command line arguments: the MPI launcher can pass arguments to the spawned processes indicating how and where to connect in order to establish the universe. That's why MPI has to be initialised by calling MPI_Init() with argc and argv in C - thus the library can get access to the command line and extract all arguments that are meant for it;
environment variables: the MPI launcher can set specific environment variables whose content can indicate where and how to connect.
Open MPI for example sets environment variables and also writes some universe state in a disk location known to all processes that run on the same node. You can easily see the special variables that its run-time component ORTE (OpenMPI Run-Time Environment) uses by executing a command like mpirun -np 1 printenv:
$ mpiexec -np 1 printenv | grep OMPI
... <many more> ...
OMPI_MCA_orte_hnp_uri=1660944384.0;tcp://x.y.z.t:43276;tcp://p.q.r.f:43276
OMPI_MCA_orte_local_daemon_uri=1660944384.1;tcp://x.y.z.t:36541
... <many more> ...
(IPs changed for security reasons)
Once a child process is launched remotely and MPI_Init() or MPI_Init_thread() is called, ORTE kicks in and reads those environment variables. Then it connects back to the specified network address with the "home" mpirun/mpiexec process which then coordinates all spawned processes into establishing the MPI universe.
Other MPI implementations work in a similar fashion.

Internalize creation of MPI processes

Is there a way to internalize the creation of MPI processes? Instead of specifying the number of processes in the commandline "mpiexec -np 2 ./[PROG]"; I would like the number of processes be specified internally.
Cheers
Yes. You're looking for MPI_Spawn() from MPI-2, which launches a (possibly different) program with a number of processes that can be specified at runtime, and creates a new communicatator which you can use in place of MPI_COMM_WORLD to communicate amongst both the original and the new processes.

Spreading a job over different nodes of a cluster in sun grid engine (SGE)

I'm tryin get sun gridending (sge) to run the separate processes of an MPI job over all of the nodes of my cluster.
What is happening is that each node has 12 processors, so SGE is assigning 12 of my 60 processes to 5 separate nodes.
I'd like it to assign 2 processes to each of the 30 nodes available, because with 12 processes (dna sequence alignments) running on each node, the nodes are running out of memory.
So I'm wondering if it's possible to explicitly get SGE to assign the processes to a given node?
Thanks,
Paul.
Check out "allocation_rule" in the configuration for the parallel environment; either with that or then by specifying $pe_slots for allocation_rule and then using the -pe option to qsub you should be able to do what you ask for above.
You can do it by creating a queue in which you can define the queue uses only only 2 processors out of 12 processors in each node.
You can see configuration of current queue by using the command
qconf -sq queuename
you will see following in the queue configuration. This queue named in such a way that it uses only 5 execution hosts and 4 slots (processors) each.
....
slots 1,[master=4],[slave1=4],[slave2=4],[slave3=4],[slave4=4]
....
use following command to change the queue configuration
qconf -mq queuename
then change those 4 into 2.
From an admin host, run "qconf -msconf" to edit the scheduler configuration. It will bring up a list of configuration options in an editor. Look for one called "load_factor". Set the value to "-slots" (without the quotes).
This tells the scheduler that the machine is least loaded when it has the fewest slots in use. If your exec hosts have a similar number of slots each, you will get an even distribution. If you have some exec hosts that have more slots than the others, they will be preferred, but your distribution will still be more even than the default value for load_factor (which I don't remember, having changed this in my cluster quite some time ago).
You may need to set the slots on each host. I have done this myself because I need to limit the number of jobs on a particular set of boxes to less than their maximum because they don't have as much memory as some of the other ones. I don't know if it is required for this load_factor configuration, but if it is, you can add a slots consumable to each host. Do this with "qconf -me hostname", add a value to the "complex_values" that looks like "slots=16" where 16 is the number of slots you want that host to use.
This is what I learned from our sysadmin. Put this SGE resource request in your job script:
#$ -l nodes=30,ppn=2
Requests 2 MPI processes per node (ppn) and 30 nodes. I think there is no guarantee that this 30x2 layout will work on a 30-node cluster if other users also run lots of jobs but perhaps you can give it a try.

Resources