Initializing MPI cluster with snowfall R - r

I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.

To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as --hostfile to tell it what nodes to use.
In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
mpirun -np 1 R --slave -f par.R
Since we build Open MPI with support for Torque, I don't use the --hostfile option: mpirun figures out what nodes to use from the PBS_NODEFILE environment variable automatically. The use of -np 1 may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the snow package. I've never used snowfall, but after looking over the source code, it appears to me that sfInit always calls makeMPIcluster with a "count" argument which will cause snow to spawn workers, so I think that -np 1 is required for MPI clusters with snowfall. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the sfInit "cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the Rmpi mpi.universe.size function useful for that.
If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.

Related

Control specific node where an MPI process executes using PBS script?

The setup: A single-processor executable and two parallel mpi-based codes that can run on 100s of processors each. All on an HPC cluster that uses a PBS-based job scheduler.
The problem: Using a shared memory communication between single-processor executable and the parallel codes requires that rank 0 of the parallel codes all be located physically on the same node in an HPC cluster that uses a PBS job scheduler.
The question: Can a PBS script be created that can specify that rank 0 of the two parallel codes must start on a specific node( the same node that the other single-processor executable is running on)?
Example:
ExecA --Single processor
ExecB -- 100 processors
ExecC -- 100 processors
I want a situation where ExecA, ExecB(Rank0), and ExecC(Rank0) all start up on the same node. I need this so that the Rank 0 processors can communicate with the single-processor using a shared memory paradigm and then broadcast that information out to the rest of their respective MPI processes.
From this post, it does appear that specification of the number of cores to use on a code can be controlled using the PBS script. From my reading of the MPI manual, it also appears that if given a hostfile, MPI will sequentially go down the hostfile until it has allocated all the processors that were requested. So theoretically if I had a hostfile/machinefile that contained the host name of a particular node, and had a specification of 1 processor on that node being used, then I believe rank 0 would likely reside on that node.
I know that most cluster-based job schedulers do provide node names for users that they can use to specify a particular node to execute on, but I can't quite determine if ability to generally tell a scheduler "Hey, for this parallel job put the first process on this node, and put the rest elsewhere" is possible.

Mulitple processes launched by mpirun at different times within a single slurm allocation

I am trying to launch multiple mpirun commands from within a single slurm allocation but launching at different times and with different core counts.
At the minute the script launches something like
mpirun -np 2 python parallel_script.py input1
mpirun -np 3 python parallel_script.py input2
Doing this naively leads to all the processes being launched on the same (0-n) cores. This can be fixed for a single node by turning off process binding, but it seems there is no way to extend this to multiple node jobs.
Is there any easy mechanism to tell it to keep track of which cores are in use? Alternatively, is there a way to specify an offset to the mpi mapping so as to specify that I want the process to actually launch on logical core X rather than logical core 0.
I'm trying to avoid having to rely too heavily on the scheduler implementation here so as this is useful on more than one system.

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

mpiexec vs mpirun

As per my little knowledge mpirun and mpiexec both are launcher. Can anybody tell the exact difference between mpiexec and mpirun?
mpiexec is defined in the MPI standard (well, the recent versions at least) and I refer you to those (your favourite search engine will find them for you) for details.
mpirun is a command implemented by many MPI implementations. It has never, however, been standardised and there have always been, often subtle, differences between implementations. For details see the documentation of the implementation(s) of your choice.
And yes, they are both used to launch MPI programs, these days mpiexec is generally preferable because it is standardised.
I know the question's been answered, but I think the answer isn't the best. I ran into a few issues on the cluster here with mpirun and looked to see if there was a difference between mpirun and mpiexec. This is what I found:
Description
Mpiexec is a replacement program for the script mpirun, which is part
of the mpich package. It is used to initialize a parallel job from
within a PBS batch or interactive environment. Mpiexec uses the task
manager library of PBS to spawn copies of the executable on the nodes
in a PBS allocation.
Reasons to use mpiexec rather than a script (mpirun) or an external
daemon (mpd):
Starting tasks with the TM interface is much faster than invoking a separate rsh or ssh once for each process.
Resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs, because all the processes
of a parallel job remain under the control of PBS, unlike when using
startup scripts such as mpirun.
Tasks that exceed their assigned limits of CPU time, wallclock time, memory usage, or disk space are killed cleanly by PBS. It is
quite hard for processes to escape control of the resource manager
when using mpiexec.
You can use mpiexec to enforce a security policy. If all jobs are required to startup using mpiexec and the PBS execution environment,
it is not necessary to enable rsh or ssh access to the compute nodes
in the cluster.
Ref: https://www.osc.edu/~djohnson/mpiexec/

How are MPI processes started?

When starting an MPI job with mpirun or mpiexec, I can understand how one might go about starting each individual process. However, without any compiler magic, how do these wrapper executables communicate the arrangement (MPI communicator) to the MPI processes?
I am interested in the details, or a pointer on where to look.
Details on how individual processes establish the MPI universe are implementation specific. You should look into the source code of the specific library in order to understand how it works. There are two almost universal approaches though:
command line arguments: the MPI launcher can pass arguments to the spawned processes indicating how and where to connect in order to establish the universe. That's why MPI has to be initialised by calling MPI_Init() with argc and argv in C - thus the library can get access to the command line and extract all arguments that are meant for it;
environment variables: the MPI launcher can set specific environment variables whose content can indicate where and how to connect.
Open MPI for example sets environment variables and also writes some universe state in a disk location known to all processes that run on the same node. You can easily see the special variables that its run-time component ORTE (OpenMPI Run-Time Environment) uses by executing a command like mpirun -np 1 printenv:
$ mpiexec -np 1 printenv | grep OMPI
... <many more> ...
OMPI_MCA_orte_hnp_uri=1660944384.0;tcp://x.y.z.t:43276;tcp://p.q.r.f:43276
OMPI_MCA_orte_local_daemon_uri=1660944384.1;tcp://x.y.z.t:36541
... <many more> ...
(IPs changed for security reasons)
Once a child process is launched remotely and MPI_Init() or MPI_Init_thread() is called, ORTE kicks in and reads those environment variables. Then it connects back to the specified network address with the "home" mpirun/mpiexec process which then coordinates all spawned processes into establishing the MPI universe.
Other MPI implementations work in a similar fashion.

Resources