Does MPI have its own way of transferring files to remote nodes? - mpi

I have a cluster of machines that are configured to talk to each other via ssh. I would like to run an MPI job on this cluster. It seems like any executable I wish to run must be present in every node of the cluster. I could manually transfer the executable from the master node to all the others, but is there a more natural way to do this from within MPI?
For instance, if I run mpirun --hostfile hosts a.out it requires me to manually copy a.out into the home directories of every single node. Doesn't MPI have its own way to automate this?


Control specific node where an MPI process executes using PBS script?

The setup: A single-processor executable and two parallel mpi-based codes that can run on 100s of processors each. All on an HPC cluster that uses a PBS-based job scheduler.
The problem: Using a shared memory communication between single-processor executable and the parallel codes requires that rank 0 of the parallel codes all be located physically on the same node in an HPC cluster that uses a PBS job scheduler.
The question: Can a PBS script be created that can specify that rank 0 of the two parallel codes must start on a specific node( the same node that the other single-processor executable is running on)?
ExecA --Single processor
ExecB -- 100 processors
ExecC -- 100 processors
I want a situation where ExecA, ExecB(Rank0), and ExecC(Rank0) all start up on the same node. I need this so that the Rank 0 processors can communicate with the single-processor using a shared memory paradigm and then broadcast that information out to the rest of their respective MPI processes.
From this post, it does appear that specification of the number of cores to use on a code can be controlled using the PBS script. From my reading of the MPI manual, it also appears that if given a hostfile, MPI will sequentially go down the hostfile until it has allocated all the processors that were requested. So theoretically if I had a hostfile/machinefile that contained the host name of a particular node, and had a specification of 1 processor on that node being used, then I believe rank 0 would likely reside on that node.
I know that most cluster-based job schedulers do provide node names for users that they can use to specify a particular node to execute on, but I can't quite determine if ability to generally tell a scheduler "Hey, for this parallel job put the first process on this node, and put the rest elsewhere" is possible.

Is it possible and how to get a list of cores on which my mpi job is running from slurm?

The question: Is it possible and if yes then how, to get the list of cores on which my mpi job is running at a given moment?
It is easy to list nodes to which the job has been assigned, but after few hours spent surveying the internet I start to suspect that slurm expose the cores list in any way (why wouldn't it tough?).
The thing is, i want to double check if the cluster i am working on is really spreading the processes of my job across nodes, cores (and if possible, sockets) as I ask it to do (call me paranoid if You will).
Please note that hwloc is not an answer to my question, i ask if it is possible to get this information from slurm, not from inside of my program (call me curious if You will).
Closely related to (but definitely not the same thing) other similar question
well, that depends on your MPI library (MPICH-based, Open MPI-based, other), on how you run your MPI app (via mpirun or direct launch via srun) and your SLURM config.
if you direct launch, SLURM is the one that may do the binding.
srun --cpu_bind=verbose ...
should report how each task is bound.
if you mpirun, SLURM only spawns one proxy on each node.
in the case of Open MPI, the spawn command is
srun --cpu_bind=none orted ...
so unless SLURM is configured to restrict the available cores (for example if you configured cpuset and nodes are not in exclusive mode), all the cores can be used by the MPI tasks.
and then it is up to the MPI library to bind the MPI tasks within the available cores.
if you want to know what the available cores are, you can
srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none grep Cpus_allowed_list /proc/self/status
if you want to know how the tasks are bound, you can
mpirun grep Cpus_allowed_list /proc/self/status
or you can ask MPI to report that
iirc, with Open MPI you can
mpirun --report-bindings ...

Assign MPI Processes to Nodes

I have an MPI program that uses a master process and multiple worker processes. I want to have the master process running on a single compute node alone, while the worker processes run on another node. The worker processes should be assigned by socket (for example as it is done with the --map-by-socket option). Is there any option to assign the master process and the working processes to different nodes or to assign it manually, by consulting the rank maybe?
Assignment of ranks to hosts simultaneously with binding is possible via the use of rankfiles. In your case, assuming that each node has two 4-core CPUs, something like this should do it (for Open MPI 1.7 and newer):
rank 0=host1 slots=0-7
rank 1=host2 slots=0:0-3
rank 2=host2 slots=1:0-3
For older versions, instead of slots=0:0-3 and slots=1:0-3 one should use slots=0-3 and slots=4-7 respectively (assuming that cores are numbered linearly which might not be the case). Then the rankfile is supplied to mpiexec via the --rankfile option. It supersedes the hostfile.
Another option would be to do an MIMD launch. In that case one could split the MPI job into several parts and provide different distribution and binding arguments for each part:
mpiexec -H host1 -n 1 --bind-to none ./program : \
-H host2 -n 2 --bind-to socket --map-by socket ./program
The easiest way I am aware of doing this is by using the --hostfile option of OpenMPI.
If you are using any decent batch system you should have a list of your hosts and slots in some simple file or environment variable and you can parse that into a hostfile.
If you run your application "by hand" you can generate such a list on your own.

How to deploy MPI program?

MPI require I deploy mpi program to each machine. Currently, I put the mpi program in nfs, but this method has 2 issues, one is nfs has latency issue and the other is nfs not suitable for large cluster. I know that I could use some linux shell commands to sync up my program to each node, but looks like not so convenient. especially, when I change the program frequently. Is there any easy method to to that ?
There's nothing wrong with NFS or any other network filing system in large clusters. It just means your file server isn't sized for the job. If you replace NFS with anything like ssh, ftp, scripts, or whatever and change nothing else, I don't think that'll make any significant difference. Also, if the loading time is a significant and bothersome component of the overall runtime then why use MPI in the first place?
OK, enough of playing devils advocate. One thing you can do is to have nodes load your program onto other nodes in a binary tree type arrangement. You'll need a script that will copy the executable to two other nodes along with a copy of the script, start that script running asynchronously on those nodes and then runs the executable locally. The result would be a chain reaction of copying and running spreading across the network. The only difficult bit is in choosing which nodes to copy to so that each one is visited just once. It will be a lot faster.
Depending on the nature of the application and the nature of the NFS network, using a shared file system for both the MPI implementation and the application "should" be able to scale with reasonable performance, to a point. Keep in mind that there is some NFS caching at the node level, so multiple ranks on the same node will not each have to traverse the network to reach the files.
In general terms, I tend to advise that NFS be discontinued at about 128 nodes or 1024 ranks in favor of local installations. That advice changes if the NFS is delivered with 10GigE, IPoIB, or if a high performance file system like SFS or GPFS is used.
If you are committed to local installations, then tools like rsync, or scp are good candidates to distribute the bits. Script the final result. You can even do a tar to shared, and remote command (e.g. ssh, clush) un-tar to local disc. The "solution" only needs to be robust, not polished or elegant.
I'll also chime in to say the NFS should be just fine in this use-case, unless you have a cluster of over 100-200 nodes.
If you just want a lightweight tool for doing many-node parallel operations, I'd suggest pdsh. pdsh is a very common tool on HPC clusters. It includes a command called pdcp for doing parallel node copies, i.e.
pdcp -w node[00-99] myfile /path/to/destination/myfile
Where the nodenames are node00, node01, ... node99.
Similarly, you use the pdsh command to run a command in parallel across all the nodes. I.e.,
pdsh -w node[00-99] /path/to/my/executable
Alternatively, if you're looking for something a little less ad-hoc for doing these operations, I can recommend Ansible as an easy and lightweight configuration management and deployment tool. It's not as simple to get started as pdsh, but might be more manageable in the long run...
For example, a simple Ansible playbook to copy a tarball to all nodes, extract it, and then execute a binary might look like:
- hosts: computenodes
user: myname
num_procs: 32
- name: copy and extract tarball to deployment location
action: unarchive src=myapp.tar.gz dest=/path/to/deploy/
- name: execute app
action: command mpirun -np {{num_procs}} /path/to/deploy/myapp.exe

Initializing MPI cluster with snowfall R

I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.
To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as --hostfile to tell it what nodes to use.
In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
mpirun -np 1 R --slave -f par.R
Since we build Open MPI with support for Torque, I don't use the --hostfile option: mpirun figures out what nodes to use from the PBS_NODEFILE environment variable automatically. The use of -np 1 may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the snow package. I've never used snowfall, but after looking over the source code, it appears to me that sfInit always calls makeMPIcluster with a "count" argument which will cause snow to spawn workers, so I think that -np 1 is required for MPI clusters with snowfall. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the sfInit "cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the Rmpi mpi.universe.size function useful for that.
If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.
