How to deploy MPI program? - mpi

MPI require I deploy mpi program to each machine. Currently, I put the mpi program in nfs, but this method has 2 issues, one is nfs has latency issue and the other is nfs not suitable for large cluster. I know that I could use some linux shell commands to sync up my program to each node, but looks like not so convenient. especially, when I change the program frequently. Is there any easy method to to that ?

There's nothing wrong with NFS or any other network filing system in large clusters. It just means your file server isn't sized for the job. If you replace NFS with anything like ssh, ftp, scripts, or whatever and change nothing else, I don't think that'll make any significant difference. Also, if the loading time is a significant and bothersome component of the overall runtime then why use MPI in the first place?
OK, enough of playing devils advocate. One thing you can do is to have nodes load your program onto other nodes in a binary tree type arrangement. You'll need a script that will copy the executable to two other nodes along with a copy of the script, start that script running asynchronously on those nodes and then runs the executable locally. The result would be a chain reaction of copying and running spreading across the network. The only difficult bit is in choosing which nodes to copy to so that each one is visited just once. It will be a lot faster.

Depending on the nature of the application and the nature of the NFS network, using a shared file system for both the MPI implementation and the application "should" be able to scale with reasonable performance, to a point. Keep in mind that there is some NFS caching at the node level, so multiple ranks on the same node will not each have to traverse the network to reach the files.
In general terms, I tend to advise that NFS be discontinued at about 128 nodes or 1024 ranks in favor of local installations. That advice changes if the NFS is delivered with 10GigE, IPoIB, or if a high performance file system like SFS or GPFS is used.
If you are committed to local installations, then tools like rsync, or scp are good candidates to distribute the bits. Script the final result. You can even do a tar to shared, and remote command (e.g. ssh, clush) un-tar to local disc. The "solution" only needs to be robust, not polished or elegant.

I'll also chime in to say the NFS should be just fine in this use-case, unless you have a cluster of over 100-200 nodes.
If you just want a lightweight tool for doing many-node parallel operations, I'd suggest pdsh. pdsh is a very common tool on HPC clusters. It includes a command called pdcp for doing parallel node copies, i.e.
pdcp -w node[00-99] myfile /path/to/destination/myfile
Where the nodenames are node00, node01, ... node99.
Similarly, you use the pdsh command to run a command in parallel across all the nodes. I.e.,
pdsh -w node[00-99] /path/to/my/executable
Alternatively, if you're looking for something a little less ad-hoc for doing these operations, I can recommend Ansible as an easy and lightweight configuration management and deployment tool. It's not as simple to get started as pdsh, but might be more manageable in the long run...
For example, a simple Ansible playbook to copy a tarball to all nodes, extract it, and then execute a binary might look like:
---
- hosts: computenodes
user: myname
vars:
num_procs: 32
tasks:
- name: copy and extract tarball to deployment location
action: unarchive src=myapp.tar.gz dest=/path/to/deploy/
- name: execute app
action: command mpirun -np {{num_procs}} /path/to/deploy/myapp.exe

Related

How to setup bidirectional rsync?

I tend to run simulations on a cluster that produces files larger than 100MB and I can't sync my computer with the cluster. So I considered setting up rsync between the two by following this link.
However, I believe this is just a cron job to sync the backup server with the main server and doesn't work in both directions. What will be the stepwise instructions to set up a bidirectional rsync ?
Both the systems run linux
Rsync isn't really the right tool for this job. You can sort of get it to work, using cron jobs and extremely carefully chosen parameters, but there's significant danger of data loss, especially if you want file deletion to propagate.
I'd recommend a tool like Syncthing for bidirectional sync. You want something that maintains an independent database of what's changed and what hasn't, and real-time updates are nice to have too.

Is using the -L flag and a addprocs script the more powerful version of -p and --machinefile?

So I have a moderately complex set of requirements for my worker processes.
I want to use a the master slave topology, and a nondefault working directory.
I also want to mix both local and remote workers.
As far as I can tell from readying the --machine-file section of the documentation.
It will not let me do that.
So I am looking at the -L <file parameter
>julia -h
...
-L, --load Load immediately on all processors
...
So if I do not use the -p or --machine-file` flags, then there is initially only one processer so the all processors just mean on the only processor.
So I tried this out
start_workers.jl
addprocs([
("cluster_c4_1",:auto),
("cluster_c4_2",:auto)
],
dir="/mnt/",
topology=:master_slave
)
addprocs(
dir="/mnt/",
topology=:master_slave
)
test.jl
println("*************")
println(workers())
println("-------------")
Running it:
>julia -L start_workers.jl pl.jl
*************
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]
-------------
So it looks all good, got my 20 workers.
Have I done anything unreasonable? Is this the best way?
That's exactly how I'm deploying it on a HPC cluster under Torque scheduler. In fact I'm in the process of re-writing the the cluster manager to support more options when adding processes through the Torque scheduling systems in particular, so I've spent quite a bit of time looking into this.
You might also want to be aware there are various ClusterManagers, Pkg.add("ClusterManagers") that extend the ability of addprocs under a variety of environments, such as when you need to request the resources from a scheduler. It looks like passwordless ssh is possible for you, so the default cluster manager is sufficient in your case.
I don't believe there is any way of defining the extra topology and directory parameters on the command line, so your approach is correct.

How can I see detailed work of nodes on a Rocks Cluster?

I built a Rocks Cluster for my school project, which is matrix multiplication, with one frontend and 5 other computers which are nodes. Over MPI I send them partions of matrix which they use for multiplication and then they send data back. Command which I run is:
mpirun -hostfile myhostfile ./myprogram
where myhostfile is a file of names of nodes and their slots(thread) numbers.
My program is working and I'm trying to analize it now.
My question is how can i see the work of each nodes core/processor working on his task, are the all processors working, is there some kind of overload?
I tried to install Vampir profiler and Intels Vtune Amplifierbut but I have some problems attaching them to my program with this command above (other comands dont allow me to run my programs on all threads of a node). All that i have accomplished (to see my nodes working good besides Ganglia) is to login to a node from the frontend and with the command "top" I could see when my program is executing by the number of threads and almost 100% CPU usage on each thread.
Take a look at mpstat
With no params it will show aggregated load for all cores
mpstat -P ALL shows load for each core
This will give you realtime stats for your nodes:
watch pdsh -w compute-01-[01-10] mpstat
(use your compute nodes names)

openMPI Master Node Setup Configuration

I'm trying to setup a relatively small cluster (36 cores) with openMPI and I've run into a small problem. I have all the openMPI libraries and any dependencies installed and running correctly (I can run a hello world MPI program on each computer as the localhost). The problem is that I can't seem to find too much documentation on how to get the computers to execute a program together. I can do the mpirun --hostfile command but I don't want to have to specify the host file every time I execute a job. Plus, future users won't have access to all the IP addresses on the cluster all the time. They and I expect to be able execute mpirun -np 20 programFile with no problem. Can someone provide some guidance on what I need to do from this point? To be fair, I've only taken one class in college where we wrote parallel programs with MPI, but they never showed us how to SETUP a new cluster with openMPI. I appreciate any advice you guys can give. I've found this guide through my searches MPICH_Cluster_Setup which would be great if it was openMPI. Is there a similar guide out there that pertains to openMPI?
You should use a cluster scheduler like Torque, SLURM, or SGE (all are free/FOSS). These lets users reserve nodes for their use, and all "talk" to open MPI to tell it what nodes to use for that users job (so that they don't have to use a hostfile).
Per the MPICH cluster setup doc, it's just about exactly what you need for open MPI, but there's no need to setup MPD at the end (MPICH has since deprecated MPD, anyway).

A program to kill long-running runaway programs

I manage Unix systems where, sometimes, programs like CGI scripts run forever, sometimes eating a lot of CPU time and wasting resources.
I want a program (typically invoked from cron) which can kill these runaways, based on the following criteria (combined with AND and OR):
Name (given by a regexp)
CPU time used
elapsed time (for programs which are blocked on an I/O)
I do not really know what to type in a search engine for this sort of program. I certainly could write it myself in Python but I'm lazy and there is may be a good program already existing?
(I did not tag my question with a language name since a program in Perl or Ruby or whatever would work as well)
Try using system-level quota enforcement instead. Most systems will allow to set per-process CPU time limit for different users.
Examples:
Linux: /etc/security/limits.conf
FreeBSD: /etc/login.conf
CGI scripts can usually be run under their own user ID, for example using mod_suid for Apache.
This might be something more like what you were looking for:
http://devel.ringlet.net/sysutils/timelimit/
Most of the watchdig-like programs or libraries are just trying to see whether a given process is running, so I'd say you'd better off writing your own, using the existing libraries that give out process information.

Resources