How to set up cluster slave nodes (on Windows) - r

I need to run thousands* of models on 15 machines (each of 4 cores), all Windows. I started to learn parallel, snow and snowfall packages and read a bunch of intro's, but they mainly focus on the setup of the master. There is only a little information on how to set up the worker (slave) nodes on Windows. The information is often contradictory: some say that SOCK cluster is practically the easiest way to go, others claim that SOCK cluster setup is complicated on Windows (sshd setup) and the best way to go is MPI.
So, what is an easiest way to install slave nodes on Windows? MPI, PVM, SOCK or NWS? My, possibly naive ideas were (listed by priority):
To use all 4 cores on the slave nodes (required).
Ideally, I need only R with some packages and a slave R script or R function that would listen on some port and wait for tasks from master.
Ideally, nodes can be added/removed dynamically from the cluster.
Ideally, the slaves would connect to the master - so I wouldn't have to list all the slaves IP's in configuration of the master.
Only 1 is 100% required, 2-4 are "would be good". Is it too naive to request?
I am sorry but I have not been able to figure this out from the available docs and tutorials. I would be grateful if you point me out to the right source.
* Note that each of those thousands of models will take at least 7 minutes, so there won't be a big communication overhead.

It's a shame how all these APIs (like parallel/snow/snowfall) are complex to work with, a lots of docs but not what you need... I have found an API which is very simple and goes straight to the ideas I sketched!! It is redis and doRedis R package (as recommended here). Finally a very simple tutorial is present! Just modified a bit and got this:
The workers need only R, doRedis package and this script:
require(doRedis)
redisWorker('jobs', '10.0.0.7') # IP of the server
The master needs redis server running (installed the experimental windows binaries for Windows), and this R code:
require(doRedis)
registerDoRedis('jobs')
foreach(j=1:10,.combine=sum,.multicombine=TRUE) %dopar%
... # whatever you need to run
removeQueue('jobs')
Adding/removing workers is fully dynamic, no need to specify IPs at master, automatic "load balanancing", simple and no need for tons of docs! This solution fulfills all the requirements and even more - as stated in ?registerDoRedis:
The doRedis parallel back end tolerates faults among the worker processes and automatically resubmits failed tasks.
I don't know how complex this would be using the parallel/snow/snowfall with SOCKS/MPI/PVM/NWS, if it would be possible at all, but I guess very complex...
The only disadvantages of using redis I found:
It is a database server. I wonder if this API exist somewhere without the need to install the database server which I don't need at all. I guess it must exist!
There is a bug in the current doRedis package ("object '.doRedisGlobals' not found") with no solution yet and I am not able to install the old working doRedis 1.0.5 package into R 3.0.1.

Related

How to see how many nodes a process is using on a cluster with Sun grid engine?

I am (trying to) run R on a multicore computing cluster with a Sun grid engine. I would like to run R in parallel using the MPI environment and the snow / snowfall parLapply() functions. My code is working at least on my laptop, but to be sure whether it does what it is supposed to on the cluster as well, I have the following questions.
If I request a number of slots / nodes, say 4, how can I check whether a running process actually uses the full number of the requested CPUs? Is there a commend that can show details about the CPU usage on the requested nodes for a process?
In order to verify that the cluster workers really started on the appropriate nodes, I often use the following command right after creating the cluster object:
clusterEvalQ(cl, Sys.info()['nodename'])
This should match the list of allocated nodes reported by the qstat command.
To actually get details on the CPU usage, I often ssh to each node and use commands like top and ps, but that can be painful if there are many nodes to check. We have the Ganglia monitoring system set up on our clusters, so I can use Ganglia's web interface to check various node statistics. You might want to check with your system administrators to see if they have set anything up for monitoring.

How to deploy MPI program?

MPI require I deploy mpi program to each machine. Currently, I put the mpi program in nfs, but this method has 2 issues, one is nfs has latency issue and the other is nfs not suitable for large cluster. I know that I could use some linux shell commands to sync up my program to each node, but looks like not so convenient. especially, when I change the program frequently. Is there any easy method to to that ?
There's nothing wrong with NFS or any other network filing system in large clusters. It just means your file server isn't sized for the job. If you replace NFS with anything like ssh, ftp, scripts, or whatever and change nothing else, I don't think that'll make any significant difference. Also, if the loading time is a significant and bothersome component of the overall runtime then why use MPI in the first place?
OK, enough of playing devils advocate. One thing you can do is to have nodes load your program onto other nodes in a binary tree type arrangement. You'll need a script that will copy the executable to two other nodes along with a copy of the script, start that script running asynchronously on those nodes and then runs the executable locally. The result would be a chain reaction of copying and running spreading across the network. The only difficult bit is in choosing which nodes to copy to so that each one is visited just once. It will be a lot faster.
Depending on the nature of the application and the nature of the NFS network, using a shared file system for both the MPI implementation and the application "should" be able to scale with reasonable performance, to a point. Keep in mind that there is some NFS caching at the node level, so multiple ranks on the same node will not each have to traverse the network to reach the files.
In general terms, I tend to advise that NFS be discontinued at about 128 nodes or 1024 ranks in favor of local installations. That advice changes if the NFS is delivered with 10GigE, IPoIB, or if a high performance file system like SFS or GPFS is used.
If you are committed to local installations, then tools like rsync, or scp are good candidates to distribute the bits. Script the final result. You can even do a tar to shared, and remote command (e.g. ssh, clush) un-tar to local disc. The "solution" only needs to be robust, not polished or elegant.
I'll also chime in to say the NFS should be just fine in this use-case, unless you have a cluster of over 100-200 nodes.
If you just want a lightweight tool for doing many-node parallel operations, I'd suggest pdsh. pdsh is a very common tool on HPC clusters. It includes a command called pdcp for doing parallel node copies, i.e.
pdcp -w node[00-99] myfile /path/to/destination/myfile
Where the nodenames are node00, node01, ... node99.
Similarly, you use the pdsh command to run a command in parallel across all the nodes. I.e.,
pdsh -w node[00-99] /path/to/my/executable
Alternatively, if you're looking for something a little less ad-hoc for doing these operations, I can recommend Ansible as an easy and lightweight configuration management and deployment tool. It's not as simple to get started as pdsh, but might be more manageable in the long run...
For example, a simple Ansible playbook to copy a tarball to all nodes, extract it, and then execute a binary might look like:
---
- hosts: computenodes
user: myname
vars:
num_procs: 32
tasks:
- name: copy and extract tarball to deployment location
action: unarchive src=myapp.tar.gz dest=/path/to/deploy/
- name: execute app
action: command mpirun -np {{num_procs}} /path/to/deploy/myapp.exe

openMPI Master Node Setup Configuration

I'm trying to setup a relatively small cluster (36 cores) with openMPI and I've run into a small problem. I have all the openMPI libraries and any dependencies installed and running correctly (I can run a hello world MPI program on each computer as the localhost). The problem is that I can't seem to find too much documentation on how to get the computers to execute a program together. I can do the mpirun --hostfile command but I don't want to have to specify the host file every time I execute a job. Plus, future users won't have access to all the IP addresses on the cluster all the time. They and I expect to be able execute mpirun -np 20 programFile with no problem. Can someone provide some guidance on what I need to do from this point? To be fair, I've only taken one class in college where we wrote parallel programs with MPI, but they never showed us how to SETUP a new cluster with openMPI. I appreciate any advice you guys can give. I've found this guide through my searches MPICH_Cluster_Setup which would be great if it was openMPI. Is there a similar guide out there that pertains to openMPI?
You should use a cluster scheduler like Torque, SLURM, or SGE (all are free/FOSS). These lets users reserve nodes for their use, and all "talk" to open MPI to tell it what nodes to use for that users job (so that they don't have to use a hostfile).
Per the MPICH cluster setup doc, it's just about exactly what you need for open MPI, but there's no need to setup MPD at the end (MPICH has since deprecated MPD, anyway).

How to fork interactive programs

I have an interactive program with a high start-up cost. After start-up, I'd like to fork the process into separate concurrent sessions. Ideally each separate session would become a GNU screen window but being able to individually telnet/ssh to each session would be fine too.
It shouldn't be too hard to write this from scratch but it seems like something that should have been done/considered before and maybe there are reasons why this is a bad idea...
I know that an alternative approach is to use shared memory for the data that's expensive to initialize. The reason I'm reluctant to go down that path is that the shared data uses C++ data structures with pointers, which makes it hard to mmap it into an unrelated process.
This is what any database does - the startup is phenomenally expensive, but the db provides several different means of connecting - example Oracle's BEQ protocol.
Telnet has issues, consider ssh. Either way, consider a daemon that answers requests for connect on a port (you would use AF_UNIX sockets I guess), then creates a separate session.
Stevens Advanced Programming in the UNIX Envrionment or Rochkind's Advanced UNIX Programming have discussions and complete examples. Since my Stevens book seems to have gone on extensive holiday, see Rochkind 4.3 and 4.10.
And no, there is no pending doom for using this approach.

How does doRedis work?

I've been playing around with the R interface to the redis database, as well as the doRedis parallel backend for foreach. I have a couple of questions, to help me better apply this tool:
doMC, doSMP, doSnow, etc. all seem to work by calling up worker processes on the same computer, passing them an element from a list and a function to apply, and then gathering the results. In the case of doMC, the workers share memory. However, I'm a little bit confused as to how a database can provide this same functionality.
When I add an additional slave computer to the doRedis job queue (as in this video), is the entire doredis database being sent to the slave computer? Or is each slave just the data it needs at a particular moment (i.e. one element of a list and a function to apply).
How do I explicitly pass additional data and functions to the doRedis job queue, that each slave will need to perform it's computations?
When using doRedis and foreach, are there any additional 'gotchas' that might not apply to other parallel backends?
I know this is a lot of questions, but I've been running into situations where my limited understanding of how parallel processing works has been hindering my abilities to implement it. For example, I recently tried to parallelize a computation on a large database, and caught myself passing the entire database to each node on my cluster, an operation which completely destroyed any advantage I'd gained from parallelizing.
Thank you!
One piece of the puzzle is rredis
1 - doRedis uses rredis. Specifically, doRedis.R uses redis:RPush (as it iterates over the foreach items) and each redisWorker uses redis:BRPop to grab something from the redis list (which you named in your doRedis "job").
Redis is not just a database. Here it is being used as a queue!
2 - You have 1 instance (remotely) accessible to all your R workers. Think of the Redis server as a distributed queue. Your job master pushes items to a list, and workers grab and item and process it and push it to the result list. You can have m workers for N items. Depends on what you want to do.
3 - Use the env param. That uses the Redis:Set which all workers have access to (via Redis:Get). You pass a delimited expression on the foreach side and that is set in a string key in redis to which the workers have access.
4 - None that I know (but that is hardly authoritative so do ask around.) I also suggest you read the provided source code. The answers above are straight from reading doRedis.R and redisWorker.R.
Hope this helps.
[p.s. telnet to your redis and issue the Redis:monitor command to monitor the chatter back and forth.]

Resources