Passing parameters to parallel R jobs - r

I am trying to run parallel R jobs using the multicore package. Every job is the execution of the same script.R with different arguments.
A general idea is to define a function that takes the args and then calls source("script.R"). The problem is that I cannot pass the args to the script.R. Since I am running in parallel the args cannot be defined in the global scope.
Any help is welcomed.

As running parallel R instances which might be even on different nodes/computers, using an outer database to store parameters might be a good option.
I would use redis as being extremely fast and fully accessible in R, and for parallel runs its brother: doredis.
So you could have a redis server (or even a replicated, slave database on every host) which could be fetched for parameters. You could instantly update the parameters even from outside of R available to all workers and could easily add new workers for the task with doredis.

Related

How to create cluster for multiple processes?

I have a 31 CPU machine available for parallel computations. I would like to create a single 31-node cluster which would then serve for parallel computations to several different R processes. How can this be done?
I am currently using makeCluster in a way like this:
cl <- makeCluster(5)
registerDoParallel(cl)
but this will only serve the current R process. How can I connect to a cluster created in different R process?
PS: The reason why I want multiple processes to access one cluster is that I want to be constantly adding new sets of computations which will be waiting in the queue for the running processes to finish. I hope it will work this way? I have used doRedis for this in the past, but there were some problems and I would like to use a simple cluster for the purpose.

How can OpenCPU run computationally expensive commands simultaneously?

I currently create an application, which needs to run millions of statistical regressions in a short time. Parallelization of these calculations is one possibility to accelerate the process.
The OpenCPU server doesn’t seem to scale well with parallel executed commands. All commands are executed in a sequential manner.
Is it possible to spawn multiple R sessions using OpenCPU or do I need to run multiple instances of the server? Do I miss something here on how OpenCPU can process multiple computationally expensive commands simultaneously?
The OpenCPU cloud server executes all http requests in parallel, so first observation is false. Of course you must make simultaneous requests to do so.
If your code consists of a single R function or script, OpenCPU won't magically parallelize things for you, if that is what you are after. In that case you would need to use something like snow or mcparallel in your R function. But that is unrelated to OpenCPU, which only provides an http interface to your R function or script.

Initializing MPI cluster with snowfall R

I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.
To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as --hostfile to tell it what nodes to use.
In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
mpirun -np 1 R --slave -f par.R
Since we build Open MPI with support for Torque, I don't use the --hostfile option: mpirun figures out what nodes to use from the PBS_NODEFILE environment variable automatically. The use of -np 1 may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the snow package. I've never used snowfall, but after looking over the source code, it appears to me that sfInit always calls makeMPIcluster with a "count" argument which will cause snow to spawn workers, so I think that -np 1 is required for MPI clusters with snowfall. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the sfInit "cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the Rmpi mpi.universe.size function useful for that.
If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.

How can I directly pass a process from local R to an Amazon EC-2 Instance?

I've been looking into running R on EC2, but I'm wondering what the deal is with parallel/cluster computing is with this setup. I've had a look around but I haven't been able to find a tutorial for this.
Basically what I'm looking to do is have R (Rstudio) running on my laptop, and do most of the work on that, but then when I have a big operation to run, explicitly pass it to an AWS slave instance to do all the heavy lifting.
As far as I can see, snow/snowfall packages seem to be the answer... but I'm not really sure how.
I'm using the tutorial on http://bioconductor.org/help/bioconductor-cloud-ami/ (the ssh one) to have R running. This tutorial does mention paralell/cluster, but it seems to be between different AWS instances.
Any help would be great. Cheers.
If you need only one slave instance I've found it's easiest to just run it in parallel on the instance rather than using your PC as a master.
You can write the script on your PC and push it up to a multicore server with R running on it and then run it on there using all cores in parallel.
For example upload this to a 4 core AWS instance:
library(snowfall)
sfInit(parallel=TRUE,cpus=4,slaveOutfile="log.txt")
vars = c(1:100)
#send variables to all processors
sfExportAll()
#Run this in parallel
results = sfLapply(vars, exp)
#Stop parallel processing
sfStop()
#save results
save(results, file = "results.RData")

How does doRedis work?

I've been playing around with the R interface to the redis database, as well as the doRedis parallel backend for foreach. I have a couple of questions, to help me better apply this tool:
doMC, doSMP, doSnow, etc. all seem to work by calling up worker processes on the same computer, passing them an element from a list and a function to apply, and then gathering the results. In the case of doMC, the workers share memory. However, I'm a little bit confused as to how a database can provide this same functionality.
When I add an additional slave computer to the doRedis job queue (as in this video), is the entire doredis database being sent to the slave computer? Or is each slave just the data it needs at a particular moment (i.e. one element of a list and a function to apply).
How do I explicitly pass additional data and functions to the doRedis job queue, that each slave will need to perform it's computations?
When using doRedis and foreach, are there any additional 'gotchas' that might not apply to other parallel backends?
I know this is a lot of questions, but I've been running into situations where my limited understanding of how parallel processing works has been hindering my abilities to implement it. For example, I recently tried to parallelize a computation on a large database, and caught myself passing the entire database to each node on my cluster, an operation which completely destroyed any advantage I'd gained from parallelizing.
Thank you!
One piece of the puzzle is rredis
1 - doRedis uses rredis. Specifically, doRedis.R uses redis:RPush (as it iterates over the foreach items) and each redisWorker uses redis:BRPop to grab something from the redis list (which you named in your doRedis "job").
Redis is not just a database. Here it is being used as a queue!
2 - You have 1 instance (remotely) accessible to all your R workers. Think of the Redis server as a distributed queue. Your job master pushes items to a list, and workers grab and item and process it and push it to the result list. You can have m workers for N items. Depends on what you want to do.
3 - Use the env param. That uses the Redis:Set which all workers have access to (via Redis:Get). You pass a delimited expression on the foreach side and that is set in a string key in redis to which the workers have access.
4 - None that I know (but that is hardly authoritative so do ask around.) I also suggest you read the provided source code. The answers above are straight from reading doRedis.R and redisWorker.R.
Hope this helps.
[p.s. telnet to your redis and issue the Redis:monitor command to monitor the chatter back and forth.]

Resources