I currently create an application, which needs to run millions of statistical regressions in a short time. Parallelization of these calculations is one possibility to accelerate the process.
The OpenCPU server doesn’t seem to scale well with parallel executed commands. All commands are executed in a sequential manner.
Is it possible to spawn multiple R sessions using OpenCPU or do I need to run multiple instances of the server? Do I miss something here on how OpenCPU can process multiple computationally expensive commands simultaneously?
The OpenCPU cloud server executes all http requests in parallel, so first observation is false. Of course you must make simultaneous requests to do so.
If your code consists of a single R function or script, OpenCPU won't magically parallelize things for you, if that is what you are after. In that case you would need to use something like snow or mcparallel in your R function. But that is unrelated to OpenCPU, which only provides an http interface to your R function or script.
Related
I'm pretty new to R Plumber. I'm trying to deploy an R function as an API to be able to do live calculations for a web app. I understand R is single-thread by default and, hence, Plumber inherits the same limitations to deal with requests. The R function I'm trying to deploy is not costly, but will probably be called multiple times in a single session.
I'm also quite a newbie in terms of serving/deploying web applications, but I do know how to set up an Apache server. I've noticed that Apache can receive and process multiple requests by opening new threads (I honestly consider this as a black-box, magical thing and have zero knowledge on how apache does this). Would serving the plumber API through Apache allow me to bypass the single-thread limitations?
Alternatively, would it be possible to bypass single-thread limitations by using doParallel (or something alike)?
RPlumber is single thread and, as such, can only work with a single request at a time. Found out that, with enough resources, you can deploy several PlumbR servers listening on different ports. It's hacky and nasty, but it got the job done.
We are running R services for SQL on an Azure VM to create a web interface to a modeling algorithm. We are using T-SQL to call a very complicated and memory intensive R script. It runs fine when we submit a single job but we get a memory allocation error when we submit another job before the 1st has finished. Eventually we will need to queue hundreds of jobs that will run over many hours. We assume that it is initiating two R processes which use up the resources. What is the best way to force it to run the R jobs sequentially rather than simultaneously? We have looked at updating the resource pool to MAX_PROCESSES = 0, with no success (we have already adjusted memory resources in this way) . We are considering using the Azure Service Bus Queue but we are hoping there might be simpler options. We would appreciate any advice on how others have dealt with this sort of issue.
Thanks 10^6.
I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.
To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as --hostfile to tell it what nodes to use.
In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
mpirun -np 1 R --slave -f par.R
Since we build Open MPI with support for Torque, I don't use the --hostfile option: mpirun figures out what nodes to use from the PBS_NODEFILE environment variable automatically. The use of -np 1 may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the snow package. I've never used snowfall, but after looking over the source code, it appears to me that sfInit always calls makeMPIcluster with a "count" argument which will cause snow to spawn workers, so I think that -np 1 is required for MPI clusters with snowfall. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the sfInit "cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the Rmpi mpi.universe.size function useful for that.
If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.
I've been looking into running R on EC2, but I'm wondering what the deal is with parallel/cluster computing is with this setup. I've had a look around but I haven't been able to find a tutorial for this.
Basically what I'm looking to do is have R (Rstudio) running on my laptop, and do most of the work on that, but then when I have a big operation to run, explicitly pass it to an AWS slave instance to do all the heavy lifting.
As far as I can see, snow/snowfall packages seem to be the answer... but I'm not really sure how.
I'm using the tutorial on http://bioconductor.org/help/bioconductor-cloud-ami/ (the ssh one) to have R running. This tutorial does mention paralell/cluster, but it seems to be between different AWS instances.
Any help would be great. Cheers.
If you need only one slave instance I've found it's easiest to just run it in parallel on the instance rather than using your PC as a master.
You can write the script on your PC and push it up to a multicore server with R running on it and then run it on there using all cores in parallel.
For example upload this to a 4 core AWS instance:
library(snowfall)
sfInit(parallel=TRUE,cpus=4,slaveOutfile="log.txt")
vars = c(1:100)
#send variables to all processors
sfExportAll()
#Run this in parallel
results = sfLapply(vars, exp)
#Stop parallel processing
sfStop()
#save results
save(results, file = "results.RData")
I am trying to run parallel R jobs using the multicore package. Every job is the execution of the same script.R with different arguments.
A general idea is to define a function that takes the args and then calls source("script.R"). The problem is that I cannot pass the args to the script.R. Since I am running in parallel the args cannot be defined in the global scope.
Any help is welcomed.
As running parallel R instances which might be even on different nodes/computers, using an outer database to store parameters might be a good option.
I would use redis as being extremely fast and fully accessible in R, and for parallel runs its brother: doredis.
So you could have a redis server (or even a replicated, slave database on every host) which could be fetched for parameters. You could instantly update the parameters even from outside of R available to all workers and could easily add new workers for the task with doredis.