I would like to use the plumber package to carry out some flexible parallel processing and was hoping it would work within a node.js framework such that it is non-blocking...
I have the following plumber file.
# myfile.R
#* #get /mean
normalMean <- function(samples=10){
Sys.sleep(5)
data <- rnorm(samples)
mean(data)
}
I have also installed pm2 as suggested here http://plumber.trestletech.com/docs/hosting/
I have also made the same run-myfile.sh file i.e.
#!/bin/bash
R -e "library(plumber); pr <- plumb('myfile.R'); pr\$run(port=4000)"
and made it executable as suggested...
I have started up pm2 using
pm2 start /path/to/run-myfile.sh
and wanted to test to see if it could carry out a non-blocking node.js framework...
by opening up another R console and running the following...
foo <- function(){
con <- curl::curl('http://localhost:4000/mean?samples=10000',handle = curl::new_handle())
on.exit(close(con))
return(readLines(con, n = 1, ok = FALSE, warn = FALSE))
}
system.time(for (i in seq(5)){
print(foo())
})
Perhaps it is my miss-understanding of how a node.js non-blocking framework is meant to work, but in my head the last loop should take only a bit of over 5 seconds. But it seems to take 25 seconds, suggesting everything is sequential rather than parallel.
How could I use the plumber package to carry out that non-blocking nature?
pm2 can't load-balance R processes for you, unfortunately. R is single-threaded and doesn't really have libraries that allow it to behave in asynchronous fashion like NodeJS does (yet), so there aren't many great ways to parallelize code like this in plumber today. The best option would be to run multiple plumber R back-ends and distribute traffic across them. See the "load balancing" section here: http://plumber.trestletech.com/docs/docker-advanced
Basically concurrent requests are queued by httpuv so that it is not performant by itself. The author recommends multiple docker containers but it can be complicated as well as response-demanding.
There are other tech eg Rserve and rApache. Rserve forks prosesses and it is possible to configure rApache to pre-fork so as to handle concurrent requests.
See the following posts for comparison
https://www.linkedin.com/pulse/api-development-r-part-i-jaehyeon-kim/
https://www.linkedin.com/pulse/api-development-r-part-ii-jaehyeon-kim/
Related
I am queueing and running an R script on a HPC cluster via sbatch and mpirun; the script is meant to use foreach in parallel. To do this I've used several useful questions & answers from StackOverflow: R Running foreach dopar loop on HPC MPIcluster, Single R script on multiple nodes, Slurm: Use cores from multiple nodes for R parallelization.
It seems that the script completes, but a couple of strange things happen. The most important is that the slurm job keeps on running afterwards, doing nothing(?). I'd like to understand if I'm doing things properly. I'll first give some more specific information, then explain the strange things I'm seeing, then I'll ask my questions.
– Information:
R is loaded as a module, which also calls an OpenMPI module. The packages Rmpi, doParallel, snow, foreach were already compiled and included in the module.
The cluster has nodes with 20 CPUs each. My sbatch file books 2 nodes and 20 CPUs per node.
The R script myscript.R is called in the sbatch file like this:
mpirun -np 1 Rscript -e "source('myscript.R')"
My script calls several libraries in this order:
library('snow')
library('Rmpi')
library('doParallel')
library('foreach')
and then sets up parallelization as follows at the beginning:
workers <- mpi.universe.size() - 1
cl <- makeMPIcluster(workers, outfile='', type='MPI')
registerDoParallel(cl)
Then several foreach-dopar are called in succession – that is, each starts after the previous has finished. Finally
stopCluster(cl)
mpi.quit()
are called at the very end of the script.
mpi.universe.size() correctly gives 40, as expected. Also, getDoParWorkers() gives doParallelSNOW. The slurm log encouragingly says
39 slaves are spawned successfully. 0 failed.
starting MPI worker
starting MPI worker
...
Also, calling print(clusterCall(cl, function() Sys.info()[c("nodename","machine")])) from within the script correctly reports the node names shown in the slurm queue.
– What's strange:
The R script completes all its operations, the last one being saving a plot as pdf, which I do see and is correct. But the slurm job doesn't end, it remains in the queue indefinitely with status "running".
The slurm log shows very many lines with
Type: EXEC. I can't find any relation between their number and the number of foreach called. At the very end the log shows 19 lines with Type: DONE (which make sense to me).
– My questions:
Why does the slurm job run indefinitely after the script has finished?
Why the numerous Type: EXEC messages? are they normal?
There is some masking between packages snow and doParallel. Am I calling the right packages and in the right order?
Some answers to the StackOverflow questions mentioned above recommend to call the script with
mpirun -np 1 R --slave -f 'myscript.R'
instead of using Rscript as I did. What's the difference? Note that the problems I mentioned remain even if I call the script this way, though.
I thank you very much for your help!
I am using R and spark to run a simple example to test spark.
I have a spark master running locally using the following:
spark-class org.apache.spark.deploy.master.Master
I can see the status page at http://localhost:8080/
Code:
system("spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --master local[*]")
suppressPackageStartupMessages(library(SparkR)) # Load the library
sc <- sparkR.session(master = "local[*]")
df <- as.DataFrame(faithful)
head(df)
Now this runs fine when I do the following (code is saved as 'sparkcode'):
Rscript sparkcode.R
Problem:
But what happens is that a new spark instance is created, I want the R to use the existing master instance (should see this as a completed job http://localhost:8080/#completed-app)
P.S: using Mac OSX , spark 2.1.0 and R 3.3.2
A number of things:
If you use standalone cluster use correct url which should be sparkR.session(master = "spark://hostname:port"). Both hostname and port depend on the configuration but the standard port is 7077 and hostname should default to hostname. This is the main problem.
Avoid using spark-class directly. This is what $SPARK_HOME/sbin/ scripts are for (like start-master.sh). There are not crucial but handle small and tedious tasks for you.
Standalone master is only resource manager. You have to start worker nodes as well (start-slave*).
It is usually better to use bin/spark-submit though it shouldn't matter much here.
spark-csv is no longer necessary in Spark 2.x and even if it was Spark 2.1 uses Scala 2.11 by default. Not to mention 1.0.3 is extremely old (like Spark 1.3 or so).
I am now dealing with a large dataset and I want to use parallel calculation to accelerate the process. WestGird is a Canadian computing system which has clusters with interconnect.
I use two packages doSNOW and parallel to do parallel jobs. My question is how I should write the pbs file. When I submit the job using qsub, an error occurs: mpirun noticed that the job aborted, but has no info as to the process that caused that situation.
Here is the R script code:
install.packages("fume_1.0.tar.gz")
library(fume)
library(foreach)
library(doSNOW)
load("spei03_df.rdata",.GlobalEnv)
cl <- makeCluster(mpi.universe.size(), type='MPI' )
registerDoSNOW(cl)
MK_grid <-
foreach(i=1:6000, .packages="fume",.combine='rbind') %dopar% {
abc <- mkTrend(as.matrix(spei03_data)[i,])
data.frame(P_value=abc$`Corrected p.value`, Slope=abc$`Sen's Slope`*10,Zc=abc$Zc)
}
stopCluster(cl)
save(MK_grid,file="MK_grid.rdata")
mpi.exit()
The "fume" package is download from https://cran.r-project.org/src/contrib/Archive/fume/ .
Here is the pbs file:
#!/bin/bash
#PBS -l nodes=2:ppn=12
#PBS -l walltime=2:00:00
module load application/R/3.3.1
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=1
mpirun -np 1 -hostfile $PBS_NODEFILE R CMD BATCH Trend.R
Can anyone help? Thanks a lot.
It's difficult to give advice on how to use a compute cluster that I've never used since each cluster is setup somewhat differently, but I can give you some general advice that may help.
Your job script looks reasonable to me. It's very similar to what I use on one of our Torque/Moab clusters. It's a good idea to verify that you're able to load all of the necessary R packages interactively because sometimes additional module files may need to be loaded. If you need to install packages yourself, make sure you install them in the standard "personal library" which is called something like "~/R/x86_64-pc-linux-gnu-library/3.3". That often avoids errors loading packages in the R script when executing in parallel.
I have more to say about your R script:
You need to load the Rmpi package in your R script using library(Rmpi). It isn't automatically loaded when loading doSNOW, so you will get an error when calling mpi.universe.size().
I don't recommend installing R packages in the R script itself. That will fail if install.script needs to prompt you for the CRAN repository, for example, since you can't execute interactive functions from an R script executed via mpirun.
I suggest starting mpi.universe.size() - 1 cluster workers when calling makeCluster. Since mpirun starts one worker, it may not be safe for makeCluster to spawn mpi.universe.size() additional workers since that would result in a total of mpi.universize.size() + 1 MPI processes. That works on some clusters, but it fails on at least one of our clusters.
While debugging, try using the makeCluster outfile='' option. Depending on your MPI installation, that may let you see error messages that would otherwise be hidden.
Is it possible that processes spawned by RServe share some common libraries loaded once into memory?
Imagine that I need to execute bellow code on 100 different RConnections concurrently.
library(libraryOfSize40MB)
fun()
It means that I need about 3.9GB of memory just to load library. I would prefer to load library once and then execute fun() one hundred times, so that I can run this on cheap host.
Maybe this is helpful?
https://github.com/s-u/Rserve/blob/master/NEWS#L40-L48
It is possible. You have to run RServe from R shell using run.serve preceded by loaded libraries:
library(Rserve)
#load libraries so all connections will share them
library("yaml")
library("reshape")
library("rjson")
library("zoo")
(...)
library("stringr")
run.Rserve(debug = TRUE, port = 6311, remote=TRUE, auth=FALSE, args="--no-save", config.file = "/etc/Rserve.conf")
Every new connection will be able to see this libraries
library(RSclient)
con = RS.connect(host='10.1.2.3')
RS.eval(con, quote(search()))
> #lots of libraries available
I'm fighting this problem second day straight with a completely sleepless night and I'm really starting to lose my patience and strength. It all started after I decided to provision another (paid) AWS EC2 instance in order to test my R code for dissertation data analysis. Previously I was using a single free tier t1.micro instance, which is painfully slow, especially when testing/running particular code. Time is much more valuable than reasonable number of cents per hour that Amazon is charging.
Therefore, I provisioned a m3.large instance, which I hope should have enough power to crunch my data comfortably fast. After EC2-specific setup, which included selecting Ubuntu 14.04 LTS as an operating system and some security setup, I installed R and RStudio Server per instructions via sudo apt-get install r-base r-base-dev as ubuntu user. I also created ruser as a special user for running R sessions. Basically, the same procedure as on the smaller instance.
Current situation is that any command that I issuing in R session command line result in messages like this: Error: could not find function "sessionInfo". The only function that works is q(). I suspect here a permissions problem, however, I'm not sure how to approach investigating permission-related problems in R environment. I'm also curious what could be the reasons for such situation, considering that I was following recommendations from R Project and RStudio sources.
I was able to pinpoint the place that I think caused all that horror - it was just a small configuration file "/etc/R/Rprofile.site", which I have previously updated with directives borrowed from R experts' posts here on StackOverflow. After removing questionable contents, I was able to run R commands successfully. Out of curiosity and for sharing this hard-earned knowledge, here's the removed contents:
local({
# add DISS_FLOSS_PKGS to the default packages, set a CRAN mirror
DISS_FLOSS_PKGS <- c("RCurl", "digest", "jsonlite",
"stringr", "XML", "plyr")
#old <- getOption("defaultPackages")
r <- getOption("repos")
r["CRAN"] <- "http://cran.us.r-project.org"
#options(defaultPackages = c(old, DISS_FLOSS_PKGS), repos = r)
options(defaultPackages = DISS_FLOSS_PKGS, repos = r)
#lapply(list(DISS_FLOSS_PKGS), function() library)
library(RCurl)
library(digest)
library(jsonlite)
library(stringr)
library(XML)
library(plyr)
})
Any comments on this will be appreciated!