Is it possible that processes spawned by RServe share some common libraries loaded once into memory?
Imagine that I need to execute bellow code on 100 different RConnections concurrently.
library(libraryOfSize40MB)
fun()
It means that I need about 3.9GB of memory just to load library. I would prefer to load library once and then execute fun() one hundred times, so that I can run this on cheap host.
Maybe this is helpful?
https://github.com/s-u/Rserve/blob/master/NEWS#L40-L48
It is possible. You have to run RServe from R shell using run.serve preceded by loaded libraries:
library(Rserve)
#load libraries so all connections will share them
library("yaml")
library("reshape")
library("rjson")
library("zoo")
(...)
library("stringr")
run.Rserve(debug = TRUE, port = 6311, remote=TRUE, auth=FALSE, args="--no-save", config.file = "/etc/Rserve.conf")
Every new connection will be able to see this libraries
library(RSclient)
con = RS.connect(host='10.1.2.3')
RS.eval(con, quote(search()))
> #lots of libraries available
Related
I have written a sparkR code and wondering if I can submit it using spark-submit or sparkR on an EMR cluster.
I have tried several ways for example:
sparkR mySparkRScript.r or sparkR --no-save mySparkScript.r etc.. but every time I am getting below error:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Sample Code:
#Set the path for the R libraries you would like to use.
#You may need to modify this if you have custom R libraries.
.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))
#Set the SPARK_HOME environment variable to the location on EMR
Sys.setenv(SPARK_HOME = '/usr/lib/spark')
#Load the SparkR library into R
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
#Initiate a Spark context and identify where the master node is located.
#local is used here because the RStudio server
#was installed on the master node
sc <- sparkR.session(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
Note: I am able to run my code in sparkr-shell by pasting directly or using source("mySparkRScript.R").
Ref:
Crunching Statistics at Scale with SparkR on Amazon EMR
SparkR Spark documentation
R on Spark
Executing-existing-r-scripts-from-spark-rutger-de-graaf
Github
I was able to get this running via Rscript. There are a few things you need to do, and this may be a bit process intensive. If you are willing to give it a go, I would recommend:
Figure out how to do an automated SparkR or sparklyR build. Via: https://github.com/UrbanInstitute/spark-social-science
Use the AWS CLI to first create a cluster with the EMR template and bootstrap script you will create via following Step 1. (Make sure to put the EMR template and rstudio_sparkr_emrlyr_blah_blah.sh sripts into an S3 bucket)
Place your R code into a single file and put this in another S3 bucket...the sample code you have provided would work just fine, but I would recommend actually doing some operation, say reading in data from S3, adding a value to it, then writing it back out (just to confirm it works before getting into the 'heavy' code you might have sitting around)
Create another .sh file that copies the R file from the S3 bucket you have to the cluster, and then execute it via Rscript. Put this shell script in the same S3 bucket as your R code file (for simplicity). An example of the contents of this shell file might look like this:
#!/bin/bash
aws s3 cp s3://path/to/the/R/file/from/step3.R theNameOfTheFileToRun.R
Rscript theNameOfTheFileToRun.R
In the AWS CLI, at the time of cluster creation, insert a --step to your cluster creation call, Use the CUSTOM JAR RUNNER provided by Amazon to run the shell script that copies and executes the R code
Make sure to stop the Spark session at the end of your R code.
An example of the AWS CLI command might look like this (I'm using the us-east-1 zone on Amazon in my example, and throwing a 100GB disk on each worker in the cluster...just put your zone in wherever you see 'us-east-1' and pick whatever size disk you want instead)
aws emr create-cluster --name "MY COOL SPARKR OR SPARKLYR CLUSTER WITH AN RSCRIPT TO RUN SOME R CODE" --release-label emr-5.8.0 --applications Name=Spark Name=Ganglia Name=Hadoop --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.2xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=1}]}' --log-uri s3://path/to/EMR/sparkr_logs --bootstrap-action Path=s3://path/to/EMR/sparkr_bootstrap/rstudio_sparkr_emr5lyr-proc.sh,Args=['--user','cool_dude','--user-pw','top_secret','--shiny','true','--sparkr','true','sparklyr','true'] --ec2-attributes KeyName=mykeyfilename,InstanceProfile=EMR_EC2_DefaultRole,AdditionalMasterSecurityGroups="sg-abc123",SubnetId="subnet-abc123" --service-role EMR_DefaultRole --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --auto-terminate --region us-east-1 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://path/to/the/shell/file/from/step4.sh"]
Good luck! Cheers, Nate
I am using R and spark to run a simple example to test spark.
I have a spark master running locally using the following:
spark-class org.apache.spark.deploy.master.Master
I can see the status page at http://localhost:8080/
Code:
system("spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --master local[*]")
suppressPackageStartupMessages(library(SparkR)) # Load the library
sc <- sparkR.session(master = "local[*]")
df <- as.DataFrame(faithful)
head(df)
Now this runs fine when I do the following (code is saved as 'sparkcode'):
Rscript sparkcode.R
Problem:
But what happens is that a new spark instance is created, I want the R to use the existing master instance (should see this as a completed job http://localhost:8080/#completed-app)
P.S: using Mac OSX , spark 2.1.0 and R 3.3.2
A number of things:
If you use standalone cluster use correct url which should be sparkR.session(master = "spark://hostname:port"). Both hostname and port depend on the configuration but the standard port is 7077 and hostname should default to hostname. This is the main problem.
Avoid using spark-class directly. This is what $SPARK_HOME/sbin/ scripts are for (like start-master.sh). There are not crucial but handle small and tedious tasks for you.
Standalone master is only resource manager. You have to start worker nodes as well (start-slave*).
It is usually better to use bin/spark-submit though it shouldn't matter much here.
spark-csv is no longer necessary in Spark 2.x and even if it was Spark 2.1 uses Scala 2.11 by default. Not to mention 1.0.3 is extremely old (like Spark 1.3 or so).
I am now dealing with a large dataset and I want to use parallel calculation to accelerate the process. WestGird is a Canadian computing system which has clusters with interconnect.
I use two packages doSNOW and parallel to do parallel jobs. My question is how I should write the pbs file. When I submit the job using qsub, an error occurs: mpirun noticed that the job aborted, but has no info as to the process that caused that situation.
Here is the R script code:
install.packages("fume_1.0.tar.gz")
library(fume)
library(foreach)
library(doSNOW)
load("spei03_df.rdata",.GlobalEnv)
cl <- makeCluster(mpi.universe.size(), type='MPI' )
registerDoSNOW(cl)
MK_grid <-
foreach(i=1:6000, .packages="fume",.combine='rbind') %dopar% {
abc <- mkTrend(as.matrix(spei03_data)[i,])
data.frame(P_value=abc$`Corrected p.value`, Slope=abc$`Sen's Slope`*10,Zc=abc$Zc)
}
stopCluster(cl)
save(MK_grid,file="MK_grid.rdata")
mpi.exit()
The "fume" package is download from https://cran.r-project.org/src/contrib/Archive/fume/ .
Here is the pbs file:
#!/bin/bash
#PBS -l nodes=2:ppn=12
#PBS -l walltime=2:00:00
module load application/R/3.3.1
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=1
mpirun -np 1 -hostfile $PBS_NODEFILE R CMD BATCH Trend.R
Can anyone help? Thanks a lot.
It's difficult to give advice on how to use a compute cluster that I've never used since each cluster is setup somewhat differently, but I can give you some general advice that may help.
Your job script looks reasonable to me. It's very similar to what I use on one of our Torque/Moab clusters. It's a good idea to verify that you're able to load all of the necessary R packages interactively because sometimes additional module files may need to be loaded. If you need to install packages yourself, make sure you install them in the standard "personal library" which is called something like "~/R/x86_64-pc-linux-gnu-library/3.3". That often avoids errors loading packages in the R script when executing in parallel.
I have more to say about your R script:
You need to load the Rmpi package in your R script using library(Rmpi). It isn't automatically loaded when loading doSNOW, so you will get an error when calling mpi.universe.size().
I don't recommend installing R packages in the R script itself. That will fail if install.script needs to prompt you for the CRAN repository, for example, since you can't execute interactive functions from an R script executed via mpirun.
I suggest starting mpi.universe.size() - 1 cluster workers when calling makeCluster. Since mpirun starts one worker, it may not be safe for makeCluster to spawn mpi.universe.size() additional workers since that would result in a total of mpi.universize.size() + 1 MPI processes. That works on some clusters, but it fails on at least one of our clusters.
While debugging, try using the makeCluster outfile='' option. Depending on your MPI installation, that may let you see error messages that would otherwise be hidden.
I would like to use the plumber package to carry out some flexible parallel processing and was hoping it would work within a node.js framework such that it is non-blocking...
I have the following plumber file.
# myfile.R
#* #get /mean
normalMean <- function(samples=10){
Sys.sleep(5)
data <- rnorm(samples)
mean(data)
}
I have also installed pm2 as suggested here http://plumber.trestletech.com/docs/hosting/
I have also made the same run-myfile.sh file i.e.
#!/bin/bash
R -e "library(plumber); pr <- plumb('myfile.R'); pr\$run(port=4000)"
and made it executable as suggested...
I have started up pm2 using
pm2 start /path/to/run-myfile.sh
and wanted to test to see if it could carry out a non-blocking node.js framework...
by opening up another R console and running the following...
foo <- function(){
con <- curl::curl('http://localhost:4000/mean?samples=10000',handle = curl::new_handle())
on.exit(close(con))
return(readLines(con, n = 1, ok = FALSE, warn = FALSE))
}
system.time(for (i in seq(5)){
print(foo())
})
Perhaps it is my miss-understanding of how a node.js non-blocking framework is meant to work, but in my head the last loop should take only a bit of over 5 seconds. But it seems to take 25 seconds, suggesting everything is sequential rather than parallel.
How could I use the plumber package to carry out that non-blocking nature?
pm2 can't load-balance R processes for you, unfortunately. R is single-threaded and doesn't really have libraries that allow it to behave in asynchronous fashion like NodeJS does (yet), so there aren't many great ways to parallelize code like this in plumber today. The best option would be to run multiple plumber R back-ends and distribute traffic across them. See the "load balancing" section here: http://plumber.trestletech.com/docs/docker-advanced
Basically concurrent requests are queued by httpuv so that it is not performant by itself. The author recommends multiple docker containers but it can be complicated as well as response-demanding.
There are other tech eg Rserve and rApache. Rserve forks prosesses and it is possible to configure rApache to pre-fork so as to handle concurrent requests.
See the following posts for comparison
https://www.linkedin.com/pulse/api-development-r-part-i-jaehyeon-kim/
https://www.linkedin.com/pulse/api-development-r-part-ii-jaehyeon-kim/
My question is about the feasibilty of running a sparkR program in spark without an R dependency.
In other words can I run the following program in spark when there is no R interpreter installed in the machine?
#set env var
Sys.setenv(SPARK_HOME="/home/fazlann/Downloads/spark-1.5.0-bin-hadoop2.6")
#Tell R where to find sparkR package
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
#load sparkR into this environment
library(SparkR)
#create the sparkcontext
sc <- sparkR.init(master = "local")
#to work with DataFrames we will need a SQLContext, which can be created from the SparkContext
sqlContext <- sparkRSQL.init(sc)
name <- c("Nimal","Kamal","Ashen","lan","Harin","Vishwa","Malin")
age <- c(23,24,12,25,31,22,43)
child <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE)
localdf <- data.frame(name,age,child)
#convert R dataframe into spark DataFrame
sparkdf <- createDataFrame(sqlContext,localdf);
#since we are passing a spark DataFrame into head function, the method gets executed in spark
head(sparkdf)
No, you can't. You'll need to install R and also the needed packages, otherwise your machine won't know that she needs to interpret R.
Don't try to ship your R interpreter in the application you are submitting as the uber application will be excessively heavy to distribute among your cluster.
You'll need a configuration management system that allows you to define the state of your IT infrastructure, then automatically enforces the correct state.
No. SparkR works by having an R process communicating with Spark via rJava. You will still need R installed on your machine, just as you need a JVM installed.