External Scripting and R (Kognitio) - r

I have created the R script environment (used this command to create it "create script environment RSCRIPT command '/usr/local/R/bin/Rscript --vanilla --slave'") and tried running the one R script but it fails with the below error message.
ERROR: RS 10 S 332659 R 31A004F LO:Script stderr: external script vfork child: No such file or directory
Is it because of the below line which i am using in the script ?
mydata <- read.csv(file=file("stdin"), header=TRUE)
if (nrow(mydata) > 0){
I am not sure what is it expecting.
I have one more questions to ask.
1) do we need to install the R package on our unix box ? if not then the kognitio package has it

I suspect the problem here is that you have not installed the R environment on ALL the database nodes in your system - it must be installed on every DB node involved in processing (as explained in chapter 10 of the Kognitio Guide which you can download from http://www.kognitio.com/forums/viewtopic.php?t=3) or you will see errors like "external script vfork child: No such file or directory".
You would normally use a remote deployment tool (e.g. HP's RDP) to ensure the installation was identical on all DB nodes. Alternatively, you can leverage the Kognitio wxsync tool to synchronise files across nodes.
Section 10.6 of the Kognitio Guide also explains how to constrain which DB nodes are involved in processing - this is appropriate if your script environment should not run on all nodes for some reason (e.g. it has an expensive per-node/per-core licence). That does not seem appropriate for using R though.

Related

R cmd check note: unable to verify current time

When running R CMD check I get the following note:
checking for future file timestamps ... NOTE
unable to verify current time
I have seen this discussed here, but I am not sure which files it is checking for timestamps, so I'm not sure which files I should look at. This happens locally on my windows and remotely on different systems (using github actions).
Take a look at https://svn.r-project.org/R/trunk/src/library/tools/R/check.R
The check command relies on an external web resource:
now <- tryCatch({
foo <- suppressWarnings(readLines("http://worldclockapi.com/api/json/utc/now",
warn = FALSE))
This resource http://worldclockapi.com/ is currently not available.
Hence the following happens (see same package source):
if (is.na(now)) {
any <- TRUE
noteLog(Log, "unable to verify current time")
See also references:
https://community.rstudio.com/t/r-devel-r-cmd-check-failing-because-of-time-unable-to-verify-current-time/25589
So, unfortunately this requires a fix in the check function by the R development team ... or the web-resource coming online again.
To add to qasta's answer, you can silence this check by setting the _R_CHECK_SYSTEM_CLOCK_ environment variable to zero e.g Sys.setenv('_R_CHECK_SYSTEM_CLOCK_' = 0)
To silence this in a persistent manner, you can set this environment variable on R startup. One way to do so is through the .Renviron file, in the following manner:
install.packages("usethis") (If not installed already)
usethis::edit_r_environ()
Add _R_CHECK_SYSTEM_CLOCK_=0 to the file
Save, close file, restart R

mlflow R installation MLFLOW_PYTHON_BIN

I am trying to install mlflow in R and im getting this error message saying
mlflow::install_mlflow()
Error in mlflow_conda_bin() :
Unable to find conda binary. Is Anaconda installed?
If you are not using conda, you can set the environment variable MLFLOW_PYTHON_BIN to the path of yourpython executable.
I have tried the following
export MLFLOW_PYTHON_BIN="/usr/bin/python"
source ~/.bashrc
echo $MLFLOW_PYTHON_BIN -> this prints the /usr/bin/python.
or in R,
sys.setenv(MLFLOW_PYTHON_BIN="/usr/bin/python")
sys.getenv() -> prints MLFLOW_PYTHON_BIN is set to /usr/bin/python.
however, it still does not work
I do not want to use conda environment.
how to I get past this error?
The install_mlflow command only works with conda right now, sorry about the confusing message. You can either:
install conda - this is the recommended way of installing and using mlflow
or
install mlflow python package yourself via pip
To install mlflow yourself, pip install correct (matching the the R package) python version of mlflow and set the MLFLOW_PYTHON_BIN environment variable as well as MLFLOW_BIN evn variable: e.g.
library(mlflow)
system(paste("pip install -U mlflow==", mlflow:::mlflow_version(), sep=""))
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Just ran across this, and the accepted answer by #Tomas was very helpful. I added a comment above but, for some additional context, I wanted to create a more thorough response if any other Enterprise Databricks R users run across this post trying to use the MLflow package for R on Databricks.
The Databricks MLflow quickstart guide will tell you that you need to run the following:
library(mlflow)
install_mlflow()
However, for Enterprise Databricks users, the install_mlflow() function will fail if your cluster doesn't have outside connectivity privileges (as most probably don't) and can't connect to the Anaconda repo to download the necessary packages. You'll likely get an error like this:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/conda-forge/linux-64/current_repodata.js
The good news is that MLflow should already be installed on your Databricks runtime. So you can reference that install instead, and then as #Tomas mentioned, use it to set your R environment variables for MLFLOW_BIN and MLFLOW_PYTHON_BIN. From there, the R MLflow API works as specified (in my experience, but ymmv).
The only catch from the above solution is that when you use the system()function in R, you need to set intern=TRUE in order capture the output of the command. The default behavior of the system() function is intern=FALSE. Thus if you do not explicitly set intern=TRUE, then the exit code 0 will be returned from your system() call (or perhaps another exit code upon an error) and Sys.setenv() will set the environment variable to 0!
### intern=True missing ###
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Example output (you can see the the environment variables did not get set correctly):
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN 0
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN 0
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
However, when intern=TRUE, you'll get the correct environment variables:
### intern=True set ###
Sys.setenv(MLFLOW_BIN=system("which mlflow", intern=TRUE))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python", intern=TRUE))
Example output:
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN /databricks/python3/bin/mlflow
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN /databricks/python3/bin/python
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
Note: This was using Databricks runtime 9.1 LTS ML. This may or may not work on other Databricks runtime configurations.

Load h5 keras model file in R

I'm building a R package for binary classification and I'm using opencpu to host it. Currently I've saved the h5 file as .RData file(serialized), which is then loaded in the environment using the .onLoad() function in R. This enables the R script to use the environment variable to load keras model using keras::unserialized_model().
I've tried directly using keras::load_model_hdf5() in the code, but after building and deploying on opencpu, when I try to hit the prediction API, I get error
ioerror: unable to open file (unable to open file: name = '/home/modelfile_26feb.h5', errno = 13, error message = 'permission denied', flags = 0, o_flags = 0)
I have changed permission for the file(777) and even the groups but still getting the error.
I even tried putting the file in inst/extdata folder so that it gets in the package but still same error.
Can anyone help on this, or suggest some alternative to load the h5 model directly?
Which OS does OpenCPU run on? Why does it try to write in /home/, this is very unusual? The best solution is to adapt your code to write in getwd() or tempdir(). Even better is to store data in a local database or redis server and let R read it from there, so you don't need disk access at all.
If you run on Ubuntu Server, reading from /home/ is not permitted by default. If you want to allow this, you need to add apparmor rules, see section 3.5 of the server manual.
Some relevant topics from the opencpu mailing list:
write in home dir: https://groups.google.com/d/msg/opencpu/5vRvgSKY-qE/4xMzZCGJBAAJ
keras in opencpu: https://groups.google.com/d/msg/opencpu/HhRzFVVFdaA/n5Nu1sxyFgAJ
write tmp folder: https://groups.google.com/d/msg/opencpu/Y1tYhaQUzwU/ubSEd_CDCgAJ

How to run SparkR script using spark-submit or sparkR on an EMR cluster?

I have written a sparkR code and wondering if I can submit it using spark-submit or sparkR on an EMR cluster.
I have tried several ways for example:
sparkR mySparkRScript.r or sparkR --no-save mySparkScript.r etc.. but every time I am getting below error:
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, :
JVM is not ready after 10 seconds
Sample Code:
#Set the path for the R libraries you would like to use.
#You may need to modify this if you have custom R libraries.
.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))
#Set the SPARK_HOME environment variable to the location on EMR
Sys.setenv(SPARK_HOME = '/usr/lib/spark')
#Load the SparkR library into R
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
#Initiate a Spark context and identify where the master node is located.
#local is used here because the RStudio server
#was installed on the master node
sc <- sparkR.session(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)
Note: I am able to run my code in sparkr-shell by pasting directly or using source("mySparkRScript.R").
Ref:
Crunching Statistics at Scale with SparkR on Amazon EMR
SparkR Spark documentation
R on Spark
Executing-existing-r-scripts-from-spark-rutger-de-graaf
Github
I was able to get this running via Rscript. There are a few things you need to do, and this may be a bit process intensive. If you are willing to give it a go, I would recommend:
Figure out how to do an automated SparkR or sparklyR build. Via: https://github.com/UrbanInstitute/spark-social-science
Use the AWS CLI to first create a cluster with the EMR template and bootstrap script you will create via following Step 1. (Make sure to put the EMR template and rstudio_sparkr_emrlyr_blah_blah.sh sripts into an S3 bucket)
Place your R code into a single file and put this in another S3 bucket...the sample code you have provided would work just fine, but I would recommend actually doing some operation, say reading in data from S3, adding a value to it, then writing it back out (just to confirm it works before getting into the 'heavy' code you might have sitting around)
Create another .sh file that copies the R file from the S3 bucket you have to the cluster, and then execute it via Rscript. Put this shell script in the same S3 bucket as your R code file (for simplicity). An example of the contents of this shell file might look like this:
#!/bin/bash
aws s3 cp s3://path/to/the/R/file/from/step3.R theNameOfTheFileToRun.R
Rscript theNameOfTheFileToRun.R
In the AWS CLI, at the time of cluster creation, insert a --step to your cluster creation call, Use the CUSTOM JAR RUNNER provided by Amazon to run the shell script that copies and executes the R code
Make sure to stop the Spark session at the end of your R code.
An example of the AWS CLI command might look like this (I'm using the us-east-1 zone on Amazon in my example, and throwing a 100GB disk on each worker in the cluster...just put your zone in wherever you see 'us-east-1' and pick whatever size disk you want instead)
aws emr create-cluster --name "MY COOL SPARKR OR SPARKLYR CLUSTER WITH AN RSCRIPT TO RUN SOME R CODE" --release-label emr-5.8.0 --applications Name=Spark Name=Ganglia Name=Hadoop --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.2xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=1}]}' --log-uri s3://path/to/EMR/sparkr_logs --bootstrap-action Path=s3://path/to/EMR/sparkr_bootstrap/rstudio_sparkr_emr5lyr-proc.sh,Args=['--user','cool_dude','--user-pw','top_secret','--shiny','true','--sparkr','true','sparklyr','true'] --ec2-attributes KeyName=mykeyfilename,InstanceProfile=EMR_EC2_DefaultRole,AdditionalMasterSecurityGroups="sg-abc123",SubnetId="subnet-abc123" --service-role EMR_DefaultRole --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --auto-terminate --region us-east-1 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://path/to/the/shell/file/from/step4.sh"]
Good luck! Cheers, Nate

Rserve setting for connect to Tableau

This is the first time I have tried to connect R and Tableau.
I have downloaded and installed Rserve successfully but every time I try to start Rserve is get this warning:
Starting Rserve...
"C:\Users\SIMON~1.HAR\DOCUME~1\R\WIN-LI~1\3.1\Rserve\libs\x64\Rserve.exe"
Warning message:
running command '"C:\Users\SIMON~1.HAR\DOCUME~1\R\WIN-LI~1\3.1\Rserve\libs\x64\Rserve.exe" ' had status 127
I have been searching for days and couldn't find any fix.
The Rserve() function is trying to start an application (Rserve.exe) and failed. There's a couple of things you can do.
Go to the "C:\Users\SIMON~1.HAR\DOCUME~1\R\WIN-LI~1\3.1\Rserve\libs\x64\" file directory and try loading the exe yourself, troubleshoot from there.
use the run.Rserve() function instead of Rserve(). This will use your current R session to start the Rserve server. This means the exe doesn't need to be run. This worked well for me because I am working in an environment where I don't enough privileges to run the exe. It does mean that your R session can't do anything else while the server is running, but you can always load up 2 sessions at the same time.

Resources