Why does R code work locally, but not in Docker run? - r

I have a Docker container that is setup to run an R script weekly via an Airflow DAG. The DAG has 3 events- 1 that is upstream of the Docker code which takes data from several databases, computes various features and then writes data to S3. This script reads in data from an S3 bucket, formats the data frame, runs a model to score records, then writes data back to S3. Finally there is downstream code that formats the output so that it can be loaded into Salesforce. The script worked while testing when I wrote and built it in December. Recently the run has failed several times with the error code:
Error in as.character(x) :
cannot coerce type 'closure' to vector of type 'character'
Calls: %>% ... mutate_impl -> ymd -> .parse_xxx -> unlist -> lapply -> FUN
Execution halted
Ok, so that seems to mean that the date it is reading in as a character is having an issue being formatted as a date. Since 'ymd' is in the chain I believe it to be the Lubridate function in the R script below.
The Docker file (code below) leverages an R image that has the Tidyverse because my code uses Dplyr and Lubridate. I could likely get by without Lubridate and use a base function to format the date, but more on that below
Docker file code:
FROM rocker/tidyverse
RUN mkdir -p /model
RUN apt-get update -qq && apt-get install -y \
libssl-dev \
libcurl4-gnutls-dev
RUN R -e "install.packages('caret')"
RUN R -e "install.packages('randomForest')"
RUN R -e "install.packages('lubridate')"
RUN R -e "install.packages('aws.s3')"
EXPOSE 80
EXPOSE 8787
COPY / /
ENTRYPOINT ["Rscript", "account_health_scoring.R"]
R script: I have to exclude the first few lines due to some identifying info and credentials, but the code first just reads in my S3 credentials from a file. Then, this code block runs and fails. There is a good deal of code downstream, but it all functions in the container:
require("dplyr")
require("caret")
require("aws.s3")
require("randomForest")
require("lubridate")
#set credentials
Sys.setenv("AWS_ACCESS_KEY_ID" = "key",
"AWS_SECRET_ACCESS_KEY" = "key")
#read in model file
s3load("rf_gridsearch.RData", bucket = "account-model")
#read in data
data<-read.csv(text = rawToChar(get_object((paste0("account_health_data_",
gsub("-", "_", as.character(Sys.Date()),
fixed=TRUE),".csv")),
bucket = "account-health-model-input")),
stringsAsFactors = FALSE)%>%
mutate(period=ymd(period))%>%
mutate_if(is.integer,as.numeric)
The reason for the 2 mutate lines is that despite being formatted as a POSIX timestamp, R coerces the date to a string AND coerces floats to integers. Perhaps I am missing something here as well in my read.csv or there is a better function for properly reading data, but this is what I have always used.
Questions:
What is the error message referring to/am I correct to think the YMD function is the culprit?
If so, how can I rewrite my code potentially using base functions to accomplish the same goal and avoid relying on a package.
Could it be package dependencies? In reviewing the logs it doesn't seem that this is the case as Lubridate imports several base functions/uses several. The package has not been updated since I wrote and tested this code.

Well, the answer seems to be simple although I do not understand it. I changed
require(lubridate)
to
library(lubridate)
And it builds. I found this post What is the difference between require() and library()? and decided to just try changing it, built the container, and it worked. I'm still trying to understand "why".

Related

responding yes to terminal prompt via system2() in R

tl;dr: How can I invoke the system command y | conda create --name gee_interface from an R console, e.g. via system2()? I'm comfortable enough with system2('conda', c('create', '--name', 'gee_interface')), but I don't know how to handle piping in the 'y' via system2().
Details
I am trying to use an R console to run the bash command conda create --name gee_interface (OSX Mojave with Anaconda installed).
In terminal, that command executes just fine, but prompts me to answer with Proceed ([y]/n)? (I answer 'y' and everything works smoothly).
In R, I run
Sys.setenv(PATH = paste(c("/Applications/anaconda3/bin", Sys.getenv("PATH")), collapse = .Platform$path.sep)) # ensures that system2() finds conda
system2('conda', c('create', '--name', 'gee_interface')) # This is the key line for the purposes of this question
When running the second line [i.e. system2('conda', c('create', '--name', 'gee_interface'))], the process never finishes, but quickly falls to zero CPU usage. Presumably the system is waiting for my response to the prompt, but I don't know how to provide it. How does one do this via an R script? Note also that in my particular case, the number of times that I need to respond 'y' is variable, depending on whether an environment of the name gee_interface already exists or not.
The fix to your first problem is to tell conda not to ask for confirmation using -y:
system2('conda', c('create', '--name', 'gee_interface', '-y'))
As to the second part (variable times that your input is required), I'm guessing it's to overwrite the environment if it exists? In that case, you could check for its existence first with conda info --envs, and run conda remove --name gee_interface --all if necessary before creating it.
See:
https://docs.conda.io/projects/conda/en/latest/commands/create.html
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#removing-an-environment
You could also try your system2 call, with the argument input = "y", but that doesn't fix your second problem of needing to affirm multiple times.
See: Invoke a system command and pipe a variable as an argument

mlflow R installation MLFLOW_PYTHON_BIN

I am trying to install mlflow in R and im getting this error message saying
mlflow::install_mlflow()
Error in mlflow_conda_bin() :
Unable to find conda binary. Is Anaconda installed?
If you are not using conda, you can set the environment variable MLFLOW_PYTHON_BIN to the path of yourpython executable.
I have tried the following
export MLFLOW_PYTHON_BIN="/usr/bin/python"
source ~/.bashrc
echo $MLFLOW_PYTHON_BIN -> this prints the /usr/bin/python.
or in R,
sys.setenv(MLFLOW_PYTHON_BIN="/usr/bin/python")
sys.getenv() -> prints MLFLOW_PYTHON_BIN is set to /usr/bin/python.
however, it still does not work
I do not want to use conda environment.
how to I get past this error?
The install_mlflow command only works with conda right now, sorry about the confusing message. You can either:
install conda - this is the recommended way of installing and using mlflow
or
install mlflow python package yourself via pip
To install mlflow yourself, pip install correct (matching the the R package) python version of mlflow and set the MLFLOW_PYTHON_BIN environment variable as well as MLFLOW_BIN evn variable: e.g.
library(mlflow)
system(paste("pip install -U mlflow==", mlflow:::mlflow_version(), sep=""))
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Just ran across this, and the accepted answer by #Tomas was very helpful. I added a comment above but, for some additional context, I wanted to create a more thorough response if any other Enterprise Databricks R users run across this post trying to use the MLflow package for R on Databricks.
The Databricks MLflow quickstart guide will tell you that you need to run the following:
library(mlflow)
install_mlflow()
However, for Enterprise Databricks users, the install_mlflow() function will fail if your cluster doesn't have outside connectivity privileges (as most probably don't) and can't connect to the Anaconda repo to download the necessary packages. You'll likely get an error like this:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/conda-forge/linux-64/current_repodata.js
The good news is that MLflow should already be installed on your Databricks runtime. So you can reference that install instead, and then as #Tomas mentioned, use it to set your R environment variables for MLFLOW_BIN and MLFLOW_PYTHON_BIN. From there, the R MLflow API works as specified (in my experience, but ymmv).
The only catch from the above solution is that when you use the system()function in R, you need to set intern=TRUE in order capture the output of the command. The default behavior of the system() function is intern=FALSE. Thus if you do not explicitly set intern=TRUE, then the exit code 0 will be returned from your system() call (or perhaps another exit code upon an error) and Sys.setenv() will set the environment variable to 0!
### intern=True missing ###
Sys.setenv(MLFLOW_BIN=system("which mlflow"))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python"))
Example output (you can see the the environment variables did not get set correctly):
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN 0
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN 0
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
However, when intern=TRUE, you'll get the correct environment variables:
### intern=True set ###
Sys.setenv(MLFLOW_BIN=system("which mlflow", intern=TRUE))
Sys.setenv(MLFLOW_PYTHON_BIN=system("which python", intern=TRUE))
Example output:
s <- Sys.getenv()
s[grep("MLFLOW", names(s))]
MLFLOW_BIN /databricks/python3/bin/mlflow
MLFLOW_CONDA_HOME /databricks/conda
MLFLOW_PYTHON_BIN /databricks/python3/bin/python
MLFLOW_PYTHON_EXECUTABLE
/databricks/python/bin/python
MLFLOW_TRACKING_URI databricks
Note: This was using Databricks runtime 9.1 LTS ML. This may or may not work on other Databricks runtime configurations.

R Script running successfully on local machine, not on EC2 instance

I have an R script (an R plumber API) that I have deployed to an EC2 instance and managing with pm2, and I am running into a struggling issue. I have pinpointed the exact location of the error, and am hoping to understand this error a bit better.
When I run the script on my local machine (RStudio on my Mac) it works okay. When I run the script using Rscript myrfile.R from the EC2 instance command line, it breaks.
I have pinpointed that the line of code that breaks the script the on EC2 instance, as well as its error, are:
my_df <- my_df %>%
dplyr::mutate(AwayScore = ifelse(dplyr::row_number() == 1, 0, AwayScore),
HomeScore = ifelse(dplyr::row_number() == 1, 0, HomeScore))
# with the following error
<Rcpp::eval_error in mutate_impl(.data, dots): Evaluation error: argument "x" is missing, with no default.>
I am 100% sure that dplyr is installed on the EC2 instance, since my script uses it throughout. I am also 100% sure that the my_df dataframe here has the columns AwayScore and homeScore, and also that my_df doesnt have any other issues.
I am left to assume that this error is specifically due to the dplyr::row_number() function, which the EC2 instance does not seem to be able to handle, although I am not positive on this.
Any thoughts / help / things I should try / etc. would be greatly appreciated on this, thanks!!
While I appreciate you have avoided the problem by not requiring the library, at some point you may find you want to run codes in a similar way where loading a library will be necessary.
I ran into a similar problem using R script. I found it could not find the libraries I had installed. It is possible to use R.exe instead of Rscript.exe, but this causes other headaches. I found that the environment when using Rscript doesn't contain the R_LIBS_USER path
If you append the following code to the top of your R script it should work
p <- "\directory path of local R packages"
.libPaths(c(p,.libPaths()))
putting the folder path to where your libraries are found on the computer. This is the path that would be returned by Sys.getenv("R_LIBS_USER") if running R in the GUI
It was easy enough for me to simply change my code to the following:
if(is.na(my_df$AwayScore[1])) { my_df$AwayScore[1] = 0 }
if(is.na(my_df$HomeScore[1])) { my_df$HomeScore[1] = 0 }
... so I will likely not waste too much more time trying to debug this.

Executing a SAS program in R using system() Command

My company recently converted to SAS and did not buy the SAS SHARE license so I cannot ODBC into the server. I am not a SAS user, but I am writing a program that needs to query data from the server and I want to have my R script call a .sas program to retrieve the data. I think this is possible using
df <- system("sas -SYSIN path/to/sas/script.sas")
but I can't seem to make it work. I have spent all a few hours on the Googles and decided to ask here.
error message:
running command 'sas -SYSIN C:/Desktop/test.sas' had status 127
Thanks!
Assuming your sas program generates a sas dataset, you'll need to do two things:
Through shellor system, make SAS run the program, but first cd in the directory containing the sas executable in case the directory isn't in your PATH environment variable.
setwd("c:\\Program Files\\SASHome 9.4\\SASFoundation\\9.4\\")
return.code <- shell("sas.exe -SYSIN c:\\temp\\myprogram.sas")
Note that what this returns is NOT the data itself, but the code issued by the OS telling you if the task succeeded or not. A code 0 means task has succeeded.
In the sas program, all I did was to create a copy of sashelp.baseball in the c:\temp directory.
Import the generated dataset into R using one of the packages written for that. Haven is the most recent and IMO most reliable one.
# Install Haven from CRAN:
install.packages("haven")
# Import the dataset:
myData <- read_sas("c:\\temps\\baseball.sas7bdat")
And there you should have it!

Using python to execute .R script

After seven hours of googling and rereading through somewhat similar questions, and then lots of trial and error, I'm now comfortable asking for some guidance.
To simplify my actual task, I created a very basic R script (named test_script):
x <- c(1,2,3,4,5)
avg <- mean(x)
write.csv(avg, file = "output.csv")
This works as expected.
I'm new to python and I'm just trying to figure out how to execute the R script so that the same .csv file is created.
Notable results come from:
subprocess.call(["C:/Program Files/R/R-2.15.2/bin/R", 'C:/Users/matt/Desktop/test_script.R'])
This opens a cmd window with the typical R start-up verbiage, except there is a message which reads, "ARGUMENT 'C:/Users/matt/Desktop/test_script.R' __ ignored __"
And:
subprocess.call(['C:/Program Files/R/R-2.15.2/bin/Rscript', 'C:/Users/matt/Desktop/test_script.r'])
This flashes a cmd window and returns a 0, but no .csv file is created.
Otherwise, I've tried every suggestion I could identify on this site or any other. Any insight will be greatly appreciated. Thanks in advance for your time and efforts.
Running R --help at the command prompt prints:
Usage: R [options] [< infile] [> outfile]
or: R CMD command [arguments]
Start R, a system for statistical computation and graphics, with the
specified options, or invoke an R tool via the 'R CMD' interface.
Options:
-h, --help Print short help message and exit
--version Print version info and exit
...
-f FILE, --file=FILE Take input from 'FILE'
-e EXPR Execute 'EXPR' and exit
FILE may contain spaces but not shell metacharacers.
Commands:
BATCH Run R in batch mode
COMPILE Compile files for use with R
...
Try
call(["C:/Program Files/R/R-2.15.2/bin/R", '-f', 'C:/Users/matt/Desktop/test_script.R'])
There are also some other command-line arguments you can pass to R that may be helpful. Run R --help to see the full list.
It might be too late, but hope it helps for others:
Just add --vanilla in the call list.
subprocess.call(['C:/Program Files/R/R-2.15.2/bin/Rscript', '--vanilla', 'C:/Users/matt/Desktop/test_script.r'])

Resources