Sparklyr fails to download Spark from apache in Dockerfile - r

I am trying to create a dockerfile that builds an image from Rocker/tidyverse and include Spark from sparklyr. Previously, on this post: Unable to install spark with sparklyr in Dockerfile, I was trying to figure out why spark wouldn't download from my dockerfile. After playing with it for the past 5 days I think I have found the reason but have no idea how to fix it.
Here is my Dockerfile:
# start with the most up-to-date tidyverse image as the base image
FROM rocker/tidyverse:latest
# install openjdk 8 (Java)
RUN apt-get update \
&& apt-get install -y openjdk-8-jdk
# Install devtools
RUN Rscript -e 'install.packages("devtools")'
# Install sparklyr
RUN Rscript -e 'devtools::install_version("sparklyr", version = "1.5.2", dependencies = TRUE)'
# Install spark
RUN Rscript -e 'sparklyr::spark_install(version = "3.0.0", hadoop_version = "3.2")'
RUN mv /root/spark /opt/ && \
chown -R rstudio:rstudio /opt/spark/ && \
ln -s /opt/spark/ /home/rstudio/
RUN apt-get install unixodbc unixodbc-dev --install-suggests
RUN apt-get install odbc-postgresql
RUN install2.r --error --deps TRUE DBI
RUN install2.r --error --deps TRUE RPostgres
RUN install2.r --error --deps TRUE dbplyr
It has no problem downloading everything up until this line:
RUN Rscript -e 'sparklyr::spark_install(version = "3.0.0", hadoop_version = "3.2")'
Which then gives me the error:
Step 5/11 : RUN Rscript -e 'sparklyr::spark_install(version = "3.0.0", hadoop_version = "3.2")'
---> Running in 739775db8f12
Error in download.file(installInfo$packageRemotePath, destfile = installInfo$packageLocalPath, :
download from 'https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz' failed
Calls: <Anonymous>
Execution halted
ERROR: Service 'rocker_sparklyr' failed to build : The command '/bin/sh -c Rscript -e 'sparklyr::spark_install(version = "3.0.0", hadoop_version = "3.2")'' returned a non-zero code: 1
After doing some research I thought that it was a timeout error, in which case I ran beforehand:
RUN Rscript -e 'options(timeout=600)'
This did not increase the time it took to error out again. I installed everything onto my personal machine through Rstudio and it installed with no problems. I think the problem is specific to docker in that it isn't able to download from https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
I have found very little documentation on this problem and am relying heavily on this post to figure it out. Thank you in advance to anyone with this knowledge for reaching out.

download the version yourself and then use this function to install
sparklyr::spark_install_tar(tarfile ="~/spark/spark-3.0.1-bin-hadoop3.2.tgz")

Related

Installing R in a docker container

I'm trying to install in a Ubuntu:20.04 based container miniconda and, using the conda keyword, R:4.05.
The Dockerfile I'm using is this:
FROM ubuntu:20.04
USER root
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get -y install libcurl4-openssl-dev
RUN apt-get install -y wget
RUN mkdir -p ~/miniconda3
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O ~/miniconda3/miniconda.sh
RUN bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
RUN export PATH=~/miniconda3/bin:$PATH
RUN rm -rf ~/miniconda3/miniconda.sh
RUN ~/miniconda3/bin/conda init bash
RUN ~/miniconda3/bin/conda init zsh
RUN ~/miniconda3/bin/conda config --add channels conda-forge
RUN ~/miniconda3/bin/activate
RUN ~/miniconda3/bin/conda install -y -c conda-forge r-base
RUN R -e "install.packages('BiocManager')"
RUN R -e "BiocManager::install('DESeq2')"
From lines 8 to 16 I download miniconda and run it in ~/miniconda3
In line 17:
RUN R -e "install.packages('BiocManager')"
I try to use R and install the BiocManager package from the command line, but I receive this error:
> [16/17] RUN R -e "install.packages('BiocManager')":
#19 2.767 /bin/sh: 1: R: not found
------
executor failed running [/bin/sh -c R -e "install.packages('BiocManager')"]: exit code: 127
I've also tried to start from the official distribution of Rocker, but in this way (the way I've shown you in this post) I would prefer it since I would end up with an image in which I have both miniconda and R.
Can someone help me?
Thanks a lot!
Each RUN command runs in a separate shell, so your export command sets the path, but then the shell exits and the path is reset for the next RUN command.
You also have to use the absolute path. Tilde expansion doesn't work.
Instead of
RUN export PATH=~/miniconda3/bin:$PATH
try
ENV PATH=/root/miniconda3/bin:$PATH

Docker does not find R despite pre-built layers

I create a docker image to run R scripts on a VM server with no access to the internet.
For the first layer I load R and all libraries
Dockerfile1
FROM r-base
## Needed to access R
ENV R_HOME /usr/lib/R
## install required libraries
RUN apt-get update
RUN apt-get -y install libgdal-dev
## install R-packages
RUN R -e "install.packages('dplyr',dependencies=TRUE, repos='http://cran.rstudio.com/')"
...
and create it
docker build -t mycreate_od/libraries -f Dockerfile1 .
Then I use this library layer to load the R script
Dockerfile2
FROM mycreate_od/libraries
## Create directory
RUN mkdir -p /home/analysis/
## Copy files
COPY my_script_dir /home/analysis/
## Run the script
CMD R -e "source('/home/analysis/my_script_dir/main.R')"
Create the analysis layer
docker build -t mycreate_od/analysis -f vault/Dockerfile2 .
On my master VM, this runs and suceeds, but on the fresh VM I get
docker run mycreate_od/analysis
R docker ERROR: R_HOME ('/usr/lib/R') not found - Recherche Google
From a previous bug search I have set the ENV variable in the Docker (see Dockerfile1),
but it looks like docker installs R on some other place.
Thanks to Dirk advice I managed to get it done using r-apt (see Dockerfile below).
The image get then built and can be run without the R_HOME error.
BTW much faster and with a significantly smaller resulting image.
FROM rocker/r-apt:bionic
RUN apt-get update && \
apt-get -y install libgdal-dev && \
apt-get install -y -qq \
r-cran-dplyr \
r-cran-rjson \
r-cran-data.table \
r-cran-crayon \
r-cran-geosphere \
r-cran-lubridate \
r-cran-sp \
r-cran-R.utils
RUN R -e 'install.packages("tools")'
RUN R -e 'install.packages("rgdal", repos="http://R-Forge.R-project.org")'
This is unfortunately a cargo-boat solution, as I am unable to explain why the previous Dockerfile failed.

Unable to install spark with sparklyr in Dockerfile

We are trying to build our own docker image to use R and the tidyverse with Spark. However, we are getting an error in the build when trying to install Spark.
Here is our Dockerfile:
# start with the most up-to-date tidyverse image as the base image
FROM rocker/tidyverse:latest
# install openjdk 8 (Java)
RUN apt-get update \
&& apt-get install -y openjdk-8-jdk
# Install sparklyr
RUN install2.r --error --deps TRUE sparklyr
# Install spark
RUN Rscript -e 'sparklyr::spark_install("2.4.3")'
RUN mv /root/spark /opt/ && \
chown -R rstudio:rstudio /opt/spark/ && \
ln -s /opt/spark/ /home/rstudio/
RUN install2.r --error --deps TRUE DBI
RUN install2.r --error --deps TRUE RPostgres
RUN install2.r --error --deps TRUE dbplyr
We are using Docker compose up to build and then create the container.
When building, it throws the error:
=> ERROR [4/8] RUN Rscript -e 'sparklyr::spark_install("2.3.0")' 62.0s
------
> [4/8] RUN Rscript -e 'sparklyr::spark_install("2.3.0")':
#7 61.85 Error in download.file(installInfo$packageRemotePath, destfile = installInfo$packageLocalPath, :
#7 61.85 download from 'https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz' failed
#7 61.85 Calls: <Anonymous>
#7 61.85 Execution halted
------
failed to solve: rpc error: code = Unknown desc = executor failed running [/bin/sh -c Rscript -e 'sparklyr::spark_install("2.3.0")']: exit code: 1
We have also tried running it as:
RUN R -e 'sparklyr::spark_install("2.4.3")'
Instead of:
RUN Rscript -e 'sparklyr::spark_install("2.4.3")'
but it stil throws an error. We have also tried installing different versions of Spark to see if that would work but still no luck. Does anyone know why I am getting this error and how I can properly install spark with sparklyr in docker? Thank you.

Adding DT R packages prevents Docker from running

I use Docker to run Shiny app, and I install some R packages in the Dockerfile (here is the portion of the Dockerfile, I omitted some lines, marking it with <...>):
FROM r-base:latest
RUN apt-get update && apt-get install -y -t unstable \
sudo \
gdebi-core \
make \
git \
gcc \
<...>
R -e "install.packages(c('shiny', 'rmarkdown'), repos='https://cran.rstudio.com/')" && \
R -e "install.packages(c('ada','bsplus','caret','ddalpha','diptest','doMC','dplyr','e1071','evtree','fastAdaboost','foreach','GGally','ggplot2','gridExtra','iterators','kernlab','lattice','markdown','MASS','mboost','nnet','optparse','partykit','plyr','pROC','PRROC','randomForest','recipes','reshape2','RSNNS','scales','shinyBS','shinyFiles','shinythemes'))"
This works fine. But if I add one more R package (DT), the container still builds fine (and I can see that the package gets installed properly) but when I try to run the container I get:
Loading required package: shiny
Error in dir.exists(lib) : invalid filename argument
Calls: <Anonymous> ... load_libraries -> get_package -> install.packages -> dir.exists
Execution halted
This error is not informative at all and I can't figure out what possibly can be wrong. I will appreciate any ideas! Thank you.

R App in Docker Container: Unable to download PDF report (Error: No such file or directory) Knitr/Rmarkdown

I have been building a docker container for an R application and have continually run into an error with downloading a PDF report. The PDF report function works fine in R on a local machine, but when containerized, it throws the error below. I have tried forcing the install of specific packages, namely Knitr and Rmarkdown as other questions have mentioned, however it is still showing the same error. The file in Chrome downloads simply says "Failed - Server Problem". I have tested the download of a CSV file using the app, which works fine, therefore I believe it's an issue with generating and downloading a markdown PDF report.
I have included the build Dockerfile to assist. Any suggestions would be amazing!
Thanks!
DOCKERFILE:
FROM openanalytics/r-base
MAINTAINER ________
# system libraries of general use
RUN apt-get update && apt-get install -y \
sudo \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev \
libxt-dev \
libssl-dev \
libssh2-1-dev \
libxml2-dev \
libssl1.0.0 \
libpq-dev \
git \
texlive-full \
html-xml-utils \
libv8-3.14-dev
# system library dependency for the app
RUN apt-get update
# install packages for R
RUN R -e "install.packages(c('hms','devtools'), repos='https://cloud.r-
project.org/')"
RUN R -e "require(devtools)"
RUN R -e "install.packages(c('car'), repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('readxl', version = '1.0.0',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('DT', version = '0.2',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('shinydashboard', version = '0.6.1',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('knitr', version = '1.18',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('magrittr', version = '1.5',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('ggrepel', version = '0.7.0',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('dplyr', version = '0.7.4',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('Rcpp', version = '0.12.14',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('rhandsontable', version = '0.3.4',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('shinyjs', version = '0.9.1',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('V8', version = '1.5',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('data.table', version = '1.10.4-3',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('packrat', version = '0.4.8-1',
repos='https://cloud.r-project.org/')"
RUN R -e "devtools::install_version('zoo', version = '1.8-1',
repos='https://cloud.r-project.org/')"
RUN R -e "install.packages('shiny', repos='https://cloud.r-project.org/')"
RUN wget https://github.com/rstudio/rmarkdown/archive/v1.8.tar.gz
RUN R CMD INSTALL v1.8.tar.gz
RUN R -e "install.packages('xml2', repos='https://cloud.r-project.org/')"
RUN R -e "install.packages('rvest', repos='https://cloud.r-project.org/')"
RUN wget https://cran.r-
project.org/src/contrib/Archive/kableExtra/kableExtra_0.3.0.tar.gz
RUN R CMD INSTALL kableExtra_0.3.0.tar.gz
# copy the app to the image
RUN mkdir /root/tsk
COPY tsk /root/tsk
COPY Rprofile.site /usr/lib/R/etc/
EXPOSE 3838
CMD ["R", "-e", "shiny::runApp('/root/tsk')"]
ERROR FROM DOCKER:
Listening on http://0.0.0.0:3838
Warning in normalizePath(path, winslash = winslash, mustWork = mustWork) :
path[1]="/tmp/RtmpMu8ezy/TSK.Rmd": No such file or directory
Warning: Error in tools::file_path_as_absolute: file '/tmp/RtmpMu8ezy/TSK.Rmd'
does not exist
[No stack trace available]
Warning in normalizePath(path, winslash = winslash, mustWork = mustWork) :
path[1]="/tmp/RtmpMu8ezy/TSK.Rmd": No such file or directory
Warning: Error in tools::file_path_as_absolute: file '/tmp/RtmpMu8ezy/TSK.Rmd'
does not exist
[No stack trace available]
Warning in normalizePath(path, winslash = winslash, mustWork = mustWork) :
path[1]="/tmp/RtmpMu8ezy/TSK.Rmd": No such file or directory
Warning: Error in tools::file_path_as_absolute: file '/tmp/RtmpMu8ezy/TSK.Rmd'
does not exist
[No stack trace available]
Simply changing the filename CASE from tsk.Rmd to TSK.Rmd - reason for this is that testing was always on OSX in an IDE which didn't throw any errors, however when building a container with Ubuntu which is case sensitive, it was unable to find the markdown file.
When building with different operating systems, be sure to check if the system is case sensitive! An easy mistake!

Resources