How to speed up R packages installation in docker - r

Say you have the following list of packages you would like to install for a docker image
("jsonlite","dplyr","stringr","tidyr","lubridate",
"knitr","purrr","tm","cba","caret",
"plumber","httr")
It actually takes around 1 hour to install these!
Any suggestions into how to speed up such a thing ? (or how to prevent the re-installation at every new image build ?)
Side note
I do not install these packages from the dockerfile like this:
RUN Rscript -e "install.packages('stringr')
...
Instead I create an R script Requirements.R which installs these packages and simply do:
RUN Rscript Requirements.R
Is these less optimal than installing the packages directly from the Dockerfile ?

Use binary packages where you can as we often do in the Rocker Project providing multiple Docker files for R, including the official r-base one.
If you start from Ubuntu, you get Michael's PPAs with over 3000+ packages; if you start from Debian you get fewer from the distro but still many essential ones. (There are some efforts to bring more binary packages to Debian but nothing is up right now.)
Lastly, Dockerfile creation is of course compile time too. You spend the time once (per container creation) and re-use potentially many time after. Also, by using the Docker Hub you can avoid spending your local cpu cycles.
Edit in Sep 2020: The (updated) Ubuntu PPA now has over 4600 package for the three most recent LTS releases. Still highly, highly recommended.

I found an article that described how to install R packages from precompiled binaries. It reduced the build time on our Jenkins server from 45 minutes down to 3 minutes.
Here is my Dockerfile:
FROM rocker/r-apt:bionic
WORKDIR /app
RUN apt-get update && \
apt-get install -y libxml2-dev
# Install binaries (see https://datawookie.netlify.com/blog/2019/01/docker-images-for-r-r-base-versus-r-apt/)
COPY ./requirements-bin.txt .
RUN cat requirements-bin.txt | xargs apt-get install -y -qq
# Install remaining packages from source
COPY ./requirements-src.R .
RUN Rscript requirements-src.R
# Clean up package registry
RUN rm -rf /var/lib/apt/lists/*
COPY ./src /app
EXPOSE 5000
CMD ["Rscript", "Server.R"]
You can add a file requirements-bin.txt with package names:
r-cran-plumber
r-cran-quanteda
r-cran-irlba
r-cran-lsa
r-cran-caret
r-cran-stringr
r-cran-dplyr
r-cran-magrittr
r-cran-randomforest
And finally, a requirements-src.R for packages that are not available as binairies:
pkgs <- c(
'otherpackage'
)
install.packages(pkgs)

I ended up using rocker/r-base as #DirkEddelbuettel suggested. Also thanks to this How to avoid reinstalling packages when building Docker image for Python projects? I wrote my Dockerfile in a way that doesen't reinstall packages every time I rebuild my docker image.
I want to share how my Dockerfile looks like now, hopefully this will be of help to others:
FROM rocker/r-base
RUN apt-get update
# install packages
RUN apt-get -y install libcurl4-openssl-dev
RUN apt-get -y install libssl-dev
# set work directory
WORKDIR /myapp
# copy requirments R script
COPY ./Requirements.R /myapp/Requirements.R
# run requirments R script
RUN Rscript Requirements.R
COPY . /myapp
EXPOSE 8094
ENV NAME R-test-service
CMD ["Rscript", "my_R_api.R"]

Related

Problem building R api with plumber, RPostgreSQL, and docker

I'm trying to install plumber and RPostgreSQL into my docker image. Here's my dockerFile:
FROM rocker/r-base
RUN R -e "install.packages('plumber')"
RUN R -e "install.packages('RPostgreSQL')"
RUN mkdir -p /code
COPY ./plumber.R /code/plumber.R
CMD Rscript --no-save /code/plumber.R
The only thing my plumber script does is try to reference the RPostgreSQL package:
library('RPostgreSQL')
When I build, it appears to successfully install both packages, but when my script runs, it complains that RPostgreSQL doesn't exist. I've tried other base images, I've tried many things.
Any help appreciated. Thanks!
You are trying to install RPostgres and then trying to load RPostgreSQL -- these are different packages. Hence the error.
Next, as you are on r-base, the latter is installed more easily as sudo apt install r-cran-rpostgresql (maybe after an intial sudo apt update). While you're at it, you can also install plumber as a pre-made binary (along with its dependencies). So
RUN apt update -qq \
&& apt install --yes --no-install-recommends \
r-cran-rpostgresql \
r-cran-plumber
is easier and faster.

How to install newer version of R on Amazon Linux 2

For whatever reason, Amazon moved R to the so-called "Extras Library" so you can't install R using sudo yum install -y R anymore. Instead, you have to do sudo amazon-linux-extras install R3.4. As a result, I can only install R 3.4.3 when the newest stable release is 3.6.1, and so many R libraries can't even be installed because the version is too low. Is there any good and clean way to install the latest version of R and skip Amazon's package manager? Thanks!
Use amazon-linux-extras which installs R4.0.2:
amazon-linux-extras install R4
You may need root:
sudo amazon-linux-extras install R4
I've tried setting up R 3.6.x on a docker container that uses the amazonlinux image. My approach was to get the R source file from the below link and install from source
cd /tmp/
wget https://cloud.r-project.org/src/base/R-3/R-3.6.3.tar.gz
tar -zxf R-3.6.3.tar.gz
cd /tmp/R-3.6.3
./configure --without-libtiff --without-lapack --without-ICU --disable-R-profiling --disable-nls
make
make install
you will need to yum install some dependencies, like 'make', which doesn't seem to come with aws amazonlinux docker image (which i think mirrors the EC2 instance AMI image you are referring to).
The above kind of worked for me in that i had a working R3.6 installation, but it didnt allow me use it with rshiny server, so i'm reverting to the shipped 3.4.3 version.
tl;dr: you'll probably have to manually download the source files and install the desired R version from source, and throw in some build dependencies as well.
Try this on Amazon Linux 2
yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install R
Amazon Linux 2 Image contains extras library that can be used as well. Follow the guide here.
https://aws.amazon.com/premiumsupport/knowledge-center/ec2-install-extras-library-software/
sudo amazon-linux-extras enable R3.4
sudo yum clean metadata && sudo yum install R3.4

Using docker buildkit caching with R-packages

I'm trying to use the docker buildkit approach to caching packages to speed up adding packages to docker containers. I learned about it from the instructions for both python and apt-get packages and useful Stackexchange answer on caching python packages while building Docker. For Python and apt-get I am able to get this to work, but I can't get it to work for R packages.
In a Dockerfile for Python I'm able to change:
RUN pip install -r requirements.txt
to (and the comment looking bit at the top of the Dockerfile is needed)
# syntax=docker/dockerfile:experimental
RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
And then when I add a package to the requirements.txt file, rather than re-downloading and building the packages, pip is able to re-use all the work it has done. So buildkit cache mounts add a level of caching beyond the image layers of docker. It's a massive timesaver. I'm hoping to set up something similar for r-packages.
Here is what I've tried that works for apt-get but not r-packges. I've also tried with the install2.r script.
# syntax=docker/dockerfile:experimental
FROM rocker/tidyverse
RUN rm -f /etc/apt/apt.conf.d/docker-clean; echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache
RUN --mount=type=cache,target=/var/cache/apt --mount=type=cache,target=/var/lib/apt \
apt update && apt install -y gcc \
zsh \
vim
COPY ./requirements.R .
RUN --mount=type=cache,target=/usr/local/lib/R/site-library Rscript ./requirements.R
I think I don't understand:
How buildkit works. Does it do the building of containers inside a container? ie the cache path is on the 'build container'?
What one needs to specify as the target for R to notice that it already has downloaded (and possibly built).
I suspect that it has something to do with the keep.source command when installing an R package, as discussed in this question

Install packages from source failing - Dockerfile

I am trying to install some packages from source (including package that I have created that installed fine with R console or even when R CMD install.
However, while building docker-image using a docker file. I get this error with for this line in the docker file
RUN R -e 'install.packages("RcppDIUtilsPackage_1.0.tar.gz",repos=NULL,type="source")'
I also tried many other commands including R CMD INSTALL all work fine to install the package except within the docker image build.
Here is the error i am encountering.
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning: invalid package ‘RcppDIUtilsPackage_1.0.tar.gz’
Error: ERROR: no packages specified
Warning message:
In install.packages("RcppDIUtilsPackage_1.0.tar.gz", repos = NULL, :
installation of package ‘RcppDIUtilsPackage_1.0.tar.gz’ had non-zero exit status
Thanks!!
Edit: The Dockerfile
FROM rocker/r-ver:3.4.4
WORKDIR /home/ubuntu/projects/DService
RUN apt-get update -qq && apt-get install -y \
libssl-dev \
libcurl4-gnutls-dev
RUN R -e "install.packages('plumber')"
RUN R -e "install.packages('Rcpp')"
RUN R -e 'install.packages("RcppDIUtilsPackage_1.0.tar.gz",repos=NULL,type="source")'
COPY / /
EXPOSE 8000
CMD ["Rscript", "DService.R"]
command: sudo docker build --no-cache -t dservice-docker-image .
This is an indirect solution to your problem, because I was not able to resolve the same issue.
The root of the issue may have something to do with the host environment that created the Docker image from the Dockerfile. Specifically, the R instance that is spun up to install the local packages may not being able to access the path to where your local packages are stored.
The solution for me was to just avoid local packages. Move any local repositories to remote repositories, and reference them in the Dockerfile instead. e.g.
RUN R -e "devtools::install_github('dmanuge/shinyFilesWidget') ; system('echo 14')"
After that, rebuilt your Docker image and run it accordingly. While this is not a direct solution, I reached the critical threshold of debugging and needed to move on. :)

MXNet R package on an Amazon Linux Deep Learning EC2 instance

I'm attempting to setup an Amazon Linux EC2 instance with MXNet and R (and the MXNet r package available as well). Unfortunately this has been a lot harder than I expected.
I've attempted to follow the instructions from MXNet using Amazon's deep learning AMI with CUDA 8.0 on a p2.xlarge (https://mxnet.incubator.apache.org/get_started/install.html)
However I get the same error when attempting to compile the mxnet r package from this SO post:
Issues installing mxnet GPU R package for Amazon deep learning AMI
The solution discussed in that post are somewhat beyond my abilities to fully test/debug. i.e. I'm not particularly familiar with linux environment variables and such to modify. I've also reviewed some issues raised on the apache-incubator github for MXnet and those were pretty unhelpful as well.
So my questions are,
Is anyone aware of any available AMI's which come pre-packaged with R and MXNet? The ones I see seem to only include python.
Have a working set of instructions (or a script) to run on an Amazon Linux EC2 instance to install the required dependencies (assuming Im using some type of deep learning AMI that comes with CUDA 8.0 at least) to install the MXnet R package?
Right so I was the guy on the other post and I DID eventually get it working. Took 50+ hours and I'm not 100% sure where the issue was because...linux.
sudo yum install R
sudo yum install libxml2-devel
sudo yum install cairo-devel
sudo yum install giflib-devel
sudo yum install libXt-devel
sudo R
install.packages("devtools")
library(devtools)
install_github("igraph/rigraph")
install.packages(c(“DiagrammeR”, “roxygen2”, “rgexf”, “influenceR”, “Cairo”, “imager”))
cd
cd /src/mxnet
cp make/config.mk .
echo "USE_BLAS=openblas" >>config.mk
echo "ADD_CFLAGS += -I/usr/include/openblas" >>config.mk
echo "ADD_LDFLAGS += /usr/local/lib" >>config.mk
echo "USE_CUDA=1" >>config.mk
echo "USE_CUDA_PATH=/usr/local/cuda-9.0/lib64" >>config.mk
echo "USE_CUDNN=1" >>config.mk
*add another LD flag for /usr/local/lib
cd /etc/ld.so.conf.d/
sudo nano cuda.conf
Insert  /usr/local/cuda-9.0/lib64
cd
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
sudo ldconfig
cd R-package
Rscript -e "install.packages('devtools', repo = 'https://cran.rstudio.com')"
Rscript -e "library(devtools); library(methods);options(repos=c(CRAN='https://cran.rstudio.com'));install_deps(dependencies = TRUE)"
cd ..
sudo make rpkg
THEN you gotta make sure R/Rstudio can actually find those libraries:
cd /etc/rstudio
sudo nano rserver.conf
You can add elements to the default LD_LIBRARY_PATH for R sessions (as determined by the R ldpaths script) by adding an rsession-ld-library-path entry to the server config file. This might be useful for ensuring that packages can locate external library dependencies that aren't installed in the system standard library paths. For example:
rsession-ld-library-path=/opt/local/lib:/usr/local/cuda/lib64

Resources