Reduce Docker Image Size with R-installation & dependencies - r

I am using images to run R in its base form with certain packages and its dependencies installed. For this, I create an intermediary (basic) image which I subsequently use to build the final image on.
Below you find a dockerfile to build this intermediary image (and not the final image)
Goal
I want to reduce the total size AND the amount of layers (and thus reduce time to 'pull' the image from the registry.
Question
What can I do to further reduce size and layers in the final image?
Current approach
Please look at the docker file enclosed:
I used a multi-stage approach to remove as much build-dependend libraries in stage 2 as I could find.
Some R packages are available as binary, but some are not. Thats why some packages are installed using the 'Rscript -e commmand'. And some are installed using the apt-get update / install commands. The latter is faster & takes up less space.
I only install the libraries the image needs to actually run the R-session using all the packages provided.
Multistage Dockerfile to create an intermediary image
# Base image
FROM rocker/r-base:latest AS stage1
#install binary & build dependencies
RUN apt-get update && apt-get install -y -qq --no-install-recommends --purge \
r-cran-pdftools \
r-cran-dplyr \
r-cran-knitr \
r-cran-magick \
r-cran-purrr \
r-cran-tidyr \
r-cran-tm \
r-cran-lubridate \
r-cran-ggplot2 \
r-cran-readxl \
pandoc \
libxml2 \
libssl1.1 \
tesseract-ocr-eng \
tesseract-ocr-nld \
liblept5 \
libgit2-dev \
libtesseract4 && \
apt-get autoclean && \
rm -rf /var/lib/apt/lists/* && \
rm -rf /tmp/*
##Build the second stage
FROM stage1 AS stage2
RUN apt-get update && apt-get install -y -qq --no-install-recommends --purge \
libtesseract-dev \
libleptonica-dev \
libxml2-dev \
libgit2-dev \
libssl-dev && \
Rscript -e "install.packages(c('rmarkdown', 'forcats', 'tesseract', 'AzureStor', 'AzureKeyVault', 'stopwords', 'SnowballC', 'NbClust', 'flexdashboard', 'formattable', 'htmlwidgets', 'xgboost'))"
#use stage 1 as base and copy run time libraries needed for final version
FROM stage1
COPY --from=stage2 /usr/local/lib/R/site-library /usr/local/lib/R/site-library
Any help is much appreciated!

Related

Unable to install archive version of WeibullR package on docker

I have created the following dockerfile to deploy an application
# get shiny serves plus tidyverse packages image
FROM rocker/r-ver:latest
RUN apt-get update && apt-get install -y \
sudo \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev \
libxt-dev \
libssl-dev \
libxml2-dev \
libssh2-1-dev
RUN R -e "install.packages('devtools')"
RUN R -e "require(devtools)"
RUN R -e 'install_version("WeibullR", version = "1.1.10", repos="http://cran.us.r-
project.org")'
The build fails. I request someone to guide me. I am unable to get it to work
I have found a solution that works. Am posting the same so that someone could consider it a solution
get shiny server plus tidyverse packages image
FROM rocker/r-ver:latest
FROM rocker/shiny-verse:latest
RUN apt-get update && apt-get install -y \
sudo \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev \
libxt-dev \
libssl-dev \
libxml2-dev \
libssh2-1-dev
RUN R -e "install.packages('devtools')"
RUN R -e "require(devtools)"
RUN R -e 'install_version("WeibullR", version = "1.1.10",
repos="http://cran.us.r-
project.org")'
This works neatly.

Importing R CRAN, Bioconductor and github R packages in one Dockerfile

My apologies because I think this may be a simple question but it is something that I am really struggling to understand!
As a background, I am trying to create a Dockerfile which installs a lot of R CRAN and R Bioconductor packages as well as some R packages from Github. I want to do this as quickly as possible so I'm using rocker's base image to install binary files, see here for a great, quick tutorial: https://datawookie.dev/blog/2019/01/docker-images-for-r-r-base-versus-r-apt/
My approach is first to install all my necessary packages as binaries and, if any are not available install them from source. After this, I use the Bioconductor base image to install the necessary Bioconductor packages.
However, the packages I installed through the rocker base image aren't available after I import the Bioconductor base image. This is where I feel I don't have a clear understanding of creating Dockerfiles but I can't seem to find an answer in any documentation. Is there some way to copy these over after importing another image? I didn't know if this is necessary, I have seen others do it the same way, such as the question poster here: Minimizing the size of docker image R shiny app
To note, I import the Bioconductor base image as I thought this would help deal with dependency issues. I guess I could just install the Bioconductor packages like the R packages that weren't available as binaries but I want to do this as quickly and cleanly as possible and I thought that this would slow things down.
Essentially, I want to know what's the quickest way to install, R binaries, R non-binaries, R bioconductor and github packages all in one dockerfile.
An example of my approach is below with a very small subset of the packages I need. Note, I have shown my full approach to install R binaries, R non-binaries, R bioconductor and github packages but for the issue I am having see what happens to the tidyverse R package before and after I import the Bioconductor image; the call library(tidyverse) runs before but fails after:
Dockerfile
## Use r-ubuntu, prev r-apt:bionic to enable the use of binary r packages for speed for R 4.0
FROM rocker/r-ubuntu:18.04
## Install available binaries - for speed
RUN apt-get update && \
apt-get install -y -qq \
r-cran-tidyverse \
r-cran-ids \
r-cran-snow
## Install remaining packages from source
COPY ./requirements-src.R .
RUN Rscript requirements-src.R
## This works
RUN R -e 'library(tidyverse)'
## Install Bioconductor packages
# Docker inheritance
FROM bioconductor/bioconductor_docker:RELEASE_3_12
COPY ./requirements-bioc.R .
#Don't bother running for speed but this will run
#RUN R -e 'BiocManager::install(ask = F)' && Rscript requirements-bioc.R
#This will fail - can't find the package
RUN R -e 'library(tidyverse)'
## Install from GH the following
#Don't bother running for speed but this will run
#RUN installGithub.r mojaveazure/loomR
EXPOSE 8787
## Make R the default
CMD [”R”]
requirements-src.R
pkgs <- c(
'spelling',
'english',
'DT'
)
install.packages(pkgs)
requirements-bioc.R
bioc_pkgs<-c(
'biomaRt',
'DropletUtils',
'rhdf5'
)
BiocManager::install(bioc_pkgs,ask=F)
Just in the interest of anyone else who is facing a similar problem, I will post my solution. I am not suggesting that this is the only solution so if others find better alternatives, I'll update to it.
In the end my approach to creating docker image which installs a lot of R CRAN and R Bioconductor packages as well as some R packages from Github was:
Use the latest Rocker RStudio image - to get packages installed as binary and to also enable easy debugging of your package with the correct dependencies since you can interactively run your image
Install all libraries from the latest Bioconductor image - to ensure you can install any Bioconductor package without issue
Install CRAN binaries
Install CRAN packages from source - where binaries aren't available
Install Bioconductor packages
Install Github packages
My solution uses this steps in this order and should prove as a fast and efficient solution (the use case for me was an R package which required >80 other packages from CRAN, Bioconductor and Github as dependencies! This solution reduced the runtime to a fraction of the original). Also, since we are using the latest version of Rocker RStudio and packages, this should stay up-to-date with the latest versions of software and packages.
The Dockerfile looks like this:
#LABEL maintainer="John Doe"
## Use rstudio installs binaries from RStudio's RSPM service by default,
## Uses the latest stable ubuntu, R and Bioconductor versions
FROM rocker/rstudio
## Add packages dependencies - from Bioconductor
RUN apt-get update \
&& apt-get install -y --no-install-recommends apt-utils \
&& apt-get install -y --no-install-recommends \
## Basic deps
gdb \
libxml2-dev \
python3-pip \
libz-dev \
liblzma-dev \
libbz2-dev \
libpng-dev \
libgit2-dev \
## sys deps from bioc_full
pkg-config \
fortran77-compiler \
byacc \
automake \
curl \
## This section installs libraries
libpcre2-dev \
libnetcdf-dev \
libhdf5-serial-dev \
libfftw3-dev \
libopenbabel-dev \
libopenmpi-dev \
libxt-dev \
libudunits2-dev \
libgeos-dev \
libproj-dev \
libcairo2-dev \
libtiff5-dev \
libreadline-dev \
libgsl0-dev \
libgslcblas0 \
libgtk2.0-dev \
libgl1-mesa-dev \
libglu1-mesa-dev \
libgmp3-dev \
libhdf5-dev \
libncurses-dev \
libbz2-dev \
libxpm-dev \
liblapack-dev \
libv8-dev \
libgtkmm-2.4-dev \
libmpfr-dev \
libmodule-build-perl \
libapparmor-dev \
libprotoc-dev \
librdf0-dev \
libmagick++-dev \
libsasl2-dev \
libpoppler-cpp-dev \
libprotobuf-dev \
libpq-dev \
libperl-dev \
## software - perl extensions and modules
libarchive-extract-perl \
libfile-copy-recursive-perl \
libcgi-pm-perl \
libdbi-perl \
libdbd-mysql-perl \
libxml-simple-perl \
libmysqlclient-dev \
default-libmysqlclient-dev \
libgdal-dev \
## new libs
libglpk-dev \
## Databases and other software
sqlite \
openmpi-bin \
mpi-default-bin \
openmpi-common \
openmpi-doc \
tcl8.6-dev \
tk-dev \
default-jdk \
imagemagick \
tabix \
ggobi \
graphviz \
protobuf-compiler \
jags \
## Additional resources
xfonts-100dpi \
xfonts-75dpi \
biber \
libsbml5-dev \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
#install R CRAN binary packages
RUN install2.r -e \
testthat
## Install remaining packages from source
COPY ./requirements-src.R .
RUN Rscript requirements-src.R
## Install Bioconductor packages
COPY ./requirements-bioc.R .
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
libfftw3-dev \
gcc && apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN Rscript -e 'requireNamespace("BiocManager"); BiocManager::install(ask=F);' \
&& Rscript requirements-bioc.R
## Install from GH the following
RUN installGithub.r theislab/kBET \
chris-mcginnis-ucsf/DoubletFinder \
Note that the CRAN packages from source and the Bioconductor packages are held in separate scripts in the same folder as your Dockerfile.
requirements-src.R:
pkgs <- c(
'spelling',
'english',
'Seurat')
install.packages(pkgs)
requirements-bioc.R:
bioc_pkgs<-c(
'biomaRt',
'SingleCellExperiment',
'SummarizedExperiment')
requireNamespace("BiocManager")
BiocManager::install(bioc_pkgs,ask=F)

where to write and save the dockerfile?

I am a beginner in docker, my question can be considered somewhat obvious but where do I save and write the dockerfile?
example:
FROM openanalytics/r-base
LABEL maintainer "Tobias Verbeke <tobias.verbeke#openanalytics.eu>"
# system libraries of general use
RUN apt-get update && apt-get install -y \
sudo \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev \
libxt-dev \
libssl-dev \
libssh2-1-dev \
libssl1.0.0
# system library dependency for the euler app
RUN apt-get update && apt-get install -y \
libmpfr-dev
# basic shiny functionality
RUN R -e "install.packages(c('shiny', 'rmarkdown'), repos='https://cloud.r-project.org/')"
# install dependencies of the euler app
RUN R -e "install.packages('Rmpfr', repos='https://cloud.r-project.org/')"
# copy the app to the image
RUN mkdir /root/euler
COPY euler /root/euler
COPY Rprofile.site /usr/lib/R/etc/
EXPOSE 3838
CMD ["R", "-e", "shiny::runApp('/root/euler')"]
where specifically does this file have to be written? Where specifically does he have to be saved? what format does it have to be saved?
The pattern I follow is to include the Dockerfile in the root of the project directory.
I also store it in the same repo as the project.

How to silently install r-base in an ubuntu docker image

I need to install r-base within an ubuntu:18.04 dockerimage. I am doing this while building my image via
RUN apt-get update; apt-get install -y r-base [many other packages]
along with many other package installations. The problem is, that while setting up r-base at the end of the installation process, it asks for user input for timezone followed by city within the specified timezone. I obviously cannot enter the data while building the container. How would I manage to install r-base anyways?
From AskUbuntu, you should set
ENV DEBIAN_FRONTEND=noninteractive
Directly taken from: https://github.com/rocker-org/rocker/blob/master/r-ubuntu/Dockerfile
FROM ubuntu:bionic
LABEL org.label-schema.license="GPL-2.0" \
org.label-schema.vcs-url="https://github.com/rocker-org/r-apt" \
org.label-schema.vendor="Rocker Project" \
maintainer="Dirk Eddelbuettel <edd#debian.org>"
## Set a default user. Available via runtime flag `--user docker`
## Add user to 'staff' group, granting them write privileges to /usr/local/lib/R/site.library
## User should also have & own a home directory (for rstudio or linked volumes to work properly).
RUN useradd docker \
&& mkdir /home/docker \
&& chown docker:docker /home/docker \
&& addgroup docker staff
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
software-properties-common \
ed \
less \
locales \
vim-tiny \
wget \
ca-certificates \
&& add-apt-repository --enable-source --yes "ppa:marutter/rrutter3.5" \
&& add-apt-repository --enable-source --yes "ppa:marutter/c2d4u3.5"
## Configure default locale, see https://github.com/rocker-org/rocker/issues/19
RUN echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
&& locale-gen en_US.utf8 \
&& /usr/sbin/update-locale LANG=en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
## This was not needed before but we need it now
ENV DEBIAN_FRONTEND noninteractive
# Now install R and littler, and create a link for littler in /usr/local/bin
# Default CRAN repo is now set by R itself, and littler knows about it too
# r-cran-docopt is not currently in c2d4u so we install from source
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
littler \
r-base \
r-base-dev \
r-recommended \
&& ln -s /usr/lib/R/site-library/littler/examples/install.r /usr/local/bin/install.r \
&& ln -s /usr/lib/R/site-library/littler/examples/install2.r /usr/local/bin/install2.r \
&& ln -s /usr/lib/R/site-library/littler/examples/installGithub.r /usr/local/bin/installGithub.r \
&& ln -s /usr/lib/R/site-library/littler/examples/testInstalled.r /usr/local/bin/testInstalled.r \
&& install.r docopt \
&& rm -rf /tmp/downloaded_packages/ /tmp/*.rds \
&& rm -rf /var/lib/apt/lists/*
CMD ["bash"]

Dockerfile for minimum image size on R-base

I'm trying to minimize the size of my docker image. It's on R-Base from the rocker project. It needs to be as small as possible, since it is used as a container instance in a cloud based workflow.
The image needs some extra packages (dplyr, pdftools, stringr and AzureStor). Some are available as binary, but AzureStor I could not find as such.
I already used some recommended commands to minimize size. What can I do more? Please read the docker file below. A few options I'm considering now:
Can I save space using 'no cache'? How do I 'implement' this?
Is there a binary version for a R package like AzureStor? I can't find it.
Are there any other build commands or dockerfile lines I can use to reduce any excess size?
Any help would be much appreciated!
Here is my current dockerfile
FROM rocker/r-base:latest
## install binary, build and dependend packages from single run command
RUN apt-get update && apt-get install -y -qq --no-install-recommends --purge
r-cran-pdftools \
r-cran-dplyr \
r-cran-stringr \
libxml2-dev \
libssl-dev && \
## install non-binary packages (from the same run command)
echo "r <- getOption('repos');r['CRAN'] <- 'http://cran.us.r-project.org'; options(repos = r);" > ~/.Rprofile && \
Rscript -e "install.packages(c('AzureStor'))" && \
mkdir -p /scripts \
## remove and clean what we can (still the same run command)
apt-get autoclean && \
apt-get -y autoremove libssl-dev && \
rm -rf /var/lib/apt/lists/*
## copy code
COPY script / script
## Set workdir
WORKDIR /scripts
## command line for autorunning the entire rscript
CMD [ "Rscript", "runscript.R"]
Right now, the size is around 800 mb. Hoping to get this down.

Resources