Precompile R packages for specific Linux flavor - r

We have established a simple local CRAN-like repository for R packages. There are many users, all of which use the same version of Linux.
Is there a way of convincing R to provide pre-compiled Linux packages instead just source ones? The compilation step takes a considerable amount of time for anyone using our repository. It should be possible to precompile and reuse the same binaries, since we can guarantee that the Linux version is consistent for all users.
How could one hack something like this together?

In the very narrow sense of "all of which use the same version of Linux" you actually have an option (that happens to be relatively littler known). Create binary packages using e.g.
R CMD INSTALL --build nameOfDirectoryWithSources
As R CMD INSTALL --help says it
--build build binaries of the installed package(s)
and these are not .deb or .rpm alike packages: no dependency information or alike is added. But they do exactly what you ask for: save on compilation time.
I am not aware of a repository structure one can build of this though.

Related

Air-gapped env- Installing R package source vs binaries

We have a ubuntu linux server in our office which is a air-gapped environment. There is no internet access to external network.
However I would like to install few R packages like ggplot2, Database Connector, dplyr, Tidyverse etc. I have more than 10-15 packages to download
While I cannot write the usual command install.packages("DatabaseConnector"), I have to download the zipped folders from CRAN as shown here.
I am new to R. So, can you help me with my questions given below?
a) Why is there are no files for linux systems? I only see windows binaries and macOS binaries. Which one should I download?
b) Should I download binaries or package source? which one is easy to install?
c) When I download packages like above as zipped file from CRAN like shown here, will the dependencies be automatically downloaded as well? Or should I look at error messages and keep downloading them one by one?
d) Since I work in a Air-gapped environment, what would be the best way to do this process efficiently.
Under linux packages are always installed from source. There are no official binary packages for linux. However, your distro might offer some of them in the official repositories. Ubuntu does. However these tend to be quite old versions and usually limited to a handfull of the most important packages. So, for linux you have to download the source packages. The zip files are for windows and will not work.
You will also need to download all of the dependencies of the packages. For something like tidyverse this will be a huge number. Tracking those by hand is a lot of work. Easiest is probably to use a package like miniCRAN outside of your airgapped system to build a selective copy of CRAN. You can specify the packages you want and the package will download all dependencies. You can then copy the downloaded directories to your server, point install.packages in the right direction and install as usually using install.packages. For details see https://andrie.github.io/miniCRAN/articles/miniCRAN-introduction.html.
You might also run into the problem that your system does not have all of the depencies needed to build all of the packages. Under ubuntu you need for example to install libxml2-dev to be able to install the xml package. For that you need to use the package manager of ubuntu. How to do that on an airgapped system is another issue

R package build failing on Unix machines due to missing GSL - GNU Scientific Library

I am facing a particularly vexing problem with R package development. My own package, called ggstatsplot (https://github.com/IndrajeetPatil/ggstatsplot), depends on userfriendlyscience, which depends on another package called MBESS, which itself ultimately depends on another package called gsl. There is no problem at all for installation of ggstatsplot on a Windows machine (as assessed by AppVeyor continuous integration platform: https://ci.appveyor.com/project/IndrajeetPatil/ggstatsplot).
But whenever the package is to be installed on Unix machines, it throws the error that ggstatsplot can't be downloaded because userfriendlyscience and MBESS can't be downloaded because gsl can't be downloaded. The same thing is also revealed on Travis continuous integration platform with virtual Unix machines, where the package build fails (https://travis-ci.org/IndrajeetPatil/ggstatsplot).
Now one way to solve this problem for the user on the Unix machine is to configure GSL (as described here:
installing R gsl package on Mac), but I can't possibly expect every user of ggstatsplot to go through the arduous process of configuring GSL. I want them to just run install.packages("ggstatsplot") and be done with it.
So I would really appreciate if anyone can offer me any helpful advice as to how I can make my package user's life simpler by removing this problem at its source. Is there something I should include in the package itself that will take care of this on behalf of the user?
This may not have a satisfying solution via changes to your R package (I'm not sure either way). If the gsl package authors (which include a former R Core member) didn't configure it to avoid a pre-req installation of a linux package, there's probably a good reason not to.
But it may be some consolation that most R+Linux users understand that some R packages first require installing the underlying Linux libraries (eg, through apt or dnf/yum).
Primary Issue: making it easy for the users to install
Try to be super clear on the GitHub readme and the CRAN INSTALL file. The gsl package has decent CRAN directions. This leads to the following bash code:
sudo apt-get install libgsl0-dev
The best example of clear (linux pre-req package) documentation I've seen is from the curl and sf packages. sf's CRAN page lists only the human names of the 3 libraries, but the GitHub page provides the exact bash commands for three major distribution branches. The curl package does this very well too (eg, CRAN and GitHub). For example, it provides the following explanation and bash code:
Installation from source on Linux requires libcurl. On Debian or Ubuntu use libcurl4-openssl-dev:
sudo apt-get install -y libcurl-dev
Ideally your documentation would describe how do install the gsl linux package on multiple distributions.
Disclaimer: I've never developed a package that directly requires a Linux package, but I use them a lot. In case more examples would help, this doc includes a script I use to install stuff on new Ubuntu machines. Some commands were stated explicitly in the package documentation; some had little or no documentation, and required research.
edit 2018-04-07:
I encountered my new favorite example: the sys package uses a config file to produce the following message in the R console. While installing 100+ packages on a new computer, it was nice to see this direct message, and not have to track down the R package and the documentation about its dependencies.
On Debian/Ubuntu this package requires AppArmor.
Please run: sudo apt-get install libapparmor-dev
Another good one is pdftools, that also uses a config file (and is also developed by Jeroen Ooms).
Secondary Issue: installing on Travis
The userfriendly travis config file apparently installs a lot of binaries directly (including gsl), unlike the current ggstatsplot version.
Alternatively, I'm more familiar with telling travis to install the linux package, as demonstrated by curl's config file. As a bonus, this probably more closely replicates what typical users do on their own machines.
addons:
apt:
packages:
- libcurl4-openssl-dev
Follow up 2018-03-13 Indrajeet and I tweaked the travis file so it's working. Two sections were changed in the yaml file:
The libgsl0-dev entry was added under the packages section (similar to the libcurl4-openssl-dev entry above).
Packages were listed in the r_binary_packages section so they install as binaries. The build was timing out after 50 minutes, and now it's under 10 min. In this particular package, the r_binary_packages section was nested in the Linux part of the Travis matrix so it wouldn't interfere with his two OS X jobs on Travis.

R install packages from Shell

I am trying to implement a reducer for Hadoop Streaming using R. However, I need to figure out a way to access certain libraries that are not built in R, dplyr..etc. Based on my research seems like there are two approaches:
(1) In the reducer code, install the required libraries to a temporary folder and they will be disposed when the session is done, like this:
.libPaths(c(.libPaths(), temp <- tempdir()))
install.packages("dplyr", lib=temp, repos='http://cran.us.r-project.org')
library(dplyr)
...
However, this approach will have a dramatic overhead depending on how many libraries you are trying to install. So most of the time will be wasted on installing libraries(sophisticated libraries like dplyr has tons of dependencies which will take minutes to install on a vanilla R session).
So sounds like I need to install it before hand, which leads us to approach2.
(2) My cluster is fairly big. And I have to use some tool like Ansible to make it work. So I prefer to have one Linux shell command to install the library. I have seen R CMD INSTALL... before, however, it feels like will only install packages from source file instead of doing install.packages() in R console, figure out the mirror, pull the source file, install it in one command.
Can anyone show me how to use one command line in shell to non-interactively install a R package?
(sorry for this much background knowledge, if anyone thinks I am not even following the right phylosophy, feel free to leave in the comment how this whole cluster R package should be managed.)
tl;dr
Rscript -e 'install.packages("drat", repos="https://cloud.r-project.org")'
You mentioned you are trying to install dplyr into custom lib location on your disk. Be aware that dplyr package does not support that. You can read more in dplyr#4641.
Moreover if you are installing private package published in internal CRAN-like repository (created by drat or tools::write_PACKAGES), you can easily combine repos argument and resolve dependencies from CRAN automatically.
Rscript -e 'install.packages("priv.pkg", repos=c("cran.priv","https://cloud.r-project.org"))'
This is very handy feature of R repositories, although for production use I would recommend to cache packages from CRAN locally, and use those, so you will never be surprised by a breaking changes in your dependencies. For quality information about handling R in production I suggest to look into talk by Wit Jakuczun at WhyR2019 How to make R great for machine learning in (not only) Enterprise: slides, video.
You may find littler useful. It is a command-line front-end / variant of R (which uses the R-embedding interface).
I use the install.r script all the time to install package from the shell. There is a second variant with more command-line argument parsing but it has an added dependency.

most common config flags for installing R from source on 64-bit Ubuntu

I'm convinced that using Dirk's package is the best way to install and maintain R on an Ubuntu system. But I want to have some fun and get used to installing R from source.
What are the most common configure flags to use when installing?
Also, if I want to install 2.14.1 and I have 2.14.0 currently installed (which was installed from source), should I first uninstall 2.14.0?
There was a recent thread somewhere about having several versions---one from the apt-get repo, one in /usr/local. Try to find that...
Otherwise, I will roll up 2.14.1 on Friday morning, Michael will do his magic and the repo will have .deb packages of 2.14.1 'real soon', sometimes within a day.
Lastly, you can see which flags are used by getting the package sources for which you just do apt-get source r-base (and that works for any Debian/Ubuntu package that way if you have source references in apt's file.
Edit: By the way, regarding the '64-bit' aspect of your question: Nada. We don't do anything differently. It is "merely" the host OS being more generous with resources. But R finds all it needs to know on its own via its configure etc logic.

How to install and manage many versions of R packages

I am developing a framework for reproducible computing with R. One problem that I am struggling with is that some R code might run perfectly in version X.Y-Z of a package, but then why you try to reproduce it 3 years later, the packages have updated, some functions are changed, and the code doesn't run anymore. This problem affects also for example Sweave documents that use packages.
The only way to confidently reproduce the results is by installing the R version and version of the packages that were used by the original author. If this was a single case, one could pull stuff from the CRAN archives and install appropriate versions. But for my framework this is impractical, and I need to have the package versions preinstalled.
Assume for now that I restrict myself to a single version of R, e.g. 2.14. What would be a practical way to install many versions of R packages, so that I can load them on the fly? I suppose I can do something like creating separate library directories for every version of every package and then using custom lib.loc arguments while loading them. This is going to be messy though. Any tips or previous attempts to do something similar?
My framework runs on Ubuntu server.
You could install packages with versions (e.g. rename to foo_1.0 directory instead of foo) and softlink the versions you want to re-create a given R + packages snapshot into one library. Obviously, the packages could actually live in a separate tree, so you could have library.projectX/foo -> library.all/foo/1.0.
The operating system gives you even more handles for complete separation, and the Debian / Ubuntu stack as a ton of those available. Two I have played with are
chroot environments: We use this to complete separate build environments from host machines. For example, all Debian uploads I produced are built in a i386 pbuilder chroot hosted on my amd64 Ubuntu server. Chroot is a very powerful Unix system call. Chroots, and particularly the pbuilder system built on top of it (for Debian package building) are meant to operate headless.
Virtual machines: This gives you full generality. My not-so-powerful box easily handles three virtual machines: Debian i386, Ubuntu i386 as well as Windoze XP. For this, I currently use KVM along with libvirt; this is Linux specific. I have also used VirtualBox and VMware in the past.
I would try to modify the DESCRIPTION file, and change the field "Package" there by adding the version number.
For example, you download the package source a from CRAN page (http://cran.r-project.org/web/packages/pls/). Unpack the compressed file (pls_2.3-0.zip) to a directory ("pls/"). The following steps are to change the package name in DESCRIPTION ("pls/DESCRIPTION") and installation with R command 'R CMD INSTALL pls/', where 'pls/' is a path to the package source with modified DESCRIPTION file.
Playing with R library paths seems a dangerous thing to me.

Resources