I'm working on a package "xyz" that uses Rcpp with several cpp files.
When I'm only updating the R code, I would like to run R CMD INSTALL xyz on the package directory without having to recompile all the shared libraries that haven't changed. That works fine if I specify the --no-multiarch flag: the source directory src gets populated the first time with the compiled objects, and if the sources don't change they are re-used the next time. With multiarch on, however, R decides to make two copies of src, src-i386 and src-x86_64. It seems to confuse R CMD INSTALL which always re-runs all the compilation. Is there any workaround?
(I'm aware that there are alternative ways, e.g. devtools::load_all, but I'd rather stick to R CM INSTALL if possible).
The platform is MacOS 10.7, and I have the latest version of R.
I have a partial answer for you. One really easy for speed-up is provided by using ccache which you can enable for all R compilation (e.g. via R CMD whatever thereby also getting inline, attributes, RStudio use, ...) globally through .R/Makevars:
edd#max:~$ tail -10 .R/Makevars
VER=4.6
CC=ccache gcc-$(VER)
CXX=ccache g++-$(VER)
SHLIB_CXXLD=g++-$(VER)
FC=ccache gfortran
F77=ccache gfortran
MAKE=make -j8
edd#max:~$
It takes care of all caching of compilation units.
Now, that does not "explicitly" address the --no-multiarch aspect which I don;t play much with that as we are still mostly 'single arch' on Linux. This will change, eventually, but hasn't yet. Yet I suspect but by letting the compiler decide the caching you too will get the net effect.
Other aspects can be controlled too, eg ~/.R/check.Renviron can be used to turn certain tests on or off. I tend to keep'em all on -- better to waste a few seconds here than to get a nastygram from Vienna.
Related
I can only find information on how to install a ready-made R extension package, but it is nowhere mentioned which commands a developer of an extension package has to use during daily development. I am using Rcpp and I am on Windows.
If this were a typical C++ project, it would go like this:
edit
make # oops, typo
edit # fix typo
make # oops, forgot an #include
edit
make # good; updates header dependencies for subsequent 'make' automatically
./fooreader # test it
make install # only now I'm ready
Which commands do I need for daily development of an Rcpp package project?
I've allocated a skeleton project using these commands from the R command line:
library(Rcpp)
Rcpp.package.skeleton("FooReader", example_code=FALSE,
author="My Name", email="my.email#example.com")
This allocated 3 files:
DESCRIPTION
NAMESPACE
man/FooReader-package.Rd
Now I dropped source code into
src/readfoo.cpp
with these contents:
#include <Rcpp.h>
#error here
I know I can run this from the R command line:
Rcpp::sourceCpp("D:/Projects/FooReader/src/readfoo.cpp")
(this does run the compiler and indicates the #error).
But I want to develop a package ultimately.
There is no universal answer for everybody, I guess.
For some people, RStudio is everything, and with some reason. One can use the package creation facility to create an Rcpp package, then edit and just hit the buttons (or keyboard shortcuts) to compile and re-load and test.
I also work a lot on a shell, so I do a fair amount of editing in Emacs/ESS along with R CMD INSTALL (where thanks to ccache recompilation of unchanged code is immediate) with command-line use via r of the littler package -- this allows me to write compact expressions loading the new package and evaluating: r -lnewpackage -esomeFunc(somearg) to test newpackage::someFunc() with somearg.
You can also launch the build and test from Emacs. As I said, it all depends.
Both those answers are for package, where I do real work. When I just test something in a single file, I do that in one Emacs buffer and sourceCpp() in an R session in another buffer of the same Emacs. Or sometimes I edit in Emacs and run sourceCpp() in RStudio.
There is no one answer. Find what works for you.
Also, the first part of your question describes the initial setup of a package. That is not part of the edit/compile/link/test cycle as it is a one off. And for that too do we have different approaches many of which have been discussed here.
Edit: The other main misunderstanding of your question is that once you have package you generally do not use sourceCpp() anymore.
In order to test an R package, it has to be installed into a (temporary) library such that it can be attached to a running R process. So you will typically need:
R CMD build . to build package_version.tar.gz
R CMD check <package_version.tar.gz> to test your package, including tests placed into the testsfolder
R CMD INSTALL <package_version.tar.gz> to install it into a library
After that you can attach the package and test it. Quite often I try to use a more TTD approach, which means I do not have to INSTALL the package. Running the unit tests (e.g. via R CMD check) is enough.
All that is independent of Rcpp. For a package using Rcpp you need to call Rcpp::compileAttributes() before these steps, e.g. with Rscript -e 'Rcpp::compileAttributes()'.
If you use RStudio for package development, it offers a lot of automation via the devtools package. I still find it useful to know what has to go on under the hood and it is by no means required.
The BLAS section in R installation and administration manual says that when R is built from source, with configuration parameter --without-blas, it will build Netlib's reference BLAS into a standalone shared library at R_HOME/lib/libRblas.so, along side the standard R shared library R_HOME/lib/libR.so. This makes it easier for user to switch and benchmark different tuned BLAS in R environment. The guide suggests that researcher might use symbolic link to libRblas.so to achieve this, and this article gives more details on this.
On contrary, when simply installing a pre-compiled binary version of R, either from R CRAN's mirrors or Ubuntu's repository (for linux user like me), in theory it should be more difficult to switch between different BLAS without rebuilding R, because a pre-compiled R version is configured with --with-blas = (some blas library). We can easily check this, either by reading configuration file at R_HOME/etc/Makeconf, or check the result of R CMD config BLAS_LIBS. For example, on my machine it returns -lblas, so it is linked to reference BLAS at build time. As a result, there is no R_HOME/lib/libRblas.so, only R_HOME/lib/libR.so.
However, this R-blog says that it is possible to switch between difference BLAS, even if R is not installed from source. The author tried the ATLAS and OpenBLAS from ubuntu's repository, and then use update-alternatives --config to work around. It is also possible to configure and install tuned BLAS from source, add them to "alternatives" through update-alternatives --install, and later switch between them in the same way. The BLAS library (a symbolic link) in this case will be found at /usr/lib/libblas.so.3, which is under both ubuntu and R's LD_LIBRARY_PATH. I have tested and this does work! But I am very surprised at how R achieves this. As I said, R should have been tied to the BLAS library configured at building time, i.e., I would expect all BLAS routines integrated into R_HOME/lib/libR.so. So why is it still possible to change BLAS via /usr/lib/libblas.so.3?
Thanks if someone can please explain this.
I've found several posts about best practice, reproducibility and workflow in R, for example:
How to increase longer term reproducibility of research (particularly using R and Sweave)
Complete substantive examples of reproducible research using R
One of the major preoccupations is ensuring portability of code, in the sense that moving it to a new machine (possibly running a different OS) is relatively straightforward and gives the same results.
Coming from a Python background, I'm used to the concept of a virtual environment. When coupled with a simple list of required packages, this goes some way to ensuring that the installed packages and libraries are available on any machine without too much fuss. Sure, it's no guarantee - different OSes have their own foibles and peculiarities - but it gets you 95% of the way there.
Does such a thing exist within R? Even if it's not as sophisticated. For example simply maintaining a plain text list of required packages and a script that will install any that are missing?
I'm about to start using R in earnest for the first time, probably in conjunction with Sweave, and would ideally like to start in the best way possible! Thanks for your thoughts.
I'm going to use the comment posted by #cboettig in order to resolve this question.
Packrat
Packrat is a dependency management system for R. Gives you three important advantages (all of them focused in your portability needs)
Isolated : Installing a new or updated package for one project won’t break your other projects, and vice versa. That’s because packrat gives each project its own private package library.
Portable: Easily transport your projects from one computer to another, even across different platforms. Packrat makes it easy to install the packages your project depends on.
Reproducible: Packrat records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go.
What's next?
Walkthrough guide: http://rstudio.github.io/packrat/walkthrough.html
Most common commands: http://rstudio.github.io/packrat/commands.html
Using Packrat with RStudio: http://rstudio.github.io/packrat/rstudio.html
Limitations and caveats: http://rstudio.github.io/packrat/limitations.html
Update: Packrat has been soft-deprecated and is now superseded by renv, so you might want to check this package instead.
The Anaconda package manager conda supports creating R environments.
conda create -n r-environment r-essentials r-base
conda activate r-environment
I have had a great experience using conda to maintain different Python installations, both user specific and several versions for the same user. I have tested R with conda and the jupyter-notebook and it works great. At least for my needs, which includes RNA-sequencing analyses using the DEseq2 and related packages, as well as data.table and dplyr. There are many bioconductor packages available in conda via bioconda and according to the comments on this SO question, it seems like install.packages() might work as well.
It looks like there is another option from RStudio devs, renv. It's available on CRAN and supersedes Packrat.
In short, you use renv::init() to initialize your project library, and use renv::snapshot() / renv::restore() to save and load the state of your library.
I prefer this option to conda r-enviroments because here everything is stored in the file renv.lock, which can be committed to a Git repo and distributed to the team.
To add to this:
Note:
1. Have Anaconda installed already
2. Assumed your working directory is "C:"
To create desired environment -> "r_environment_name"
C:\>conda create -n "r_environment_name" r-essentials r-base
To see available environments
C:\>conda info --envs
.
..
...
To activate environment
C:\>conda activate "r_environment_name"
(r_environment_name) C:\>
Launch Jupyter Notebook and let the party begins
(r_environment_name) C:\> jupyter notebook
For a similar "requirements.txt", perhaps this link will help -> Is there something like requirements.txt for R?
Check out roveR, the R container management solution. For details, see https://www.slideshare.net/DavidKunFF/ownr-technical-introduction, in particular slide 12.
To install roveR, execute the following command in R:
install.packages("rover", repos = c("https://lair.functionalfinances.com/repos/shared", "https://lair.functionalfinances.com/repos/cran"))
To make full use of the power of roveR (including installing specific versions of packages for reproducibility), you will need access to a laiR - for CRAN, you can use our laiR instance at https://lair.ownr.io, for uploading your own packages and sharing them with your organization you will need a laiR license. You can contact us on the email address in the presentation linked above.
I'm trying to build my own package using Windows. I installed everything I need, and running R CMD build mypackage and R CMD INSTALL mypackage seem to run fine. when I run the build command, though, I get a warning from cygwin:
cygwin warning:
MS-DOS style path detected: C:/Documents and Settings/e_sander/My Documents/mypackage_1.0.tar.gz
Preferred POSIX equivalent is: /cygdrive/c/Documents and Settings/e_sander/My Documents/mypackage_1.0.tar.gz
CYGWIN environment variable option "nodosfilewarning"turns off this warning. Consult the user's guide for more details about POSIX paths:
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
I did go to the recommended website, but I don't know much about cygwin or linux so I'm not sure there's anything I need to do. I realize that using the MS-DOS style path is deprecated and not recommended in cygwin, but I'm not sure how to change that, since I'm running Windows and that's the path I need. I also haven't noticed any problems with my package, at least when I install it to my computer (and although I haven't used the tarball, I've opened it and everything looks fine). So here's what I'm trying to figure out:
Does leaving the path as is affect my package in any way?
If so, how could it adversely affect my package?
How do I change the path to make cygwin happy?
This is just a warning and it tells you how to disable it. It doesn't affect anything. If you want it to go away, run this from your shell:
export CYGWIN="nodosfilewarning"
Or you could mount C: to /c/ (see man mount).
Does leaving the path as is affect my package in any way?
- No, it is just a warning.
If so, how could it adversely affect my package?
- N/A
How do I change the path to make cygwin happy?
- Set the environment variable as the output states. There are multiple ways to do this; I chose to go with a solution that handles the issue across any invoked environment that parses or inherits from the windows environment by using the "Rapid Environment Editor" program to add a User Variable named CYGWIN with a value nodosfilewarning. But if you wanted you could add it through the control panel using Add environment variables for your account.
I have an R package which is easily sped up by using OpenMP. If your compiler supports it then you get the win, if it doesn't then the pragmas are ignored and you get one core.
My problem is how to get the package build system to use the right compiler options and libraries. Currently I have:
PKG_CPPFLAGS=-fopenmp
PKG_LIBS=-fopenmp
hardcoded into src/Makevars on my machine, and this builds it with OpenMP support. But it produces a warning about non-standard compiler flags on check, and will probably fail hard on a machine with no openMP capabilities.
The solution seems to be to use configure and autoconf. There's some information around here:
http://cran.r-project.org/doc/manuals/R-exts.html#Using-Makevars
including a complex example to compile in odbc functionality. But I can't see how to begin tweaking that to check for openmp and libgomp.
None of the R packages I've looked at that talk about using openMP seem to have this set up either.
So does anyone have a walkthrough for setting up an R package with OpenMP?
[EDIT]
I may have cracked this now. I have a configure.ac script and a Makevars.in with #FOO# substitutions for the compiler options. But now I'm not sure of the workflow. Is it:
Run "autoconf configure.in > configure; chmod 755 configure" if I change the configure.in file.
Do a package build.
On package install, the system runs ./configure for me and creates Makevars from Makevars.in
But just to be clear, "autoconf configure.in > configure" doesn't run on package install - its purely a developer process to create the configure script that is distributed - amirite?
Methinks you have the library option wrong, please try
## -- compiling for OpenMP
PKG_CXXFLAGS=-fopenmp
##
## -- linking for OpenMP
PKG_LIBS= -fopenmp -lgomp
In other words, -lgomp gets you the OpenMP library linked. And I presume you know that this library is not part of the popular Rtools kit for Windows. On a modern Linux you should be fine.
In an unrelease testpackage I have here I also add the following to PKG_LIBS, but that is mostly due to my use of Rcpp:
$(shell $(R_HOME)/bin/Rscript -e "Rcpp:::LdFlags()") \
$(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
Lastly, I think the autoconf business is not really needed unless you feel you need to test for OpenMP via configure.
Edit: SpacedMan is correct. Per the beginning of the libgomp-4.4 manual:
1 Enabling OpenMP
To activate the OpenMP extensions for
C/C++ and Fortran, the compile-time
flag `-fopenmp' must be specified.
This enables the OpenMP directive
[...] The flag also
arranges for automatic linking of the
OpenMP runtime library.
So I stand corrected. Seems that it doesn't hurt to manually add what would get added anyway, just for clarity...
Just addressing your question regarding the usage of autoconf--no, you do not want to run autoconf with any arguments, nor should you redirect its output. You are correct that running autoconf to build the configure script is something that the package maintainer does, and the resulting configure script is distributed. Normally, to generate the configure script from configure.ac (older packages use the name configure.in, but that name has been discouraged for several years), the developer simply runs autoconf with no arguments. Before running autoconf, it is necessary to run aclocal, autoheader, libtoolize, etc... There is also a tool (autoreconf) which simplifies the process and invokes all the required programs in the correct order. It is now more typical to run autoreconf instead of autoconf.