when is R's `ByteCompile` counter-productive? - r

The R docs describe the ByteCompile field in the "DESCRIPTION file" section as:
The ‘ByteCompile’ logical field controls if the package code is to be byte-compiled on installation: the default is currently not to, so this may be useful for a package known to benefit particularly from byte-compilation (which can take quite a long time and increases the installed size of the package)
I infer the only detrimental side-effects to byte-compiling are (a) time-to-install and (b) installation size. I haven't found a package that takes too long during installation/byte-compiling, and the general consensus is that GBs are cheap (for storage).
Q: When should I choose to not byte-compile packages I write? (Does anybody have anecdotal or empirical limits beyond which they choose against it?)
Edit: As noted in the comments of an older question, the rationale that debugging is not possible with byte-compiled code has been debunked. Other related questions on SO have discussed how to do it (either manually with R CMD INSTALL --byte-compile ... or with install.packages(..., type="source", INSTALL_opts="--byte-compile")), but have not discussed the ramifications of or arguments against doing so.

I have yet to find a downside for byte-compiling, other than the ones you mention: slightly increased file size and installation time.
In the past, compiling certain code could cause slow-down but in recent versions of R (version >3.3.0), this doesn't seem to be a problem.

Related

What is the difference between :: and library? [duplicate]

I'm writing some R functions that employ some useful functions in other packages like stringr and base64enc. Is it good not to call library(...) or require(...) to load these packages first but to use :: to directly refer to the function I need, like stringr::str_match(...)?
Is it a good practice in general case? Or what problem might it induce?
It all depends on context.
:: is primarily necessary if there are namespace collisions, functions from different packages with the same name. When I load the dplyr package, it provides a function filter, which collides with (and masks) the filter function loaded by default in the stats package. So if I want to use the stats version of the function after loading dplyr, I'll need to call it with stats::filter.
This also gives motivation for not loading lots of packages. If you really only want one function from a package, it can be better to use :: than load the whole package, especially if you know the package will mask other functions you want to use.
Not in code, but in text, I do find :: very useful. It's much more concise to type stats::filter than "the filter function from the stats package".
From a performance perspective, there is a (very) small price for using ::. Long-time R-Core development team member Martin Maechler wrote (on the r-devel mailing list (Sept 2017))
Many people seem to forget that every use of :: is an R
function call and using it is inefficient compared to just using
the already imported name.
The performance penalty is very small, on the order of a few microseconds, so it's only a concern when you need highly optimized code. Running a line of code that uses :: one million times will take a second or two longer than code that doesn't use ::.
As far as portability goes, it's nice to explicitly load packages at the top of a script because it makes it easy to glance at the first few lines and see what packages are needed, installing them if necessary before getting too deep in anything else, i.e., getting halfway through a long process that now can't be completed without starting over.
Aside: a similar argument can be made to prefer library() over require(). Library will cause an error and stop if the package isn't there, whereas require will warn but continue. If your code has a contingency plan in case the package isn't there, then by all means use if (require(package)) ..., but if your code will fail without a package you should use library(package) at the top so it fails early and clearly.
Within your own package
The general solution is to make your own package that imports the other packages you need to use in the DESCRIPTION file. Those packages will be automatically installed when your package is installed, so you can use pkg::fun internally. Or, by also importing them in the NAMESPACE file, you can import an entire package or selectively importFrom specific functions and not need ::. Opinions differ on this. Martin Maechler (same r-devel source as above) says:
Personally I've got the impression that :: is
much "overused" nowadays, notably in packages where I'd strongly
advocate using importFrom() in NAMESPACE, so all this happens
at package load time, and then not using :: in the package
sources itself.
On the other hand, RStudio Chief Scientist Hadley Wickham says in his R Packages book:
It's common for packages to be listed in Imports in DESCRIPTION, but not in NAMESPACE. In fact, this is what I recommend: list the package in DESCRIPTION so that it’s installed, then always refer to it explicitly with pkg::fun(). Unless there is a strong reason not to, it's better to be explicit.
With two esteemed R experts giving opposite recommendations, I think it's fair to say that you should pick whichever style suits you best and meets your needs for clarity, efficiency, and maintainability.
If you frequently find yourself using just one function from another package, you can copy the code and add it to your own package. For example, I have a package for personal use that borrows %nin% from the Hmisc package because I think it's a great function, but I don't often use anything else from Hmisc. With roxygen2, it's easy to add #author and #references to properly attribute the code for a borrowed function. Also make sure the package licenses are compatible when doing this.

R: When to use package::function() vs function()? [duplicate]

I'm writing some R functions that employ some useful functions in other packages like stringr and base64enc. Is it good not to call library(...) or require(...) to load these packages first but to use :: to directly refer to the function I need, like stringr::str_match(...)?
Is it a good practice in general case? Or what problem might it induce?
It all depends on context.
:: is primarily necessary if there are namespace collisions, functions from different packages with the same name. When I load the dplyr package, it provides a function filter, which collides with (and masks) the filter function loaded by default in the stats package. So if I want to use the stats version of the function after loading dplyr, I'll need to call it with stats::filter.
This also gives motivation for not loading lots of packages. If you really only want one function from a package, it can be better to use :: than load the whole package, especially if you know the package will mask other functions you want to use.
Not in code, but in text, I do find :: very useful. It's much more concise to type stats::filter than "the filter function from the stats package".
From a performance perspective, there is a (very) small price for using ::. Long-time R-Core development team member Martin Maechler wrote (on the r-devel mailing list (Sept 2017))
Many people seem to forget that every use of :: is an R
function call and using it is inefficient compared to just using
the already imported name.
The performance penalty is very small, on the order of a few microseconds, so it's only a concern when you need highly optimized code. Running a line of code that uses :: one million times will take a second or two longer than code that doesn't use ::.
As far as portability goes, it's nice to explicitly load packages at the top of a script because it makes it easy to glance at the first few lines and see what packages are needed, installing them if necessary before getting too deep in anything else, i.e., getting halfway through a long process that now can't be completed without starting over.
Aside: a similar argument can be made to prefer library() over require(). Library will cause an error and stop if the package isn't there, whereas require will warn but continue. If your code has a contingency plan in case the package isn't there, then by all means use if (require(package)) ..., but if your code will fail without a package you should use library(package) at the top so it fails early and clearly.
Within your own package
The general solution is to make your own package that imports the other packages you need to use in the DESCRIPTION file. Those packages will be automatically installed when your package is installed, so you can use pkg::fun internally. Or, by also importing them in the NAMESPACE file, you can import an entire package or selectively importFrom specific functions and not need ::. Opinions differ on this. Martin Maechler (same r-devel source as above) says:
Personally I've got the impression that :: is
much "overused" nowadays, notably in packages where I'd strongly
advocate using importFrom() in NAMESPACE, so all this happens
at package load time, and then not using :: in the package
sources itself.
On the other hand, RStudio Chief Scientist Hadley Wickham says in his R Packages book:
It's common for packages to be listed in Imports in DESCRIPTION, but not in NAMESPACE. In fact, this is what I recommend: list the package in DESCRIPTION so that it’s installed, then always refer to it explicitly with pkg::fun(). Unless there is a strong reason not to, it's better to be explicit.
With two esteemed R experts giving opposite recommendations, I think it's fair to say that you should pick whichever style suits you best and meets your needs for clarity, efficiency, and maintainability.
If you frequently find yourself using just one function from another package, you can copy the code and add it to your own package. For example, I have a package for personal use that borrows %nin% from the Hmisc package because I think it's a great function, but I don't often use anything else from Hmisc. With roxygen2, it's easy to add #author and #references to properly attribute the code for a borrowed function. Also make sure the package licenses are compatible when doing this.

lineprof equivalent for Rcpp

The lineprof package in R is very useful for profiling which parts of function take up time and allocate/free memory.
Is there a lineprof() equivalent for Rcpp ?
I currently use std::chrono::steady_clock and such to get chunk timings out of an Rcpp function. Alternatives? Does Rstudio IDE provide some help here?
To supplement #Dirk's answer...
If you are working on OS X, the Time Profiler Instrument, part of Apple's Instruments set of instrumentation tools, is an excellent sampling profiler.
Just to fix ideas:
A sampling profiler lets you answer the question, what code paths does my program spend the most time executing?
A (full) cache profiler lets you answer the question, which are the most frequently executed code paths in my program?
These are different questions -- it's possible that your hottest code paths are already optimized enough that, even though the total number of instructions executed in that path is very high, the amount of time required to execute them might be relatively low.
If you want to use instruments to profile C++ code / routines used in an R package, the easiest way to go about this is:
Create a target, pointed at your R executable, with appropriate command line arguments to run whatever functions you wish to profile:
Set the command line arguments to run the code that will exercise your C++ routines -- for example, this code runs Rcpp:::test(), to instrument all of the Rcpp test code:
Click the big red Record button, and off you go!
I'll leave the rest of the instructions in understanding instruments + the timing profiler to your google-fu + the documentation, but (if you're on OS X) you should be aware of this tool.
See any decent introduction to high(er) performance computing as eg some slides from (older) presentation of my talks page which include worked examples for both KCacheGrind (part of the KDE frontend to Valgrind) as well as Google Perftools.
In a more abstract sense, you need to come to terms with the fact that C++ != R and not all tools have identical counterparts. In particular Rprof, the R profiler which several CRAN packages for profiling build on top of, is based on the fact that R is interpreted. C++ is not, so things will be different. But profiling compiled is about as old as compiling and debugging so you will find numerous tutorials.

pairwise.wilcox.test() after friedman.test() in R

Can I use pairwise.wilcox.test() for post hoc test as my friedman.test() gat sifnificant?
I can't install pgirmass for the friedmanmc() function as its not compatible with my R version.
Does pairwise.wilcox.test() make sense for more than two samples?
Thank you for your help!
You haven't offered a specific example or an explanation of the the study design and hypotheses being tested, but the documentation does say that "corrections for multiple testing" are made, so you should be reasonably safe on statistical grounds. (There is some debate about the need for multiple comparisons tests.)
On the topic of the other package, you are misspelling its name and there is a current version available from CRAN for pkg:pgirmess. After reading the documentation of the two tests, I would probably trust the pairwise.wilcox.test more than the friedmanmc test because it is in a core R package, while the friedmanmc test appears to have undesireable behavior that gets suppressed in an awkward fashion, leading me to think it uses something of a statistical hack. I'm not encouraging you to do so but if your unstated R version is somewhat older, there may be suitable package versions, since I see versions going back to 2005 in the Archives.

How do I ensure R / Rcpp code is reproducible ("distributable")?

I've written some R code for a dissertation, relying on some external packages (e.g., plyr and reshape) and writing a couple relatively simple inline C++ functions using inline and RcppArmadillo.
I would like to ensure it can be performed "as is" on computers others than my own (Win64), for research reproducibility purposes.
My question: suppose I included code for installing the required packages, would the RcppArmadillo (and Rcpp and inline) packages be sufficient to be able to compile the functions written in RcppArmadillo, or would the end user need to change system paths for compilation on his Windows machine? If not, is it possible/recommended to saved the compiled functions from my end and included in the R code I'm shipping?
Also, in the unlikely case that the code should be run some time later (say, a couple of years), is it suffient to include a full R installation with the relevant packages in their current version to make the code "future-proof"?
I hope the question is clear.
I think you mean your code to be "distributable" and "executable by someone else" which is a looser requirement. Being "reproducible" implies that the previous question is a true, and adds the restriction that the results are identical (up to the an epsilon of your choice).
And the usual answer for 'how can I let others run my R code' is to create a package.
For Rcpp-related code, we have an entire vignette devoted to building your own package with your Rcpp-using cod. The vignette is called 'Rcpp-package'.

Resources