What is furrr's "black magic?" - r

I use the R package furrr for most of my parallelization needs, and basically never have issues with exporting things from my global environment to the cluster. Today I did and I have no idea why. The package documentation seems to describe the process by which global variables are sent to the clusters as "black magic." What is the black magic?
The furrr::future_options documentation says:
Global variables and packages
By default, the future package will perform black magic to look up the global variables and packages that your furrr call requires, and it will export these to each worker. However, it is not always perfect, and can be refined with the globals and packages arguments.
As a secondary question: is there an elegant way to tell it to do its black magic, but also to export something it missed? Or, are the choices a) all black magic, or b) hard code everything in the .options argument?

This doesn't totally answer the question, but I think it points you in the right direction. From the "Globals" section of this intro vignette:
It does this with help of the globals package, which uses static-code inspection to identify global variables. If a global variable is identified, it is captured and made available to the evaluating process.
There's also this "Common Issues with Solutions" vignette (which #michael helpfully linked above) that discusses some common "gotcha" things that result from the static code eval in the globals package.
I found my way here because my future_map() code was failing to find the variables I referenced inside a glue() call. That vignette explained exactly why this happens.
As to why your code was sometimes working and sometimes not, hard to say. But as you can see, there is sufficient complexity going on under the hood that I'm not surprised if some seemingly unrelated change broke something. (For me this change was cleaning up my code and using glue instead of paste :shrug:)

Related

What is the difference between :: and library? [duplicate]

I'm writing some R functions that employ some useful functions in other packages like stringr and base64enc. Is it good not to call library(...) or require(...) to load these packages first but to use :: to directly refer to the function I need, like stringr::str_match(...)?
Is it a good practice in general case? Or what problem might it induce?
It all depends on context.
:: is primarily necessary if there are namespace collisions, functions from different packages with the same name. When I load the dplyr package, it provides a function filter, which collides with (and masks) the filter function loaded by default in the stats package. So if I want to use the stats version of the function after loading dplyr, I'll need to call it with stats::filter.
This also gives motivation for not loading lots of packages. If you really only want one function from a package, it can be better to use :: than load the whole package, especially if you know the package will mask other functions you want to use.
Not in code, but in text, I do find :: very useful. It's much more concise to type stats::filter than "the filter function from the stats package".
From a performance perspective, there is a (very) small price for using ::. Long-time R-Core development team member Martin Maechler wrote (on the r-devel mailing list (Sept 2017))
Many people seem to forget that every use of :: is an R
function call and using it is inefficient compared to just using
the already imported name.
The performance penalty is very small, on the order of a few microseconds, so it's only a concern when you need highly optimized code. Running a line of code that uses :: one million times will take a second or two longer than code that doesn't use ::.
As far as portability goes, it's nice to explicitly load packages at the top of a script because it makes it easy to glance at the first few lines and see what packages are needed, installing them if necessary before getting too deep in anything else, i.e., getting halfway through a long process that now can't be completed without starting over.
Aside: a similar argument can be made to prefer library() over require(). Library will cause an error and stop if the package isn't there, whereas require will warn but continue. If your code has a contingency plan in case the package isn't there, then by all means use if (require(package)) ..., but if your code will fail without a package you should use library(package) at the top so it fails early and clearly.
Within your own package
The general solution is to make your own package that imports the other packages you need to use in the DESCRIPTION file. Those packages will be automatically installed when your package is installed, so you can use pkg::fun internally. Or, by also importing them in the NAMESPACE file, you can import an entire package or selectively importFrom specific functions and not need ::. Opinions differ on this. Martin Maechler (same r-devel source as above) says:
Personally I've got the impression that :: is
much "overused" nowadays, notably in packages where I'd strongly
advocate using importFrom() in NAMESPACE, so all this happens
at package load time, and then not using :: in the package
sources itself.
On the other hand, RStudio Chief Scientist Hadley Wickham says in his R Packages book:
It's common for packages to be listed in Imports in DESCRIPTION, but not in NAMESPACE. In fact, this is what I recommend: list the package in DESCRIPTION so that it’s installed, then always refer to it explicitly with pkg::fun(). Unless there is a strong reason not to, it's better to be explicit.
With two esteemed R experts giving opposite recommendations, I think it's fair to say that you should pick whichever style suits you best and meets your needs for clarity, efficiency, and maintainability.
If you frequently find yourself using just one function from another package, you can copy the code and add it to your own package. For example, I have a package for personal use that borrows %nin% from the Hmisc package because I think it's a great function, but I don't often use anything else from Hmisc. With roxygen2, it's easy to add #author and #references to properly attribute the code for a borrowed function. Also make sure the package licenses are compatible when doing this.

Common functions that don't behave as expected with magrittr pipe?

There are some cases where the magrittr pipe (%>%) doesn't behave quite as may be expected, and that therefore take a little time to debug, generate reprex's, discuss, and eventually refactor.
E.g.
piping to ls() and
piping to return()
Question
In the interest of proactively educating oneself, is there some compilation of known base (and perhaps tidy) R functions with which the pipe doesn't behave as one may expect?
Current solutions
We can periodically read the magrittr issues (and probably issues with other related tidy packages), and
Read relevant stack overflow questions.
Although these sources are somewhat disparate.
Is there a way to generate a list, or perhaps one is maintained somewhere, perhaps in documentation, on github, or elsewhere, so that it can be periodically reviewed?
There isn’t a way to generate such a list automatically, since you effectively need to know the semantics of a function to know why (and therefore whether) it would work inside a pipeline.
As a heuristic, any function that takes an envir parameter that’s set to a default (as in the case of ls, bit also get, exists, etc.) will behave oddly with pipes.
That said, if you understand how function evaluation works in R it is usually pretty obvious for most functions whether they will work. I therefore suggest reading up on the R model of evaluation. In particular, read the relevant chapters in Hadley Wickham’s Advanced R:
Chapter 6 Functions
Chapter 7 Environments
And, potentially, parts of
Part IV Metaprogramming
To be fair, that’s a lot of material. But having a good handle of how function evaluation and scoping (environments) work in R is crucial for a solid understanding of R anyway. Metaprogramming is more advanced and it’s more important to be aware of its existence than to immediately have a solid understanding of it.

How long should I keep a deprecated function?

For R packages, is there a guideline for how long I should keep around an obsolete function once I deprecate it? What about demoting to defunct? De-thoughts?
I admit that this question is mostly opinion based, but here are some thoughts.
Hadley Wickham ran a poll on twitter asking a very similar question, and the majority of people voted that he should wait 4-5 years before removing a deprecated function. Granted, it was the shortest available option in the poll, but the next most popular answer was 10+ years.
Now, it of course depends on how you are deprecating it. The documentation of the {lifecycle} packages gathers some thoughts around function deprecation and is worth a read. The {lifecycle} article suggests first to issue a warning when a function is used in the global environment and testthat functions (i.e. not when other developers use your soft-deprecated function in their packages). Next to issue hard warnings every time and lastly to have the function return an error every time.
Things change a little when you have released very similar functionality under a different function name, which is called 'superseding' a function. I think this fine for any update. My advice would be to using something along the following lines:
# Your function that supercedes the older one
plus <- function(x, y) x + y
# Your deprecated function
add <- function(x, y) {
warning("`add()` is deprecated. Using `plus()` instead.")
plus(x, y)
}
Another thing you could do is to scour the internet (or GitHub) for a verbatim search library(your_package) or your_package::my_to_be_deprecated_function to see if and how other people are using it in code that they share (and whether that code is just an one-off analysis script or intended to be re-used by others). If you find someone has a use-case that cannot be easily superseded with another function, you might think about keeping it around for longer or soft-deprecating it instead of hard-deprecating it.
If you find out people barely use your function, your package isn't on a centralised repository (CRAN / Bioconductor) or you haven't advertised your function anywhere, I personally think you're allowed to take more liberties than if you're the developer behind the base::data.frame() function.

R: When to use package::function() vs function()? [duplicate]

I'm writing some R functions that employ some useful functions in other packages like stringr and base64enc. Is it good not to call library(...) or require(...) to load these packages first but to use :: to directly refer to the function I need, like stringr::str_match(...)?
Is it a good practice in general case? Or what problem might it induce?
It all depends on context.
:: is primarily necessary if there are namespace collisions, functions from different packages with the same name. When I load the dplyr package, it provides a function filter, which collides with (and masks) the filter function loaded by default in the stats package. So if I want to use the stats version of the function after loading dplyr, I'll need to call it with stats::filter.
This also gives motivation for not loading lots of packages. If you really only want one function from a package, it can be better to use :: than load the whole package, especially if you know the package will mask other functions you want to use.
Not in code, but in text, I do find :: very useful. It's much more concise to type stats::filter than "the filter function from the stats package".
From a performance perspective, there is a (very) small price for using ::. Long-time R-Core development team member Martin Maechler wrote (on the r-devel mailing list (Sept 2017))
Many people seem to forget that every use of :: is an R
function call and using it is inefficient compared to just using
the already imported name.
The performance penalty is very small, on the order of a few microseconds, so it's only a concern when you need highly optimized code. Running a line of code that uses :: one million times will take a second or two longer than code that doesn't use ::.
As far as portability goes, it's nice to explicitly load packages at the top of a script because it makes it easy to glance at the first few lines and see what packages are needed, installing them if necessary before getting too deep in anything else, i.e., getting halfway through a long process that now can't be completed without starting over.
Aside: a similar argument can be made to prefer library() over require(). Library will cause an error and stop if the package isn't there, whereas require will warn but continue. If your code has a contingency plan in case the package isn't there, then by all means use if (require(package)) ..., but if your code will fail without a package you should use library(package) at the top so it fails early and clearly.
Within your own package
The general solution is to make your own package that imports the other packages you need to use in the DESCRIPTION file. Those packages will be automatically installed when your package is installed, so you can use pkg::fun internally. Or, by also importing them in the NAMESPACE file, you can import an entire package or selectively importFrom specific functions and not need ::. Opinions differ on this. Martin Maechler (same r-devel source as above) says:
Personally I've got the impression that :: is
much "overused" nowadays, notably in packages where I'd strongly
advocate using importFrom() in NAMESPACE, so all this happens
at package load time, and then not using :: in the package
sources itself.
On the other hand, RStudio Chief Scientist Hadley Wickham says in his R Packages book:
It's common for packages to be listed in Imports in DESCRIPTION, but not in NAMESPACE. In fact, this is what I recommend: list the package in DESCRIPTION so that it’s installed, then always refer to it explicitly with pkg::fun(). Unless there is a strong reason not to, it's better to be explicit.
With two esteemed R experts giving opposite recommendations, I think it's fair to say that you should pick whichever style suits you best and meets your needs for clarity, efficiency, and maintainability.
If you frequently find yourself using just one function from another package, you can copy the code and add it to your own package. For example, I have a package for personal use that borrows %nin% from the Hmisc package because I think it's a great function, but I don't often use anything else from Hmisc. With roxygen2, it's easy to add #author and #references to properly attribute the code for a borrowed function. Also make sure the package licenses are compatible when doing this.

R code examples/best practices

I'm new to R and having a hard time piecing together information from various sources online related to what is considered a "good" practice with writing R code. I've read basic guides but I've been having a hard time finding information that is definitely up to date.
What are some examples of well written/documented S3 classes?
How about corresponding S4 classes?
What conventions do you use when commenting .R classes/functions? Do you put all of your comments in both .Rd files and .R files? Is synchronization of these files tiresome?
Whether to use S3, S4, or a package at all is mostly a style issue (as Dirk says), but I would suggest using one of those if you want to have a very well structured object (just as you would in any OOP language). For instance, all the time series classes have time series objects (I believe that they're all S3 with the exception of its) because it allows them to enforce certain behavior around the construction and usage of those objects. Similarly with the question about creating a package: it's a good idea to do this if you will be re-using your code frequently or if the code will be useful to someone else. It requires a little more effort, but the added organizational structure can easily make up for the cost.
Regarding S3 vs. S4 (discussed on R-Help here and here), the basic guideline is that S3 classes are more "quick and dirty" while S4 classes place more rigid control over objects and types. If you're working on Bioconductor, you typically will use S4 (see, for instance, "S4 classes and methods").
I would recommend reading some of the following:
"A (Not So) Short Introduction to S4" by Christophe Genolini
"Programmers' niche: A simple class, in S3 and S4" by Thomas Lumley
"Brobdingnag: a ''hello world'' package using S4 methods" by Robin K. S. Hankin
"Converting packages to S4" by Douglas Bates
"How S4 Methods Work" by John Chambers
For documentation, Hadley's suggestion is spot on: Roxygen will make life easier and puts the documentation right next to the code. That aside, you may still want to provide other comments in your code beyond what Roxygen or the man files require, in which case it's a good practice to comment your code for other developers. Those comments will not end up in your package; they will only be visible in the source code.
For 3. Use roxygen - it works like javadoc to take comments in your source files and build Rd files.
That's half a dozen or more questions bundled into one, which makes it difficult to answer.
So let's try from the inside out: First try to solve your RODBC wrapper problem. A code representation will suggest itself. I would start with simple functions, and then maybe build a package around it. That already gives you some encapsulation.
Much of the rest is style. Some prominent R codes swear by S4, while other swear about it. You can always read the packages of others as well as code in R itself. And you can always re-implement your RODBC wrapper in different ways and the compare your own approaches.
Edit: Reflecting you updated and much shortened question: Pick some packages from CRAN, in particular among those you use. I think you will quickly find some more or less interesting according to your style.
somewhat more style related than substance, but the Google R style guide is worth reading:

Resources