How long should I keep a deprecated function? - r

For R packages, is there a guideline for how long I should keep around an obsolete function once I deprecate it? What about demoting to defunct? De-thoughts?

I admit that this question is mostly opinion based, but here are some thoughts.
Hadley Wickham ran a poll on twitter asking a very similar question, and the majority of people voted that he should wait 4-5 years before removing a deprecated function. Granted, it was the shortest available option in the poll, but the next most popular answer was 10+ years.
Now, it of course depends on how you are deprecating it. The documentation of the {lifecycle} packages gathers some thoughts around function deprecation and is worth a read. The {lifecycle} article suggests first to issue a warning when a function is used in the global environment and testthat functions (i.e. not when other developers use your soft-deprecated function in their packages). Next to issue hard warnings every time and lastly to have the function return an error every time.
Things change a little when you have released very similar functionality under a different function name, which is called 'superseding' a function. I think this fine for any update. My advice would be to using something along the following lines:
# Your function that supercedes the older one
plus <- function(x, y) x + y
# Your deprecated function
add <- function(x, y) {
warning("`add()` is deprecated. Using `plus()` instead.")
plus(x, y)
}
Another thing you could do is to scour the internet (or GitHub) for a verbatim search library(your_package) or your_package::my_to_be_deprecated_function to see if and how other people are using it in code that they share (and whether that code is just an one-off analysis script or intended to be re-used by others). If you find someone has a use-case that cannot be easily superseded with another function, you might think about keeping it around for longer or soft-deprecating it instead of hard-deprecating it.
If you find out people barely use your function, your package isn't on a centralised repository (CRAN / Bioconductor) or you haven't advertised your function anywhere, I personally think you're allowed to take more liberties than if you're the developer behind the base::data.frame() function.

Related

R Development: Use of `::` Operator for `base` Package

TLDR
Does rigorous best practice recommend that a proactive R developer explicitly disambiguate all base functions — even the ubiquitously common functions like c() or cat() — within the .R files of their package, using the package::function() convention?
Context
Though a novice developer, I need to create a (proprietary) package in R. The text R Packages (by the authoritative authors Hadley Wickham and Jenny Bryan) has proven extremely helpful (if occasionally deprecated).
I am keen on following best practices from the start, so as to save myself time and effort down the road. As described in the text, the use of the :: operator can prevent current and
future conflicts, by disambiguating functions whose names are overloaded. To wit, the authors are careful to introduce each function with the package::function() convention, and they recommend its general use within the .R files of one's package.
However, their code examples often call functions that hail from the base package yet are unaccompanied by base::. Many base functions, like the ubiquitous c() or cat(), are used by R programmers in their sleep and (I imagine) are unlikely to ever be overloaded by a presumptuous developer. Nonetheless, it is confusing to see (for example) the juxtaposition of base::with() against (the base function) print(), all within a few lines of text.
...(These functions are inspired by how base::with() works.)
f <- function(x, sig_digits) {
# imagine lots of code here
withr::with_options(
list(digits = sig_digits),
print(x)
)
# ... and a lot more code here
}
I understand that the purpose of base::with() is to unambiguously introduce the with() function to the reader. However, the absence of base:: (within the code itself) seems to stick out like a sore thumb, when the package is explicitly named for any function called from any other package. Given my inexperience, I don't feel comfortable assuming the authors' intent.
Question
Are the names of base functions sufficiently unique that using this convention — of calling base::function() for every function() within the base package — would not be worth it? That the risk of overloading the functions (at some point in the future) is far outweighed by the inconvenience (and sheer ugliness) of
my_vector <- base::c(1, 2, 3)
throughout one's .R files? If not, is there an established convention that would balance unambiguity with elegance?
As always, I am grateful for any help, especially on this, my first post to Stack Overflow.

Using functions from other packages - when to use package::function?

When making your own package for R, one often wants to make use of functions from a different package.
Maybe it's a plotting library like ggplot2, dplyr, or some niche function.
However, when making a function that depends on functions in other packages, what is the appropriate way to call them? In particular, I am looking for examples of when to use
myFunction <- function(x) {
example_package::function(x)
}
or
require(example_package)
myFunction <- function(x) {
function(x)
}
When should I use one over the other?
If you're actually creating an R package (as opposed to a script to source, R Project, or other method), you should NEVER use library() or require(). This is not an alternative to using package::function(). You are essentially choosing between package::function() and function(), which as highlighted by #Bernhard, explicitly calling the package ensures consistency if there are conflicting names in two or more packages.
Rather than require(package), you need to worry about properly defining your DESCRIPTION and NAMESPACE files. There's many posts about that on SO and elsewhere, so won't go into details, see here for example.
Using package::function() can help with above if you are using roxygen2 to generate your package documentation (it will automatically generate a proper NAMESPACE file.
The douple-colon variant :: has a clear advantage in the rare situations, when the same function name is used by two packages. There is a function psych::alpha to calculate Cronbach's alpha as a measure of internal consistency and a function scales::alpha to modify color transparency. There are not that many examples but then again, there are examples. dplyr even masks functions from the stats and base package! (And the tidyverse is continuing to produce more and more entries in our namespaces. Should you use dyplr you do not know, if the base function you use today will be masked by a future version of dplyr thus leading to an unexpected runtime problem of your package in the future.)
All of that is no problem if you use the :: variant. All of that is not a problem if in your package the last package opened is the one you mean.
The require (or library) variant leads to overall shorter code and it is obvious, at what time and place in the code the problem of a not-available package will lead to an error and thus become visible.
In general, both work well and you are free to choose, which of these admittedly small differences appears more important to you.

What is furrr's "black magic?"

I use the R package furrr for most of my parallelization needs, and basically never have issues with exporting things from my global environment to the cluster. Today I did and I have no idea why. The package documentation seems to describe the process by which global variables are sent to the clusters as "black magic." What is the black magic?
The furrr::future_options documentation says:
Global variables and packages
By default, the future package will perform black magic to look up the global variables and packages that your furrr call requires, and it will export these to each worker. However, it is not always perfect, and can be refined with the globals and packages arguments.
As a secondary question: is there an elegant way to tell it to do its black magic, but also to export something it missed? Or, are the choices a) all black magic, or b) hard code everything in the .options argument?
This doesn't totally answer the question, but I think it points you in the right direction. From the "Globals" section of this intro vignette:
It does this with help of the globals package, which uses static-code inspection to identify global variables. If a global variable is identified, it is captured and made available to the evaluating process.
There's also this "Common Issues with Solutions" vignette (which #michael helpfully linked above) that discusses some common "gotcha" things that result from the static code eval in the globals package.
I found my way here because my future_map() code was failing to find the variables I referenced inside a glue() call. That vignette explained exactly why this happens.
As to why your code was sometimes working and sometimes not, hard to say. But as you can see, there is sufficient complexity going on under the hood that I'm not surprised if some seemingly unrelated change broke something. (For me this change was cleaning up my code and using glue instead of paste :shrug:)

Common functions that don't behave as expected with magrittr pipe?

There are some cases where the magrittr pipe (%>%) doesn't behave quite as may be expected, and that therefore take a little time to debug, generate reprex's, discuss, and eventually refactor.
E.g.
piping to ls() and
piping to return()
Question
In the interest of proactively educating oneself, is there some compilation of known base (and perhaps tidy) R functions with which the pipe doesn't behave as one may expect?
Current solutions
We can periodically read the magrittr issues (and probably issues with other related tidy packages), and
Read relevant stack overflow questions.
Although these sources are somewhat disparate.
Is there a way to generate a list, or perhaps one is maintained somewhere, perhaps in documentation, on github, or elsewhere, so that it can be periodically reviewed?
There isn’t a way to generate such a list automatically, since you effectively need to know the semantics of a function to know why (and therefore whether) it would work inside a pipeline.
As a heuristic, any function that takes an envir parameter that’s set to a default (as in the case of ls, bit also get, exists, etc.) will behave oddly with pipes.
That said, if you understand how function evaluation works in R it is usually pretty obvious for most functions whether they will work. I therefore suggest reading up on the R model of evaluation. In particular, read the relevant chapters in Hadley Wickham’s Advanced R:
Chapter 6 Functions
Chapter 7 Environments
And, potentially, parts of
Part IV Metaprogramming
To be fair, that’s a lot of material. But having a good handle of how function evaluation and scoping (environments) work in R is crucial for a solid understanding of R anyway. Metaprogramming is more advanced and it’s more important to be aware of its existence than to immediately have a solid understanding of it.

Coding practice in R : what are the advantages and disadvantages of different styles?

The recent questions regarding the use of require versus :: raised the question about which programming styles are used when programming in R, and what their advantages/disadvantages are. Browsing through the source code or browsing on the net, you see a lot of different styles displayed.
The main trends in my code :
heavy vectorization I play a lot with the indices (and nested indices), which results in rather obscure code sometimes but is generally a lot faster than other solutions.
eg: x[x < 5] <- 0 instead of x <- ifelse(x < 5, x, 0)
I tend to nest functions to avoid overloading the memory with temporary objects that I need to clean up. Especially with functions manipulating large datasets this can be a real burden. eg : y <- cbind(x,as.numeric(factor(x))) instead of y <- as.numeric(factor(x)) ; z <- cbind(x,y)
I write a lot of custom functions, even if I use the code only once in eg. an sapply. I believe it keeps it more readible without creating objects that can remain lying around.
I avoid loops at all costs, as I consider vectorization to be a lot cleaner (and faster)
Yet, I've noticed that opinions on this differ, and some people tend to back away from what they would call my "Perl" way of programming (or even "Lisp", with all those brackets flying around in my code. I wouldn't go that far though).
What do you consider good coding practice in R?
What is your programming style, and how do you see its advantages and disadvantages?
What I do will depend on why I am writing the code. If I am writing a data analysis script for my research (day job), I want something that works but that is readable and understandable months or even years later. I don't care too much about compute times. Vectorizing with lapply et al. can lead to obfuscation, which I would like to avoid.
In such cases, I would use loops for a repetitive process if lapply made me jump through hoops to construct the appropriate anonymous function for example. I would use the ifelse() in your first bullet because, to my mind at least, the intention of that call is easier to comprehend than the subset+replacement version. With my data analysis I am more concerned with getting things correct than necessarily with compute time --- there are always the weekends and nights when I'm not in the office when I can run big jobs.
For your other bullets; I would tend not to inline/nest calls unless they were very trivial. If I spell out the steps explicitly, I find the code easier to read and therefore less likely to contain bugs.
I write custom functions all the time, especially if I am going to be calling the code equivalent of the function repeatedly in a loop or similar. That way I have encapsulated the code out of the main data analysis script into it's own .R file which helps keep the intention of the analysis separate from how the analysis is done. And if the function is useful I have it for use in other projects etc.
If I am writing code for a package, I might start with the same attitude as my data analysis (familiarity) to get something I know works, and only then go for the optimisation if I want to improve compute times.
The one thing I try to avoid doing, is being too clever when I code, whatever I am coding for. Ultimately I am never as clever as I think I am at times and if I keep things simple, I tend not to fall on my face as often as I might if I were trying to be clever.
I write functions (in standalone .R files) for various chunks of code that conceptually do one thing. This keeps things short and sweet. I found debugging somewhat easier, because traceback() gives you which function produced an error.
I too tend to avoid loops, except when its absolutely necessary. I feel somewhat dirty if I use a for() loop. :) I try really hard to do everything vectorized or with the apply family. This is not always the best practice, especially if you need to explain the code to another person who is not as fluent in apply or vectorization.
Regarding the use of require vs ::, I tend to use both. If I only need one function from a certain package I use it via ::, but if I need several functions, I load the entire package. If there's a conflict in function names between packages, I try to remember and use ::.
I try to find a function for every task I'm trying to achieve. I believe someone before me has thought of it and made a function that works better than anything I can come up with. This sometimes works, sometimes not so much.
I try to write my code so that I can understand it. This means I comment a lot and construct chunks of code so that they somehow follow the idea of what I'm trying to achieve. I often overwrite objects as the function progresses. I think this keeps the transparency of the task, especially if you're referring to these objects later in the function. I think about speed when computing time exceeds my patience. If a function takes so long to finish that I start browsing SO, I see if I can improve it.
I found out that a good syntax editor with code folding and syntax coloring (I use Eclipse + StatET) has saved me a lot of headaches.
Based on VitoshKa's post, I am adding that I use capitalizedWords (sensu Java) for function names and fullstop.delimited for variables. I see that I could have another style for function arguments.
Naming conventions are extremely important for the readability of the code. Inspired by R's S4 internal style here is what I use:
camelCase for global functions and objects (like doSomething, getXyyy, upperLimit)
functions start with a verb
not exported and helper functions always start with "."
local variables and functions are all in small letters and in "_" syntax (do_something, get_xyyy), It makes it easy to distinguish local vs global and therefore leads to a cleaner code.
For data juggling I try to use as much SQL as possible, at least for the basic things like GROUP BY averages. I like R a lot but sometimes it's not only fun to realize that your research strategy was not good enough to find yet another function hidden in yet another package. For my cases SQL dialects do not differ much and the code is really transparent. Most of the time the threshold (when to start to use R syntax) is rather intuitive to discover. e.g.
require(RMySQL)
# selection of variables alongside conditions in SQL is really transparent
# even if conditional variables are not part of the selection
statement = "SELECT id,v1,v2,v3,v4,v5 FROM mytable
WHERE this=5
AND that != 6"
mydf <- dbGetQuery(con,statement)
# some simple things get really tricky (at least in MySQL), but simple in R
# standard deviation of table rows
dframe$rowsd <- sd(t(dframe))
So I consider it good practice and really recommend to use a SQL database for your data for most use cases. I am also looking into TSdbi and saving time series in relational database, but cannot really judge that yet.

Resources