dplyr 0.7.0 tidyeval in packages - r

Preamble
I commonly use dplyr in my packages. Prior to 0.7.0, I was using the underscored versions of dplyr verbs to avoid NOTEs during R CMD CHECK. For example, the code:
x <- tibble::tibble(v = 1:3, w = 2)
y <- dplyr::filter(x, v > w)
would have yielded the R CMD CHECK NOTE:
* checking R code for possible problems ... NOTE
no visible binding for global variable ‘v’
By comparison, using the standard evaluation version:
y <- dplyr::filter_(x, ~v > w)
yielded no such NOTE.
However, in dplyr 0.7.0, the vignette Programming with dplyr says that the appropriate syntax for including dplyr functions in packages (to avoid NOTEs) is:
y <- dplyr::filter(x, .data$v > .data$w)
Consequently, the news file says that "the underscored version of each main verb is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility)."
Question
The vignette says that the above new syntax will not yield R CMD check NOTES, "provided that you’ve also imported rlang::.data with #importFrom rlang .data." However, when I run the code:
y <- dplyr::filter(x, rlang::.data$v > rlang::.data$w)
Evaluation error: Object `From` not found in data.
Is this error similar to the following?
y <- dplyr::filter(x, v == dplyr::n())
Evaluation error: This function should not be called directly.
Namely, for some functions, calling them prefixed with the package yields errors? (Something to do with whether or not they've been exported, perhaps?)
Comment
As an aside, is there a less verbose way of writing package-friendly dplyr functions with the new syntax in 0.7.0? In particular, the syntax for dplyr >=0.7.0:
y <- dplyr::filter(x, .data$v > .data$w)
is more verbose than the syntax for dplyr <0.7.0:
y <- dplyr::filter_(x, ~v > w)
and the verbosity increases as more variables are referenced. However, I don't want to use the less verbose syntax with the underscored version, as it is deprecated.

for some functions, calling them prefixed with the package yields errors?
That's right, but we could make them work to make things more predictable. You can file a github issue for this feature.
is there a less verbose way of writing package-friendly dplyr functions with the new syntax in 0.7.0?
The alternative is to declare all your column symbols to R, e.g. within a globalVariables(c("v", "w")) statement somewhere in your package.
Ideally, R should know about NSE functions and never warn for unknown symbols in those cases.

Another work-around is to add lines such as
v <- NULL; # mark as not an unbound global reference
just above your expressions that are generating CRAN checks. It is no less accurate (column names are not in fact global variables) and has somewhat limited scope.

Related

How does the PACKAGE argument to .Call work?

.Call seems rather poorly documented; ?.Call gives an explanation of the PACKAGE argument:
PACKAGE: if supplied, confine the search for a character string .NAME to the DLL given by this argument (plus the conventional extension, ‘.so’, ‘.dll’, ...).
This argument follows ... and so its name cannot be abbreviated.
This is intended to add safety for packages, which can ensure by using this argument that no other package can override their external symbols, and also speeds up the search (see ‘Note’).
And in the Note:
If one of these functions is to be used frequently, do specify PACKAGE (to confine the search to a single DLL) or pass .NAME as one of the native symbol objects. Searching for symbols can take a long time, especially when many namespaces are loaded.
You may see PACKAGE = "base" for symbols linked into R. Do not use this in your own code: such symbols are not part of the API and may be changed without warning.
PACKAGE = "" used to be accepted (but was undocumented): it is now an error.
But there are no usage examples.
It's unclear how the PACKAGE argument works. For example, in answering this question, I thought the following should have worked, but it doesn't:
.Call(C_BinCount, x, breaks, TRUE, TRUE, PACKAGE = "graphics")
Instead this works:
.Call(graphics:::C_BinCount, x, breaks, TRUE, TRUE)
Is this simply because C_BinCount is unexported? I.e., if the internal code of hist.default had added PACKAGE = "graphics", this would have worked?
This seems simple but is really rare to find usage of this argument; none of the sources I found give more than passing mention (1, 2, 3, 4, 5)... Examples of this actually working would be appreciated (even if it's just citing code found in an existing package)
(for self-containment purposes, if you don't want to copy-paste code from the other question, here are x and breaks):
x = runif(100000000, 2.5, 2.6)
nB <- 99
delt <- 3/nB
fuzz <- 1e-7 * c(-delt, rep.int(delt, nB))
breaks <- seq(0, 3, by = delt) + fuzz
C_BinCount is an object of class "NativeSymbolInfo", rather than a character string naming a C-level function, hence PACKAGE (which "confine(s) the search for a character string .NAME") is not relevant. C_BinCount is made a symbol by its mention in useDynLib() in the graphics package NAMESPACE.
As an R symbol, C_BinCount's resolution is subject to the same rules as other symbols -- it's not exported from the NAMESPACE, so only accessible via graphics:::C_BinCount. And also, for that reason, off-limits for robust code development. Since the C entry point is imported as a symbol, it is not available as a character string, so .Call("C_BinCount", ...) will not work.
Using a NativeSymbolInfo object tells R where the C code is located, so there is no need to do so again via PACKAGE; the choice to use the symbol rather than character string is made by the package developer, and I think would generally be considered good practice. Many packages developed before the invention of NativeSymbolInfo use the PACKAGE argument, if I grep the Bioconductor source tree there are 4379 lines with .Call.*PACKAGE, e.g., here.
Additional information, including examples, is in Writing R Extensions section 1.5.4.

packaging issues with a function that uses arules

I'm using R and trying to assemble a bunch of functions into a package. One of the function uses the package arules to mine rules from a dataset, subset them and get other interest measures.
I'm having problem with the line that subsets them.
rules <- apriori(trainingTrans, parameter = list(support = 0.005, confidence = 0.0, maxlen = 6)
rulesCases <- subset(rules, subset = rhs %in% "event")
The functions works outside of the package as long as I've loaded arules, but doesn't work in the package whether I've set arules as a Depends, an Imports, or had the function call it with library(arules). The error displayed is 'match' requires vector arguments. I thought Arules has its own version of match to get around that, I've tried arules::match(rhs,"event"), but I still have the same problem.
The issue is that it does not find the correct version of %in%. Possibly this works:
rulesCases <- subset(rules, subset = arules::"%in%"(rhs, "event"))
This should be not necessary if you import arules, but there seems to be something weird going on. I hope this will be resolved in a future release of arules.
I had the same problem in my package and be able to fix it :
The syntax subset(rules, subset = arules::"%in%"(rhs, "event")) forces to use the correct version of %in% in the package, as Michael Hahsler noticed
But rhs is no more related with rules so it needed to be re-precised using rules#rhs
So the correct syntax should be subset(rules, subset = arules::"%in%"(rules#rhs, "event"))
It do the job for my package, with the DESCRIPTION file containing
LinkingTo: arules
Imports: arules
And no further uses of library(arules).

R Package development: overriding a function from one package with a function from another?

I am currently working on developing two packages, below is a simplified
version of my problem:
In package A I have some functions (say "sum_twice"), and I it calls to
another function inside the package (say "slow_sum").
However, in package B, I wrote another function (say "fast_sum"), with
which I wish to replace the slow function in package A.
Now, how do I manage this "overriding" of the "slow_sum" function with the
"fast_sum" function?
Here is a simplified example of such functions (just to illustrate):
############################
##############
# Functions in package A
slow_sum <- function(x) {
sum_x <- 0
for(i in seq_along(x)) sum_x <- sum_x + x[i]
sum_x
}
sum_twice <- function(x) {
x2 <- rep(x,2)
slow_sum(x2)
}
##############
# A function in package B
fast_sum <- function(x) { sum(x) }
############################
If I only do something like slow_sum <- fast_sum, this would not work, since "sum_twice" uses "slow_sum" from the NAMESPACE
of package A.
I tried using the following function when loading package "B":
assignInNamespace(x = "slow_sum", value = B:::fast_sum, ns = "A")
This indeed works, however, it makes the CRAN checks return both a NOTE on
how I should not use ":::", and also a warning for using assignInNamespace
(since it is supposed to not be very safe).
However, I am at a loss.
What would be a way to have "sum_twice" use "fast_sum" instead of
"slow_sum"?
Thank you upfront for any feedback or suggestion,
With regards,
Tal
p.s: this is a double post from here.
UDPATE: motivation for this question
I am developing two packages, one is based solely on R and works fine (but a bit slow), it is dendextend (which is now on CRAN). The other one is meant to speed up the first package by using Rcpp (this is dendextendRcpp which is on github). The second package speeds up the first by overriding some basic functions the first package uses. But in order for the higher levels functions in the first package will use the lower functions in the second package, I have to use assignInNamespace which leads CRAN to throw warnings+NOTES, which ended up having the package rejected from CRAN (until these warnings will be avoided).
The problem is that I have no idea how to approach this issue. The only solution I can think of is either mixing the two packages together (making it harder to maintain, and will automatically require a larger dependency structure for people asking to use the package). And the other option is to just copy paste the higher level functions from dendextend to dendextendRcpp, and thus have them mask the other functions. But I find this to be MUCH less elegant (because that means I will need to copy-paste MANY functions, forcing more double-code maintenance) . Any other ideas? Thanks.
We could put this in sum_twice:
my_sum_ch <- getOption("my_sum", if ("package:fastpkg" %in% search())
"fast_sum" else "slow_sum")
my_sum <- match.fun(my_sum_ch)
If the "my_sum" option were set then that version of my_sum would be used and if not it would make the decision based on whether or not fastpkg had been loaded.
The solution I ended up using (thanks to Uwe and Kurt), is using "local" to create a localized environment with the package options. If you're curious, the function is called "dendextend_options", and is here:
https://github.com/talgalili/dendextend/blob/master/R/zzz.r
Here is an example for its use:
dendextend_options <- local({
options <- list()
function(option, value) {
# ellipsis <- list(...)
if(missing(option)) return(options)
if(missing(value))
options[[option]]
else options[[option]] <<- value
}
})
dendextend_options("a")
dendextend_options("a", 1)
dendextend_options("a")
dendextend_options("a", NULL)
dendextend_options("a")
dendextend_options()

How is J() function implemented in data.table?

I recently learned about the elegant R package data.table. I am very curious to know how the J function is implemented there. This function is bound to the function [.data.table, it doesn't exist in the global environment.
I downloaded the source code but I cannot find the definition for this J function anywhere there. I found lockBind(".SD", ...), but not J. Any idea how this function is implemented?
Many thanks.
J() used to be exported before, but not since 1.8.8. Here's the note from 1.8.8:
o The J() alias is now removed outside DT[...], but will still work inside DT[...]; i.e., DT[J(...)] is fine. As warned in v1.8.2 (see below in this file) and deprecated with warning() in v1.8.4. This resolves the conflict with function J() in package XLConnect (#1747) and rJava (#2045). Please use data.table() directly instead of J(), outside DT[...].
Using R's lazy evaluation, J(.) is detected and simply replaced with list(.) using the (invisible) non-exported function .massagei.
That is, when you do:
require(data.table)
DT = data.table(x=rep(1:5, each=2L), y=1:10, key="x")
DT[J(1L)]
i (= J(1L)) is checked for its type and this line gets executed:
i = eval(.massagei(isub), x, parent.frame())
where isub = substitute(i) and .massagei is simply:
.massagei = function(x) {
if (is.call(x) && as.character(x[[1L]]) %chin% c("J","."))
x[[1L]] = quote(list)
x
}
Basically, data.table:::.massagei(quote(J(1L))) gets executed which returns list(1L), which is then converted to data.table. And from there, it's clear that a join has to happen.

Require minimum version of R package

I just noticed that there's no version argument to R's require() or library() functions. What do people do when they need to ensure they have at least some minimum version of a package, so that e.g. they know some bug is fixed, or some feature is available, or whatever?
I'm aware of the Depends stuff for package authors, but I'm looking for something to use in scripts, interactive environments, org-mode files, code snippets, etc.
You could use packageVersion():
packageVersion("stats")
# [1] ‘2.14.1’
if(packageVersion("stats") < "2.15.0") {
stop("Need to wait until package:stats 2.15 is released!")
}
# Error: Need to wait until package:stats 2.15 is released!
This works because packageVersion() returns an object of class package_version for which < behaves as we'd like it to (which < will not do when comparing two character strings using their lexicographical ordering).
After reading Paul's pseudocode, here's the function I've written.
use <- function(package, version=0, ...) {
package <- as.character(substitute(package))
library(package, ..., character.only=TRUE)
pver <- packageVersion(package)
if (compareVersion(as.character(pver), as.character(version)) < 0)
stop("Version ", version, " of '", package,
"' required, but only ", pver, " is available")
invisible(pver)
}
It functions basically the same as library(), but takes an extra version argument:
> use(plyr, 1.6)
> use(ggplot2, '0.9')
Error in use(ggplot2, "0.9") :
Version 0.9 of 'ggplot2' required, but only 0.8.9 is available
I am not aware of such a function, but it should be quite easy to make one. You can base it on sessionInfo() or packageVersion(). After loading the packages required for the script, you can harvest the package numbers from there. A function that checks the version number would look like (in pseudo code, as I don't have time right now):
check_version = function(pkg_name, min_version) {
cur_version = packageVersion(pkg_name)
if(cur_version < min_version) stop(sprintf("Package %s needs a newer version,
found %s, need at least %s", pkg_name, cur_version, min_version))
}
Calling it would be like:
library(ggplot2)
check_version("ggplot2", "0.8-9")
You still need to parse the version numbers into something that allows the comparison cur_version < min_version, but the basic structure remains the same.

Resources