defining custom dplyr methods in R package - r

I have a package with custom summary(), print() methods for objects that have a particular class. This package also uses the wonderful dplyr package for data manipulation - and I expect my users to write scripts that use both my package and dplyr.
One roadblock, which has been noted by others here and here is that dplyr verbs doesn't preserve custom classes - meaning that an ungroup command can strip my data.frames of their custom classes, and thus screw up method dispatch for summary, etc.
Hadley says "doing this correctly is up to you - you need to define a method for your class for each dplyr method that correctly restores all the classes and attributes" and I'm trying to take the advice - but I can't figure out how to correctly wrap the dplyr verbs.
Here's a simple toy example. Let's say I've defined a cars class, and I have a custom summary for it.
this works
library(tidyverse)
class(mtcars) <- c('cars', class(mtcars))
summary.cars <- function(x, ...) {
#gather some summary stats
df_dim <- dim(x)
quantile_sum <- map(mtcars, quantile)
cat("A cars object with:\n")
cat(df_dim[[1]], 'rows and ', df_dim[[2]], 'columns.\n')
print(quantile_sum)
}
summary(mtcars)
here's the problem
small_cars <- mtcars %>% filter(cyl < 6)
summary(small_cars)
class(small_cars)
that summary call for small_cars just gives me the generic summary, not my custom method, because small_cars no longer retains the cars class after dplyr filtering.
what I tried
First I tried writing a custom method around filter (filter.cars). That didn't work, because filter actually a wrapper around filter_ that allows for non-standard evaluation.
So I wrote a custom filter_ method for cars objects, attempting to implement #jwdink 's advice
filter_.cars <- function(df, ...) {
old_classes <- class(df)
out <- dplyr::filter_(df, ...)
new_classes <- class(out)
class(out) <- c(new_classes, old_classes) %>% unique()
out
}
That doesn't work - I get an infinite recursion error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?
All I want to do is grab the classes on the incoming df, hand off to dplyr, then return the object with the same classnames as it had before the dplyr call. How do I change my filter_ wrapper to accomplish that? Thanks!

UPDATE:
Some things have changed since my original answer:
Many dplyr verbs no longer remove custom classes; for example, dplyr::filter keeps the class. However, some — like dplyr::group_by — still remove the class, so this question lives on.
With R 3.5 and beyond, method lookup changed its scoping rules
The trailing-underscore version of the verbs are deprecated
Recently ran into a hard-to-figure-out issues due to the second bullet, so just wanted to give a fuller example. Let's say you're using a custom class, with name custom_class, and you want to add a groupby method. Assuming you're using roxygen:
#' group_by.custom_class
#'
#' #description Preserve the class of a `custom_class` object.
#' #inheritParams dplyr::group_by
#'
#' #importFrom dplyr group_by
#'
#' #export
#' #method group_by custom_class
group_by.custom_class <- function(.data, ...) {
result <- NextMethod()
return(reclass(.data, result))
}
(see original answer for definition of reclass function)
Highlights:
You need #method group_by custom_class to add S3method(group_by,custom_class) to NAMESPACE
You need #importFrom dplyr group_by to add importFrom(dplyr,group_by) to your NAMESPACE
I believe in R < 3.5 you could get away with just that second one, but now you need both.
OLD ANSWER:
Further suggestions were offered in the thread so I thought I'd update with what seems to be best practice, which is to use NextMethod().
filter_.cars <- function(.data, ...) {
result <- NextMethod()
reclass(.data, result)
}
Where reclass is written by you; it's just a generic that (at least) adds the original class back on:
reclass <- function(x, result) {
UseMethod('reclass')
}
reclass.default <- function(x, result) {
class(result) <- unique(c(class(x)[[1]], class(result)))
result
}

Your new filter_ method tries to apply to the new class within the definition, hence the recursion.
Following the advice in the issue you linked, try removing that new class prior to filter_ in your updated method.
class(out) <- class(out)[-1]

Related

Referring to package and function as arguments in another function

I am trying to find methods for specific functions across different packages in R. For example methods(broom::tidy) will return all methods for the function tidy in the package broom. For my current issue it would be better if I could have the methods function in another function like so:
f1 <- function(x,y){
methods(x::y)
}
(I removed other parts of the code that are not relevant to my issue.)
However when I run the function like this:
f1 <- function(x,y){ methods(x::y)}
f1(broom,tidy)
I get the error
Error in loadNamespace(name) : there is no package called ‘x’
If I try to modify it as to only change the function but keep the package the same I get a similar error :
f2 <- function(y){ methods(broom::y)}
f2(tidy)
Error: 'y' is not an exported object from 'namespace:broom'
How can I get the package and function name to evaluate properly in the function? Does this current issue have to do with when r is trying to evaluate/substitute values in the function?
Both the :: and methods() functions use non-standard evaluation in order to work. This means you need to be a bit more clever with passing values to the functions in order to get it to work. Here's one method
f1 <- function(x,y){
do.call("methods", list(substitute(x::y)))
}
f1(broom,tidy)
Here we use substitute() to expand and x and y values we pass in into the namespace lookup. That solves the :: part which you can see with
f2 <- function(x,y){
substitute(x::y)
}
f2(broom,tidy)
# broom::tidy
We need the substitute because there could very well be a package x with function y. For this reason, variables are not expanded when using ::. Note that :: is just a wrapper to getExportedValue() should you otherwise need to extract values from namespaces using character values.
But there is one other catch: methods() doesn't evaluate it's parameters, it uses the raw expression to find the methods. This means we don't actually need the value of broom::tidy, we to pass that literal expression. Since we need to evaluate the substitute to get the expression we need, we need to build the call with do.call() in order to evaluate the substitute and pass that expression on to methods()

Dispatch of `rbind` and `cbind` for a `data.frame`

Background
The dispatch mechanism of the R functions rbind() and cbind() is non-standard. I explored some possibilities of writing rbind.myclass() or cbind.myclass() functions when one of the arguments is a data.frame, but so far I do not have a satisfactory approach. This post concentrates on rbind, but the same holds for cbind.
Problem
Let us create an rbind.myclass() function that simply echoes when it has been called.
rbind.myclass <- function(...) "hello from rbind.myclass"
We create an object of class myclass, and the following calls to rbind all
properly dispatch to rbind.myclass()
a <- "abc"
class(a) <- "myclass"
rbind(a, a)
rbind(a, "d")
rbind(a, 1)
rbind(a, list())
rbind(a, matrix())
However, when one of the arguments (this need not be the first one), rbind() will call base::rbind.data.frame() instead:
rbind(a, data.frame())
This behavior is a little surprising, but it is actually documented in the
dispatch section of rbind(). The advice given there is:
If you want to combine other objects with data frames,
it may be necessary to coerce them to data frames first.
In practice, this advice may be difficult to implement. Conversion to a data frame may remove essential class information. Moreover, the user who might be unware of the advice may be stuck with an error or an unexpected result after issuing the command rbind(a, x).
Approaches
Warn the user
A first possibility is to warn the user that the call to rbind(a, x) should not be made when x is a data frame. Instead, the user of package mypackage should make an explicit call to a hidden function:
mypackage:::rbind.myclass(a, x)
This can be done, but the user has to remember to make the explicit call when needed. Calling the hidden function is something of a last resort, and should not be regular policy.
Intercept rbind
Alternatively, I tried to shield the user by intercepting dispatch. My first try was to provide a local definition of base::rbind.data.frame():
rbind.data.frame <- function(...) "hello from my rbind.data.frame"
rbind(a, data.frame())
rm(rbind.data.frame)
This fails as rbind() is not fooled in calling rbind.data.frame from the .GlobalEnv, and calls the base version as usual.
Another strategy is to override rbind() by a local function, which was suggested in S3 dispatching of `rbind` and `cbind`.
rbind <- function (...) {
if (attr(list(...)[[1]], "class") == "myclass") return(rbind.myclass(...))
else return(base::rbind(...))
}
This works perfectly for dispatching to rbind.myclass(), so the user can now type rbind(a, x) for any type of object x.
rbind(a, data.frame())
The downside is that after library(mypackage) we get the message The following objects are masked from ‘package:base’: rbind .
While technically everything works as expected, there should be better ways than a base function override.
Conclusion
None of the above alternatives is satisfactory. I have read about alternatives using S4 dispatch, but so far I have not located any implementations of the idea. Any help or pointers?
As you mention yourself, using S4 would be one good solution that works nicely. I have not investigated recently, with data frames as I am much more interested in other generalized matrices, in both of my long time CRAN packages 'Matrix' (="recommended", i.e. part of every R distribution) and in 'Rmpfr'.
Actually even two different ways:
1) Rmpfr uses the new way to define methods for the '...' in rbind()/cbind().
this is well documented in ?dotsMethods (mnemonic: '...' = dots) and implemented in Rmpfr/R/array.R line 511 ff (e.g. https://r-forge.r-project.org/scm/viewvc.php/pkg/R/array.R?view=annotate&root=rmpfr)
2) Matrix uses the older approach by defining (S4) methods for rbind2() and cbind2(): If you read ?rbind it does mention that and when rbind2/cbind2 are used. The idea there: "2" means you define S4 methods with a signature for two ("2") matrix-like objects and rbind/cbind uses them for two of its potentially many arguments recursively.
The dotsMethod approach was suggested by Martin Maechler and implemented in the Rmpfr package. We need to define a new generic, class and a method using S4.
setGeneric("rbind", signature = "...")
mychar <- setClass("myclass", slots = c(x = "character"))
b <- mychar(x = "b")
rbind.myclass <- function(...) "hello from rbind.myclass"
setMethod("rbind", "myclass",
function(..., deparse.level = 1) {
args <- list(...)
if(all(vapply(args, is.atomic, NA)))
return( base::cbind(..., deparse.level = deparse.level) )
else
return( rbind.myclass(..., deparse.level = deparse.level))
})
# these work as expected
rbind(b, "d")
rbind(b, b)
rbind(b, matrix())
# this fails in R 3.4.3
rbind(b, data.frame())
Error in rbind2(..1, r) :
no method for coercing this S4 class to a vector
I haven't been able to resolve the error. See
R: Shouldn't generic methods work internally within a package without it being attached?
for a related problem.
As this approach overrides rbind(), we get the warning The following objects are masked from 'package:base': rbind.
I don't think you're going to be able to come up with something completely satisfying. The best you can do is export rbind.myclass so that users can call it directly without doing mypackage:::rbind.myclass. You can call it something else if you want (dplyr calls its version bind_rows), but if you choose to do so, I'd use a name that evokes rbind, like rbind_myclass.
Even if you can get r-core to agree to change the dispatch behavior, so that rbind dispatches on its first argument, there are still going to be cases when users will want to rbind multiple objects together with a myclass object somewhere other than the first. How else can users dispatch to rbind.myclass(df, df, myclass)?
The data.table solution seems dangerous; I would not be surprised if the CRAN maintainers put in a check and disallow this at some point.

I cannot find unique function from data.table package [duplicate]

help(unique) shows that unique function is present in two packages - base and data.table. I would like to use this function from data.table package. I thought that the following syntax - data <- data.table::unique(data) indicates the package to be used. But I get the following error -
'unique' is not an exported object from 'namespace:data.table'
But data <- unique(data) works well.
What is wrong here?
The function in question is really unique.data.table, an S3 method defined in the data.table package. That method is not really intended to be called directly, so it isn't exported. This is typically the case with S3 methods. Instead, the package registers the method as an S3 method, which then allows the S3 generic, base::unique in this case, to dispatch on it. So the right way to call the function is:
library(data.table)
irisDT <- data.table(iris)
unique(irisDT)
We use base::unique, which is exported, and it dispatches data.table:::unique.data.table, which is not exported. The function data.table:::unique does not actually exist (or does it need to).
As eddi points out, base::unique dispatches based on the class of the object called. So base::unique will call data.table:::unique.data.table only if the object is a data.table. You can force a call to that method directly with something like data.table:::unique.data.table(iris), but internally that will mostly likely result in the next method getting called unless your object is actually a data.table.
There are actually two infix operators in R that pull functions from particular package namespaces. You used :: but there is also a ::: that retrieves "unexported" functions. The unique-function is actually a family of functions and its behavior will depend on both the class of its argument and the particular packages that have been loaded. The R term of this is "generic". Try:
data <- data.table:::unique(data) # assuming 'data' is a data.table
The other tool that lets you peek behind the curtain that the lack of "exportation" is creating is the getAnywhere-function. It lets you see the code at the console:
> unique.data.table
Error: object 'unique.data.table' not found
> getAnywhere(unique.data.table)
A single object matching ‘unique.data.table’ was found
It was found in the following places
registered S3 method for unique from namespace data.table
namespace:data.table
with value
function (x, incomparables = FALSE, fromLast = FALSE, by = key(x),
...)
{
if (!cedta())
return(NextMethod("unique"))
dups <- duplicated.data.table(x, incomparables, fromLast,
by, ...)
.Call(CsubsetDT, x, which_(dups, FALSE), seq_len(ncol(x)))
}
<bytecode: 0x2ff645950>
<environment: namespace:data.table>

`data.table::unique` errors: is not an exported object from namespace

help(unique) shows that unique function is present in two packages - base and data.table. I would like to use this function from data.table package. I thought that the following syntax - data <- data.table::unique(data) indicates the package to be used. But I get the following error -
'unique' is not an exported object from 'namespace:data.table'
But data <- unique(data) works well.
What is wrong here?
The function in question is really unique.data.table, an S3 method defined in the data.table package. That method is not really intended to be called directly, so it isn't exported. This is typically the case with S3 methods. Instead, the package registers the method as an S3 method, which then allows the S3 generic, base::unique in this case, to dispatch on it. So the right way to call the function is:
library(data.table)
irisDT <- data.table(iris)
unique(irisDT)
We use base::unique, which is exported, and it dispatches data.table:::unique.data.table, which is not exported. The function data.table:::unique does not actually exist (or does it need to).
As eddi points out, base::unique dispatches based on the class of the object called. So base::unique will call data.table:::unique.data.table only if the object is a data.table. You can force a call to that method directly with something like data.table:::unique.data.table(iris), but internally that will mostly likely result in the next method getting called unless your object is actually a data.table.
There are actually two infix operators in R that pull functions from particular package namespaces. You used :: but there is also a ::: that retrieves "unexported" functions. The unique-function is actually a family of functions and its behavior will depend on both the class of its argument and the particular packages that have been loaded. The R term of this is "generic". Try:
data <- data.table:::unique(data) # assuming 'data' is a data.table
The other tool that lets you peek behind the curtain that the lack of "exportation" is creating is the getAnywhere-function. It lets you see the code at the console:
> unique.data.table
Error: object 'unique.data.table' not found
> getAnywhere(unique.data.table)
A single object matching ‘unique.data.table’ was found
It was found in the following places
registered S3 method for unique from namespace data.table
namespace:data.table
with value
function (x, incomparables = FALSE, fromLast = FALSE, by = key(x),
...)
{
if (!cedta())
return(NextMethod("unique"))
dups <- duplicated.data.table(x, incomparables, fromLast,
by, ...)
.Call(CsubsetDT, x, which_(dups, FALSE), seq_len(ncol(x)))
}
<bytecode: 0x2ff645950>
<environment: namespace:data.table>

Method initialisation in R reference classes

I've noticed some strange behaviour in R reference classes when trying to implement some optimisation algorithm. There seems to be some behind-the-scenes parsing magic involved in initialising methods in a particular which makes it difficult to work with anonymous functions.
Here's an example that illustrates the difficulty: I define a function to optimise (f_opt), a function that runs optim on it, and a reference class that has these two as methods. The odd behaviour will be clearer in the code
f_opt <- function(x) (t(x)%*%x)
do_optim_opt <- function(x) optim(x,f)
do_optim2_opt <- function(x)
{
f(x) #Pointless extra evaluation
optim(x,f)
}
optClass <- setRefClass("optClass",methods=list(do_optim=do_optim_opt,
do_optim2=do_optim2_opt,
f=f_opt))
oc <- optClass$new()
oc$do_optim(rep(0,2)) #Doesn't work: Error in function (par) : object 'f' not found
oc$do_optim2(rep(0,2)) #Works.
oc$do_optim(rep(0,2)) #Parsing magic has presumably happened, and now this works too.
Is it just me, or does this look like a bug to other people too?
This post in R-devel seems relevant, with workaround
do_optim_opt <- function(x, f) optim(x, .self$f)
Seems worth a post to R-devel.

Resources