I'm implementing an S4 class that contains a data.table, and attempting to implement [ subsetting of the object (as described here) such that it also subsets the data.table. For example (defining just i subsetting):
library(data.table)
.SuperDataTable <- setClass("SuperDataTable", representation(dt="data.table"))
setMethod("[", c("SuperDataTable", "ANY", "missing", "ANY"),
function(x, i, j, ..., drop=TRUE)
{
initialize(x, dt=x#dt[i])
})
d = data.table(a=1:4, b=rep(c("x", "y"), each=2))
s = new("SuperDataTable", dt=d)
At this point, subsetting with a numeric vector (s[1:2]) works as desired (it subsets the data.table in the slot). However, I'd like to add the ability to subset using an expression. This works for the data.table itself:
s#dt[b == "x"]
# a b
# 1: 1 x
# 2: 2 x
But not for the S4 [ method:
s[b == "x"]
# Error: object 'b' not found
The problem appears to be that arguments in the signature of the S4 method are not evaluated using R's traditional lazy evaluation- see here:
All arguments in the signature of the generic function will be
evaluated when the function is called, rather than using the
traditional lazy evaluation rules of S. Therefore, it's important to
exclude from the signature any arguments that need to be dealt with
symbolically (such as the first argument to function substitute).
This explains why it doesn't work, but not how one can implement this kind of subsetting, since i and j are included in the signature of the generic. Is there any way to have the i argument not be evaluated immediately?
You may be out of luck on this one. From the R developer notes,
Arguments appearing in the signature of the generic will be evaluated as soon as the generic function
is called; therefore, any arguments that need to take advantage of lazy evaluation must not be in
the signature. These are typically arguments treated literally, often via the substitute() function.
For example, if one wanted to turn substitute() itself into a generic, the first argument, expr,
would not be in the signature since it must not be evaluated but rather treated as a literal.
Furthermore, due to method caching,
All the arguments in the full signature are evaluated as described above, not just the active
ones. Otherwise, in special circumstances the behavior of the function could change for one
method when another method was cached, definitely undesirable.
I would follow the example from the data.table package writers and use an S3 object (see line 304 of R/data.table.R in their source code). Your S3 object can still create and manipulate an S4 object underneath to maintain the semi-static typing feature.
We can't get extraordinarily clever:
‘[’ is a primitive function; methods can be defined, but the generic function is implicit, and cannot be changed.
Defining both an S3 and S4 method will dispatch the S3 method, which makes it seem like we should be able to route around the S4 call and dispatch it manually, but unfortunately the argument evaluation still occurs! You can get close by borrowing plyr::., which would give you syntax like:
s <- new('SuperDataTable', dt = as.data.table(iris))
s[.(Sepal.Length > 4), 2]
Not ideal, but closer than anything else.
Related
I have seen the use of %||% within the Seurat package (e.g. line 1662) and was wondering what is the meaning of this expression
You can define custom operators in R. Their names can be pretty much arbitrary, but they need to be delimited by %…%.
%||% is such an operator. It isn’t predefined in core R, but you can define it yourself, and Seurat did that, in R/utilities.R.
Its definition is however quite a common one, and can be found in many packages, not just Seurat. Its semantics are effectively this:
`%||%` = function (lhs, rhs) {
if (is.null(lhs) rhs else lhs
}
That is: use the first operand, unless that is NULL. In that case, use the second operand.
I recently thought about the ... argument for a function and noticed that R does not allow to check the class of the object.
f <- function(...) {
class(...)
}
f(1, 2, 3)
## Error in class(...) : 3 arguments passed to 'class' which requires 1
Now with the quote
“To understand computations in R, two slogans are helpful:
• Everything that exists is an object. • Everything that happens is a
function call."
— John Chambers
in my head I'm wondering: What kind of object is ...?
What an interesting question!
Dot-dot-dot ... is an object (John Chambers is right!) and it's a type of pairlist. Well, I searched the documentation, so I'd like to share it with you:
R Language Definition document says:
The ‘...’ object type is stored as a type of pairlist. The components of ‘...’ can be accessed in the usual pairlist manner from C code, but is not easily accessed as an object in interpreted code. The object can be captured as a list.
Another chapter defines pairlists in detail:
Pairlist objects are similar to Lisp’s dotted-pair lists.
Pairlists are handled in the R language in exactly the same way as generic vectors (“lists”).
Help on Generic and Dotted Pairs says:
Almost all lists in R internally are Generic Vectors, whereas traditional dotted pair lists (as in LISP) remain available but rarely seen by users (except as formals of functions).
And a nice summary is here at Stack Overflow!
I read in several places that pipes in Julia only work with functions that take only one argument. This is not true, since I can do the following:
function power(a, b = 2) a^b end
3 |> power
> 9
and it works fine.
However, I but can't completely get my head around the pipe. E.g. why is this not working?? :
3 |> power()
> MethodError: no method matching power()
What I would actually like to do is using a pipe and define additional arguments, e.g. keyword arguments so that it is actually clear which argument to pass when piping (namely the only positional one):
function power(a; b = 2) a^b end
3 |> power(b = 3)
Is there any way to do something like this?
I know I could do a work-around with the Pipe package, but to honest it feels kind of clunky to write #pipe at the start of half of the lines.
In R the magritrr package has convincing logic (in my opinion): it passes what's left of the pipe by default as the first argument to the function on the right - I'm looking for something similar.
power as defined in the first snippet has two methods. One with one argument, one with two. So the point about |> working only with one-argument methods still holds.
The kind of thing you want to do is called "partial application", and very common in functional languages. You can always write
3 |> (a -> power(a, 3))
but that gets clunky quickly. Other language have syntax like power(%1, 3) to denote that lambda. There's discussion to add something similar to Julia, but it's difficult to get right. Pipe is exactly the macro-based fix for it.
If you have control over the defined method, you can also implement methods with an interface that return partially applied versions as you like -- many predicates in Base do this already, e.g., ==(1). There's also the option of Base.Fix2(power, 3), but that's not really an improvement, if you ask me (apart from maybe being nicer to the compiler).
And note that magrittrs pipes are also "macro"-based. The difference is that argument passing in R is way more complicated, and you can't see from outside whether an argument is used as a value or as an expression (essentially, R passes a thunk containing the expression and a pointer to the parent environment, and automatically evaluates and caches it if you use it as a value; see substitute)
If I need to treat R objects in different ways according to their class, I can either use if and else within a single function:
foo <- function (x) {
if (inherits(x, 'list')) {
# Foo the list
} else if (inherits(x, 'numeric')) {
# Foo the numeric
} else {
# Throw an error
}
}
Or I can define a method:
foo <- function (x) UseMethod('foo')
foo.list <- function (x) {
# Foo the list
}
foo.numeric <- function (x) {
# Foo the numeric
}
What are the advantages to each approach? Are there performance implications?
OK, there is some background to be covered to answer this question (in my view)...
Within R, the class of an object is explicit in situations where you have user-defined object structures or an object such as a factor vector or data frame where other attributes play an important part in the handling of the object itself—for example, level labels of a factor vector, or variable names in a data frame, are modifiable attributes that play a primary role in accessing the observations of each object.
Note, however, that elementary R objects such as vectors, matrices, and arrays, are implicitly classed, which means the class is not identified with the attributes function. Whether implicit or explicit, the class of a given object can always be retrieved using the attribute-specific function class.
When a generic function foo is applied to an object with class attribute c("first", "second"), the system searches for a function called foo.first and, if it finds it, applies it to the object. If no such function is found, a function called foo.second is tried. If no class name produces a suitable function, the function foo.default is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.
The function class prints the vector of names of classes an object inherits from.
class <- sets the classes an object inherits from.
inherits() indicates whether its first argument inherits from any of the classes specified in the what argument. Method dispatch takes place based on the class of the first argument to the generic function. If which is TRUE then an integer vector of the same length as what is returned. Each element indicates the position in the class(x) matched by the element of what; zero indicates no match. If which is FALSE then TRUE is returned by inherits if any of the names in what match with any class.
All but inherits() are primitive functions.
Considerations
OK, so let us now consider your examples in reverse order...
foo <- function (x) UseMethod('foo')
foo.list <- function (x) {
# Foo the list
}
foo.numeric <- function (x) {
# Foo the numeric
}
now if we use the function methods()
methods(foo)
[1] foo.list foo.numeric
see '?methods' for accessing help and source code
> getS3method('foo','list')
function (x) {
# Foo the list
}
thus we have a class foo and two associated methods foo.list and foo.numeric. Thus, we now know that class foo, has methods to support list and numeric operations.
OK, now let's consider your first example...
function (x) {
if (inherits(x, 'list')) {
# Foo the list
print(paste0("List: ", x))
} else if (inherits(x, 'numeric')) {
# Foo the numeric
print(paste0("Numeric: ", x))
} else {
# Throw an error
print(paste0("Unhandled - Sorry!"))
}
}
the problem is that this is not an s3 class, it is an R function. If you run methods() against foo it returns "no methods found"
> methods(foo)
no methods found
> getS3method('foo','list')
Error in getS3method("foo", "list") : no function 'foo' could be found
so what is happening in the second example? The inherits() operation is matching the class of the parameter. inherits() -> Method dispatch takes place based on the class of the first argument to the generic function.
So your first example is simply looking up the class of the function argument x, no S3 class is created or exists.
What are the advantages to each approach? Are there performance implications?
OK, I am biased here but an object’s class is one of the most useful attributes for describing an entity in R. Every object you create is identified, either implicitly or explicitly, with at least one class. R is an object-oriented programming language, meaning entities are stored as objects and have methods that act upon them.
So the second approach is the way to go in my opinion. Why? Because you are truly using the language construct as intended. The first approach where you use inherits() explicitly feels like a hack. Readability is key to comprehension from my personal perspective, thus I worry that a person reading the first example might be led to ask the question "Why did they (the programmer) take said approach, what am I missing?". My concern then is that complexity is to be avoided as it can impede code comprehension. Thus, keep it simple is advantageous to code comprehension.
In reference to code performance, an if-else parser is generally going to be faster than an object lookup model though a lookup model is not equivalent to a class mapping process so I feel the performance question is tricky to answer in this context. Why? The two approaches are different.
I hope the above points you in the right direction. Stay safe, good karma flying your way.
A couple of Book recommendations here:
R Inferno by Patrick Burns
Advanced R by Hadley Wickham
R for Everyone: Advanced Analytics and Graphics
In a by() function, I will use cor (correlation) to be the FUN there. However, I'd like to setup use="complete.obs" too.
I don't know how to pass this argument in the FUN = cor part.
For example,
by(data, INDICES=list(data$Age), FUN=cor)
probably
by(data, INDICES=list(data$Age), FUN=cor, use = "complete.obs")
will work.
the arguments to by are passed to FUN.
If you start looking around at various R help files for functions like by, you may start to notice a curious 'argument' popping up over and over again: .... You're going to see an ellipsis listed along with all the other arguments to a function.
This is actually an argument itself. It will collect any other arguments you pass and hand them off to subsequent functions called later. The documentation will usually tell you what function these arguments will be handed to.
In this case, in ?by we see this:
... further arguments to FUN.
This means that any other arguments you pass to by that don't match the ones listed will be handed off to the function you pass to FUN.
Another common instance can be found in plot, where the documentation only lists two specific arguments, x and y. Then there's the ... which gathers up anything else you pass to plot and hands it off to methods or to par to set graphical parameter settings.
So in #kohske's example, use = "complete.obs" will be automatically passed on the cor, since it doesn't match any of the other arguments for by.
#kohske and #joran give equivalent answers showing built in features of by (which are also present in apply and the entire plyr family) for passing additional arguments to the supplied function since this is a common application/problem. #Tomas also shows another way to specify an anonymous function which is just a function that calls the "real" function with certain parameters fixed. Fixing parameters to a function call (to effectively make a function with fewer arguments) is a common approach, especially in functional approaches to programming; in that context it is called currying or partial application.
library("functional")
by(data, INDICES=list(data$Age), FUN=Curry(cor, use = "complete.obs"))
This approach can be used when one function does not use ... to "pass along" arguments, and you want to indicate the only reason that an anonymous function is needed is to specify certain arguments.
In general, you have 2 possibilities:
1) specify the arguments in the calling function (tapply() or by() in this case). This also works even if the key argument to fun() is not the first one:
fun <- function(arg1, arg2, arg3) { ... } # just to see how fun() looks like
tapply(var1, var2, fun, arg1 = something, arg3 = something2)
# arg2 will be filled by tapply
2) you may write your wrapper function (sometimes this is needed):
tapply(var1, var2, function (x) { fun(something, x, something2) })