How does one correctly do the following:
I have a class SpectraSet with slots parentSpectrum, childSpectra, name (to keep it simple)
name is character()
parentSpectrum should contain one object of class ParentSpec (so it is of type ParentSpec)
childSpectra should contain n objects of class ChildSpec. However I can't make it of type ChildSpec because vectors can only contain atomic types. What is best practice in this case? I can make it a list() and type check in the validity check, but is there anything better?
Here are related anaswers I've provided in the past.
It's usually better to re-think the class design so ChildSpec is intrinsically a vector -- minimally, supports length() and subsetting [, [[. Your problem above then goes away, the design is consistent with R's vectorized orientation, and likely common operations are efficient.
An alternative to implementing your own type-checked list (which is really the other alternative) is to re-use the infrastructure from Biocdonductor's S4Vectors class
.X = setClass("X", representation(x="numeric"))
.XList = setClass("XList", contains="SimpleList",
prototype=prototype(elementType="X"))
And in action
> xl = .XList(listData=list(.X(x=1), .X(x=2)))
> xl
XList of length 2
> xl[[2]]
An object of class "X"
Slot "x":
[1] 2
Related
If I need to treat R objects in different ways according to their class, I can either use if and else within a single function:
foo <- function (x) {
if (inherits(x, 'list')) {
# Foo the list
} else if (inherits(x, 'numeric')) {
# Foo the numeric
} else {
# Throw an error
}
}
Or I can define a method:
foo <- function (x) UseMethod('foo')
foo.list <- function (x) {
# Foo the list
}
foo.numeric <- function (x) {
# Foo the numeric
}
What are the advantages to each approach? Are there performance implications?
OK, there is some background to be covered to answer this question (in my view)...
Within R, the class of an object is explicit in situations where you have user-defined object structures or an object such as a factor vector or data frame where other attributes play an important part in the handling of the object itself—for example, level labels of a factor vector, or variable names in a data frame, are modifiable attributes that play a primary role in accessing the observations of each object.
Note, however, that elementary R objects such as vectors, matrices, and arrays, are implicitly classed, which means the class is not identified with the attributes function. Whether implicit or explicit, the class of a given object can always be retrieved using the attribute-specific function class.
When a generic function foo is applied to an object with class attribute c("first", "second"), the system searches for a function called foo.first and, if it finds it, applies it to the object. If no such function is found, a function called foo.second is tried. If no class name produces a suitable function, the function foo.default is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.
The function class prints the vector of names of classes an object inherits from.
class <- sets the classes an object inherits from.
inherits() indicates whether its first argument inherits from any of the classes specified in the what argument. Method dispatch takes place based on the class of the first argument to the generic function. If which is TRUE then an integer vector of the same length as what is returned. Each element indicates the position in the class(x) matched by the element of what; zero indicates no match. If which is FALSE then TRUE is returned by inherits if any of the names in what match with any class.
All but inherits() are primitive functions.
Considerations
OK, so let us now consider your examples in reverse order...
foo <- function (x) UseMethod('foo')
foo.list <- function (x) {
# Foo the list
}
foo.numeric <- function (x) {
# Foo the numeric
}
now if we use the function methods()
methods(foo)
[1] foo.list foo.numeric
see '?methods' for accessing help and source code
> getS3method('foo','list')
function (x) {
# Foo the list
}
thus we have a class foo and two associated methods foo.list and foo.numeric. Thus, we now know that class foo, has methods to support list and numeric operations.
OK, now let's consider your first example...
function (x) {
if (inherits(x, 'list')) {
# Foo the list
print(paste0("List: ", x))
} else if (inherits(x, 'numeric')) {
# Foo the numeric
print(paste0("Numeric: ", x))
} else {
# Throw an error
print(paste0("Unhandled - Sorry!"))
}
}
the problem is that this is not an s3 class, it is an R function. If you run methods() against foo it returns "no methods found"
> methods(foo)
no methods found
> getS3method('foo','list')
Error in getS3method("foo", "list") : no function 'foo' could be found
so what is happening in the second example? The inherits() operation is matching the class of the parameter. inherits() -> Method dispatch takes place based on the class of the first argument to the generic function.
So your first example is simply looking up the class of the function argument x, no S3 class is created or exists.
What are the advantages to each approach? Are there performance implications?
OK, I am biased here but an object’s class is one of the most useful attributes for describing an entity in R. Every object you create is identified, either implicitly or explicitly, with at least one class. R is an object-oriented programming language, meaning entities are stored as objects and have methods that act upon them.
So the second approach is the way to go in my opinion. Why? Because you are truly using the language construct as intended. The first approach where you use inherits() explicitly feels like a hack. Readability is key to comprehension from my personal perspective, thus I worry that a person reading the first example might be led to ask the question "Why did they (the programmer) take said approach, what am I missing?". My concern then is that complexity is to be avoided as it can impede code comprehension. Thus, keep it simple is advantageous to code comprehension.
In reference to code performance, an if-else parser is generally going to be faster than an object lookup model though a lookup model is not equivalent to a class mapping process so I feel the performance question is tricky to answer in this context. Why? The two approaches are different.
I hope the above points you in the right direction. Stay safe, good karma flying your way.
A couple of Book recommendations here:
R Inferno by Patrick Burns
Advanced R by Hadley Wickham
R for Everyone: Advanced Analytics and Graphics
To apply a function to all slots in S4.
Of course, it can be done with for-loop over slotNames(). But I'm curious if it can be done in a vectorized way.
In general it isn't possible to operate on slots in a vectorised way, because the slots might have any class. If a class has structure
slotA = "factor"
slotB = "integer"
slotC = "numeric"
then even though you might be applying the same (generic) function to all of them (say, summary) the actual methods that get called will be different. The task just isn't vectorisable, any more than the set of commands "mop the floor, wash the car and vacuum the carpet" could be vectorised even though they might all share the generic function clean — you need a mop for one task, a sponge for another and a vacuum cleaner for the third. (Contrast that with the set of commands "vacuum the three carpets in the bedroom, hallway and lounge" which can be vectorised to an extent — you don't have to get the vacuum cleaner out of the box three times and put it away three times, you can do it just once)
If you can guarantee that all the slots will be of the same class, then it becomes easier to vectorise, but if that is the case, why does this object have the structure that it does? If it needs to be S4 then just define a simple class that contains a list, matrix or array and then use sapply or apply as needed.
In reading the documentation for lists, I found references to pairlists, but it wasn't clear to me how they were different from lists.
Pairlists in day to day R
There are two places that pair lists will show up commonly in day to day R. One is as function formals:
str(formals(var))
The other is as language objects. For example:
quote(1 + 1)
produces a pairlist of type language (LANGSXP internally). The principal reason why you would even care about being aware of this is that operations such as length(<language object>) or language_object[[x]] can be slow because of how pairlist are stored internally (though long pairlist language objects are somewhat rare; note expressions are not pairlists).
Note that empty elements are just zero length symbols, and you can actually store them in lists if you cheat a bit (though you probably shouldn't do this):
list(x=substitute(x, alist(x=))) # hack alert
All that said, for the most part, OP is correct that you don't need to worry about pairlists too much unless you are writing C code for use in R.
Internal differences between lists and pairlists
Pairlists and list are different principally in their storage structure. Pairlists are stored as a chain of nodes, where each node points to the location of the next node in addition to the node's contents and the node's "name" (See CAR/CDR wiki article for generic discussion). Among other things this means you can't know how many elements there are in a pairlist unless you know what element is the first one, and you then traverse the entire list.
Pairlists are used extensively in the R internals, and do exist in normal R use, but most of the time are disguised by the print or access methods and/or coerced to lists when accessed.
Lists are also a list of addresses, but unlike pairlists, all the addresses are stored in one contiguous memory location and the total length is tracked. This makes it easy to access any arbitrary member of the list by location since you can just look up the address in the memory table. With a pairlist, you would have to jump from node to node until you eventually got to the desired node. Names are also stored as attributes of the list proper, instead of being attached to each node of a pairlist.
Benefits of pairlists
One (generally small) benefit of pairlists is that you can add to them with minimal overhead since you only need modify at most two nodes (the node ahead of the new node, and the new node itself), whereas with a list you may need to re-allocate the entire address table with an increase in size (this is typically not much of an issue since the address table is usually very small compared to the size of the data the table points to). There are also many algorithms that specialize in pairlist manipulation (e.g. sorting, indexing, etc.), but those can be ported to normal lists as well.
Less relevant for day-to-day use since you can only do this in internals, it is very easy to modify list from a programming perspective by changing what any arbitrary element points to.
Loosely related to the above, pairlists are likely be more efficient when you have highly nested objects. lists can easily replicate this structure, but each list and nested list will be saddled with the extra memory address table. This is likely the reason pairlists are used for language objects that very likely have a high nesting / element ratio.
For more details see R Internals (look for LISTSXP and VECSXP, pairlists and lists respectively, in the linked location).
edit: interestingly an experiment to compare the memory footprint of a list to a pairlist shows the pairlist to be larger, so the storage efficiency argument may be incorrect (not sure if object.size can be trusted here):
> plist_to_list <- function(x) {
+ if(is.call(x)) x <- as.list(x)
+ if(length(x) > 1) for(i in 2:length(x)) x[[i]] <- Recall(x[[i]])
+ x
+ }
> add_quote <- function(x, y) call("+", x, y)
> x <- Reduce(add_quote, lapply(letters, as.name))
> object.size(x)
7056 bytes
> y <- plist_to_list(x)
> object.size(y)
4656 bytes
First of all, pairlists are deprecated
pairlists are deprecated for normal use because "generic vectors" are typically more efficient. You won't ever need to worry about them unless you are working on R internals.
lists can contain named elements
Each element in a list in R can have a name. You can access each element in a list either by name or by its numerical index.
Here is an example of a list in which the second element is named 'second':
> my.list <- list('A',second='B','C')
> my.list
[[1]]
[1] "A"
$second
[1] "B"
[[3]]
[1] "C"
All elements can be indexed by its position in the list. Named elements can additionally be accessed by name:
> my.list[[2]]
[1] "B"
> my.list$second
[1] "B"
Also, each element in a list is a vector, even if it is only a vector containing a single element. For more about lists, see How to Correctly Use Lists in R?.
pairlists can contain empty named elements
A pairlist is basically the same as a list, except that a pairlist can contain an empty named element, but a list cannot. Also, a pairlist is constructed using the alist function.
> list('A',second=,'C')
Error in as.pairlist(list(...)) : argument is missing, with no default
> alist('A',second=,'C')
[[1]]
[1] "A"
$second
[[3]]
[1] "C"
But, as mentioned earlier, they are deprecated. They do not have any benefit or advantage over lists that I know of.
I'm implementing an S4 class that contains a data.table, and attempting to implement [ subsetting of the object (as described here) such that it also subsets the data.table. For example (defining just i subsetting):
library(data.table)
.SuperDataTable <- setClass("SuperDataTable", representation(dt="data.table"))
setMethod("[", c("SuperDataTable", "ANY", "missing", "ANY"),
function(x, i, j, ..., drop=TRUE)
{
initialize(x, dt=x#dt[i])
})
d = data.table(a=1:4, b=rep(c("x", "y"), each=2))
s = new("SuperDataTable", dt=d)
At this point, subsetting with a numeric vector (s[1:2]) works as desired (it subsets the data.table in the slot). However, I'd like to add the ability to subset using an expression. This works for the data.table itself:
s#dt[b == "x"]
# a b
# 1: 1 x
# 2: 2 x
But not for the S4 [ method:
s[b == "x"]
# Error: object 'b' not found
The problem appears to be that arguments in the signature of the S4 method are not evaluated using R's traditional lazy evaluation- see here:
All arguments in the signature of the generic function will be
evaluated when the function is called, rather than using the
traditional lazy evaluation rules of S. Therefore, it's important to
exclude from the signature any arguments that need to be dealt with
symbolically (such as the first argument to function substitute).
This explains why it doesn't work, but not how one can implement this kind of subsetting, since i and j are included in the signature of the generic. Is there any way to have the i argument not be evaluated immediately?
You may be out of luck on this one. From the R developer notes,
Arguments appearing in the signature of the generic will be evaluated as soon as the generic function
is called; therefore, any arguments that need to take advantage of lazy evaluation must not be in
the signature. These are typically arguments treated literally, often via the substitute() function.
For example, if one wanted to turn substitute() itself into a generic, the first argument, expr,
would not be in the signature since it must not be evaluated but rather treated as a literal.
Furthermore, due to method caching,
All the arguments in the full signature are evaluated as described above, not just the active
ones. Otherwise, in special circumstances the behavior of the function could change for one
method when another method was cached, definitely undesirable.
I would follow the example from the data.table package writers and use an S3 object (see line 304 of R/data.table.R in their source code). Your S3 object can still create and manipulate an S4 object underneath to maintain the semi-static typing feature.
We can't get extraordinarily clever:
‘[’ is a primitive function; methods can be defined, but the generic function is implicit, and cannot be changed.
Defining both an S3 and S4 method will dispatch the S3 method, which makes it seem like we should be able to route around the S4 call and dispatch it manually, but unfortunately the argument evaluation still occurs! You can get close by borrowing plyr::., which would give you syntax like:
s <- new('SuperDataTable', dt = as.data.table(iris))
s[.(Sepal.Length > 4), 2]
Not ideal, but closer than anything else.
I define list a & ask for the class of the first element alpha:
a <- list(alpha=c(1,2,3), beta=c("cat","dog","duck"), gamma=factor("a","b","a"))
class(a$alpha)
[1] "numeric"
I then ask for a summary of a, which reports class -none- for alpha:
summary(a)
Length Class Mode
alpha 3 -none- numeric
beta 3 -none- character
gamma 1 factor numeric
Questions: (1) why is this? (2) I am a novice to R and to programming. What references would you recommend for the beginner who really wants to understand how R works (besides the R language definition)? I find it hard to understand things like the difference between mode, class, & type. Thank you in advance.
I don't claim to fully understand the why here, but as best I can tell, this is what's going on.
summary.default actually calls oldClass rather than class. Why I'm not sure, although I'm sure there's a good reason.
Somewhat cryptically in ?class we find the following passages:
Many R objects have a class attribute, a character vector giving the
names of the classes from which the object inherits. If the object
does not have a class attribute, it has an implicit class, "matrix",
"array" or the result of mode(x) (except that integer vectors have
implicit class "integer"). (Functions oldClass and oldClass<- get and
set the attribute, which can also be done directly.)
So what's going on here is that class returns the implicit class (numeric). Note that attr(a$alpha,"class") returns NULL. Since the attribute doesn't exists, oldClass faithfully returns NULL.
As for the differences between mode, type and class, the first two are related, the third is sort of a separate idea. Mode and type are (I think) actually fairly well explained in the documentation. mode tells you the storage mode of an object, but it is relying on the result of typeof, so they are (mostly) the same. Or connected, at least. But the different values that typeof returns are simply collapsed down to a smaller subset.