To apply a function to all slots in S4.
Of course, it can be done with for-loop over slotNames(). But I'm curious if it can be done in a vectorized way.
In general it isn't possible to operate on slots in a vectorised way, because the slots might have any class. If a class has structure
slotA = "factor"
slotB = "integer"
slotC = "numeric"
then even though you might be applying the same (generic) function to all of them (say, summary) the actual methods that get called will be different. The task just isn't vectorisable, any more than the set of commands "mop the floor, wash the car and vacuum the carpet" could be vectorised even though they might all share the generic function clean — you need a mop for one task, a sponge for another and a vacuum cleaner for the third. (Contrast that with the set of commands "vacuum the three carpets in the bedroom, hallway and lounge" which can be vectorised to an extent — you don't have to get the vacuum cleaner out of the box three times and put it away three times, you can do it just once)
If you can guarantee that all the slots will be of the same class, then it becomes easier to vectorise, but if that is the case, why does this object have the structure that it does? If it needs to be S4 then just define a simple class that contains a list, matrix or array and then use sapply or apply as needed.
Related
Here is a piece of R code that writes to each element of a matrix in a reference class. It runs incredibly slowly, and I’m wondering if I’ve missed a simple trick that will speed this up.
nx = 2000
ny = 10
ref_matrix <- setRefClass(
"ref_matrix",fields = list(data = "matrix"),
)
out <- ref_matrix(data = matrix(0.0,nx,ny))
#tracemem(out$data)
for (iy in 1:ny) {
for (ix in 1:nx) {
out$data[ix,iy] <- ix + iy
}
}
It seems that each write to an element of the matrix triggers a check that involves a copy of the entire matrix. (Uncommenting the tracemen() call shows this.) Now, I’ve found a discussion that seems to confirm this:
https://r-devel.r-project.narkive.com/8KtYICjV/rd-copy-on-assignment-to-large-field-of-reference-class
and this also seems to be covered by Speeding up field access in R reference classes
but in both of these this behaviour can be bypassed by not declaring a class for the field, and this works for the example in the first link which uses a 1D vector, b, which can just be set as b <<- 1:10000. But I’ve not found an equivalent way of creating a 2D array without using a explicit “matrix” instance.
Am I just missing something simple, or is this actually not possible?
Let me add a couple of things. First, I’m very new to R, so could easily have missed something. Second, I’m really just curious about the way reference classes work in this case and whether there’s a simple way to use them efficiently; I’m not looking for a really fast way to set the elements of a matrix - I can do that by not having the matrix in a reference class at all, and if I really care about speed I can write a C routine to do it and call it from R.
Here’s some background that might explain why I’m interested in this, which you’re welcome to ignore.
I got here by wanting to see how different languages, and even different compiler options and different ways of coding the same operation, compared for efficiency when accessing 2D rectangular arrays. I’ve been playing with a test program that creates two 2D arrays of the same size, and calls a subroutine that sets the first to the elements of the second plus their index values. (Almost any operation would do, but this one isn’t completely trivial to optimise.) I have this in a number of languages now, C, C++, Julia, Tcl, Fortran, Swift, etc., even hand-coded assembler (spoiler alert: assembler isn’t worth the effort any more) and thought I’d try R. The obvious implementation in R passes the two arrays to a subroutine that does the work, but because R doesn’t normally pass by reference, that routine has to make a copy of the modified array and return that as the function value. I thought using a reference class would avoid the relatively minor overhead of that copy, so I tried that and was surprised to discover that, far from speeding things up, it slowed them down enormously.
Use outer:
out$data <- outer(1:ny, 1:nx, `+`)
Also, don't use reference classes (or R6 classes) unless you actually need reference semantics. KISS and all that.
Suppose A is an abstract type, I have a function f{T<:A}(x::Vector{A}). So x could be type Vector{A} or Vector{B} where B <: A. In the middle of the function I would like to cast x to Vector{A} so it can be consumed by another function that requires that signature.
What's the best way to do that? At the moment I am doing x = collect(A, x). Is there a way to avoid copying if possible?
If at all possible, I'd just change your second function definition to be parametric like f. Enforcing this kind of container structure in method signatures is a big performance bug that doesn't gain you any functionality… and just makes them much harder to use.
That said, the best way to do this kind of conversion where you don't care if the output aliases the input is with convert(Vector{A}, x). This will be a no-op if x already isa Vector{A}, but otherwise it'll be just like collect. That's as good as it gets.
Here's why: two containers of types Vector{A} and Vector{B} cannot share the same memory if A !== B since it'd be possible to corrupt the data in the Vector{B} by assigning a non-B element to the array through the Vector{A}.
"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book.
In simple examples such as
> funs <- lapply(1:10, function(i) function() print(i))
> funs[[1]]()
[1] 10
> funs[[2]]()
[1] 10
it is possible to take such unintuitive behaviour into account.
However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters.
Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem.
What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't.
How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be:
library(functional)
myprint <- function(x) print(x)
funs <- lapply(1:10, function(i) Curry(myprint, i))
funs[[1]]()
# [1] 1
funs[[2]]()
# [1] 2
Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes.
Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here).
Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function:
funs2 <- lapply(
1:10, function(i) {
fun.res <- function() print(i)
environment(fun.res) <- list2env(as.list(environment())) # force parent env copy
fun.res
}
)
funs2[[1]]()
# [1] 1
funs2[[2]]()
# [1] 2
but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this:
myfun<-function(x,y,z){
x;y;z;
## code
}
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall().
From the online R help documentation:
forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.
How does one correctly do the following:
I have a class SpectraSet with slots parentSpectrum, childSpectra, name (to keep it simple)
name is character()
parentSpectrum should contain one object of class ParentSpec (so it is of type ParentSpec)
childSpectra should contain n objects of class ChildSpec. However I can't make it of type ChildSpec because vectors can only contain atomic types. What is best practice in this case? I can make it a list() and type check in the validity check, but is there anything better?
Here are related anaswers I've provided in the past.
It's usually better to re-think the class design so ChildSpec is intrinsically a vector -- minimally, supports length() and subsetting [, [[. Your problem above then goes away, the design is consistent with R's vectorized orientation, and likely common operations are efficient.
An alternative to implementing your own type-checked list (which is really the other alternative) is to re-use the infrastructure from Biocdonductor's S4Vectors class
.X = setClass("X", representation(x="numeric"))
.XList = setClass("XList", contains="SimpleList",
prototype=prototype(elementType="X"))
And in action
> xl = .XList(listData=list(.X(x=1), .X(x=2)))
> xl
XList of length 2
> xl[[2]]
An object of class "X"
Slot "x":
[1] 2
I am quite new to Clojure, although I am familiar with functional languages, mainly Scala.
I am trying to figure out what is the idiomatic way to operate on collections in Clojure. I am particularly confused by the behaviour of functions such as map.
In Scala, a great care is taken in making so that map will always return a collection of the same type of the original collection, as long as this makes sense:
List(1, 2, 3) map (2 *) == List(2, 4, 6)
Set(1, 2, 3) map (2 *) == Set(2, 4, 6)
Vector(1, 2, 3) map (2 *) == Vector(2, 4, 6)
Instead, in Clojure, as far as I understand, most operations such as map or filter are lazy, even when invoked on eager data-structures. This has the weird result of making
(map #(* 2 %) [1 2 3])
a lazy-list instead of a vector.
While I prefer, in general, lazy operations, I find the above confusing. In fact, vectors guarantee certain performance characteristics that lists do not.
Say I use the result from above and append on its end. If I understand correctly, the result is not evaluated until I try to append on it, then it is evaluated and I get a list instead of a vector; so I have to traverse it to append on the end. Of course I could turn it into a vector afterwards, but this gets messy and can be overlooked.
If I understand correctly, map is polymorphic and it would not be a problem to implement is so that it returns a vector on vectors, a list on lists, a stream on streams (this time with lazy semantics) and so on. I think I am missing something about the basic design of Clojure and its idioms.
What is the reason basic operations on clojure data structures do not preverse the structure?
In Clojure many functions are based on the Seq abstraction.
The benefit of this approach is that you don't have to write a function for every different collection type - as long as your collection can be viewed as a sequence (things with a head and possibly a tail), you can use it with all of the seq functions. Functions that take seqs and output seqs are much more composable and thus re-usable than functions that limit their use to a certain collection type. When writing your own function on seq you don't need to handle special cases like: if the user gives me a vector, I have to return a vector, etc. Your function will fit in just as good inside the seq pipeline as any other seq function.
The reason that map returns a lazy seq is a design choice. In Clojure lazyness is the default for many of these functional constructions. If you want to have other behavior, like parallelism without intermediate collections, take a look at the reducers library: http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
As far as performance goes, map always has to apply a function n times on a collection, from the first to the last element, so its performance will always be O(n) or worse. In this case vector or list makes no difference. The possible benefit that laziness would give you is when you would only consume the first part of the list. If you must append something to the end of map's output, a vector is indeed more efficient. You can use mapv (added in Clojure 1.4) in this case: it takes in a collection and will output a vector. I would say, only worry about these performance optimizations if you have a very good reason for it. Most of the time it's not worth it.
Read more about the seq abstraction here: http://clojure.org/sequences
Another vector-returning higher order function that was added in Clojure 1.4 is filterv.