Ensuring reproducibility in an R environment - r
I work in a computational biology lab, where we have several folks working on multiple projects, mostly in R (which is what I care about for this post). In the past, people would simply develop their code for each project, which may or may not involve boilerplate code copied over from previous projects. One thing that I've pushed over the years was to bring some centralized structure to this mess and have people identify common patterns such that we can turn these repeated/common blocks of code into packages for all of the many reasons one might think that is a good thing to do. So now our folks are using a mix of centralized packages/routines within their project specific scripts.
There's one gotcha here. We have a mandate from the powers that be that every script for every project need to be 100% reproducible over time to the best of our ability (and this includes 100% of all code we have direct access to, including our packages). That is, if I call function foo in package bar with parameter A to get result X today, 4 years from now I should get the exact same result. (erroneous output due to bugs is excepted here)
The topic of reproducibility has come up now and then in R within various circles, but typically it seems to be discussed in terms of reproducibility of process (e.g. vignettes). This is not the same thing - I can run a vignette today and then run the same code 6 months from now using updated packages and receive wildly different results.
The solution that's been agreed upon (which I'm not a fan of) is that if a function or package needs to be changed in a non-backwards compatible change that it simply gets a new name. Thus, if we needed to radically change function foo(), it'd be called foo2(), and if that needs a radical change it gets called foo3(). This ensures that any script that called foo() will always get the original result, while allowing things to march forward within the package repository. It works, but I really dislike this - it seems aesthetically extremely cluttered, and I worry that it will lead to mass confusion over time having packages bar, bar2, bar3, bar4 ... functions foo1, foo2, foo3, etc.
The problem is that I haven't come up with an alternate solution that's really better. One possibility would be to note version numbers of packages, R, etc and make sure those are loaded, but that has multiple problems - not the least of which is that it relies on proper package versioning discipline and that's prone to error. Also, this alternative was already rejected ;) Ideally what we'd have is some sort of notion of devel & release as most of these changes tend to happen earlier on and then level off with changes happening much less frequently. OTOH what devel really means here is "not actually in a package yet" (which we do), but it can be hard to determine exactly at what point is the right one to transport stuff over. Invariably the moment you think you're safe, that's when you realize you're not.
So with all this in mind, I'm curious if anyone else out there has dealt with similar situations, and how they might have resolved things.
edit: just to be clear, by non-backwards compatible, I'm not just talking about APIs and such, but also outputs for a given set of inputs.
This is indeed an important thing to think about and I think ultimately requires the institutionalization of a couple of different processes.
Version Control (svn, git, bzr, cvs, etc)
Unit Tests
My first reaction is that you need to institutionalize some sort of code management system. This will make it easier, because the old version of foo() is still available, if you really want it. From what you have said, it sounds like you need to package up your common functions and institute some sort of a release schedule. Scripts which require backward compatibility must include the package name and release information. This way it is possible to ALWAYS obtain foo() exactly as it was when the script was written. You should also make sure people only use official release versions in their work, because otherwise this could become quite a pain.
I agree, having a collection of foo:foo99 is doomed to failure. But at least it will be a gloriously confusing failure. Aesthetics aside, it will drive you all bonkers. If foo2() is an improvement (more accurate, faster, etc) of foo(), then it should be called foo() and released for use according to your company-wide release schedule. If it does something different, it is no longer foo(). It might be fooo() or superFoo() or fooMe(), but it ain't foo().
Finally, you need to start testing your functions. (Unit Tests) For each function that is published and made available for others, you should have a clearly defined test suite. Unless someone fixes a bug in foo(), the results should stay the same. If someone fixes a bug, then the results should be more accurate and will probably more desirable in most cases. If you do need to reproduce the old, incorrect, results, you can dig out an old version of foo() from your version control system. By instituting rigorous unit tests, you will know if/when the results of foo have changed. This knowledge should help minimize the number of foo() functions you need. Rather than create a version every time someone tweaks something, you can test the new version to see whether or not the results conform to expectations. But, this is tricky, because you have to make sure that your tests cover anything the function is ever likely to see, including bizarre edge cases. In a research setting, I would imagine that could become a challenge.
I'm not sure about integrating it with R, but Sumatra might be worth looking into. It appears to allow you to keep track of code and results. So if you need to go back an re-run that simulation from 4 years ago, the code should be there.
Well, ask yourself how you would do that in any other language. There's really nothing more to it than good bookkeeping I'm afraid:
record version numbers of all software involved
put the code in manageable chunks, say in packages.
make sure you have all software/packages involved still available in 5 years.
R can easily be made portable, including all installed packages. Keep a portable version of R together with the used packages, the code and the data on a CD-ROM for each analysis, and you're sure you can reproduce whenever you want. OK, you miss the OS, but can't have them all. In any case, if the OS makes a difference important enough to call the analysis not reproducible, the problem is very likely your analysis. You don't want to tell anybody your result is dependent on the version of Windows you use, do you?
PS : please get into peoples head that they should never ever in their life copy-paste code. They should wrap it in functions and use those. A whole lot easier and far less error-prone. I mean, what's the difference between copying
x <- read.table("sometable")
y <- ColSums(x)/4.3
and adjusting the values, or typing
myfun <- function(i,j){
x <- read.table(i)
y <- ColSums(x)/j
}
Saves you and a lot of other people a whole lot of copy-paste trouble. (How so, object not found? What object?)
Whenever you want to freeze your code in a way that needs to be reproducible "forever", e.g., when your paper has been published, the safest way to do this is to create a virtual machine containing all your code and data and the software needed to run it (including the operating system). There's an example here on the University of Washington site.
This is exactly the kind of thinking that causes Microsoft to maintain bug compatibility in Excel. Rather than attempting to conform to such a request you should be doing your best to show that it's not a good idea.
This thinking means that all errors remain errors in order to maintain consistency. It's thinking transferred from corporate bureaucracy and has no business in a science lab.
The only way to do this is to save the copy of all your packages and version of R with your code. There's no central corporation beholden to bug compatibility that's going to take care of that for you.
What if a change in result is due to a change in your operating system? Perhaps Microsoft fix a bug in Windows XP for Windows 7 and then when you upgrade - all your outputs are different.
If you want to handle this then I think the best way of working is to keep snapshots of virtual machines when you close out an analysis, and store the VM images for later use. Of course in five years time you won't have a license to run Windows XP so that's another problem - one solved by using an open-source operating system, such as Linux.
I would go with docker images.
This is pretty convenient way to reproduce OS and all dependencies.
You build an image and later can deploy it any time to docker, it will be fully configured.
You can find multiple R docker images available, so you can easily build your image upon them.
Having already built image you can use it to deploy to Test environment and later to Production.
This may be a late answer, but I have found it useful to create a generic wrapper like the following, especially when iterating quickly in my development of a new function:
myFunction <- function(..., version = "latest"){
if((version == "latest") || (version == 6)){
return(myFunction06(...))
} ...
if((version == 1)){
return(myFunction01(...))
}
}
Then, code should simply state which version it wants. Once the actual function stabilizes, I remove support for the older versions of the function, and a quick search through my code lets me find any offending calls. Use of "latest" means I can assure that the caller and the function match some fairly fixed definitions.
Naturally, all code is maintained in a version control system, so even when I remove the earlier code, it is only from the currently available source. I can reproduce any behavior from any point in time, including errors, as long as the data from that point in time is obtainable.
A solution might be to use S4 methods and letting R's internal dispatcher do the work for you (see example below). That way, you're somewhat "bulletproof" with respect to being able to systematically update your code without running the risk of breaking something.
Key benefits
The key thing here is that S4 methods support multiple dispatch.
That way your function will always be foo (as opposed to having to keep track of foo1, foo2 etc.) while new functionality can be easily implemented (by adding respective methods) without touching "old" methods (that other people/packages might rely on).
Key functions you'll need:
setGeneric
setMethod
setRefClass (S4 Reference Classes; personal recommendation) or setClass (S4 Class; I wouldn't use them for the reason described in the "Additional remarks" at the very end)
The "downsides"
You need to switch from a S3 to a S4 logic
This implies that you need to write a bit more code than what you might be used to (generic method definitions, method definitions and possibly own class defitions (see example below). But this "buys" yourself and your code much more structure and makes it more robust.
It might also imply that you'll eventually dig deeper and deeper into the world of Object-Oriented Programming or Object-Oriented Design. While I personally consider this to be a good thing (my personal rule of thumb: the more complex/distributed your application, the better you're off using OOP), some would consider these approaches to be R-untypic (I strongly disagree as R does have superb OO-features that are maintained by the Core Team) or "unsuited" for R (this might be true depending on how much you rely on "non-OOP" packages/code). If you're willing to go that way, you might want to familiarize yourself with the SOLID principles of Object-Oriented Design. You also might want to check out the following books: Clean Coder and The Pragmatic Programmer.
If computational efficiency (e.g. when estimating statistical models) is really critical, using S4 methods and S4 Reference Classes might slow you down a bit. After all, there's more code involved compared to S3. But I'd recommend testing the impact of this from case to case via system.time() and/or microbenchmark::microbenchmark() instead of picking "ideological" sides (S3 vs. S4).
Example
Initial function
Let's suppose you're in department A and someone in your team started out with creating a function called foo()
foo <- function(x, y) {
x + y
}
foo(x=10, y=20)
First change request
You would like to be able to extend it without breaking "old" code that relies on foo().
Now, I think we all agree that this can be quite hard to do.
You either need to explicitly modify the source code of foo() (each time running the risk that you break something that already used to work; this violates the "O" in SOLID: Open Closed-Principle) or you need to come with alternative names such as foo1, foo2 etc (really hard to keep track of which function is doing what).
foo <- function(x, y, type=c("old", "new")) {
type <- match.arg(type, choices=c("old", "new"))
if (type == "old") {
x + y
} else if (type == "new") {
x * y
}
}
foo(x=10, y=20)
[1] 30
foo(x=10, y=20, type="new")
[1] 200
foo1 <- function(x, y) {
x * y
}
foo1(x=10, y=20)
[1] 200
Let's see how S4 methods and multiple dispatch can really help us out here.
Generic method
You need to start out by turning foo() into a generic method.
setGeneric(
name="foo",
signature=c("x", "y", ".ctx", ".ns"),
def=function(x, y, ..., .ctx, .ns) {
standardGeneric("foo")
}
)
In simplified words: a generic method itself doesn't do anything yet. It's simply a precondition in order to be able to specifiy "actual" methods for its signature arguments that do something useful.
Signature arguments
The degree of flexiblity with respect to the original problem is directly linked to the number of signature arguments that you declare (signature=c("x", "y", ".ctx", ".ns")): the more signature arguments, the more flexiblity you have but the more complex your code might get as well (with respect to how much code you have to write).
Again, in simplified words: signature arguments (and it's classes) are used by the method dispatcher to retrieve the correct method that's doing the actual work.
Think of the method dispatcher being like the clerk in a ski rental business: you present him an arbitrary large set of signature information (i.e. information that "clearly distinguish you from others": your age, height, shoe size and skill level) and he uses that information to provide you with the right equipment to hit the slopes. Think of R's method dispatcher as beeing the clerk that has access to the storage room of the ski rental. But instead of ski equipment it will return methods.
Notice that we said that our "old" arguments x and y are from now on supposed to be signature arguments while there are also two new arguments: .ctx and .ns. I'll get to these in a minute. It's those arguments that will provide us with the flexibility that we're after.
Initial method definition
We now define a "variant" (a method) of the generic method for the following "signature scenario":
x is numeric
y is numeric
.ctx will just not be provided when calling the method and is thus missing
.ns will just not be provided when calling the method and is thus missing
Think of it as registering your signature information with explicit equipment of the ski rental. Once you did that and ask for your equipment, the only thing the clerk has to do is to go to the storage room and look up which equipment is linked to your personal information.
setMethod(
f="foo",
signature=signature(x="numeric", y="numeric", .ctx="missing", .ns="missing"),
definition=function(x, y, ..., .ctx, .ns) {
x + y
}
)
When we call foo with this "signature scenario" (asking for the method that we registered for this scenario), the method dispatcher knows exactly which actual method it needs to get out of the storage room:
foo(x=10, y=20)
[1] 30
First update
Now someone from department B comes along, looks at foo(), likes it but decides that foo() needs to be updated (x * y instead of x + y) if it is to be used in his department.
That's when .ctx (short for context) comes into play: it's an argument by which we are able to distinguish application contexts.
Definining a class that represents the new application context
setRefClass("ApplicationContextDepartmentB")
When calling foo(), we'll provide it with an instance of this class
(.ctx=new("ApplicationContextDepartmentB"))
Definining a new method for the new application context
Notice how we register signature argument .ctx to our new class ApplicationContextDepartmentB:
setMethod(
f="foo",
signature=signature(x="numeric", y="numeric",
.ctx="ApplicationContextDepartmentB", .ns="missing"),
definition=function(x, y, ..., .ctx, .ns) {
out <- x * y
attributes(out)$description <- "I'm different from the original foo()"
return(out)
}
)
That way, the method dispatcher knows exactly that it should return the "new" method instead of the "old" method when we call foo() like this:
foo(x=1, y=10, .ctx=new("ApplicationContextDepartmentB"))
[1] 10
attr(,"description")
[1] "I'm different from the original foo()"
The "old" method is not affected at all:
foo(x=1, y=10)
[1] 30
Second update
Suppose that someone from department C comes along and suggests yet another "configuration" or version for foo(). You can easily provide that withouth breaking anything that you've realized for departments A and B so far by following the same routine as for department B.
But we'll even take it one step further here: we'll define two additional classes that let us distinguish different "namespaces" (that's where .ns comes into play).
Think of namespaces as a way of distinguishing different runtime scenarios for a specific method for a specific application context (i.e. "testing" and "productive mode").
Definining the classes
setRefClass("ApplicationContextDepartmentC")
setRefClass("TestNamespace")
setRefClass("ProductionNamespace")
Definining a new method for the new application context and a "test" scenario
Notice how we register signature arguments .ctx to our new class ApplicationContextDepartmentC and .ns to our new class TestNamespace:
setMethod(
f="foo",
signature=signature(x="character", y="numeric",
.ctx="ApplicationContextDepartmentC", .ns="TestNamespace"),
definition=function(x, y, ..., .ctx, .ns) {
data.frame(x, y, test.ok=rep(TRUE, length(x)))
}
)
Again, the method dispatcher will look up the correct method when calling foo() like this:
foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"),
.ns=new("TestNamespace"))
x y test.ok
1 a 11 TRUE
2 b 12 TRUE
3 c 13 TRUE
4 d 14 TRUE
5 e 15 TRUE
Definining a new method for the new application context and a "productive" scenario
setMethod(
f="foo",
signature=signature(x="character", y="numeric",
.ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"),
definition=function(x, y, ..., .ctx, .ns) {
data.frame(x, y)
}
)
We tell the method dispatcher that we now want the method registered for this scenario or namespace like this:
foo(x=letters[1:5], y=11:15, .ctx=new("ApplicationContextDepartmentC"),
.ns=new("ProductionNamespace"))
x y
1 a 11
2 b 12
3 c 13
4 d 14
5 e 15
Notice that you're free to use the classes TestNamespace and ProductionNamespace anywhere you'd like. These classes are not bound to ApplicationContextDepartmentC in any way, so you can for example also use the for all your other application scenarios.
Additional remarks for method definitions
Something that's often quite usefull is to start out with a method that accepts ANY classes for its signature arguments and define more restrictive methods as your software evolves:
setMethod(
f="foo",
signature=signature(x="ANY", y="ANY", .ctx="missing", .ns="missing"),
definition=function(x, y, ..., .ctx, .ns) {
message("Value of x:")
print(x)
message("Value of y:")
print(y)
}
)
foo(x="Hello World!", y=rep(TRUE, 3))
Value of x:
[1] "Hello World!"
Value of y:
[1] TRUE TRUE TRUE
Additional remarks for class definitions
I prefer S4 Reference Classes over S4 Classes because of the self-referencing capabilities of S4 Reference Classes:
setRefClass(
Class="A",
fields=list(
x1="numeric",
x2="logical"
),
methods=list(
getX1=function() {
.self$x1
},
getX2=function() {
.self$x2
},
setX1=function(x) {
.self$x1 <- x
},
setX2=function(x) {
.self$field("x2", x)
},
addX1AndX2=function() {
.self$getX1() + .self$getX2()
}
)
)
x <- new("A", x1=10, x2=TRUE)
x$getX1()
[1] 10
x$getX2()
[1] TRUE
x$addX1AndX2()
[1] 11
S4 Classes don't have that feature.
Subsequent modifications of field values:
x$setX1(100)
x$addX1AndX2()
[1] 101
x$x1 <- 1000
x$addX1AndX2()
[1] 1001
Additional remarks for documenting methods and classes
I strongly recommend using packages roxygen2 and devtools to document your methods and classes. You possibly might also want to look into package roxygen3.
Documenting generic methods with roxygen2:
#' Foo
#'
#' This method takes \code{x} and \code{y} and adds them.
#'
#' Some details here
#'
#' #param x \strong{Signature argument}.
#' #param y \strong{Signature argument}.
#' #param ... Further arguments to be passed to subsequent functions.
#' #param .ctx \strong{Signature argument}.
#' Application context.
#' #param .ns \strong{Signature argument}.
#' Application namespace. Usually used to distinguish different context
#' versions or configurations.
#' #author Janko Thyson \email{john.doe##something.com}
#' #references \url{http://www.something.com/}
#' #example inst/examples/foo.R
#' #docType methods
#' #rdname foo-methods
#' #export
setGeneric(
name="foo",
signature=c("x", "y", ".ctx", ".ns"),
def=function(x, y, ..., .ctx, .ns) {
standardGeneric("foo")
}
)
Documenting methods with roxygen2:
#' #param x \code{\link{character}}. Character vector.
#' #param y \code{\link{numeric}}. Numerical vector.
#' #param .ctx \code{\link{ApplicationContextDepartmentC}}.
#' #param .ns \code{\link{ProductionNamespace}}.
#' #return \code{\link{data.frame}}. Some data frame.
#' #rdname foo-methods
#' #aliases foo,character,numeric,missing,missing-method
#' #export
setMethod(
f="foo",
signature=signature(x="character", y="numeric",
.ctx="ApplicationContextDepartmentC", .ns="ProductionNamespace"),
definition=function(x, y, ..., .ctx, .ns) {
data.frame(x, y)
}
)
Related
How to speed up writing to a matrix in a reference class in R
Here is a piece of R code that writes to each element of a matrix in a reference class. It runs incredibly slowly, and I’m wondering if I’ve missed a simple trick that will speed this up. nx = 2000 ny = 10 ref_matrix <- setRefClass( "ref_matrix",fields = list(data = "matrix"), ) out <- ref_matrix(data = matrix(0.0,nx,ny)) #tracemem(out$data) for (iy in 1:ny) { for (ix in 1:nx) { out$data[ix,iy] <- ix + iy } } It seems that each write to an element of the matrix triggers a check that involves a copy of the entire matrix. (Uncommenting the tracemen() call shows this.) Now, I’ve found a discussion that seems to confirm this: https://r-devel.r-project.narkive.com/8KtYICjV/rd-copy-on-assignment-to-large-field-of-reference-class and this also seems to be covered by Speeding up field access in R reference classes but in both of these this behaviour can be bypassed by not declaring a class for the field, and this works for the example in the first link which uses a 1D vector, b, which can just be set as b <<- 1:10000. But I’ve not found an equivalent way of creating a 2D array without using a explicit “matrix” instance. Am I just missing something simple, or is this actually not possible? Let me add a couple of things. First, I’m very new to R, so could easily have missed something. Second, I’m really just curious about the way reference classes work in this case and whether there’s a simple way to use them efficiently; I’m not looking for a really fast way to set the elements of a matrix - I can do that by not having the matrix in a reference class at all, and if I really care about speed I can write a C routine to do it and call it from R. Here’s some background that might explain why I’m interested in this, which you’re welcome to ignore. I got here by wanting to see how different languages, and even different compiler options and different ways of coding the same operation, compared for efficiency when accessing 2D rectangular arrays. I’ve been playing with a test program that creates two 2D arrays of the same size, and calls a subroutine that sets the first to the elements of the second plus their index values. (Almost any operation would do, but this one isn’t completely trivial to optimise.) I have this in a number of languages now, C, C++, Julia, Tcl, Fortran, Swift, etc., even hand-coded assembler (spoiler alert: assembler isn’t worth the effort any more) and thought I’d try R. The obvious implementation in R passes the two arrays to a subroutine that does the work, but because R doesn’t normally pass by reference, that routine has to make a copy of the modified array and return that as the function value. I thought using a reference class would avoid the relatively minor overhead of that copy, so I tried that and was surprised to discover that, far from speeding things up, it slowed them down enormously.
Use outer: out$data <- outer(1:ny, 1:nx, `+`) Also, don't use reference classes (or R6 classes) unless you actually need reference semantics. KISS and all that.
R copies for no apparent reason
Some R function will make R copy the object AFTER the function call, like nrow, while some others don't, like sum. For example the following code: x = as.double(1:1e8) system.time(x[1] <- 100) y = sum(x) system.time(x[1] <- 200) ## Fast (takes 0s), after calling sum foo = function(x) { return(sum(x)) } y = foo(x) system.time(x[1] <- 300) ## Slow (takes 0.35s), after calling foo Calling foo is NOT slow, because x isn't copied. However, changing x again is very slow, as x is copied. My guess is that calling foo will leave a reference to x, so when changing it after, R makes another copy. Any one knows why R does this? Even when the function doesn't change x at all? Thanks.
I definitely recommend Hadley's Advanced R book, as it digs into some of the internals that you will likely find interesting and relevant. Most relevant to your question (and as mentioned by #joran and #lmo), the reason for the slow-down was an additional reference that forced copy-on-modify. An excerpt that might be beneficial from Memory#Modification: There are two possibilities: R modifies x in place. R makes a copy of x to a new location, modifies the copy, and then uses the name x to point to the new location. It turns out that R can do either depending on the circumstances. In the example above, it will modify in place. But if another variable also points to x, then R will copy it to a new location. To explore what’s going on in greater detail, we use two tools from the pryr package. Given the name of a variable, address() will tell us the variable’s location in memory and refs() will tell us how many names point to that location. Also of interest are the sections on R's C interface and Performance. The pryr package also has tools for working with these sorts of internals in an easier fashion. One last note from Hadley's book (same Memory section) that might be helpful: While determining that copies are being made is not hard, preventing such behaviour is. If you find yourself resorting to exotic tricks to avoid copies, it may be time to rewrite your function in C++, as described in Rcpp.
How to not fall into R's 'lazy evaluation trap'
"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book. In simple examples such as > funs <- lapply(1:10, function(i) function() print(i)) > funs[[1]]() [1] 10 > funs[[2]]() [1] 10 it is possible to take such unintuitive behaviour into account. However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters. Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem. What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't. How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be: library(functional) myprint <- function(x) print(x) funs <- lapply(1:10, function(i) Curry(myprint, i)) funs[[1]]() # [1] 1 funs[[2]]() # [1] 2 Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes. Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here). Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function: funs2 <- lapply( 1:10, function(i) { fun.res <- function() print(i) environment(fun.res) <- list2env(as.list(environment())) # force parent env copy fun.res } ) funs2[[1]]() # [1] 1 funs2[[2]]() # [1] 2 but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this: myfun<-function(x,y,z){ x;y;z; ## code }
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall(). From the online R help documentation: forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.
Analog to utility classes in R?
I have a couple of functions that convert between coordinate systems, and they all rely on constants from the WGS84 ellipsoid, etc. I'd rather not have these constants pollute the global namespace. Similarly, not all of the functions need to be visible globally. In Java, I'd encapsulate all the coordinate stuff in a utility class and only expose the coordinate transformation methods. What's a low-overhead way to do this in R? Ideally, I could: source("coordinateStuff.R") at the top of my file and call the "public" functions as needed. It might make a nice package down the road, but that's not a concern right now. Edit for initial approach: I started coords.R with: coords <- new.env() with(coords, { ## Semi-major axis (center to equator) a <- 6378137.0 ## And so on... }) The with statement and indentation clearly indicate that something is different about the assignment variables. And it sure beats typing a zillion assign statements. The first cut at functions looked like: ecef2geodetic <- function (x,y,z) { attach(coords) on.exit(detach(coords)) The on.exit() ensures that we'll leave coords when the function exits. But the attach() statements caused trouble when one function in coords called another in coords. See this question to see how things went from there.
Utility classes in Java are code smell. This is not what you want in R. There are several ways of solving this in R. For medium / large scale things, the way to go is to put you stuff into a package and use it in the remaining code. That encapsulates your “private” variables nicely and exposes a well-defined interface. For smaller things, an excellent way of doing this is to put your code into a local call which, as the name suggests, executes its argument in a local scope: x <- 23 result <- local({ foo <- 42 bar <- x foo * bar }) Finally, you can put your objects into a list or environment (there are differences but you may ignore them for now), and then just access them via listname$objname: coordinateStuff <- list( foo = function () { cat('42\n') } bar = 23 ) coordinateStuff$foo() If you want something similar to your source statement, take a look at my xsource command which solves this to some extent (although it’s work in progress and has several issues!). This would allow you to write cs <- xsource(coordinateStuff) # Use cs as if it were an evironment, e.g. cs$public_function() # or even: cs::public_function()
A package is the solution... But for a fast solution you could use Environments http://stat.ethz.ch/R-manual/R-devel/library/base/html/environment.html
Examples of the perils of globals in R and Stata
In recent conversations with fellow students, I have been advocating for avoiding globals except to store constants. This is a sort of typical applied statistics-type program where everyone writes their own code and project sizes are on the small side, so it can be hard for people to see the trouble caused by sloppy habits. In talking about avoidance of globals, I'm focusing on the following reasons why globals might cause trouble, but I'd like to have some examples in R and/or Stata to go with the principles (and any other principles you might find important), and I'm having a hard time coming up with believable ones. Non-locality: Globals make debugging harder because they make understanding the flow of code harder Implicit coupling: Globals break the simplicity of functional programming by allowing complex interactions between distant segments of code Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions A useful answer to this question would be a reproducible and self-contained code snippet in which globals cause a specific type of trouble, ideally with another code snippet in which the problem is corrected. I can generate the corrected solutions if necessary, so the example of the problem is more important. Relevant links: Global Variables are Bad Are global variables bad?
I also have the pleasure of teaching R to undergraduate students who have no experience with programming. The problem I found was that most examples of when globals are bad, are rather simplistic and don't really get the point across. Instead, I try to illustrate the principle of least astonishment. I use examples where it is tricky to figure out what was going on. Here are some examples: I ask the class to write down what they think the final value of i will be: i = 10 for(i in 1:5) i = i + 1 i Some of the class guess correctly. Then I ask should you ever write code like this? In some sense i is a global variable that is being changed. What does the following piece of code return: x = 5:10 x[x=1] The problem is what exactly do we mean by x Does the following function return a global or local variable: z = 0 f = function() { if(runif(1) < 0.5) z = 1 return(z) } Answer: both. Again discuss why this is bad.
Oh, the wonderful smell of globals... All of the answers in this post gave R examples, and the OP wanted some Stata examples, as well. So let me chime in with these. Unlike R, Stata does take care of locality of its local macros (the ones that you create with local command), so the issue of "Is this this a global z or a local z that is being returned?" never comes up. (Gosh... how can you R guys write any code at all if locality is not enforced???) Stata has a different quirk, though, namely that a non-existent local or global macro is evaluated as an empty string, which may or may not be desirable. I have seen globals used for several main reasons: Globals are often used as shortcuts for variable lists, as in sysuse auto, clear regress price $myvars I suspect that the main usage of such construct is for someone who switches between interactive typing and storing the code in a do-file as they try multiple specifications. Say they try regression with homoskedastic standard errors, heteroskedastic standard errors, and median regression: regress price mpg foreign regress price mpg foreign, robust qreg price mpg foreign And then they run these regressions with another set of variables, then with yet another one, and finally they give up and set this up as a do-file myreg.do with regress price $myvars regress price $myvars, robust qreg price $myvars exit to be accompanied with an appropriate setting of the global macro. So far so good; the snippet global myvars mpg foreign do myreg produces the desirable results. Now let's say they email their famous do-file that claims to produce very good regression results to collaborators, and instruct them to type do myreg What will their collaborators see? In the best case, the mean and the median of mpg if they started a new instance of Stata (failed coupling: myreg.do did not really know you meant to run this with a non-empty variable list). But if the collaborators had something in the works, and too had a global myvars defined (name collision)... man, would that be a disaster. Globals are used for directory or file names, as in: use $mydir\data1, clear God only knows what will be loaded. In large projects, though, it does come handy. You would want to define global mydir somewhere in your master do-file, may be even as global mydir `c(pwd)' Globals can be used to store an unpredictable crap, like a whole command: capture $RunThis God only knows what will be executed; let's just hope it is not ! format c:\. This is the worst case of implicit strong coupling, but since I am not even sure that RunThis will contain anything meaningful, I put a capture in front of it, and will be prepared to treat the non-zero return code _rc. (See, however, my example below.) Stata's own use of globals is for God settings, like the type I error probability/confidence level: the global $S_level is always defined (and you must be a total idiot to redefine this global, although of course it is technically doable). This is, however, mostly a legacy issue with code of version 5 and below (roughly), as the same information can be obtained from less fragile system constant: set level 90 display $S_level display c(level) Thankfully, globals are quite explicit in Stata, and hence are easy to debug and remove. In some of the above situations, and certainly in the first one, you'd want to pass parameters to do-files which are seen as the local `0' inside the do-file. Instead of using globals in the myreg.do file, I would probably code it as unab varlist : `0' regress price `varlist' regress price `varlist', robust qreg price `varlist' exit The unab thing will serve as an element of protection: if the input is not a legal varlist, the program will stop with an error message. In the worst cases I've seen, the global was used only once after having been defined. There are occasions when you do want to use globals, because otherwise you'd have to pass the bloody thing to every other do-file or a program. One example where I found the globals pretty much unavoidable was coding a maximum likelihood estimator where I did not know in advance how many equations and parameters I would have. Stata insists that the (user-supplied) likelihood evaluator will have specific equations. So I had to accumulate my equations in the globals, and then call my evaluator with the globals in the descriptions of the syntax that Stata would need to parse: args lf $parameters where lf was the objective function (the log-likelihood). I encountered this at least twice, in the normal mixture package (denormix) and confirmatory factor analysis package (confa); you can findit both of them, of course.
One R example of a global variable that divides opinion is the stringsAsFactors issue on reading data into R or creating a data frame. set.seed(1) str(data.frame(A = sample(LETTERS, 100, replace = TRUE), DATES = as.character(seq(Sys.Date(), length = 100, by = "days")))) options("stringsAsFactors" = FALSE) set.seed(1) str(data.frame(A = sample(LETTERS, 100, replace = TRUE), DATES = as.character(seq(Sys.Date(), length = 100, by = "days")))) options("stringsAsFactors" = TRUE) ## reset This can't really be corrected because of the way options are implemented in R - anything could change them without you knowing it and thus the same chunk of code is not guaranteed to return exactly the same object. John Chambers bemoans this feature in his recent book.
A pathological example in R is the use of one of the globals available in R, pi, to compute the area of a circle. > r <- 3 > pi * r^2 [1] 28.27433 > > pi <- 2 > pi * r^2 [1] 18 > > foo <- function(r) { + pi * r^2 + } > foo(r) [1] 18 > > rm(pi) > foo(r) [1] 28.27433 > pi * r^2 [1] 28.27433 Of course, one can write the function foo() defensively by forcing the use of base::pi but such recourse may not be available in normal user code unless packaged up and using a NAMESPACE: > foo <- function(r) { + base::pi * r^2 + } > foo(r = 3) [1] 28.27433 > pi <- 2 > foo(r = 3) [1] 28.27433 > rm(pi) This highlights the mess you can get into by relying on anything that is not solely in the scope of your function or passed in explicitly as an argument.
Here's an interesting pathological example involving replacement functions, the global assign, and x defined both globally and locally... x <- c(1,NA,NA,NA,1,NA,1,NA) local({ #some other code involving some other x begin x <- c(NA,2,3,4) #some other code involving some other x end #now you want to replace NAs in the the global/parent frame x with 0s x[is.na(x)] <<- 0 }) x [1] 0 NA NA NA 0 NA 1 NA Instead of returning [1] 1 0 0 0 1 0 1 0, the replacement function uses the index returned by the local value of is.na(x), even though you're assigning to the global value of x. This behavior is documented in the R Language Definition.
One quick but convincing example in R is to run the line like: .Random.seed <- 'normal' I chose 'normal' as something someone might choose, but you could use anything there. Now run any code that uses generated random numbers, for example: rnorm(10) Then you can point out that the same thing could happen for any global variable. I also use the example of: x <- 27 z <- somefunctionthatusesglobals(5) Then ask the students what the value of x is; the answer is that we don't know.
Through trial and error I've learned that I need to be very explicit in naming my function arguments (and ensure enough checks at the start and along the function) to make everything as robust as possible. This is especially true if you have variables stored in global environment, but then you try to debug a function with a custom valuables - and something doesn't add up! This is a simple example that combines bad checks and calling a global variable. glob.arg <- "snake" customFunction <- function(arg1) { if (is.numeric(arg1)) { glob.arg <- "elephant" } return(strsplit(glob.arg, "n")) } customFunction(arg1 = 1) #argument correct, expected results customFunction(arg1 = "rubble") #works, but may have unexpected results
An example sketch that came up while trying to teach this today. Specifically, this focuses on trying to give intuition as to why globals can cause problems, so it abstracts away as much as possible in an attempt to state what can and cannot be concluded just from the code (leaving the function as a black box). The set up Here is some code. Decide whether it will return an error or not based on only the criteria given. The code stopifnot( all( x!=0 ) ) y <- f(x) 5/x The criteria Case 1: f() is a properly-behaved function, which uses only local variables. Case 2: f() is not necessarily a properly-behaved function, which could potentially use global assignment. The answer Case 1: The code will not return an error, since line one checks that there are no x's equal to zero and line three divides by x. Case 2: The code could potentially return an error, since f() could e.g. subtract 1 from x and assign it back to the x in the parent environment, where any x element equal to 1 could then be set to zero and the third line would return a division by zero error.
Here's one attempt at an answer that would make sense to statisticsy types. Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions First we define a log likelihood function, logLik <- function(x) { y <<- x^2+2 return(sum(sqrt(y+7))) } Now we write an unrelated function to return the sum of squares of an input. Because we're lazy we'll do this passing it y as a global variable, sumSq <- function() { return(sum(y^2)) } y <<- seq(5) sumSq() [1] 55 Our log likelihood function seems to behave exactly as we'd expect, taking an argument and returning a value, > logLik(seq(12)) [1] 88.40761 But what's up with our other function? > sumSq() [1] 633538 Of course, this is a trivial example, as will be any example that doesn't exist in a complex program. But hopefully it'll spark a discussion about how much harder it is to keep track of globals than locals.
In R you may also try to show them that there is often no need to use globals as you may access the variables defined in the function scope from within the function itself by only changing the enviroment. For example the code below zz="aaa" x = function(y) { zz="bbb" cat("value of zz from within the function: \n") cat(zz , "\n") cat("value of zz from the function scope: \n") with(environment(x),cat(zz,"\n")) }