Recently I learned that in R there are no references, rather all object are immutable and each assignment makes a copy.
Uh-oh.
Copying large matrices over and over seems pretty horrible...
Now I'm in a paranoia, copypasting code all the time because I'm afraid of making helper functions (passing parameters = assignment? returning values = assignment?), I'm afraid of making helper variables if I'm not 100% sure an object would be copied anyway...
Example:
What I would love to make:
foo = function(someGivenLargeObject) {
returnedMatrix = someGivenLargeObject$someLargeMatrix # <- BAD?!?!?!?!
if(someCondition)
returnedMatrix = operateOn(returnedMatrix)
if(otherCondition)
returnedMatrix = operateOn(returnedMatrix)
returnedMatrix
}
What I'm making instead:
foo = function(someGivenLargeObject) { # <- still BAD?!?!?!
returnedMatrix = NULL # <- No copy of someLargeMatrix is made!
if(someCondition)
returnedMatrix = operateOn(someGivenLargeObject$someLargeMatrix)
if(otherCondition)
returnedMatrix = operateOn(
if(is.null(returnedMatrix))
someGivenLargeObject$someLargeMatrix
else
returnedMatrix
) # <- ^ Incredible clutter! Unreadable!
if(is.null(returnedMatrix))
return(someGivenLargeObject$someLargeMatrix)
else
return(returnedMatrix) # <- does return copy stuff?!?!?!?!
The readability loss in the second version of the function is pretty amazing IMO; yet - is this the price to avoid the unecessary copying of someLargeMatrix in case neither someCondition nor otherCondition holds? Because the line returnedMatrix = someGivenLargeObject$someLargeMatrix would necessite this copying?
Or am I in a paranoia, may I go safely with the more readable version of the function because making a reference to someLargeMatrix doesn't necessite copying? (BUT THERE ARE NO REFERENCES IN R!!!)
Also I hope that a function call / function return doesn't copy stuff either?
}
Side note: Just so that it is clear: I didn't yet run into an issue when I knew an object was copied unecessarily in a situation like that I described above. I'm just perplexed by having read that "there are no references in R", so this question is based on my worries from what might be the implication of this lack of references, rather than any empirical observation.
Donald Knuth famously said "Premature Optimization is the root of all evil",
http://wiki.c2.com/?PrematureOptimization
it is good to be aware about this, but code clarity is on most cases more important.
R is usually smart enough to figure out when copy is needed.
(not all assignments cause a copy only assignments that are later modified)
Related
Using top, I manually measured the following memory usages at the specific points designated in the comments of the following code block:
x <- matrix(rnorm(1e9),nrow=1e4)
#~15gb
gc()
# ~7gb after gc()
y <- as.vector(x)
gc()
#~15gb after gc()
It's pretty clear that rnorm(1e9) is a ~7gb vector that's then copied to create the matrix. gc() removes the original vector since it's not assigned to anything. as.vector(x) then coerces and copies the data to vector.
My question is, why can't these three objects all point to the same memory block (at least until one is modified)? Isn't a matrix really just a vector with some additional metadata?
This is in R version 3.6.2
edit: also tested in 4.0.3, same results.
The question you're asking is to the reasoning. That seems more suited for R-devel, and I am assuming the answer in return is "no one knows". The relevant function from R-source is the do_asvector function.
Going down the source code of a call to as.vector(matrix(...)), it is important to note that the default argument for mode is any. This translates to ANYSXP (see R internals). This lets us find the evil culprit (line 1524) of the copy-behaviour.
// source reference: do_asvector
...
if(type == ANYSXP || TYPEOF(x) == type) {
switch(TYPEOF(x)) {
case LGLSXP:
case INTSXP:
case REALSXP:
case CPLXSXP:
case STRSXP:
case RAWSXP:
if(ATTRIB(x) == R_NilValue) return x;
ans = MAYBE_REFERENCED(x) ? duplicate(x) : x; // <== evil culprit
CLEAR_ATTRIB(ans);
return ans;
case EXPRSXP:
case VECSXP:
return x;
default:
;
}
...
Going one step further, we can find the definition for MAYBE_REFERENCED in src/include/Rinternals.h, and by digging a bit we can find that it checks whether sxpinfo.named is equal to 0 (false) or not (true). What I am guessing here is that the assignment operator <- increments the sxpinfo.named counter and thus MAYBE_REFERENCED(x) returns TRUE and we get a duplicate (deep copy).
However, Is this behaviour necessary?
That is a great question. If we had given an argument to mode other than any or class(x) (same as our input class), we skip the duplicate line, and we continue down the function, until we hit a ascommon. So I dug a bit extra and took a look at the source code for ascommon, we can see that if we were to try and convert to list manually (setting mode = "list"), ascommon only calls shallowDuplicate.
// Source reference: ascommon
---
if ((type == LISTSXP) &&
!(TYPEOF(u) == LANGSXP || TYPEOF(u) == LISTSXP ||
TYPEOF(u) == EXPRSXP || TYPEOF(u) == VECSXP)) {
if (MAYBE_REFERENCED(v)) v = shallow_duplicate(v); // <=== ascommon duplication behaviour
CLEAR_ATTRIB(v);
}
return v;
}
---
So one could imagine that the call to duplicate in do_asvector could be replaced by a call to shallow_duplicate. Perhaps a "better safe than sorry" strategy was chosen when the code was originally implemented (prior to R-2.13.0 according to a comment in the source code), or perhaps there is a scenario in one of the types not handled by ascommon that requires a deep-copy.
For now I would test if the function does a deep-copy if we set mode='list' or pass the list without assignment. In either case it might not be a bad idea to send a follow-up question to the R-devel mailing list.
Edit: <- behaviour
I took the liberty to confirm my suspicion, and looked at the source code for <-. I previously stated that I assumed that <- incremented sxpinfo.named, and we can confirm this by looking at do_set (the c source code for <-). When assigning as x <- ... x is a SYMSXP, and this we can see that the source code calls INCREMENT_NAMED which in turn calls SET_NAMED(x, NAMED(X) + 1). So everything else equal we should see a copy behaviour for x <- matrix(...); y <- as.vector(x) while we shouldn't for y <- as.vector(matrix(...)).
At the final gc(), you have x pointing to a vector with a dim attribute, and y pointing to a vector without any dim attribute. The data is an intrinsic part of the object, it's not an attribute, so those two vectors have to be different.
If matrices had been implemented as lists, e.g.
x <- list(data = rnorm(1e9), dim = c(1e4, 1e5))
then a shallow copy would be possible, but that's not how it was done. You can read the details of the internal structure of objects in the R Internals manual. For the current release, that's here: https://cloud.r-project.org/doc/manuals/r-release/R-ints.html#SEXPs .
You may wonder why things were implemented this way. I suspect it's intended to be efficient for the common use cases. Converting a matrix to a vector isn't generally necessary (you can treat x as a vector already, e.g. x[100000] and y[100000] will give the same value), so there's no need for "convert to vector" to be efficient. On the other hand, extracting elements is very common, so you don't want to have an extra pointer dereference slowing that down.
I have the situation where I have written an R function, ComplexResult, that computes a computationally expensive result that two other separate functions will later use, LaterFuncA and LaterFuncB.
I want to store the result of ComplexResult somewhere so that both LaterFuncA and LaterFuncB can use it, and it does not need to be recalculated. The result of ComplexResult is a large matrix that only needs to be calculated once, then re-used later on.
R is my first foray into the world of functional programming, so interested to understand what it considered good practice. My first line of thinking is as follows:
# run ComplexResult and get the result
cmplx.res <- ComplexResult(arg1, arg2)
# store the result in the global environment.
# NB this would not be run from a function
assign("CachedComplexResult", cmplx.res, envir = .GlobalEnv)
Is this at all the right thing to do? The only other approach I can think of is having a large "wrapper" function, e.g.:
MyWrapperFunction <- function(arg1, arg2) {
cmplx.res <- ComplexResult(arg1, arg2)
res.a <- LaterFuncA(cmplx.res)
res.b <- LaterFuncB(cmplx.res)
# do more stuff here ...
}
Thoughts? Am I heading at all in the right direction with either of the above? Or is an there Option C which is more cunning? :)
The general answer is you should Serialize/deSerialize your big object for further use. The R way to do this is using saveRDS/readRDS:
## save a single object to file
saveRDS(cmplx.res, "cmplx.res.rds")
## restore it under a different name
cmplx2.res <- readRDS("cmplx.res.rds")
This assign to GlobalEnv:
CachedComplexResult <- ComplexResult(arg1, arg2)
To store I would use:
write.table(CachedComplexResult, file = "complex_res.txt")
And then to use it directly:
LaterFuncA(read.table("complex_res.txt"))
Your approach works for saving to local memory; other answers have explained saving to global memory or a file. Here are some thoughts on why you would do one or the other.
Save to file: this is slowest, so only do it if your process is volatile and you expect it to crash hard and you need to pick up the pieces where it left off, OR if you just need to save the state once in a while where speed/performance is not a concern.
Save to global: if you need access from multiple spots in a large R program.
For small function, it is trivial to just write conditional statement based on the argument value. For example, I have a function that extracts variable label from an ex-STATA dataframe. There are two options for output-type, environment and df.
f_extract_stata_label <- function(df, output="environment") {
if (output=="env") {
lab_env <- new.env()
for (i in seq_along(names(df))) {
lab_env[[names(df)[i]]] <- attr(df, "var.labels")[i]
}
return(lab_env)
} else if (output=="df") {
lab_df <- data.frame(var.name = names(d_tmp),
var.label = attr(d_tmp, "var.labels"))
return(lab_df)
}
}
However, I suspect that this is not good R idiom. First, how the function depends on output is not clear -- the reader has to read half way through the code to find out. Second, adding options to output in the future makes the function very hard to read.
So how should I rewrite this function?
R uses this kind of pattern in its core stats libraries where "label" strings make sense. These are functions where R's dispatch system is not that useful. That said, what you want is still dispatch-like.
You could refactor it to use a switch that calls a function dedicated to a specific output type. Two things happen then. First, the extra function call makes it clear what context you're in when using the traceback. Second, it makes the functions smaller and easier to read.
I would question whether you really want to use a dispatch function though, and why separate direct functions are not appropriate.
I had changed the for loop into sapply function ,but failed .
I want to know why ?
list.files("c:/",pattern="mp3$",recursive=TRUE,full.names=TRUE)->z
c()->w
left<-basename(z)
for (i in 1:length(z)){
if (is.element(basename(z[i]),left))
{
append(w,values=z[i])->w;
setdiff(left,basename(z[i]))->left
}}
print(w)
list.files("c:/",pattern="mp3$",recursive=TRUE,full.names=TRUE)->z
c()->w
left<-basename(z)
sapply(z,function(y){
if (is.element(basename(y),left))
{ append(w,values=y)->w;
setdiff(left,basename(y))->left
}})
print(w)
my rule of selecting music is that if the basename(music) is the same ,then save only one full.name of music ,so unique can not be used directly.there are two concepts full.name and basename in file path which can confuse people here.
The problem you have here is that you want your function to have two side-effects. By side-effect, I mean modify objects that are outside its scope:w and left.
Curently, w and left are only modified within the function's scope, then they are eventually lost as the function call ends.
Instead, you want to modify w and left outside the function's environment. For that you can use <<- instead of <-:
sapply(z, function(y) {
if (is.element(basename(y),left)) {
w <<- append(w, values = y)
left <<- setdiff(left, basename(y))
}
})
Note that I have been saying "you want", "you can", but this is not what "you should" do. Functions with side-effects are really considered bad programming. Try reading about it.
Also, it is good to reserve the *apply tools to functions that can run their inputs independently. Here instead, you have an algorithm where the outcome of an iteration depends on the outcome of the previous ones. These are cases where you're better off using a for loop, unless you can rethink the algorithm in a framework that better suits *apply or can make use of functions that can handle such dependent situations: filter, unique, rle, etc.
For example, using unique, your code can be rewritten as:
base.names <- basename(z)
left <- unique(base.names)
w <- z[match(left, base.names)]
It also has the advantage that it is not building an object recursively, another no-no of your current code.
In PHP we can do error_reporting(E_ALL) or error_reporting(E_ALL|E_STRICT) to have warnings about suspicious code. In g++ you can supply -Wall (and other flags) to get more checking of your code. Is there some similar in R?
As a specific example, I was refactoring a block of code into some functions. In one of those functions I had this line:
if(nm %in% fields$non_numeric)...
Much later I realized that I had overlooked adding fields to the parameter list, but R did not complain about an undefined variable.
(Posting as an answer rather than a comment)
How about ?codetools::checkUsage (codetools is a built-in package) ... ?
This is not really an answer, I just can't resist showing how you could declare globals explicitly. #Ben Bolker should post his comment as the Answer.
To avoiding seeing globals, you can take a function "up" one environment -- it'll be able to see all the standard functions and such (mean, etc), but not anything you put in the global environment:
explicit.globals = function(f) {
name = deparse(substitute(f))
env = parent.frame()
enclos = parent.env(.GlobalEnv)
environment(f) = enclos
env[[name]] = f
}
Then getting a global is just retrieving it from .GlobalEnv:
global = function(n) {
name = deparse(substitute(n))
env = parent.frame()
env[[name]] = get(name, .GlobalEnv)
}
assign('global', global, env=baseenv())
And it would be used like
a = 2
b = 3
f = function() {
global(a)
a
b
}
explicit.globals(f)
And called like
> f()
Error in f() : object 'b' not found
I personally wouldn't go for this but if you're used to PHP it might make sense.
Summing up, there is really no correct answer: as Owen and gsk3 point out, R functions will use globals if a variable is not in the local scope. This may be desirable in some situations, so how could the "error" be pointed out?
checkUsage() does nothing that R's built-in error-checking does not (in this case). checkUsageEnv(.GlobalEnv) is a useful way to check a file of helper functions (and might be great as a pre-hook for svn or git; or as part of an automated build process).
I feel the best solution when refactoring is: at the very start to move all global code to a function (e.g. call it main()) and then the only global code would be to call that function. Do this first, then start extracting functions, etc.