R attribute ".Environment" consuming large amounts of RAM in nnet package - r

I have a piece of code that that is using the nnet package and I am interested in calculating a number of different neural network models & then saving all the models to disk (with save() ).
The issue that I am running into is that the "terms" elements in the neural network has an attribute ".Environment" that ends up being hundreds of megabytes whereas the rest of the model is only a few kilobytes. (once the fitted values & residuals are deleted)
Further, deleting the ".Environment" attribute doesn't appear to cause a problem in terms of using the model with 'predict'.
Does anyone have any idea what either R or nnet is doing with this attribute? Has anyone seen anything like this?

tl;dr: this is OK, except for some very special cases
Background
The .Environment attribute in R contains a reference to the context in which an R closure (usually a formula or a function) was defined. An R environment is a store holding values of variables, similarly to a list. This allows the formula to refer to these variables, for example:
> f = function(g) return(y ~ g(x))
> form = f(exp)
> lm(form, list(y=1:10, x=log(1:10)))
...
Coefficients:
(Intercept) g(x)
3.37e-15 1.00e+00
In this example, the formula form if defined as y~exp(x), by giving g the value of exp. In order to be able to find the value of g (which is an argument to function f), the formula needs to hold a reference to the environment constructed inside the call to function f.
You can see the enviroment attached to a formula by using the attributes() or environment() functions as follows:
> attributes(form)
$class
[1] "formula"
$.Environment
<environment: R_GlobalEnv>
> environment(form)
<environment: R_GlobalEnv>
Your question
I believe you are using the nnet() function variant with a formula (rather than matrices), i.e.
> nnet(y ~ x1 + x2, ...)
Unfortunately, R keeps the entire environment (including all the variables defined where your formula is defined) allocated, even if your formula does not refer to any of it. There is no way to the language to easily tell what you may or may not be using from the environment.
One solution is to explicitly retain only the required parts of the environment. In particular, if your formula does not refer to anything in the environment (which is the most common case), it is safe to remove it.
I would suggest removing the environment from your formula before you call nnet, something like this:
form = y~x + z
environment(form) = NULL
...
result = nnet(form, ...)

Related

what data structure does model formula operator in R create?

The R Language Definition makes several mentions of the model formula operator, but fails to define or explain anything about the formula class.
I am having a hard time finding anything that documents the semantics of the ~ operator from either official or unofficial sources.
In particular, I am not interested in information like is provided in the formula function documentation ("An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model.") or usage scenarios, I'd like to understand what kind of data structure I am creating when using it and how I could inspect and dissect it on the REPL.
Don't know if this helps, but: it's a language object — i.e. R parses the input but doesn't try to evaluate any of the components — with class "formula"
> f <- a ~ b + (c + d)
> str(f)
Class 'formula' language a ~ b + (c + d)
..- attr(*, ".Environment")=<environment: R_GlobalEnv>
If you want to work with these objects, you need to know that it is essentially stored as a tree, where the parent node, an operator or function (~, +, () , can be extracted as the first element and the child nodes (as many as the 'arity' of the function/operator) can be extracted as elements 2..n, i.e.
f[[1]] is ~
f[[2]] is a (the first argument, i.e. the LHS of the formula)
f[[3]] is b + (c+d)
f[[3]][[1]] is +
f[[3]][[2]] is b
... and so on.
The chapter on Expressions in Hadley Wickham's Advanced R gives a more complete description.
This is also discussed (more opaquely) in the R Language Manual, e.g.
Expression objects
Direct manipulation of language objects
#user2554330 points out that formulas also typically have associated environments; that is, they carry along a pointer to the environment in which they were created

Differences between different types of functions in R

I would appreciate help understanding the main differences between several types of functions in R.
I'm somewhat overwhelmed among the definitions of different types of functions and it has become somewhat difficult to understand how different types of functions might relate to each other.
Specifically, I'm confused about the relationships and differences between the following types of functions:
Either Generic or Method: based on the class of the input argument(s), generic functions by using Method Dispatch call an appropriate method function.
Invisible vs. Visible
Primitive vs. Internal
I'm confused about how these different types of functions relate to each other (if at all) and what the various differences and overlaps are between them.
Here's some documentation about primitive vs internal: http://www.biosino.org/R/R-doc/R-ints/_002eInternal-vs-_002ePrimitive.html
Generics are generic functions that can be applied to a class object. Each class is written with specific methods that are then set as generic. So you can see the specific methods associated with a generic call with the "methods" function:
methods(print)
This will list all the methods associated with the generic "print".
Alternatively you can see all the generics that a given class has with this call
methods(,"lm")
Where lm is the class linear model.
Here's an example:
x <- rnorm(100)
y <- 1 + .4*x + rnorm(100,0,.1)
mod1 <- lm(y~x)
print(mod1)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.002 0.378
print.lm(mod1)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.002 0.378
Both the print(mod1) (the generic call) and print.lm(mod1) (the method call to the class) do the same thing. Why does R do this? I don't really know, but that's the difference between method and generic as I understand it.

R: temporarily overriding functions and scope/namespace

Consider the following R code:
local({
lm <- function(x) x^2
lm(10)
})
This temporarily overrides the lm function, but once local has been executed it will "be back to normal". I am wondering why the same approach does not seem to work in this next simple example:
require(car)
model <- lm(len ~ dose, data=ToothGrowth)
local({
vcov <- function(x) hccm(x) #robust var-cov matrix
confint(model) # confint will call vcov, but not the above one.
})
The confint function uses the vcov function to obtain standard errors for the coefficients, and the idea is to use a robust var-cov matrix by temporarily overriding vcov, without doing things "manually" or altering functions.
Both vcov and confint are generic functions, I don't know if this is the reason it does not work as intended. It is not the specific example I am interested in as such; rather the conceptual lesson. Is this a namespace or a scope "issue"?
We show how to do this using proxy objects (see Proxies section of this document), first using the proto package and then without:
1) proto. Since confint.lm is calling vcov we need to ensure that (a) our new replacement for vcov is in the revised confint.lm's environment and (b) the revised confint.lm can still access the objects from its original. (For example, confint.lm calls the hidden function format.perc in stats so if we did not arrange for the second point to be true that hidden function could not be accessed.)
To perform the above we make a new confint.lm which is the same except it has a new environment (the proxy environment) which contains our replacment vcov and whose parent in turn is the original confint.lm environment. Below, the proxy environment is implemented as a proto object where the key items to know here are: (a) proto objects are environments and (b) placing a function in a proto object in the way shown changes its environment to be that proto object. Also to avoid any problems from S3 dispatch of confint to confint.lm we call the confint.lm method directly.
Although the hccm does not seem to have any different result here we can verify that it was run by noticing the output of the trace:
library(car)
library(proto)
trace(hccm)
model <- lm(len ~ dose, data=ToothGrowth)
proto(environment(stats:::confint.lm), # set parent
vcov = function(x) hccm(x), #robust var-cov matrix
confint.lm = stats:::confint.lm)[["confint.lm"]](model)
For another example, see example 2 here.
2) environments. The code is a bit more onerous without proto (in fact it roughly doubles the code size) but here it is:
library(car)
trace(hccm)
model <- lm(len ~ dose, data=ToothGrowth)
local({
vcov <- function(x) hccm(x) #robust var-cov matrix
confint.lm <- stats:::confint.lm
environment(confint.lm) <- environment()
confint.lm(model) # confint will call vcov, but not the above one.
}, envir = new.env(parent = environment(stats:::confint.lm)))
EDIT: various improvements in clarity
This is because the functions confint and vcov are both in the namespace "stats". The confint() you call here effectively gets stats::vcov, pretty much because that is what namespaces are for - you are allowed to write your own versions of things but not to the detriment of otherwise expected behaviour.
In your first example, you can still safely call other functions that rely on stats::lm and that will not get upset by your local modification.

S3 and order of classes

I've alway had trouble understanding the the documentation on how S3 methods are called, and this time it's biting me back.
I'll apologize up front for asking more than one question, but they are all closely related. Deep in the heart of a complex set of functions, I create a lot of glmnet fits, in particular logistic ones. Now, glmnet documentation specifies its return value to have both classes "glmnet" and (for logistic regression) "lognet". In fact, these are specified in this order.
However, looking at the end of the implementation of glmnet, righter after the call to (the internal function) lognet, that sets the class of fit to "lognet", I see this line of code just before the return (of the variable fit):
class(fit) = c(class(fit), "glmnet")
From this, I would conclude that the order of the classes is in fact "lognet", "glmnet".
Unfortunately, the fit I had, had (like the doc suggests):
> class(myfit)
[1] "glmnet" "lognet"
The problem with this is the way S3 methods are dispatched for it, in particular predict. Here's the code for predict.lognet:
function (object, newx, s = NULL, type = c("link", "response",
"coefficients", "class", "nonzero"), exact = FALSE, offset,
...)
{
type = match.arg(type)
nfit = NextMethod("predict") #<- supposed to call predict.glmnet, I think
switch(type, response = {
pp = exp(-nfit)
1/(1 + pp)
}, class = ifelse(nfit > 0, 2, 1), nfit)
}
I've added a comment to explain my reasoning. Now when I call predict on this myfit with a new datamatrix mydata and type="response", like this:
predict(myfit, newx=mydata, type="response")
, I do not, as per the documentation, get the predicted probabilities, but the linear combinations, which is exactly the result of calling predict.glmnet immediately.
I've tried reversing the order of the classes, like so:
orgclass<-class(myfit)
class(myfit)<-rev(orgclass)
And then doing the predict call again: lo and behold: it works! I do get the probabilities.
So, here come some questions:
Am I right in 'having learned' that
S3 methods are dispatched in order
of appearance of the classes?
Am I right in assuming the code in
glmnetwould cause the wrong order
for correct dispatching of
predict?
In my code there is nothing that
manipulates classes
explicitly/visibly to my knowledge.
What could cause the order to
change?
For completeness' sake: here's some sample code to play around with (as I'm doing myself now):
library(glmnet)
y<-factor(sample(2, 100, replace=TRUE))
xs<-matrix(runif(100), ncol=1)
colnames(xs)<-"x"
myfit<-glmnet(xs, y, family="binomial")
mydata<-matrix(runif(10), ncol=1)
colnames(mydata)<-"x"
class(myfit)
predict(myfit, newx=mydata, type="response")
class(myfit)<-rev(class(myfit))
class(myfit)
predict(myfit, newx=mydata, type="response")
class(myfit)<-rev(class(myfit))#set it back
class(myfit)
Depending on the data generated, the difference is more or less obvious (in my true dataset I noticed negative values in the so called probabilities, which is how I picked up the problem), but you should indeed see a difference.
Thanks for any input.
Edit:
I just found out the horrible truth: either order worked in glmnet 1.5.2 (which is present on the server where I ran the actual code, resulting in the fit with the class order reversed), but the code from 1.6 requires the order to be "lognet", "glmnet". I have yet to check what happens in 1.7.
Thanks to #Aaron for reminding me of the basics of informatics (besides 'if all else fails, restart': 'check your versions'). I had mistakenly assumed that a package by the gods of statistical learning would be protected from this type of error), and to #Gavin for confirming my reconstruction of how S3 works.
Yes, the order of dispatch is in the order in which the classes are listed in the class attribute. In the simple, every-day case, yes, the first stated class is the one chosen first by method dispatch, and only if it fails to find a method for that class (or NextMethod is called) will it move on to the second class, or failing that search for a default method.
No, I do not think you are right that the order of the classes is wrong in the code. The documentation appears wrong. The intent is clearly to call predict.lognet() first, use the workhorse predict.glmnet() to do the basic computations for all types of lasso/elastic net models fitted by glmnet, and finally do some post processing of those general predictions. That predict.glmnet() is not exported from the glmnet NAMESPACE whilst the other methods are is perhaps telling, also.
I'm not sure why you think the output from this:
predict(myfit, newx=mydata, type="response")
is wrong? I get a matrix of 10 rows and 21 columns, with the columns relating to the intercept-only model prediction plus predictions at 20 values of lambda at which model coefficients along the lasso/elastic net path have been computed. These do not seem to be linear combinations and are one the response scale as you requested.
The order of the classes is not changing. I think you are misunderstanding how the code is supposed to work. There is a bug in the documentation, as the ordering is stated wrong there. But the code is working as I think it should.

Examples of the perils of globals in R and Stata

In recent conversations with fellow students, I have been advocating for avoiding globals except to store constants. This is a sort of typical applied statistics-type program where everyone writes their own code and project sizes are on the small side, so it can be hard for people to see the trouble caused by sloppy habits.
In talking about avoidance of globals, I'm focusing on the following reasons why globals might cause trouble, but I'd like to have some examples in R and/or Stata to go with the principles (and any other principles you might find important), and I'm having a hard time coming up with believable ones.
Non-locality: Globals make debugging harder because they make understanding the flow of code harder
Implicit coupling: Globals break the simplicity of functional programming by allowing complex interactions between distant segments of code
Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions
A useful answer to this question would be a reproducible and self-contained code snippet in which globals cause a specific type of trouble, ideally with another code snippet in which the problem is corrected. I can generate the corrected solutions if necessary, so the example of the problem is more important.
Relevant links:
Global Variables are Bad
Are global variables bad?
I also have the pleasure of teaching R to undergraduate students who have no experience with programming. The problem I found was that most examples of when globals are bad, are rather simplistic and don't really get the point across.
Instead, I try to illustrate the principle of least astonishment. I use examples where it is tricky to figure out what was going on. Here are some examples:
I ask the class to write down what they think the final value of i will be:
i = 10
for(i in 1:5)
i = i + 1
i
Some of the class guess correctly. Then I ask should you ever write code like this?
In some sense i is a global variable that is being changed.
What does the following piece of code return:
x = 5:10
x[x=1]
The problem is what exactly do we mean by x
Does the following function return a global or local variable:
z = 0
f = function() {
if(runif(1) < 0.5)
z = 1
return(z)
}
Answer: both. Again discuss why this is bad.
Oh, the wonderful smell of globals...
All of the answers in this post gave R examples, and the OP wanted some Stata examples, as well. So let me chime in with these.
Unlike R, Stata does take care of locality of its local macros (the ones that you create with local command), so the issue of "Is this this a global z or a local z that is being returned?" never comes up. (Gosh... how can you R guys write any code at all if locality is not enforced???) Stata has a different quirk, though, namely that a non-existent local or global macro is evaluated as an empty string, which may or may not be desirable.
I have seen globals used for several main reasons:
Globals are often used as shortcuts for variable lists, as in
sysuse auto, clear
regress price $myvars
I suspect that the main usage of such construct is for someone who switches between interactive typing and storing the code in a do-file as they try multiple specifications. Say they try regression with homoskedastic standard errors, heteroskedastic standard errors, and median regression:
regress price mpg foreign
regress price mpg foreign, robust
qreg price mpg foreign
And then they run these regressions with another set of variables, then with yet another one, and finally they give up and set this up as a do-file myreg.do with
regress price $myvars
regress price $myvars, robust
qreg price $myvars
exit
to be accompanied with an appropriate setting of the global macro. So far so good; the snippet
global myvars mpg foreign
do myreg
produces the desirable results. Now let's say they email their famous do-file that claims to produce very good regression results to collaborators, and instruct them to type
do myreg
What will their collaborators see? In the best case, the mean and the median of mpg if they started a new instance of Stata (failed coupling: myreg.do did not really know you meant to run this with a non-empty variable list). But if the collaborators had something in the works, and too had a global myvars defined (name collision)... man, would that be a disaster.
Globals are used for directory or file names, as in:
use $mydir\data1, clear
God only knows what will be loaded. In large projects, though, it does come handy. You would want to define global mydir somewhere in your master do-file, may be even as
global mydir `c(pwd)'
Globals can be used to store an unpredictable crap, like a whole command:
capture $RunThis
God only knows what will be executed; let's just hope it is not ! format c:\. This is the worst case of implicit strong coupling, but since I am not even sure that RunThis will contain anything meaningful, I put a capture in front of it, and will be prepared to treat the non-zero return code _rc. (See, however, my example below.)
Stata's own use of globals is for God settings, like the type I error probability/confidence level: the global $S_level is always defined (and you must be a total idiot to redefine this global, although of course it is technically doable). This is, however, mostly a legacy issue with code of version 5 and below (roughly), as the same information can be obtained from less fragile system constant:
set level 90
display $S_level
display c(level)
Thankfully, globals are quite explicit in Stata, and hence are easy to debug and remove. In some of the above situations, and certainly in the first one, you'd want to pass parameters to do-files which are seen as the local `0' inside the do-file. Instead of using globals in the myreg.do file, I would probably code it as
unab varlist : `0'
regress price `varlist'
regress price `varlist', robust
qreg price `varlist'
exit
The unab thing will serve as an element of protection: if the input is not a legal varlist, the program will stop with an error message.
In the worst cases I've seen, the global was used only once after having been defined.
There are occasions when you do want to use globals, because otherwise you'd have to pass the bloody thing to every other do-file or a program. One example where I found the globals pretty much unavoidable was coding a maximum likelihood estimator where I did not know in advance how many equations and parameters I would have. Stata insists that the (user-supplied) likelihood evaluator will have specific equations. So I had to accumulate my equations in the globals, and then call my evaluator with the globals in the descriptions of the syntax that Stata would need to parse:
args lf $parameters
where lf was the objective function (the log-likelihood). I encountered this at least twice, in the normal mixture package (denormix) and confirmatory factor analysis package (confa); you can findit both of them, of course.
One R example of a global variable that divides opinion is the stringsAsFactors issue on reading data into R or creating a data frame.
set.seed(1)
str(data.frame(A = sample(LETTERS, 100, replace = TRUE),
DATES = as.character(seq(Sys.Date(), length = 100, by = "days"))))
options("stringsAsFactors" = FALSE)
set.seed(1)
str(data.frame(A = sample(LETTERS, 100, replace = TRUE),
DATES = as.character(seq(Sys.Date(), length = 100, by = "days"))))
options("stringsAsFactors" = TRUE) ## reset
This can't really be corrected because of the way options are implemented in R - anything could change them without you knowing it and thus the same chunk of code is not guaranteed to return exactly the same object. John Chambers bemoans this feature in his recent book.
A pathological example in R is the use of one of the globals available in R, pi, to compute the area of a circle.
> r <- 3
> pi * r^2
[1] 28.27433
>
> pi <- 2
> pi * r^2
[1] 18
>
> foo <- function(r) {
+ pi * r^2
+ }
> foo(r)
[1] 18
>
> rm(pi)
> foo(r)
[1] 28.27433
> pi * r^2
[1] 28.27433
Of course, one can write the function foo() defensively by forcing the use of base::pi but such recourse may not be available in normal user code unless packaged up and using a NAMESPACE:
> foo <- function(r) {
+ base::pi * r^2
+ }
> foo(r = 3)
[1] 28.27433
> pi <- 2
> foo(r = 3)
[1] 28.27433
> rm(pi)
This highlights the mess you can get into by relying on anything that is not solely in the scope of your function or passed in explicitly as an argument.
Here's an interesting pathological example involving replacement functions, the global assign, and x defined both globally and locally...
x <- c(1,NA,NA,NA,1,NA,1,NA)
local({
#some other code involving some other x begin
x <- c(NA,2,3,4)
#some other code involving some other x end
#now you want to replace NAs in the the global/parent frame x with 0s
x[is.na(x)] <<- 0
})
x
[1] 0 NA NA NA 0 NA 1 NA
Instead of returning [1] 1 0 0 0 1 0 1 0, the replacement function uses the index returned by the local value of is.na(x), even though you're assigning to the global value of x. This behavior is documented in the R Language Definition.
One quick but convincing example in R is to run the line like:
.Random.seed <- 'normal'
I chose 'normal' as something someone might choose, but you could use anything there.
Now run any code that uses generated random numbers, for example:
rnorm(10)
Then you can point out that the same thing could happen for any global variable.
I also use the example of:
x <- 27
z <- somefunctionthatusesglobals(5)
Then ask the students what the value of x is; the answer is that we don't know.
Through trial and error I've learned that I need to be very explicit in naming my function arguments (and ensure enough checks at the start and along the function) to make everything as robust as possible. This is especially true if you have variables stored in global environment, but then you try to debug a function with a custom valuables - and something doesn't add up! This is a simple example that combines bad checks and calling a global variable.
glob.arg <- "snake"
customFunction <- function(arg1) {
if (is.numeric(arg1)) {
glob.arg <- "elephant"
}
return(strsplit(glob.arg, "n"))
}
customFunction(arg1 = 1) #argument correct, expected results
customFunction(arg1 = "rubble") #works, but may have unexpected results
An example sketch that came up while trying to teach this today. Specifically, this focuses on trying to give intuition as to why globals can cause problems, so it abstracts away as much as possible in an attempt to state what can and cannot be concluded just from the code (leaving the function as a black box).
The set up
Here is some code. Decide whether it will return an error or not based on only the criteria given.
The code
stopifnot( all( x!=0 ) )
y <- f(x)
5/x
The criteria
Case 1: f() is a properly-behaved function, which uses only local variables.
Case 2: f() is not necessarily a properly-behaved function, which could potentially use global assignment.
The answer
Case 1: The code will not return an error, since line one checks that there are no x's equal to zero and line three divides by x.
Case 2: The code could potentially return an error, since f() could e.g. subtract 1 from x and assign it back to the x in the parent environment, where any x element equal to 1 could then be set to zero and the third line would return a division by zero error.
Here's one attempt at an answer that would make sense to statisticsy types.
Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions
First we define a log likelihood function,
logLik <- function(x) {
y <<- x^2+2
return(sum(sqrt(y+7)))
}
Now we write an unrelated function to return the sum of squares of an input. Because we're lazy we'll do this passing it y as a global variable,
sumSq <- function() {
return(sum(y^2))
}
y <<- seq(5)
sumSq()
[1] 55
Our log likelihood function seems to behave exactly as we'd expect, taking an argument and returning a value,
> logLik(seq(12))
[1] 88.40761
But what's up with our other function?
> sumSq()
[1] 633538
Of course, this is a trivial example, as will be any example that doesn't exist in a complex program. But hopefully it'll spark a discussion about how much harder it is to keep track of globals than locals.
In R you may also try to show them that there is often no need to use globals as you may access the variables defined in the function scope from within the function itself by only changing the enviroment. For example the code below
zz="aaa"
x = function(y) {
zz="bbb"
cat("value of zz from within the function: \n")
cat(zz , "\n")
cat("value of zz from the function scope: \n")
with(environment(x),cat(zz,"\n"))
}

Resources