R: temporarily overriding functions and scope/namespace - r

Consider the following R code:
local({
lm <- function(x) x^2
lm(10)
})
This temporarily overrides the lm function, but once local has been executed it will "be back to normal". I am wondering why the same approach does not seem to work in this next simple example:
require(car)
model <- lm(len ~ dose, data=ToothGrowth)
local({
vcov <- function(x) hccm(x) #robust var-cov matrix
confint(model) # confint will call vcov, but not the above one.
})
The confint function uses the vcov function to obtain standard errors for the coefficients, and the idea is to use a robust var-cov matrix by temporarily overriding vcov, without doing things "manually" or altering functions.
Both vcov and confint are generic functions, I don't know if this is the reason it does not work as intended. It is not the specific example I am interested in as such; rather the conceptual lesson. Is this a namespace or a scope "issue"?

We show how to do this using proxy objects (see Proxies section of this document), first using the proto package and then without:
1) proto. Since confint.lm is calling vcov we need to ensure that (a) our new replacement for vcov is in the revised confint.lm's environment and (b) the revised confint.lm can still access the objects from its original. (For example, confint.lm calls the hidden function format.perc in stats so if we did not arrange for the second point to be true that hidden function could not be accessed.)
To perform the above we make a new confint.lm which is the same except it has a new environment (the proxy environment) which contains our replacment vcov and whose parent in turn is the original confint.lm environment. Below, the proxy environment is implemented as a proto object where the key items to know here are: (a) proto objects are environments and (b) placing a function in a proto object in the way shown changes its environment to be that proto object. Also to avoid any problems from S3 dispatch of confint to confint.lm we call the confint.lm method directly.
Although the hccm does not seem to have any different result here we can verify that it was run by noticing the output of the trace:
library(car)
library(proto)
trace(hccm)
model <- lm(len ~ dose, data=ToothGrowth)
proto(environment(stats:::confint.lm), # set parent
vcov = function(x) hccm(x), #robust var-cov matrix
confint.lm = stats:::confint.lm)[["confint.lm"]](model)
For another example, see example 2 here.
2) environments. The code is a bit more onerous without proto (in fact it roughly doubles the code size) but here it is:
library(car)
trace(hccm)
model <- lm(len ~ dose, data=ToothGrowth)
local({
vcov <- function(x) hccm(x) #robust var-cov matrix
confint.lm <- stats:::confint.lm
environment(confint.lm) <- environment()
confint.lm(model) # confint will call vcov, but not the above one.
}, envir = new.env(parent = environment(stats:::confint.lm)))
EDIT: various improvements in clarity

This is because the functions confint and vcov are both in the namespace "stats". The confint() you call here effectively gets stats::vcov, pretty much because that is what namespaces are for - you are allowed to write your own versions of things but not to the detriment of otherwise expected behaviour.
In your first example, you can still safely call other functions that rely on stats::lm and that will not get upset by your local modification.

Related

Why do I have to define the top-level parameter in JAGS, and how?

According to the user manual of r-jags (section Compilation):
Any node that is used on the right hand side of a relation, but is not
defined on the left hand side of any relation, is assumed to be a
constant node. Its value must be supplied in the data file.
But it is weird, many probabilistic graph models contains many top-level parameters to be inferred. And that's what BN means to do, isn't it? So why do I need to define the value of the top-level parameter first? And what should I do when I want to implement the model like LDA, which has topic-distribution prior a and word-distribution beta that are unknown? Please tell me if I have said anything wrong.
If you want to make inference about a parameter, then by definition this is NOT a top-level parameter. If you want to infer something about a parameter then you have to put a prior on it, in which case the hyper-parameters in the prior are the top-level parameters. For example:
Count ~ dpois(lambda)
lambda <- 10
Means that lambda is the top-level parameter, and cannot be inferred.
Count ~ dpois(lambda)
lambda ~ dgamma(0.001, 0.001)
Means that lambda is inferred, and the hyper-parameters of the gamma prior are the top-level parameters. To see this more explicitly, notice that this syntax is equivalent:
Count ~ dpois(lambda)
lambda ~ dgamma(shape, rate)
shape <- 0.001
rate <- 0.001
The shape and rate parameters could also be specified in the data if you prefer, but that would be a bit unusual.
Choice of a reasonable prior distribution for these parameters is not always straightforward, but is an integral part of any Bayesian analysis. Don't just assume that a prior with large variance is minimally informative without thinking about it and/or testing it.
Matt

R attribute ".Environment" consuming large amounts of RAM in nnet package

I have a piece of code that that is using the nnet package and I am interested in calculating a number of different neural network models & then saving all the models to disk (with save() ).
The issue that I am running into is that the "terms" elements in the neural network has an attribute ".Environment" that ends up being hundreds of megabytes whereas the rest of the model is only a few kilobytes. (once the fitted values & residuals are deleted)
Further, deleting the ".Environment" attribute doesn't appear to cause a problem in terms of using the model with 'predict'.
Does anyone have any idea what either R or nnet is doing with this attribute? Has anyone seen anything like this?
tl;dr: this is OK, except for some very special cases
Background
The .Environment attribute in R contains a reference to the context in which an R closure (usually a formula or a function) was defined. An R environment is a store holding values of variables, similarly to a list. This allows the formula to refer to these variables, for example:
> f = function(g) return(y ~ g(x))
> form = f(exp)
> lm(form, list(y=1:10, x=log(1:10)))
...
Coefficients:
(Intercept) g(x)
3.37e-15 1.00e+00
In this example, the formula form if defined as y~exp(x), by giving g the value of exp. In order to be able to find the value of g (which is an argument to function f), the formula needs to hold a reference to the environment constructed inside the call to function f.
You can see the enviroment attached to a formula by using the attributes() or environment() functions as follows:
> attributes(form)
$class
[1] "formula"
$.Environment
<environment: R_GlobalEnv>
> environment(form)
<environment: R_GlobalEnv>
Your question
I believe you are using the nnet() function variant with a formula (rather than matrices), i.e.
> nnet(y ~ x1 + x2, ...)
Unfortunately, R keeps the entire environment (including all the variables defined where your formula is defined) allocated, even if your formula does not refer to any of it. There is no way to the language to easily tell what you may or may not be using from the environment.
One solution is to explicitly retain only the required parts of the environment. In particular, if your formula does not refer to anything in the environment (which is the most common case), it is safe to remove it.
I would suggest removing the environment from your formula before you call nnet, something like this:
form = y~x + z
environment(form) = NULL
...
result = nnet(form, ...)

Passing a list to a function in R for using in optimization

I want to program the maximum likelihood of a gamma distribution in R; until now I have done the following:
library(stats4)
x<-scan("http://www.cmc.edu/pages/faculty/MONeill/Math152/Handouts/gamma-arrivals.txt")
loglike2<-function(LL){
alpha<-LL$a
beta<-LL$b
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)}
mle(loglike2,start=list(a=0.5,b=0.5))
but when I want to run it, the following message appear:
Error in mle(loglike2, start = list(a = 0.5, b = 0.5)) :
some named arguments in 'start' are not arguments to the supplied log-likelihood
What am I doing wrong?
From the error message it sounds like mle needs to be able to see the variable names listed in start= in the function call itself.
loglike2<-function(a, b){
alpha<-a
beta<-b
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}
mle(loglike2,start=list(a=0.5,b=0.5))
If that doesn't work you should post a reproducible example with all variables defined and also explicitly indicate which package the mle function is coming from.
The error message is unfortunately criptic because it indicates mising
values owing to the fact that alpha and gamma have to be positive and mle optimizes over the real numbers. Hence, you need to transfomt the vector over which the function is being optimized, like so:
library(stats4)
x<-scan("http://www.cmc.edu/pages/faculty/MONeill/Math152/Handouts/gamma-arrivals.txt")
loglike<-function(alpha,beta){
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}
fit <- mle(function(alpha,beta)
# transfrom the parameters so they are positive
loglike(exp(alpha),exp(beta)),
start=list(alpha=log(.5),beta=log(.5)))
# of course you would have to exponentiate the estimates too.
exp(coef(fit1))
note that the error now is that you are using n in loglike()
which you have not defined. If you define n, then you get an error stating
Lapack routine dgesv: system is exactly singular: U[1,1] = 0. which is
caused either by a not very good guess for the start value of alpha and
beta or (more likely) that loglike() does not have a minima (I think your
deleted post from last night had a slightly different formula which I was
able to get working, but not able to respond to b/c the post was deleted...)
FYI, if you want to inspect the alpha and beta parameters that cause the
errors, you can use scoping assignment to post the most recently called
parameters to the environment in which loglike() is defined as in:
loglike<-function(alpha,beta){
g <<- c(alpha,beta)
(alpha-1)*sum(log(x))-n*alpha*log(beta)-n*lgamma(alpha)
}

Differences between different types of functions in R

I would appreciate help understanding the main differences between several types of functions in R.
I'm somewhat overwhelmed among the definitions of different types of functions and it has become somewhat difficult to understand how different types of functions might relate to each other.
Specifically, I'm confused about the relationships and differences between the following types of functions:
Either Generic or Method: based on the class of the input argument(s), generic functions by using Method Dispatch call an appropriate method function.
Invisible vs. Visible
Primitive vs. Internal
I'm confused about how these different types of functions relate to each other (if at all) and what the various differences and overlaps are between them.
Here's some documentation about primitive vs internal: http://www.biosino.org/R/R-doc/R-ints/_002eInternal-vs-_002ePrimitive.html
Generics are generic functions that can be applied to a class object. Each class is written with specific methods that are then set as generic. So you can see the specific methods associated with a generic call with the "methods" function:
methods(print)
This will list all the methods associated with the generic "print".
Alternatively you can see all the generics that a given class has with this call
methods(,"lm")
Where lm is the class linear model.
Here's an example:
x <- rnorm(100)
y <- 1 + .4*x + rnorm(100,0,.1)
mod1 <- lm(y~x)
print(mod1)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.002 0.378
print.lm(mod1)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.002 0.378
Both the print(mod1) (the generic call) and print.lm(mod1) (the method call to the class) do the same thing. Why does R do this? I don't really know, but that's the difference between method and generic as I understand it.

S3 and order of classes

I've alway had trouble understanding the the documentation on how S3 methods are called, and this time it's biting me back.
I'll apologize up front for asking more than one question, but they are all closely related. Deep in the heart of a complex set of functions, I create a lot of glmnet fits, in particular logistic ones. Now, glmnet documentation specifies its return value to have both classes "glmnet" and (for logistic regression) "lognet". In fact, these are specified in this order.
However, looking at the end of the implementation of glmnet, righter after the call to (the internal function) lognet, that sets the class of fit to "lognet", I see this line of code just before the return (of the variable fit):
class(fit) = c(class(fit), "glmnet")
From this, I would conclude that the order of the classes is in fact "lognet", "glmnet".
Unfortunately, the fit I had, had (like the doc suggests):
> class(myfit)
[1] "glmnet" "lognet"
The problem with this is the way S3 methods are dispatched for it, in particular predict. Here's the code for predict.lognet:
function (object, newx, s = NULL, type = c("link", "response",
"coefficients", "class", "nonzero"), exact = FALSE, offset,
...)
{
type = match.arg(type)
nfit = NextMethod("predict") #<- supposed to call predict.glmnet, I think
switch(type, response = {
pp = exp(-nfit)
1/(1 + pp)
}, class = ifelse(nfit > 0, 2, 1), nfit)
}
I've added a comment to explain my reasoning. Now when I call predict on this myfit with a new datamatrix mydata and type="response", like this:
predict(myfit, newx=mydata, type="response")
, I do not, as per the documentation, get the predicted probabilities, but the linear combinations, which is exactly the result of calling predict.glmnet immediately.
I've tried reversing the order of the classes, like so:
orgclass<-class(myfit)
class(myfit)<-rev(orgclass)
And then doing the predict call again: lo and behold: it works! I do get the probabilities.
So, here come some questions:
Am I right in 'having learned' that
S3 methods are dispatched in order
of appearance of the classes?
Am I right in assuming the code in
glmnetwould cause the wrong order
for correct dispatching of
predict?
In my code there is nothing that
manipulates classes
explicitly/visibly to my knowledge.
What could cause the order to
change?
For completeness' sake: here's some sample code to play around with (as I'm doing myself now):
library(glmnet)
y<-factor(sample(2, 100, replace=TRUE))
xs<-matrix(runif(100), ncol=1)
colnames(xs)<-"x"
myfit<-glmnet(xs, y, family="binomial")
mydata<-matrix(runif(10), ncol=1)
colnames(mydata)<-"x"
class(myfit)
predict(myfit, newx=mydata, type="response")
class(myfit)<-rev(class(myfit))
class(myfit)
predict(myfit, newx=mydata, type="response")
class(myfit)<-rev(class(myfit))#set it back
class(myfit)
Depending on the data generated, the difference is more or less obvious (in my true dataset I noticed negative values in the so called probabilities, which is how I picked up the problem), but you should indeed see a difference.
Thanks for any input.
Edit:
I just found out the horrible truth: either order worked in glmnet 1.5.2 (which is present on the server where I ran the actual code, resulting in the fit with the class order reversed), but the code from 1.6 requires the order to be "lognet", "glmnet". I have yet to check what happens in 1.7.
Thanks to #Aaron for reminding me of the basics of informatics (besides 'if all else fails, restart': 'check your versions'). I had mistakenly assumed that a package by the gods of statistical learning would be protected from this type of error), and to #Gavin for confirming my reconstruction of how S3 works.
Yes, the order of dispatch is in the order in which the classes are listed in the class attribute. In the simple, every-day case, yes, the first stated class is the one chosen first by method dispatch, and only if it fails to find a method for that class (or NextMethod is called) will it move on to the second class, or failing that search for a default method.
No, I do not think you are right that the order of the classes is wrong in the code. The documentation appears wrong. The intent is clearly to call predict.lognet() first, use the workhorse predict.glmnet() to do the basic computations for all types of lasso/elastic net models fitted by glmnet, and finally do some post processing of those general predictions. That predict.glmnet() is not exported from the glmnet NAMESPACE whilst the other methods are is perhaps telling, also.
I'm not sure why you think the output from this:
predict(myfit, newx=mydata, type="response")
is wrong? I get a matrix of 10 rows and 21 columns, with the columns relating to the intercept-only model prediction plus predictions at 20 values of lambda at which model coefficients along the lasso/elastic net path have been computed. These do not seem to be linear combinations and are one the response scale as you requested.
The order of the classes is not changing. I think you are misunderstanding how the code is supposed to work. There is a bug in the documentation, as the ordering is stated wrong there. But the code is working as I think it should.

Resources