I am trying to apply cross validation to a list of linear models and getting an error.
Here is my code:
library(MASS)
library(boot)
glm.fits = lapply(1:10,function(d) glm(nox~poly(dis,d),data=Boston))
cvs = lapply(1:10,function(i) cv.glm(Boston,glm.fits[[i]],K=10)$delta[1])
I get the error:
Error in poly(dis, d) : object 'd' not found
I then tried the following code:
library(MASS)
library(boot)
cvs=rep(0,10)
for (d in 1:10){
glmfit = glm(nox~poly(dis,d),data=Boston)
cvs[d] = cv.glm(Boston,glmfit,K=10)$delta[1]
}
and this worked.
Can anyone explain why my first attempt did not work, and suggest a fix?
Also, assuming a fix to the first attempt can be obtained, which way of writing code is better practice? (assume that I want a list of the various fits and that I would edit the latter code to preserve them) To me, the first attempt is more elegant.
In order for your first attempt to work, cv.glm (and maybe glm) would have to be written differently to take much more care about where evaluations are taking place.
The function cv.glm basically re-evaluates the model formula a bunch of times. It takes that model formula directly from the fitted glm object. So put yourself in R's shoes (as it were), and consider you're deep in the function cv.glm and you've been instructed to refit this model:
glm(formula = nox ~ poly(dis, d), data = Boston)
The fitted glm object has Boston in it, and glm knows to look first in Boston for variables, so it finds nox and dis easily. But where is d? It's not in Boston. It's not in the fitted glm object (and glm wouldn't know to look there anyway). In fact, it isn't anywhere. That d value existed only in the context of the lapply iterations and then disappeared.
In the second case, since d is currently an active variable in your for loop, after R fails to find d in the data frame Boston, it looks in the parent frame, in this case the global environment and finds your for loop index d and merrily keeps going.
If you need to use glm and cv.glm in this way I would just use the for loop; it might be possible to work around the evaluation issues, but it probably wouldn't be worth the time and hassle.
Related
I'm running conditional logistic regression models in R as part of a discordant sibling pair analysis and I need to isolate the total n for each model. Also, I need to isolate the number and % of cases of the disease in the exposed and unexposed groups.
In Stata the e(sample) == 1 command gives this info. Is there an equivalent function for accomplishing this in R?
In R, if you run a regression you create a regression object.
RegOb <- lm(y ~ x1 + x2, data)
Often people call "RegOb" which uses the internal "print" method of this type of object. Alternative "summary(RegOb)" is popular (and often people would assign this).
However, RegOb contains many information about the regression. So in Stata you could use -ereturn list- to see what is saved. In R I would recommend to use "str(RegOb)" or "View(RegOb)" you will see everything that is saved. I have forgotten the correct syntax atm, but it will be something like:
RegOb$data
And since you have the original data, you can simply use a logical statement based on the original and the used data which will give you the estimation sample.
I am fitting a model with many random effects using the bam() function within the mgcv package for R. My basic model structure looks like:
fit <- bam(y ~ s(x1) + s(x2) + s(xn) + s(plot, bs = 're'), data = dat)
This function works for 4 subsets of my data, but not the fifth, which is surprising. Instead, it throws this error:
Error in qr.qty(qrx, f) :
right-hand side should have 14195 not 14196 rows
This error goes away if I switch to using the gam() rather than bam() function. It also goes away if I drop the random effect from the model. I am really unsure whats causing this error, or what to do about it. Unfortunately, generating a reproducible example would require passing along a very large dataset, as its not clear why this error is thrown on this particular dataset, compared to 4 other datasets fitting the exact same model.
Any idea why this error is being thrown, and how to overcome it, would be greatly appreciated.
I had the same question and I found this r-help mail which tries to solve the same problem:
[R] bam (mgcv) not using the specified number of cores
After reading the mail, I deleted all the code about the cluster, such as the argument cluster in bam() function. Then the error message goes away.
I don't know the details but I hope this trick will help you.
One possible cause of
Error in qr.qty(qrx, f) :
right-hand side should have 14195 not 14196 rows
is running out of RAM. This may explain why you have seen the error for some datasets but not others. This is especially common when using a large cluster size.
Ran into this problem while trying to get the empirical distribution of the K-R degrees of freedom...
This seems like fairly dangerous behaviour? Does it constitute a bug?
Reproducible example:
## import lmerTest package
library(lmerTest)
## an object of class merModLmerTest
m <- lmer(Informed.liking ~ Gender+Information+Product +(1|Consumer), data=ham)
# simulate data from fitted model
simData=ham
simData$Informed.liking=unlist(simulate(m))
# fit model to simulated data
m1 <- lmer(Informed.liking ~ Gender+Information+Product +(1|Consumer), data=simData)
stats:::anova(m1)
lmerTest:::anova(m1)
# simulate again, WITHOUT refitting
simData$Informed.liking=unlist(simulate(m))
stats:::anova(m1) # same as before
lmerTest:::anova(m1) # not same as before!
my response does not constitute a solid answer, rather an extended comment:
this looks pretty bad - in fact I have discovered today that almost all the analyses that I conducted in a project that was on the verge of submission have to be redone because of a related behavior of lmerTest.
The problem I have run into was when I used a short function that fits a model with lmer and then returns coef(summary(model)) - simple stuff, two lines of code. However the input to this function was named data and I also had a dataframe called data in the workspace. It seems that although during fitting with lmer the local variable from the function scope was correctly used, during summary the workspace data variable was used (which often was not the same as the dataframe passed to the function) leading to invalid t values and degrees of freedom leading to incorrect p values (the estimates and their standard error was ok however).
So, answering your question:
This seems like fairly dangerous behaviour? Does it constitute a bug?
It seems dangerous indeed and I would definitelly call this a bug.
I'm trying to run the following regression:
m1=glm(y~x1+x2+x3+x4,data=df,family=binomial())
m2=glm(y~x1+x2+x3+x4+x5,data=df,family=binomial())
m3=glm(y~x1+x2+x3+x4+x5+x6,data=df,family=binomial())
m4=glm(y~x1+x2+x3+x4+x5+x6+x7,data=df,family=binomial())
and then to print them using the stargazer package:
stargazer(m1,m2,m3,m4 type="html", out="models.html")
Thing is, the data frame df is rather big (~600MB) and thus each glm object I create is at least ~1.5GB.
This creates a memory issue which prevents me from creating all the regressions I need to print in stargazer.
I've tried 2 approches in order to decrease the size of the glm objects:
Trim the glm object using this tutorial. This indeed trims the glm object to <1MB, though I get the following error from the stargazer function:
Error in Qr$qr[p1, p1, drop = FALSE] : incorrect number of dimensions
Use the package speedglm. however, it's not supported by stargazer.
Any suggestions?
The stargazer calls summary which requires qr (see source code). So -- as far as I know -- it is not possible.
BUT I think that it should be easy to rewrite stargazer to handle a list of summaries as an input. It would be extremely handy.
An option that has worked well for me is to first convert the large *lm objects to "coeftest" class using the lmtest package. A "coeftest" object is really just a matrix of your summarised regression results and hardly takes up any space as a result. Moreover, Stargazer readily accepts the "coeftest" class as an input, so your code doesn't need to change much at all.
Using your example:
library(lmtest)
m1 <- glm(y~x1+x2+x3+x4,data=df,family=binomial())
m1 <- coeftest(m1)
m2 <- glm(y~x1+x2+x3+x4+x5,data=df,family=binomial())
m2 <- coeftest(m2)
m3 <- glm(y~x1+x2+x3+x4+x5+x6,data=df,family=binomial())
m3 <- coeftest(m3)
m4 <- glm(y~x1+x2+x3+x4+x5+x6+x7,data=df,family=binomial())
m4 <- coeftest(m4)
stargazer(m1,m2,m3,m4 type="html", out="models.html")
Apart from taking care of the memory problem, this approach has the added benefit of the coeftest() transformation itself being extremely quick. (Well, with the notable exception of times when you ask it to produce robust/clustered standard errors on a particularly large *lm object by invoking the "vcov = vcovHC" option. However, even then, the coeftest() transformation is a necessary step to exporting the robust regression results in the first place.)
A minor downside to this approach is that it doesn't save some regression statistics that may be of interest for your Stargazer table (e.g. R-squared or N). However, you could easily obtain these from the *lm object before converting it.
Using lmer I get the following warning:
Warning messages:
1: In optwrap(optimizer, devfun, x#theta, lower = x#lower) :
convergence code 3 from bobyqa: bobyqa -- a trust region step failed to reduce q
This error ois generated after using anova(model1, model2) . I tried to make this reproducible but if I dput the data and try again I the error does not reproduce on the dput data, despite the original and new datarames have the exact same str.
If have tried again in a clean session, and the error reproduces, and again is lost with a dput
I know I am not giving people much to work with here, like i said I would love to reproduce the problem. Cayone shed light on this warning?
(I'm not sure whether this is a comment or an answer, but it's a bit long and might be an answer.)
The proximal cause of your difficulty with reproducing the result is that lme4 uses both environments and reference classes: these are tricky to "serialize", i.e. to translate to a linear stream that can be saved via dput() or save(). (Can you please try save() and see if it works better than dput()?
In addition, both environments and reference classes use "pass-by-reference" semantics, so operating on the saved model can change it. anova() automatically refits the model, which makes some tiny but non-zero changes in the internal structure of the saved model object (we are still trying to track this down).
#alexkeil's comment is wrong: the nonlinear optimizers used within lme4 do not use any calls to the pseudo-random number generator. They are deterministic (but the two points above explain why things might look a bit weird).
To allay your concerns with the fit, I would check the fit by computing the gradient and Hessian at the final fit, e.g.
library(lme4)
library(numDeriv)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
dd <- update(fm1,devFunOnly=TRUE)
params <- getME(fm1,"theta") ## also need beta for glmer fits
grad(dd,params)
## all values 'small', say < 1e-3
## [1] 0.0002462423 0.0003276917 0.0003415010
eigen(solve(hessian(dd,params)),only.values=TRUE)$values
## all values positive and of similar magnitude
## [1] 0.029051631 0.002757233 0.001182232
We are in the process of implementing similar checks to run automatically within lme4.
That said, I would still love to see your example, if there's a way to reproduce it relatively easily.
PS: in order to be using bobyqa, you must either be using glmer or have used lmerControl to modify the default optimizer choice ... ??