lmerTest:::anova uses lazy loading of data sets? - r

Ran into this problem while trying to get the empirical distribution of the K-R degrees of freedom...
This seems like fairly dangerous behaviour? Does it constitute a bug?
Reproducible example:
## import lmerTest package
library(lmerTest)
## an object of class merModLmerTest
m <- lmer(Informed.liking ~ Gender+Information+Product +(1|Consumer), data=ham)
# simulate data from fitted model
simData=ham
simData$Informed.liking=unlist(simulate(m))
# fit model to simulated data
m1 <- lmer(Informed.liking ~ Gender+Information+Product +(1|Consumer), data=simData)
stats:::anova(m1)
lmerTest:::anova(m1)
# simulate again, WITHOUT refitting
simData$Informed.liking=unlist(simulate(m))
stats:::anova(m1) # same as before
lmerTest:::anova(m1) # not same as before!

my response does not constitute a solid answer, rather an extended comment:
this looks pretty bad - in fact I have discovered today that almost all the analyses that I conducted in a project that was on the verge of submission have to be redone because of a related behavior of lmerTest.
The problem I have run into was when I used a short function that fits a model with lmer and then returns coef(summary(model)) - simple stuff, two lines of code. However the input to this function was named data and I also had a dataframe called data in the workspace. It seems that although during fitting with lmer the local variable from the function scope was correctly used, during summary the workspace data variable was used (which often was not the same as the dataframe passed to the function) leading to invalid t values and degrees of freedom leading to incorrect p values (the estimates and their standard error was ok however).
So, answering your question:
This seems like fairly dangerous behaviour? Does it constitute a bug?
It seems dangerous indeed and I would definitelly call this a bug.

Related

Unexpected behavior in R using lapply() with glm() and cv.glm()

I am trying to apply cross validation to a list of linear models and getting an error.
Here is my code:
library(MASS)
library(boot)
glm.fits = lapply(1:10,function(d) glm(nox~poly(dis,d),data=Boston))
cvs = lapply(1:10,function(i) cv.glm(Boston,glm.fits[[i]],K=10)$delta[1])
I get the error:
Error in poly(dis, d) : object 'd' not found
I then tried the following code:
library(MASS)
library(boot)
cvs=rep(0,10)
for (d in 1:10){
glmfit = glm(nox~poly(dis,d),data=Boston)
cvs[d] = cv.glm(Boston,glmfit,K=10)$delta[1]
}
and this worked.
Can anyone explain why my first attempt did not work, and suggest a fix?
Also, assuming a fix to the first attempt can be obtained, which way of writing code is better practice? (assume that I want a list of the various fits and that I would edit the latter code to preserve them) To me, the first attempt is more elegant.
In order for your first attempt to work, cv.glm (and maybe glm) would have to be written differently to take much more care about where evaluations are taking place.
The function cv.glm basically re-evaluates the model formula a bunch of times. It takes that model formula directly from the fitted glm object. So put yourself in R's shoes (as it were), and consider you're deep in the function cv.glm and you've been instructed to refit this model:
glm(formula = nox ~ poly(dis, d), data = Boston)
The fitted glm object has Boston in it, and glm knows to look first in Boston for variables, so it finds nox and dis easily. But where is d? It's not in Boston. It's not in the fitted glm object (and glm wouldn't know to look there anyway). In fact, it isn't anywhere. That d value existed only in the context of the lapply iterations and then disappeared.
In the second case, since d is currently an active variable in your for loop, after R fails to find d in the data frame Boston, it looks in the parent frame, in this case the global environment and finds your for loop index d and merrily keeps going.
If you need to use glm and cv.glm in this way I would just use the for loop; it might be possible to work around the evaluation issues, but it probably wouldn't be worth the time and hassle.

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model?

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

obscure warning lme4 using lmer in optwrap

Using lmer I get the following warning:
Warning messages:
1: In optwrap(optimizer, devfun, x#theta, lower = x#lower) :
convergence code 3 from bobyqa: bobyqa -- a trust region step failed to reduce q
This error ois generated after using anova(model1, model2) . I tried to make this reproducible but if I dput the data and try again I the error does not reproduce on the dput data, despite the original and new datarames have the exact same str.
If have tried again in a clean session, and the error reproduces, and again is lost with a dput
I know I am not giving people much to work with here, like i said I would love to reproduce the problem. Cayone shed light on this warning?
(I'm not sure whether this is a comment or an answer, but it's a bit long and might be an answer.)
The proximal cause of your difficulty with reproducing the result is that lme4 uses both environments and reference classes: these are tricky to "serialize", i.e. to translate to a linear stream that can be saved via dput() or save(). (Can you please try save() and see if it works better than dput()?
In addition, both environments and reference classes use "pass-by-reference" semantics, so operating on the saved model can change it. anova() automatically refits the model, which makes some tiny but non-zero changes in the internal structure of the saved model object (we are still trying to track this down).
#alexkeil's comment is wrong: the nonlinear optimizers used within lme4 do not use any calls to the pseudo-random number generator. They are deterministic (but the two points above explain why things might look a bit weird).
To allay your concerns with the fit, I would check the fit by computing the gradient and Hessian at the final fit, e.g.
library(lme4)
library(numDeriv)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
dd <- update(fm1,devFunOnly=TRUE)
params <- getME(fm1,"theta") ## also need beta for glmer fits
grad(dd,params)
## all values 'small', say < 1e-3
## [1] 0.0002462423 0.0003276917 0.0003415010
eigen(solve(hessian(dd,params)),only.values=TRUE)$values
## all values positive and of similar magnitude
## [1] 0.029051631 0.002757233 0.001182232
We are in the process of implementing similar checks to run automatically within lme4.
That said, I would still love to see your example, if there's a way to reproduce it relatively easily.
PS: in order to be using bobyqa, you must either be using glmer or have used lmerControl to modify the default optimizer choice ... ??

Test for Multicollinearity in Panel Data R

I am running a panel data regression using the plm package in R and want to control for multicollinearity between the explanatory variables.
I know there is the vif() function in the car-package, however as far as I know, it cannot deal with panel data output.
The plm can do other diagnostics such as a unit root test but I found no method to calculate for multicollinearity.
Is there a way to calculate a similar test to vif, or can I just regard each variable as a time-series, leaving out the panel information and run tests using the car package?
I cannot disclose the data, but the problem should be relevant to all panel data models.
The dimension is roughly 1,000 observations, over 50 time-periods.
The code I use looks like this:
pdata <- plm.data(RegData, index=c("id","time"))
fixed <- plm(Y~X, data=pdata, model="within")
and then
vif(fixed)
returns an error.
Thank you in advance.
This question has been asked with reference to other statistical packages such as SAS https://communities.sas.com/thread/47675 and Stata http://www.stata.com/statalist/archive/2005-08/msg00018.html and the common answer has been to use pooled model to get VIF. The logic is that since multicollinearity is only about independent variable there is no need to control for individual effects using panel methods.
Here's some code extracted from another site:
mydata=read.csv("US Panel Data.csv")
attach(mydata) # not sure is that's really needed
Y=cbind(Return) # not sure what that is doing
pdata=plm.data(mydata, index=c("id","t"))
model=plm(Y ~ 1+ESG+Beta+Market.Cap+PTBV+Momentum+Dummy1+Dummy2+Dummy3+Dummy4+Dummy5+
Dummy6+Dummy7+Dummy8+Dummy9,
data=pdata,model="pooling")
vif(model)

Resources