Mixed effect model not returning all coefficients - r

I am running some mixed effect models in R. The code is:
m <- lmer(DV ~ FE^2 + FE + (FE^2 | ME) + (FE | ME), data=data, REML=FALSE)
When getting the coefficients with :
coefs <- data.frame(coef(m)[1])
I get a dataframe with the coefficients. However, when counting the number of groups it is missing a few. Why might this be? The model failed but my assumption is that it should still always produce a coefficient table showing the coefficients for each level of the Mixed effect.

Related

Ordinal logistic regression with random variable and quantitative predictor variable

I want to use an ordinal logistic regression (my response variable is ordinal) that works with 2 random variables and for quantitative predictor variable with interaction (my formula is: ordinal_variable~ quantitative_variable:habitat + (1|community) + (1|species).
I was analyzing my data with clmm (see the script below) and got the results I expected, however I noticed that clmm was designed to be used when response and predictor variables are factorial.
I then tried the polmer (see the script below), but I did not get any answers.
Would someone have any suggestions on how to analyze this data?
library(ordinal)
model1 <- clmm(as.factor(ordinal_variable)~
quantitative_variable:habitat + (1|community) + (1|species),
data=baseline)
summary(model1)
library(MPDiR)
library(lme4)
model2 <- polmer(as.factor(ordinal_variable) ~
quantitative_variable:habitat + (community - 1 | Obs) +
(species - 1 | Obs), data=baseline)
summary(model2)

plm() versus lm() with multiple fixed effects

I am attempting to run a model with county, year, and state:year fixed effects. The lm() approach looks like this:
lm <- lm(data = mydata, formula = y ~ x + county + year + state:year
where county, year, and state:year are all factors.
Because I have a large number of counties, running the model is very slow using lm(). More frustrating given the number of models I need to produce, lm() produces a much larger object than plm(). This plm() command yields the same coefficients and levels of significance for my main variables.
plm <- plm(data = mydata, formula = y ~ x + year + state:year, index = "county", model = "within"
However, these produce substantially different R-squared, Adj. R-squared, etc. I thought I could solve the R-squared problem by calculating the R-squared for plm by hand:
SST <- sum((mydata$y - mean(mydata$y))^2)
fit <- (mydata$y - plm$residuals)
SSR <- sum((fit - mean(mydata$y))^2)
R2 <- SSR / SST
I tested the R-squared code with lm and got the same result reported by summary(lm). However, when I calculated R-squared for plm I got a different R-squared (and it was greater than 1).
At this point I checked what the coefficients for my fixed effects in plm were and they were different than the coefficients in lm.
Can someone please 1) help me understand why I'm getting these differing results and 2) suggest the most efficient way to construct the models I need and obtain correct R-squareds? Thanks!

Application of a multi-way cluster-robust function in R

Hello (first timer here),
I would like to estimate a "two-way" cluster-robust variance-covariance matrix in R. I am using a particular canned routine from the "multiwayvcov" library. My question relates solely to the set-up of the cluster.vcov function in R. I have panel data of various crime outcomes. My cross-sectional unit is the "precinct" (over 40 precincts) and I observe crime in those precincts over several "months" (i.e., 24 months). I am evaluating an intervention that 'turns on' (dummy coded) for only a few months throughout the year.
I include "precinct" and "month" fixed effects (i.e., a full set of precinct and month dummies enter the model). I have only one independent variable I am assessing. I want to cluster on "both" dimensions but I am unsure how to set it up.
Do I estimate all the fixed effects with lm first? Or, do I simply run a model regressing crime on the independent variable (excluding fixed effects), then use cluster.vcov i.e., ~ precinct + month_year.
This seems like it would provide the wrong standard error though. Right? I hope this was clear. Sorry for any confusion. See my set up below.
library(multiwayvcov)
model <- lm(crime ~ as.factor(precinct) + as.factor(month_year) + policy, data = DATASET_full)
boot_both <- cluster.vcov(model, ~ precinct + month_year)
coeftest(model, boot_both)
### What the documentation offers as an example
### https://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf
library(lmtest)
data(petersen)
m1 <- lm(y ~ x, data = petersen)
### Double cluster by firm and year using a formula
vcov_both_formula <- cluster.vcov(m1, ~ firmid + year)
coeftest(m1, vcov_both_formula)
Is is appropriate to first estimate a model that ignores the fixed effects?
First the answer: you should first estimate your lm -model using fixed effects. This will give you your asymptotically correct parameter estimates. The std errors are incorrect because they are calculated from a vcov matrix which assumes iid errors.
To replace the iid covariance matrix with a cluster robust vcov matrix, you can use cluster.vcov, i.e. my_new_vcov_matrix <- cluster.vcov(~ precinct + month_year).
Then a recommendation: I warmly recommend the function felm from lfe for both multi-way fe's and cluster-robust standard erros.
The syntax is as follows:
library(multiwayvcov)
library(lfe)
data(petersen)
my_fe_model <- felm(y~x | firmid + year | 0 | firmid + year, data=petersen )
summary(my_fe_model)

Mixed Modelling - Different Results between lme and lmer functions

I am currently working through Andy Field's book, Discovering Statistics Using R. Chapter 14 is on Mixed Modelling and he uses the lme function from the nlme package.
The model he creates, using speed dating data, is such:
speedDateModel <- lme(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality,
random = ~1|participant/looks/personality)
I tried to recreate a similar model using the lmer function from the lme4 package; however, my results are different. I thought I had the proper syntax, but maybe not?
speedDateModel.2 <- lmer(dateRating ~ looks + personality + gender +
looks:gender + personality:gender +
(1|participant) + (1|looks) + (1|personality),
data = speedData, REML = FALSE)
Also, when I run the coefficients of these models I notice that it only produces random intercepts for each participant. I was trying to then create a model that produces both random intercepts and slopes. I can't seem to get the syntax correct for either function to do this. Any help would be greatly appreciated.
The only difference between the lme and the corresponding lmer formula should be that the random and fixed components are aggregated into a single formula:
dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+ (1|participant/looks/personality)
using (1|participant) + (1|looks) + (1|personality) is only equivalent if looks and personality have unique values at each nested level.
It's not clear what continuous variable you want to define your slopes: if you have a continuous variable x and groups g, then (x|g) or equivalently (1+x|g) will give you a random-slopes model (x should also be included in the fixed-effects part of the model, i.e. the full formula should be y~x+(x|g) ...)
update: I got the data, or rather a script file that allows one to reconstruct the data, from here. Field makes a common mistake in his book, which I have made several times in the past: since there is only a single observation in the data set for each participant/looks/personality combination, the three-way interaction has one level per observation. In a linear mixed model, this means the variance at the lowest level of nesting will be confounded with the residual variance.
You can see this in two ways:
lme appears to fit the model just fine, but if you try to calculate confidence intervals via intervals(), you get
intervals(speedDateModel)
## Error in intervals.lme(speedDateModel) :
## cannot get confidence intervals on var-cov components:
## Non-positive definite approximate variance-covariance
If you try this with lmer you get:
## Error: number of levels of each grouping factor
## must be < number of observations
In both cases, this is a clue that something's wrong. (You can overcome this in lmer if you really want to: see ?lmerControl.)
If we leave out the lowest grouping level, everything works fine:
sd2 <- lmer(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+
(1|participant/looks),
data=speedData)
Compare lmer and lme fixed effects:
all.equal(fixef(sd2),fixef(speedDateModel)) ## TRUE
The starling example here gives another example and further explanation of this issue.

Do I need to set refit=FALSE when testing for random effects in lmer() models with anova()?

I am currently testing whether I should include certain random effects in my lmer model or not. I use the anova function for that. My procedure so far is to fit the model with a function call to lmer() with REML=TRUE (the default option). Then I call anova() on the two models where one of them does include the random effect to be tested for and the other one doees not. However, it is well known that the anova() function refits the model with ML but in the new version of anova() you can prevent anova() from doing so by setting the option refit=FALSE. In order to test for random effects should I set refit=FALSE in my call to anova() or not? (If I do set refit=FALSE the p-values tend to be lower. Are the p-values anti-conservative when I set refit=FALSE?)
Method 1:
mod0_reml <- lmer(x ~ y + z + (1 | w), data=dat)
mod1_reml <- lmer(x ~ y + z + (y | w), data=dat)
anova(mod0_reml, mod1_reml)
This will result in anova() refitting the models with ML instead of REML. (Newer versions of the anova() function will also output an info about this.)
Method 2:
mod0_reml <- lmer(x ~ y + z + (1 | w), data=dat)
mod1_reml <- lmer(x ~ y + z + (y | w), data=dat)
anova(mod0_reml, mod1_reml, refit=FALSE)
This will result in anova() performing its calculations on the original models, i.e. with REML=TRUE.
Which of the two methods is correct in order to test whether I should include a random effect or not?
Thanks for any help
In general I would say that it would be appropriate to use refit=FALSE in this case, but let's go ahead and try a simulation experiment.
First fit a model without a random slope to the sleepstudy data set, then simulate data from this model:
library(lme4)
mod0 <- lmer(Reaction ~ Days + (1|Subject), data=sleepstudy)
## also fit the full model for later use
mod1 <- lmer(Reaction ~ Days + (Days|Subject), data=sleepstudy)
set.seed(101)
simdat <- simulate(mod0,1000)
Now refit the null data with the full and the reduced model, and save the distribution of p-values generated by anova() with and without refit=FALSE. This is essentially a parametric bootstrap test of the null hypothesis; we want to see if it has the appropriate characteristics (i.e., uniform distribution of p-values).
sumfun <- function(x) {
m0 <- refit(mod0,x)
m1 <- refit(mod1,x)
a_refit <- suppressMessages(anova(m0,m1)["m1","Pr(>Chisq)"])
a_no_refit <- anova(m0,m1,refit=FALSE)["m1","Pr(>Chisq)"]
c(refit=a_refit,no_refit=a_no_refit)
}
I like plyr::laply for its convenience, although you could just as easily use a for loop or one of the other *apply approaches.
library(plyr)
pdist <- laply(simdat,sumfun,.progress="text")
library(ggplot2); theme_set(theme_bw())
library(reshape2)
ggplot(melt(pdist),aes(x=value,fill=Var2))+
geom_histogram(aes(y=..density..),
alpha=0.5,position="identity",binwidth=0.02)+
geom_hline(yintercept=1,lty=2)
ggsave("nullhist.png",height=4,width=5)
Type I error rate for alpha=0.05:
colMeans(pdist<0.05)
## refit no_refit
## 0.021 0.026
You can see that in this case the two procedures give practically the same answer and both procedures are strongly conservative, for well-known reasons having to do with the fact that the null value of the hypothesis test is on the boundary of its feasible space. For the specific case of testing a single simple random effect, halving the p-value gives an appropriate answer (see Pinheiro and Bates 2000 and others); this actually appears to give reasonable answers here, although it is not really justified because here we are dropping two random-effects parameters (the random effect of slope and the correlation between the slope and intercept random effects):
colMeans(pdist/2<0.05)
## refit no_refit
## 0.051 0.055
Other points:
You might be able to do a similar exercise with the PBmodcomp function from the pbkrtest package.
The RLRsim package is designed precisely for fast randomization (parameteric bootstrap) tests of null hypotheses about random effects terms, but doesn't appear to work in this slightly more complex situation
see the relevant GLMM faq section for similar information, including arguments for why you might not want to test the significance of random effects at all ...
for extra credit you could redo the parametric bootstrap runs using the deviance (-2 log likelihood) differences rather than the p-values as output and check whether the results conformed to a mixture between a chi^2_0 (point mass at 0) and a chi^2_n distribution (where n is probably 2, but I wouldn't be sure for this geometry)

Resources