Residual variance extracted from glm and lmer in R - r

I am trying to take what I have read about multilevel modelling and merge it with what I know about glm in R. I am now using the height growth data from here.
I have done some coding shown below:
library(lme4)
library(ggplot2)
setwd("~/Documents/r_code/multilevel_modelling/")
rm(list=ls())
oxford.df <- read.fwf("oxboys/OXBOYS.DAT",widths=c(2,7,6,1))
names(oxford.df) <- c("stu_code","age_central","height","occasion_id")
oxford.df <- oxford.df[!is.na(oxford.df[,"age_central"]),]
oxford.df[,"stu_code"] <- factor(as.character(oxford.df[,"stu_code"]))
oxford.df[,"dummy"] <- 1
chart <- ggplot(data=oxford.df,aes(x=occasion_id,y=height))
chart <- chart + geom_point(aes(colour=stu_code))
# see if lm and glm give the same estimate
glm.01 <- lm(height~age_central+occasion_id,data=oxford.df)
glm.02 <- glm(height~age_central+occasion_id,data=oxford.df,family="gaussian")
summary(glm.02)
vcov(glm.02)
var(glm.02$residual)
(logLik(glm.01)*-2)-(logLik(glm.02)*-2)
1-pchisq(-2.273737e-13,1)
# lm and glm give the same estimation
# so glm.02 will be used from now on
# see if lmer without level2 variable give same result as glm.02
mlm.03 <- lmer(height~age_central+occasion_id+(1|dummy),data=oxford.df,REML=FALSE)
(logLik(glm.02)*-2)-(logLik(mlm.03)*-2)
# 1-pchisq(-3.408097e-07,1)
# glm.02 and mlm.03 give the same estimation, only if REML=FALSE
mlm.03 gives me the following output:
> mlm.03
Linear mixed model fit by maximum likelihood
Formula: height ~ age_central + occasion_id + (1 | dummy)
Data: oxford.df
AIC BIC logLik deviance REMLdev
1650 1667 -819.9 1640 1633
Random effects:
Groups Name Variance Std.Dev.
dummy (Intercept) 0.000 0.0000
Residual 64.712 8.0444
Number of obs: 234, groups: dummy, 1
Fixed effects:
Estimate Std. Error t value
(Intercept) 142.994 21.132 6.767
age_central 1.340 17.183 0.078
occasion_id 1.299 4.303 0.302
Correlation of Fixed Effects:
(Intr) ag_cnt
age_central 0.999
occasion_id -1.000 -0.999
You can see that there is a variance for the residual in the random effect section, which I have read from Applied Multilevel Analysis - A Practical Guide by Jos W.R. Twisk, that this represents the amount of "unexplained variance" from the model.
I wondered if I could arrive at the same residual variance from glm.02, so I tried the following:
> var(resid(glm.01))
[1] 64.98952
> sd(resid(glm.01))
[1] 8.061608
The results are slightly different from the mlm.03 output. Does this refer to the same "residual variance" stated in mlm.03?

Your glm.02 and glm.01 estimate a simple linear regression model using least squares. On the other hand, mlm.03 is a linear mixed model estimated through maximum likelihood.
I don't know your dataset, but it looks like you use the dummy variable to create a cluster structure at level-2 with zero variance.
So your question has basically two answers, but only the second answer is important in your case. The models glm.02 and mlm.03 do not contain the same residual variance estimate, because...
The models are usually different (mixed effects vs. classical regression). In your case, however, the dummy variable seems to supress the additional variance component in the mixed model. So for me the models seem to be equal.
The method used to estimate the residual variance is different. glm uses LS, lmer uses ML in your code. ML estimates for the residual variance are slightly biased (resulting in smaller variance estimates). This can be solved by using REML instead of ML to estimate variance components.
Using classic ML (instead of REML), however, is still necessary and correct for the likelihood-ratio test. Using REML the comparison of the two likelihoods would not be correct.
Cheers!

Related

Is it normal that glmer returns no variance of intercept?

I am trying to run a multi-level model to account for the fact that votes for a country's presidential elections may be nested within groups (depending of voters' mother tongues, places of residence etc.). In order to do so, I use the glmer function of the lme4 package.
m1<-glmer(vote_DPP ~ 1 + (1 | county_city),
family = binomial(link="logit"), data = d3)
Here, my vote variable is binary, representing whether people vote for a given party (1) or not (0). Since I believe results may change depending on people's state of residence, I want to allow intercepts to vary across states. However, I see no variation of intercept when I run my code.
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: vote_DPP ~ 1 + (1 | county_city)
Data: d3
AIC BIC logLik deviance df.resid
1746.7918 1757.2001 -871.3959 1742.7918 1343
Random effects:
Groups Name Std.Dev.
county_city (Intercept) 0.2559
Number of obs: 1345, groups: county_city, 17
Fixed Effects:
(Intercept)
0.5937
What puzzles me here is the complete absence of variance column. I have seen other forums on the web regarding problems with variance = 0, but I cannot seem to find anything about the complete disappearance of this column (which makes me think it's probably something very simple I missed). First time posting in here, and quite a beginner in R and Stats, so any help would be appreciated :)
If you're concerned about seeing if the variance is zero, that's equivalent to seeing if the standard deviation is zero (similarly for "is (std dev/variance) small, although in this case they will be on different scales"). Furthermore, if the std dev/variance are zero or nearly zero you should get a "singular fit" message as well.
#Roland's comment is correct that summary() will print both the standard deviation and the variance by default. You can ask for both (or either) in the output of print() as well by specifying the ranef.comp (random effect component) argument:
library(lme4)
gm1 <- glmer(incidence/size ~ period + (1|herd),
data = cbpp,
weight = size,
family = binomial)
print(gm1, ranef.comp = c("Std.Dev.", "Variance"))
## ...
## Random effects:
## Groups Name Std.Dev. Variance
## herd (Intercept) 0.6421 0.4123
## ...
(You can similarly modify which components are shown in the summary printout: for example if you only want to see the variance, you can specify print(summary(gm1), ranef.comp = c("Variance")).)
For a bit more context: the standard deviation and variance are essentially redundant information (the standard error of the estimates of the random effects are not shown because they can be unreliable estimates of uncertainty in this case). Which form is more useful depends on the application: standard deviations are easier to compare to the corresponding fixed effects, variances can sometimes be used to make conclusions about partitioning of variance across effects (although doing this is more complicated than in the classic linear, balanced ANOVA case).

lme4::lmer() With A -1?

I am looking at an lmer model that's been coded, and I don't quite understand what the -1 is / is doing. The code looks like fit = lmer(resids ~ -1 + (1|loc/time))
I believe the (1|loc/time) piece can be equivalently written as (1|loc) + (1|loc:time), which is a random intercept of loc, and a random intercept of time varying within loc.
Now the part I don't quite get: the -1, which I think has to do with the mean. The only place I have found that has anything on using a -1 in that spot (as opposed to 1 or leaving it blank) is on page 7 of Fitting Linear Mixed-Effects Models using lme4. The table here shows it in conjunction with offset(o), which is used "to specify that a random intercept has a priori known means". So, my gut says that leaving the offset(o) out would be the same as using offset(0) (number 0 not character o), which would mean the a priori means are all 0.
Is this correct?
Yes, this sets the fixed-effect component of the model to exactly zero. While it is legal, this is a moderately strange model; I hope that whoever has coded it knows what they're doing. I can only think of two reasons you would fit a model with an empty fixed-effect component:
for some reason you want to do a likelihood ratio test for the significance of the intercept (this is unusual, in most cases the intercept is not of particular statistical interest)
you have a particular experimental design where the mean is known a priori to be zero (e.g. your response variable is some kind of difference between two elements that have been randomized to be exchangeable).
lmer(Reaction ~ -1 + (Days|Subject), sleepstudy)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Reaction ~ -1 + (Days | Subject)
Data: sleepstudy
AIC BIC logLik deviance df.resid
1840.7814 1853.5532 -916.3907 1832.7814 176
Random effects:
Groups Name Std.Dev. Corr
Subject (Intercept) 252.53
Days 11.93 0.88
Residual 25.59
Number of obs: 180, groups: Subject, 18
No fixed effect coefficients

Opposite directions of exponential hazard model coefficients ( with survreg and glm poisson)

I want to estimate an exponential hazards model with one predictor in R. For some reason, I am getting coefficients with opposite signs when I estimate it using a glm poisson with offset log t and when I just use the survreg function from the survival package. I am sure the explanation is perfectly obvious but I can not figure it out.
Example
t <- c(89,74,23,74,53,3,177,44,28,43,25,24,31,111,57,20,19,137,45,48,9,17,4,59,7,26,180,56,36,51,6,71,23,6,13,28,16,180,16,25,6,25,4,5,32,94,106,1,69,63,31)
d <- c(0,1,1,0,1,1,0,1,1,0,1,1,1,1,0,0,1,0,1,1,1,0,1,0,1,1,0,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1)
p <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,1,1)
df <- data.frame(d,t,p)
# exponential hazards model using poisson with offest log(t)
summary(glm(d ~ offset(log(t)) + p, data = df, family = "poisson"))
Produces:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.3868 0.7070 -7.619 2.56e-14 ***
p 1.3932 0.7264 1.918 0.0551 .
Compared to
# exponential hazards model using survreg exponential
require(survival)
summary(survreg(Surv(t,d) ~ p, data = df, dist = "exponential"))
Produces:
Value Std. Error z p
(Intercept) 5.39 0.707 7.62 2.58e-14
p -1.39 0.726 -1.92 5.51e-02
Why are the coefficients in opposite directions and how would I interpret the results as they stand?
Thanks!
In the second model an increased value of p is associated with a decreased expected survival. In the first model the increased p that had a long t value would imply a higher chance of survival and a lower risk. Variations in risk and mean survival time values of necessity go in opposite directions. The fact that the absolute values are the same comes from the mathematical identity log(1/x) = -log(x). The risk is (exactly) inversely proportional to mean lifetime in exponential models.

Confidence Interval Based on Asymptotic Normality in lmer model

Why is not confint.default which is based on asymptotic normality doesn't work for lmer model ?
fit <- lmer(y~(1|operator)+(1|part),data=dat)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ (1 | operator) + (1 | part)
Data: dat
REML criterion at convergence: 409.3913
Random effects:
Groups Name Std.Dev.
part (Intercept) 3.2018
operator (Intercept) 0.1031
Residual 0.9398
Number of obs: 120, groups: part, 20; operator, 3
Fixed Effects:
(Intercept)
22.39
confint.default(fit)
Error in as.integer(x) :
cannot coerce type 'S4' to vector of type 'integer'
What is the error saying ? How can I get confidence interval based on asymptotic normality for lmer model ?
Don't use confint.default(), just use confint(). The methods to calculate confidence intervals are different for the different model types. You can see the different methods with methods(confint). The "correct" version of the function is called based on the class of the first object you pass to the function. Directly calling one of the methods usually isn't a good idea.
There are options for how to calculate the bounds for the lmer objects. Look at the help page for ?confint.merMod to see the options unique to that model type.
#MrFlick is correct, but it may be worth adding that confint.merMod() gives likelihood profile CIs by default; confint(.,method="Wald") will give the confidence intervals based on asymptotic normality:
β€˜"Wald"’: approximating the confidence intervals (of fixed-effect
parameters only; all variance-covariance parameters CIs will
be returned as β€˜NA’) based on the estimated local curvature
of the likelihood surface;
(this is obvious from the help page, but is probably worth restating here).

Random slope for time in subject not working in lme4

I can not insert a random slope in this model with lme4(1.1-7):
> difJS<-lmer(JS~Tempo+(Tempo|id),dat,na.action=na.omit)
Error: number of observations (=274) <= number of random effects (=278) for term
(Tempo | id); the random-effects parameters and the residual variance (or scale
parameter) are probably unidentifiable
With nlme it is working:
> JSprova<-lme(JS~Tempo,random=~1+Tempo|id,data=dat,na.action=na.omit)
> summary(JSprova)
Linear mixed-effects model fit by REML Data: dat
AIC BIC logLik
769.6847 791.3196 -378.8424
Random effects:
Formula: ~1 + Tempo | id
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.1981593 (Intr)
Tempo 0.5409468 -0.692
Residual 0.5597984
Fixed effects: JS ~ Tempo
Value Std.Error DF t-value p-value
(Intercept) 4.116867 0.14789184 138 27.837013 0.0000
Tempo -0.207240 0.08227474 134 -2.518874 0.0129
Correlation:
(Intr)
Tempo -0.837
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.79269550 -0.39879115 0.09688881 0.41525770 2.32111142
Number of Observations: 274
Number of Groups: 139
I think it is a problem of missing data as I have few cases where there is a missing data in time two of the DV but with na.action=na.omit should not the two package behave in the same way?
It is "working" with lme, but I'm 99% sure that your random slopes are indeed confounded with the residual variation. The problem is that you only have two measurements per subject (or only one measurement per subject in 4 cases -- but that's not important here), so that a random slope plus a random intercept for every individual gives one random effect for every observation.
If you try intervals() on your lme fit, it will give you an error saying that the variance-covariance matrix is unidentifiable.
You can force lmer to do it by disabling some of the identifiability checks (see below).
library("lme4")
library("nlme")
library("plyr")
Restrict the data to only two points per individual:
sleepstudy0 <- ddply(sleepstudy,"Subject",
function(x) x[1:2,])
m1 <- lme(Reaction~Days,random=~Days|Subject,data=sleepstudy0)
intervals(m1)
## Error ... cannot get confidence intervals on var-cov components
lmer(Reaction~Days+(Days|Subject),data=sleepstudy0)
## error
If you want you can force lmer to fit this model:
m2B <- lmer(Reaction~Days+(Days|Subject),data=sleepstudy0,
control=lmerControl(check.nobs.vs.nRE="ignore"))
## warning messages
The estimated variances are different from those estimated by lme, but that's not surprising since some of the parameters are jointly unidentifiable.
If you're only interested in inference on the fixed effects, it might be OK to ignore these problems, but I wouldn't recommend it.
The sensible thing to do is to recognize that the variation among slopes is unidentifiable; there may be among-individual variation among slopes, but you just can't estimate it with this model. Don't try; fit a random-intercept model and let the implicit/default random error term take care of the variation among slopes.
There's a recent related question on CrossValidated; there I also refer to another example.

Resources