Interpreting output from emmeans::contrast - r

I have data from a longitudinal study and calculated the regression using the lme4::lmer function. After that I calculated the contrasts for these data but I am having difficulty interpreting my results, as they were unexpected. I think I might have made a mistake in the code. Unfortunately I couldn't replicate my results with an example, but I will post both the failed example and my actual results below.
My results:
library(lme4)
library(lmerTest)
library(emmeans)
#regression
regmemory <- lmer(memory ~ as.factor(QuartileConsumption)*Age+
(1 + Age | ID) + sex + education +
HealthScore, CognitionData)
#results
summary(regmemory)
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) -7.981e-01 9.803e-02 1.785e+04 -8.142 4.15e-16 ***
#as.factor(QuartileConsumption)2 -8.723e-02 1.045e-01 2.217e+04 -0.835 0.40376
#as.factor(QuartileConsumption)3 5.069e-03 1.036e-01 2.226e+04 0.049 0.96097
#as.factor(QuartileConsumption)4 -2.431e-02 1.030e-01 2.213e+04 -0.236 0.81337
#Age -1.709e-02 1.343e-03 1.989e+04 -12.721 < 2e-16 ***
#sex 3.247e-01 1.520e-02 1.023e+04 21.355 < 2e-16 ***
#education 2.979e-01 1.093e-02 1.061e+04 27.266 < 2e-16 ***
#HealthScore -1.098e-06 5.687e-07 1.021e+04 -1.931 0.05352 .
#as.factor(QuartileConsumption)2:Age 1.101e-03 1.842e-03 1.951e+04 0.598 0.55006
#as.factor(QuartileConsumption)3:Age 4.113e-05 1.845e-03 1.935e+04 0.022 0.98221
#as.factor(QuartileConsumption)4:Age 1.519e-03 1.851e-03 1.989e+04 0.821 0.41174
#contrasts
emmeans(regmemory, poly ~ QuartileConsumption * Age)$contrast
#$contrasts
# contrast estimate SE df z.ratio p.value
# linear 0.2165 0.0660 Inf 3.280 0.0010
# quadratic 0.0791 0.0289 Inf 2.733 0.0063
# cubic -0.0364 0.0642 Inf -0.567 0.5709
The interaction terms in the regression results are not significant, but the linear contrast is. Shouldn't the p-value for the contrast be non-significant?
Below is the code I wrote to try to recreate these results, but failed:
library(dplyr)
library(lme4)
library(lmerTest)
library(emmeans)
data("sleepstudy")
#create quartile column
sleepstudy$Quartile <- sample(1:4, size = nrow(sleepstudy), replace = T)
#regression
model1 <- lmer(Reaction ~ Days * as.factor(Quartile) + (1 + Days | Subject), data = sleepstudy)
#results
summary(model1)
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) 258.1519 9.6513 54.5194 26.748 < 2e-16 ***
#Days 9.8606 2.0019 43.8516 4.926 1.24e-05 ***
#as.factor(Quartile)2 -11.5897 11.3420 154.1400 -1.022 0.308
#as.factor(Quartile)3 -5.0381 11.2064 155.3822 -0.450 0.654
#as.factor(Quartile)4 -10.7821 10.8798 154.0820 -0.991 0.323
#Days:as.factor(Quartile)2 0.5676 2.1010 152.1491 0.270 0.787
#Days:as.factor(Quartile)3 0.2833 2.0660 155.5669 0.137 0.891
#Days:as.factor(Quartile)4 1.8639 2.1293 153.1315 0.875 0.383
#contrast
emmeans(model1, poly ~ Quartile*Days)$contrast
#contrast estimate SE df t.ratio p.value
# linear -1.91 18.78 149 -0.102 0.9191
# quadratic 10.40 8.48 152 1.227 0.2215
# cubic -18.21 18.94 150 -0.961 0.3379
In this example, the p-value for the linear contrast is non-significant just as the interactions from the regression. Did I do something wrong, or these results are to be expected?

Look at the emmeans() call for the original model:
emmeans(regmemory, poly ~ QuartileConsumption * Age)
This requests that we obtain marginal means for combinations of QuartileConsumption and Age, and obtain polynomial contrasts from those results. It appears that Age is a quantitative variable, so in computing the marginal means, we just use the mean value of Age (see documentation for ref_grid() and vignette("basics", "emmeans")). So the marginal means display, which wasn't shown in the OP, will be in this general form:
QuartileConsumption Age emmean
------------------------------------
1 <mean> <est1>
2 <mean> <est2>
3 <mean> <est3>
4 <mean> <est4>
... and the contrasts shown will be the linear, quadratic, and cubic trends of those four estimates, in the order shown.
Note that these marginal means have nothing to do with the interaction effect; they are just predictions from the model for the four levels of QuartileConsumption at the mean Age (and mean education, mean health score), averaged over the two sexes, if I understand the data structure correctly. So essentially the polynomial contrasts estimate polynomial trends of the 4-level factor at the mean age. And note in particular that age is held constant, so we certainly are not looking at any effects of Age.
I am guessing what you want to be doing to examine the interaction is to assess how the age trend varies over the four levels of that factor. If that is the case, one useful thing to do would be something like
slopes <- emtrends(regmemory, ~ QuartileConsumption, var = "age")
slopes # display the estimated slope at each level
pairs(slopes) # pairwise comparisons of these slopes
See vignette("interactions", "emmeans") and the section on interactions with covariates.

Related

R: Plotting Mixed Effect models plot results

I am working on linguistic data and try to investigate the realisation of the vowel in words such as NURSE. There are more less 3 categories that can be realised, which I coded as <Er, Ir, Vr>. I then measured Formant values (F1 and F2). Then I created an LME that predicts the F1 and F2 values with different fixed and random effects but the main effect is a cross random effect of phoneme (i.e. <Er, Ir, Vr>) and individual. An example model can be found below.
Linear mixed model fit by REML ['lmerMod']
Formula:
F2 ~ (phoneme | individual) + (1 | word) + age + frequency +
(1 | zduration)
Data: nurse_female
REML criterion at convergence: 654.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.09203 -0.20332 0.03263 0.25273 1.37056
Random effects:
Groups Name Variance Std.Dev. Corr
zduration (Intercept) 0.27779 0.5271
word (Intercept) 0.04488 0.2118
individual (Intercept) 0.34181 0.5846
phonemeIr 0.54227 0.7364 -0.82
phonemeVr 1.52090 1.2332 -0.93 0.91
Residual 0.06326 0.2515
Number of obs: 334, groups:
zduration, 280; word, 116; individual, 23
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.79167 0.32138 5.575
age -0.01596 0.00508 -3.142
frequencylow -0.37587 0.18560 -2.025
frequencymid -1.18901 0.27738 -4.286
frequencyvery high -0.68365 0.26564 -2.574
Correlation of Fixed Effects:
(Intr) age frqncyl frqncym
age -0.811
frequencylw -0.531 -0.013
frequencymd -0.333 -0.006 0.589
frqncyvryhg -0.356 0.000 0.627 0.389
The question is now, how would I go about plotting the mean F2 values for each individual and for each of 3 variants <Er, Ir, Vr>?
I tried plotting the random effects as a caterpillar plot and get the following, but I am not sure, if this is accurate or does what I want. If what I have done Is right, are there any other better ways of plotting it?
ranefs_nurse_female_F2 <- ranef(nurse_female_F2.lmer8_2)
dotplot(ranefs_nurse_female_F2)

Regression equation produces model outside of all data

I'm fairly confused to why I produce a regression equation that is so outside of the range of all data in dataset. I have a feeling the equation is very sensitive to data with a big spread but Im still confused. Any assistance would be greatly appreciated, stats certainly isn't my first language!
For reference this is a geochemical thermodynamics problem: Im trying to fit the Maier-Kelley equation to some experimental data. The Maier-Kelley equation describes how the equilibrium constant (K), in this case dolomite dissolving in water, changes with temperature (T in this case in Kelvin).
log K = A + B.T + C/T + D.logT + E/T^2
To cut a long story short (see Hyeong and Capuano., 2001 if interested) the equilibrium constant (K) is the same as Log_Ca_Mg (ratio of calcium to magnesium ion acitivities).
The experimental data uses groundwater data from different locations and different depths (so identified by FIELD and DepthID - which are my random variables).
I have included 3 datasets
(Problem)Dataset 1:https://pastebin.com/fe2r2ebA
(Working)Dataset 2:https://pastebin.com/gFgaJ2c8
(Working)Dataset 3:https://pastebin.com/X5USaaNA
Using the following code, for dataset 1
> dat1 <- read.csv("PATH_TO_DATASET_1.txt", header = TRUE,sep="\t")
> fm1 <- lmer(Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1|FIELD) +(1|DepthID),data=dat1)
Warning messages:
1: Some predictor variables are on very different scales: consider rescaling
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0196619 (tol = 0.002, component 1)
3: Some predictor variables are on very different
> summary(fm1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1 | FIELD) + (1 | DepthID)
Data: dat1
REML criterion at convergence: -774.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.5464 -0.4538 -0.0671 0.3736 6.4217
Random effects:
Groups Name Variance Std.Dev.
DepthID (Intercept) 0.01035 0.1017
FIELD (Intercept) 0.01081 0.1040
Residual 0.01905 0.1380
Number of obs: 1175, groups: DepthID, 675; FIELD, 410
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.368e+03 1.706e+03 4.582e-02 1.974 0.876
kelvin 4.615e-01 2.375e-01 4.600e-02 1.943 0.876
I(kelvin^-1) -1.975e+05 9.788e+04 4.591e-02 -2.018 0.875
I(log10(kelvin)) -1.205e+03 6.122e+02 4.582e-02 -1.968 0.876
I(kelvin^-2) 1.230e+07 5.933e+06 4.624e-02 2.073 0.873
Correlation of Fixed Effects:
(Intr) kelvin I(^-1) I(10()
kelvin 0.999
I(kelvn^-1) -1.000 -0.997
I(lg10(kl)) -1.000 -0.999 0.999
I(kelvn^-2) 0.998 0.994 -0.999 -0.997
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
Model failed to converge with max|grad| = 0.0196619 (tol = 0.002, component 1)
For Dataset 2
> summary(fm2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1 | FIELD) + (1 | DepthID)
Data: dat2
REML criterion at convergence: -1073.8
Scaled residuals:
Min 1Q Median 3Q Max
-3.0816 -0.4772 -0.0581 0.3650 5.6209
Random effects:
Groups Name Variance Std.Dev.
DepthID (Intercept) 0.007368 0.08584
FIELD (Intercept) 0.014266 0.11944
Residual 0.023048 0.15182
Number of obs: 1906, groups: DepthID, 966; FIELD, 537
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -9.366e+01 2.948e+03 1.283e-03 -0.032 0.999
kelvin -2.798e-02 4.371e-01 1.289e-03 -0.064 0.998
I(kelvin^-1) 2.623e+02 1.627e+05 1.285e-03 0.002 1.000
I(log10(kelvin)) 3.965e+01 1.067e+03 1.283e-03 0.037 0.999
I(kelvin^-2) 2.917e+05 9.476e+06 1.294e-03 0.031 0.999
Correlation of Fixed Effects:
(Intr) kelvin I(^-1) I(10()
kelvin 0.999
I(kelvn^-1) -0.999 -0.997
I(lg10(kl)) -1.000 -0.999 0.999
I(kelvn^-2) 0.998 0.994 -0.999 -0.997
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
Model failed to converge with max|grad| = 0.0196967 (tol = 0.002, component 1)
For Dataset 3
> summary(fm2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1 | FIELD) + (1 | DepthID)
Data: dat3
REML criterion at convergence: -1590.1
Scaled residuals:
Min 1Q Median 3Q Max
-4.2546 -0.4987 -0.0379 0.4313 4.5490
Random effects:
Groups Name Variance Std.Dev.
DepthID (Intercept) 0.01311 0.1145
FIELD (Intercept) 0.01424 0.1193
Residual 0.03138 0.1771
Number of obs: 6674, groups: DepthID, 3422; FIELD, 1622
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.260e+03 1.835e+03 9.027e-02 0.687 0.871
kelvin 1.824e-01 2.783e-01 9.059e-02 0.655 0.874
I(kelvin^-1) -7.289e+04 9.961e+04 9.044e-02 -0.732 0.866
I(log10(kelvin)) -4.529e+02 6.658e+02 9.028e-02 -0.680 0.872
I(kelvin^-2) 4.499e+06 5.690e+06 9.104e-02 0.791 0.860
Correlation of Fixed Effects:
(Intr) kelvin I(^-1) I(10()
kelvin 0.999
I(kelvn^-1) -1.000 -0.997
I(lg10(kl)) -1.000 -0.999 0.999
I(kelvn^-2) 0.998 0.994 -0.999 -0.998
fit warnings:
Some predictor variables are on very different scales: consider rescaling
convergence code: 0
unable to evaluate scaled gradient
Model failed to converge: degenerate Hessian with 1 negative eigenvalues
I've plotted 'all the data' but for the regression analysis there is no data above the red line or bellow the green line. Only points with a log_ca_mg value between the red and green line at any temperature are included in the regression analysis.
So looking at the regressions on a plot dataset 1 is just way off but as there is no data above the red line this just confuses me no end. The regression is sitting in an area where there is no data. For the other two datasets this isn't a problem. Even for datasets with smaller sizes (n=200) its approximately in the same area. The three datasets look relatively similar when plotted individually.
Im kind of lost. Any help in understanding this would be appreciated.
What follows is an attempt to diagnose what may be going wrong with your model. It will use Dataset 1 for this discussion:
As described in your question, when one runs the original model with Dataset 1, they recieve warnings:
# original model
fm1 <- lme4::lmer(Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + I(log10(kelvin)) + I(kelvin^-2) + (1|FIELD) +(1|DepthID),data=dat1)
Some predictor variables are on very different scales: consider
rescaling convergence code: 0 Model failed to converge with max|grad|
= 0.0196619 (tol = 0.002, component 1)
These and other information suggest your model has problems, perhaps related to the predictors being on a different scale.
Since fm1 has several predictors that are transformations of the variable 'kelvin',we can also check the model for collinearity with the car package vif function:
# examine collinearity with the vif (variance inflation factors)
> car::vif(fm1)
kelvin I(kelvin^-1) I(log10(kelvin)) I(kelvin^-2)
716333 9200929 7688348 1224275
These vif values suggest the fm1 model suffers from high collinearity.
We can try to drop the some of those predictors, to examine a simpler model:
fm1_b <- lme4::lmer(Log_Ca_Mg ~ 1 + kelvin + I(kelvin^-1) + (1|FIELD) +(1|DepthID),data=dat1)
When we run the code, we still get a warning about the predictors being on different scales:
Warning message: Some predictor variables are on very different
scales: consider rescaling
At the same time the vif values are much smaller:
# examine collinearity with the vif (variance inflation factors)
> car::vif(fm1_b)
kelvin I(kelvin^-1)
46.48406 46.48406
Following gung's suggestion that I mentioned in the comments, we can see what happens when we center our kelvin variables:
dat1$kelvin_centered <- as.vector(scale(dat1$kelvin, center= TRUE, scale = FALSE ))
# Make a power transformation on the kelvin_centered variable
dat1$kelvin_centered_pwr <- dat1$kelvin_centered^-1
And check to see if they are correlated
# check the correlation of the centered vars
cor(dat1$kelvin_centered, dat1$kelvin_centered_pwr)
> cor(dat1$kelvin_centered, dat1$kelvin_centered_pwr)
[1] 0.08056641
And construct a different model with the centered variables:
# construct a modifed model
fm1_c <- lme4::lmer(Log_Ca_Mg ~ 1 + kelvin_centered + kelvin_centered_pwr + (1|FIELD) +(1|DepthID),data=dat1)
Notably, we don' see any warnings when we run the code with this model. And the vif values are quite low:
car::vif(fm1_c)
> car::vif(fm1_c)
kelvin_centered kelvin_centered_pwr
1.005899 1.005899
Conclusion
The original model has a high degree of collinearity. Collinearity can make models unstable, which could account for why fm1 failed to converge, and why you are seeing weird predictions in the plots. Model fm1_c may or may not be the correct model for your purpose. It at least provides a lens to understand the issue with your original model.
I think you are going about this the wrong way. It sounds like you are trying to estimate the parameters A, B, C, D and E in the Maier-Kelley equation. You can do this by using non-linear least squares rather than a linear mixed effects model.
Start by defining a function that replicates the formula:
MK_eq <- function(A, B, C, D, E, Temp)
{
A + B * Temp + C / Temp + D * log10(Temp) + E / (Temp^2)
}
Now we use the nls function to get an estimate for A to E:
mod1 <- nls(Log_Ca_Mg ~ MK_eq(A, B, C, D, E, kelvin),
start = list(A = 1, B = 1, C = 1, D = 1, E = 2), data = dat1)
coef(mod1)
#> A B C D E
#> 4.802008e+03 6.538166e-01 -2.818917e+05 -1.717040e+03 1.755566e+07
and we can create a "regression line" by getting a prediction for every value of Kelvin between, say, 275 and 400 in increments of 0.1:
new_data <- data.frame(kelvin = seq(275, 400, 0.1))
new_data$Log_Ca_Mg <- predict(mod1, newdata = new_data)
and we can demonstrate that this is a good approximation by plotting our prediction over the sample:
ggplot(dat1, aes(x = kelvin, y = Log_Ca_Mg)) +
geom_point() +
geom_line(data = new_data, linetype = 2, colour = "red", size = 2)
Note that for simplicity I have avoided discussion of the random effects - it is possible to do mixed effects non-linear least squares using the nlme package, but it is more involved and the discussion here describes how to do it in more detail than I can here.

Colnames error after running Summary() in mixed model

R version 3.1.0 (2014-04-10)
lmer package version 1.1-6
lmerTest package version 2.0-6
I am currently working with lmer and lmerTest for my analysis.
Every time I add an effect to the random structure, I get the following error when running summary():
#Fitting a mixed model:
TRT5ToVerb.lmer3 = lmer(TRT5ToVerb ~ Group + Condition + (1+Condition|Participant) + (1|Trial), data=AllData, REML=FALSE, na.action=na.omit)
summary(TRT5ToVerb.lmer3)
Error in `colnames<-`(`*tmp*`, value = c("Estimate", "Std. Error", "df", : length of 'dimnames' [2] not equal to array extent
If I leave the structure like this:
TRT5ToVerb.lmer2 = lmer(TRT5ToVerb ~ Group + Condition + (1|Participant) + (1|Trial), data=AllData, REML=FALSE, na.action=na.omit)
there is no error run summary(TRT5ToVerb.lmer2), returning AIC, BIC, logLik deviance, estimates of the random effects, estimates of the fixed effects and their corresponding p-values, etc., etc.
So, apparently something happens when I run lmerTest, despite the fact that the object TRT5ToVerb.lmer3 is there. The only difference between both is the random structure: (1+Condition|Participant) vs. (1|Participant)
Some characteristics of my data:
Both Condition and Group are categorical variables: Condition
comprises 3 levels, and Group 2
The dependent variable (TRT5ToVerb) is continuous: it corresponds to
reading time in terms of ms
This a repeated measures experiment, with 48 observations per
participant (participants=28)
I read this threat, but I cannot see a clear solution. Will it be that I have to transform my dataframe to long format?
And if so, then how do I work with that in lmer?
I hope it is not that.
Thanks!
Disclaimer: I am neither an expert in R, nor in statistics, so please, have some patience.
(Should be a comment, but too long/code formatting etc.)
This fake example seems to work fine with lmerTest 2.0-6 and a development version of lme4 (1.1-8; but I wouldn't expect there to be any relevant differences from 1.1-6 for this example ...)
AllData <- expand.grid(Condition=factor(1:3),Group=factor(1:2),
Participant=1:28,Trial=1:8)
form <- TRT5ToVerb ~ Group + Condition + (1+Condition|Participant) + (1|Trial)
library(lme4)
set.seed(101)
AllData$TRT5ToVerb <- simulate(form[-2],
newdata=AllData,
family=gaussian,
newparam=list(theta=rep(1,7),sigma=1,beta=rep(0,4)))[[1]]
library(lmerTest)
lmer3 <- lmer(form,data=AllData,REML=FALSE)
summary(lmer3)
Produces:
Linear mixed model fit by maximum likelihood ['merModLmerTest']
Formula: TRT5ToVerb ~ Group + Condition + (1 + Condition | Participant) +
(1 | Trial)
Data: AllData
AIC BIC logLik deviance df.resid
4073.6 4136.0 -2024.8 4049.6 1332
Scaled residuals:
Min 1Q Median 3Q Max
-2.97773 -0.65923 0.02319 0.66454 2.98854
Random effects:
Groups Name Variance Std.Dev. Corr
Participant (Intercept) 0.8546 0.9245
Condition2 1.3596 1.1660 0.58
Condition3 3.3558 1.8319 0.44 0.82
Trial (Intercept) 0.9978 0.9989
Residual 0.9662 0.9829
Number of obs: 1344, groups: Participant, 28; Trial, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.49867 0.39764 12.40000 1.254 0.233
Group2 0.03002 0.05362 1252.90000 0.560 0.576
Condition2 -0.03777 0.22994 28.00000 -0.164 0.871
Condition3 -0.27796 0.35237 28.00000 -0.789 0.437
Correlation of Fixed Effects:
(Intr) Group2 Cndtn2
Group2 -0.067
Condition2 0.220 0.000
Condition3 0.172 0.000 0.794

Interpreting the output of summary(glmer(...)) in R

I'm an R noob, I hope you can help me:
I'm trying to analyse a dataset in R, but I'm not sure how to interpret the output of summary(glmer(...)) and the documentation isn't a big help:
> data_chosen_stim<-glmer(open_chosen_stim~closed_chosen_stim+day+(1|ID),family=binomial,data=chosenMovement)
> summary(data_chosen_stim)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: open_chosen_stim ~ closed_chosen_stim + day + (1 | ID)
Data: chosenMovement
AIC BIC logLik deviance df.resid
96.7 105.5 -44.4 88.7 62
Scaled residuals:
Min 1Q Median 3Q Max
-1.4062 -1.0749 0.7111 0.8787 1.0223
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 0 0
Number of obs: 66, groups: ID, 35
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4511 0.8715 0.518 0.605
closed_chosen_stim2 0.4783 0.5047 0.948 0.343
day -0.2476 0.5060 -0.489 0.625
Correlation of Fixed Effects:
(Intr) cls__2
clsd_chsn_2 -0.347
day -0.916 0.077
I understand the GLM behind it, but I can't see the weights of the independent variables and their error bounds.
update: weights.merMod already has a type argument ...
I think what you're looking for weights(object,type="working").
I believe these are the diagonal elements of W in your notation?
Here's a trivial example that matches up the results of glm and glmer (since the random effect is bogus and gets an estimated variance of zero, the fixed effects, weights, etc etc converges to the same value).
Note that the weights() accessor returns the prior weights by default (these are all equal to 1 for the example below).
Example (from ?glm):
d.AD <- data.frame(treatment=gl(3,3),
outcome=gl(3,1,9),
counts=c(18,17,15,20,10,20,25,13,12))
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson(),
data=d.AD)
library(lme4)
d.AD$f <- 1 ## dummy grouping variable
glmer.D93 <- glmer(counts ~ outcome + treatment + (1|f),
family = poisson(),
data=d.AD,
control=glmerControl(check.nlev.gtr.1="ignore"))
Fixed effects and weights are the same:
all.equal(fixef(glmer.D93),coef(glm.D93)) ## TRUE
all.equal(unname(weights(glm.D93,type="working")),
weights(glmer.D93,type="working"),
tol=1e-7) ## TRUE

Why is the standard error different in these two fitting methods (R Logistic Regression and Beta Regression) for a common dataset?

I am trying to understand the difference between two different fitting methods for a data set with a bounded response variable. The response variable is a fraction and therefore has a range of [0,1]. I have uncovered through my Google searching that there are a lot of different methods out there as this is a common operation. I am currently interested in the difference between the stock R GLM fit and the Beta regression offered in the betareg package. I am using the GasolineYield data set from the "betareg" package as my sample data set. Before I post the code and the results my two questions are the following:
Am I performing the Logistic Regression fit in R using the builtin R GLM correctly?
Why are the standard errors reported in the Beta regression so much smaller than the standard errors for the R logistic regression?
R Setup Code
library(betareg)
data("GasolineYield", package = "betareg")
Beta Regression code from the "betareg" package
gy = betareg(yield ~ batch + temp, data = GasolineYield)
summary(gy)
Beta Regression summary output
Call:
betareg(formula = yield ~ batch + temp, data = GasolineYield)
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-2.8750 -0.8149 0.1601 0.8384 2.0483
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.1595710 0.1823247 -33.784 < 2e-16 ***
batch1 1.7277289 0.1012294 17.067 < 2e-16 ***
batch2 1.3225969 0.1179020 11.218 < 2e-16 ***
batch3 1.5723099 0.1161045 13.542 < 2e-16 ***
batch4 1.0597141 0.1023598 10.353 < 2e-16 ***
batch5 1.1337518 0.1035232 10.952 < 2e-16 ***
batch6 1.0401618 0.1060365 9.809 < 2e-16 ***
batch7 0.5436922 0.1091275 4.982 6.29e-07 ***
batch8 0.4959007 0.1089257 4.553 5.30e-06 ***
batch9 0.3857930 0.1185933 3.253 0.00114 **
temp 0.0109669 0.0004126 26.577 < 2e-16 ***
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 440.3 110.0 4.002 6.29e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 84.8 on 12 Df
Pseudo R-squared: 0.9617
Number of iterations: 51 (BFGS) + 3 (Fisher scoring)
R GLM Logistic Regression code from stock R
glmfit = glm(yield ~ batch + temp, data = GasolineYield, family = "binomial")
summary(glmfit)
R GLM Logistic Regression summary output
Call:
glm(formula = yield ~ batch + temp, family = "binomial", data = GasolineYield)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.100459 -0.025272 0.004217 0.032879 0.082113
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.130227 3.831798 -1.600 0.110
batch1 1.720311 2.127205 0.809 0.419
batch2 1.305746 2.481266 0.526 0.599
batch3 1.562343 2.440712 0.640 0.522
batch4 1.048928 2.152385 0.487 0.626
batch5 1.125075 2.176242 0.517 0.605
batch6 1.029601 2.229773 0.462 0.644
batch7 0.540401 2.294474 0.236 0.814
batch8 0.497355 2.288564 0.217 0.828
batch9 0.378315 2.494881 0.152 0.879
temp 0.010906 0.008676 1.257 0.209
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2.34184 on 31 degrees of freedom
Residual deviance: 0.07046 on 21 degrees of freedom
AIC: 36.631
Number of Fisher Scoring iterations: 5
The standard errors are different because the variance assumptions in the two models are different.
Logistic regression assumes the response has a binomial distribution, while beta regression assumes it has a beta distribution.
The variance functions of the two are different. For the binomial, if you specify the mean (and $n$ is a given) the variance is determined. For the beta there's another free parameter, so it isn't determined by the mean and would presumably be estimated from the data.
This suggests that if you fit a quasibinomial GLM (adding a variance parameter) you might get closer to the same standard errors, but they still won't be the same, since they would weight the observations differently.
What you should actually do:
if your proportions are originally counts divided by some total count, then a binomial GLM would be an appropriate model to consider. (You would need the total counts, though.)
if your proportions are continuous fractions (the proportion of milk that's cream for example), then beta regression is an appropriate model to consider.

Resources