First of all, I have to apologize if my headline is misleading. I am not sure how to put it appropriately for my question.
I am currently working on the fixed-effect model. My data looks like the following table, although it is not actual data due to the information privacy.
state
district
year
grade
Y
X
id
AK
1001
2009
3
0.1
0.5
1001.3
AK
1001
2010
3
0.8
0.4
1001.3
AK
1001
2011
3
0.5
0.7
1001.3
AK
1001
2009
4
1.5
1.3
1001.4
AK
1001
2010
4
1.1
0.7
1001.4
AK
1001
2011
4
2.1
0.4
1001.4
...
...
...
..
..
..
...
WY
5606
2011
6
4.2
5.3
5606.6
I used the fixest package to run the fixed-effect model for this project. To get the unique observation in this dataset, I have to combine district, grade, and year. Note that I avoided using plm because there is no way to specify three fixed effects in the model unless you combine two identities (in my case, I generated id by combining district and grade). fixest seems to be able to solve this problem. However, I got different results when specifying three fixed effects (district, grade, and year) compared to two fixed effects (id and year). The following results and codes may clear up some confusion from my explanation.
# Two fixed effects (id and year)
df <- transform(df, id = apply(df[c("district", "grade")], 1, paste, collapse = "."))
fe = feols(y ~ x | id + year, df, se = "standard")
summary(fe)
OLS estimation, Dep. Var.: y
Observations: 499,112
Fixed-effects: id: 64,302, year: 10
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
X 0.012672 0.003602 3.51804 0.00043478 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.589222 Adj. R2: 0.761891
Within R2: 2.846e-5
###########################################################################
# Three fixed effects (district, grade, and year)
fe = feols(y ~ x | district + grade + year, df, se = "standard")
summary(fe)
OLS estimation, Dep. Var.: y
Observations: 499,112
Fixed-effects: district: 11,097, grade: 6, year: 10
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
X 0.014593 0.00401 3.63866 0.00027408 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.702543 Adj. R2: 0.698399
Within R2: 2.713e-5
Questions:
Why are the results different?
This is an equation I plan to use;. I am not sure which model is associated with this specification. To my feeling, it could be the second one. But if it is the case, why do many websites recommend combining two identities and running normal plm.
Thank you so much for reading my question. Any answers/ suggestions/ advice would be appreciated!
The answer is simply that you are estimating two different models.
Three fixed-effects (FEs):
Year + id FEs (I renamed id in to district_grade):
The first set of fixed-effects is strictly included in the set of FEs of the second estimation, which is more restrictive.
Here is a reproducible example in which we see that we obtain two different sets of coefficients.
data(trade)
est = fepois(Euros ~ log(dist_km) | sw(Origin + Product, Origin^Product) + Year, trade)
etable(est, vcov = "iid")
#> model 1 model 2
#> Dependent Var.: Euros Euros
#>
#> log(dist_km) -1.020*** (1.18e-6) -1.024*** (1.19e-6)
#> Fixed-Effects: ------------------- -------------------
#> Origin Yes No
#> Product Yes No
#> Year Yes Yes
#> Origin-Product No Yes
#> _______________ ___________________ ___________________
#> S.E. type IID IID
#> Observations 38,325 38,325
#> Squared Cor. 0.27817 0.35902
#> Pseudo R2 0.53802 0.64562
#> BIC 2.75e+12 2.11e+12
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that they have different FEs, which confirms that the models estimated are completely different:
summary(fixef(est$`Origin + Product + Year`))
#> Fixed_effects coefficients
#> Origin Product Year
#> Number of fixed-effects 15 20 10
#> Number of references 0 1 1
#> Mean 23.5 -0.012 0.157
#> Standard-deviation 1.15 1.35 0.113
#> COEFFICIENTS:
#> Origin: AT BE DE DK ES
#> 22.91 23.84 24.62 23.62 24.83 ... 10 remaining
#> -----
#> Product: 1 2 3 4 5
#> 0 1.381 0.624 1.414 -1.527 ... 15 remaining
#> -----
#> Year: 2007 2008 2009 2010 2011
#> 0 0.06986 0.006301 0.07463 0.163 ... 5 remaining
summary(fixef(est$`Origin^Product + Year`))
#> Fixed_effects coefficients
#> Origin^Product Year
#> Number of fixed-effects 300 10
#> Number of references 0 1
#> Mean 23.1 0.157
#> Standard-deviation 1.96 0.113
#> COEFFICIENTS:
#> Origin^Product: 101 102 103 104 105
#> 22.32 24.42 24.82 21.28 23.04 ... 295 remaining
#> -----
#> Year: 2007 2008 2009 2010 2011
#> 0 0.06962 0.006204 0.07454 0.1633 ... 5 remaining
Related
I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?
Study background: I want to see if mean caterpillar abundance for a given year and within a given population, can explain differences in bird(blue tit) density. The blue tits breed in nest boxes, so I calculated density as the total number of occupied nest boxes/the number of unoccupied nest boxes within a given year and population.
Below I show the structure of the data, my model and the error messages.
Model:
model1.1 <-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,
data.density$number_of_nest.boxes)~population*year*caterpillar+(1|site),
data = data.density, family=binomial)
The first error is:
fixed-effect model matrix is rank deficient so dropping 27 columns / coefficients
I think this this is due to not having enough combinations of caterpillars with population x year.
The second error is
boundary (singular) fit: see ?isSingular
I'm just not sure how to go about fixing this.
I also don't understand what the other error means and how to fix it.
I appreciate any advice.
#loading data density data
data.density<-read.csv ("nest_box_caterpillar_density.csv")
View(data.density)
str(data.density)
#> 'data.frame': 63 obs. of 8 variables:
#> $ year : int 2011 2012 2013 2014 2015 2016 2017 2018 2019 2011 ...
#> $ number.nest.boxes.occupied.that.year: int 17 13 12 16 16 16 15 17 12 17 ...
#> $ number_of_nest.boxes : int 20 20 20 20 20 20 20 20 20 30 ...
#> $ proportion_occupied_boxes : num 0.85 0.65 0.6 0.8 0.8 0.8 0.75 0.85 0.6 0.57 ...
#> $ site : Factor w/ 7 levels "ari","ava","fel",..: 5 5 5 5 5 5 5 5 5 1 ...
#> $ population : Factor w/ 3 levels "D-Muro","E-Muro",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ mean_yearly_frass : num 295 231 437 263 426 ...
#> $ site_ID : Factor w/ 63 levels "2011_ari_","2011_ava_",..: 5 12 19 26 33 40 47 54 61 1 ...
data.density$year<-factor (data.density$year)# making year a factor (categorical variable)
str(data.density) # now we see year as a factor in the data.
#> 'data.frame': 63 obs. of 8 variables:
#> $ year : Factor w/ 9 levels "2011","2012",..: 1 2 3 4 5 6 7 8 9 1 ...
#> $ number.nest.boxes.occupied.that.year: int 17 13 12 16 16 16 15 17 12 17 ...
#> $ number_of_nest.boxes : int 20 20 20 20 20 20 20 20 20 30 ...
#> $ proportion_occupied_boxes : num 0.85 0.65 0.6 0.8 0.8 0.8 0.75 0.85 0.6 0.57 ...
#> $ site : Factor w/ 7 levels "ari","ava","fel",..: 5 5 5 5 5 5 5 5 5 1 ...
#> $ population : Factor w/ 3 levels "D-Muro","E-Muro",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ mean_yearly_frass : num 295 231 437 263 426 ...
#> $ site_ID : Factor w/ 63 levels "2011_ari_","2011_ava_",..: 5 12 19 26 33 40 47 54 61 1 ...
density<-data.density$proportion_occupied_boxes # making a new object called density
caterpillar<-data.density$mean_yearly_frass # making new object called caterpillar
model1.1<-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,data.density$number_of_nest.boxes)~population*year*caterpillar+(1|site),data = data.density, family=binomial)
#> fixed-effect model matrix is rank deficient so dropping 27 columns / coefficients
#> boundary (singular) fit: see ?isSingular
summary(model1.1)
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula:
#> cbind(data.density$number.nest.boxes.occupied.that.year, data.density$number_of_nest.boxes) ~
#> population * year * caterpillar + (1 | site)
#> Data: data.density
#>
#> AIC BIC logLik deviance df.resid
#> 343.7 403.7 -143.8 287.7 35
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.1125 -0.1379 0.0000 0.2264 0.6778
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> site (Intercept) 0 0
#> Number of obs: 63, groups: site, 7
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.4054532 0.2454754 -1.652 0.0986 .
#> populationE-Muro -0.1158123 0.2301030 -0.503 0.6147
#> populationE-Pirio -0.4945158 0.2932707 -1.686 0.0918 .
#> year2012 0.0905137 0.2109513 0.429 0.6679
#> year2013 -0.1223076 0.2160367 -0.566 0.5713
#> year2014 -0.0703760 0.2304236 -0.305 0.7600
#> year2015 -0.0507882 0.2127083 -0.239 0.8113
#> year2016 -0.0562139 0.2077616 -0.271 0.7867
#> year2017 -0.0994962 0.2070464 -0.481 0.6308
#> year2018 0.0977751 0.2192755 0.446 0.6557
#> year2019 -0.2312869 0.2133430 -1.084 0.2783
#> caterpillar 0.0004598 0.0005432 0.846 0.3973
#> populationE-Muro:year2012 -0.1217344 0.3294773 -0.369 0.7118
#> populationE-Pirio:year2012 -0.3121173 0.2912256 -1.072 0.2838
#> populationE-Muro:year2013 -0.0682892 0.3600992 -0.190 0.8496
#> populationE-Pirio:year2013 -0.3345701 0.3051039 -1.097 0.2728
#> populationE-Muro:year2014 0.1604636 0.3383121 0.474 0.6353
#> populationE-Pirio:year2014 -0.1074231 0.3171972 -0.339 0.7349
#> populationE-Muro:year2015 0.0838557 0.3491699 0.240 0.8102
#> populationE-Pirio:year2015 -0.0640988 0.2943189 -0.218 0.8276
#> populationE-Muro:year2016 0.0679017 0.3333771 0.204 0.8386
#> populationE-Pirio:year2016 -0.0899343 0.2919975 -0.308 0.7581
#> populationE-Muro:year2017 0.1643493 0.3300491 0.498 0.6185
#> populationE-Pirio:year2017 0.0338824 0.2730344 0.124 0.9012
#> populationE-Muro:year2018 0.0315607 0.3264224 0.097 0.9230
#> populationE-Pirio:year2018 -0.4196974 0.3180515 -1.320 0.1870
#> populationE-Muro:year2019 -0.0587593 0.3619408 -0.162 0.8710
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation matrix not shown by default, as p = 27 > 12.
#> Use print(x, correlation=TRUE) or
#> vcov(x) if you need it
#> fit warnings:
#> fixed-effect model matrix is rank deficient so dropping 27 columns / coefficients
#> optimizer (Nelder_Mead) convergence code: 0 (OK)
#> boundary (singular) fit: see ?isSingular
Created on 2022-03-21 by the reprex package (v2.0.1)
I tried removing caterpillars from the model, and the error first error goes away. But the point of my model is to see how caterpillar effects density. I also still get the "boundary (singular) fit: see ?isSingular" error.
There are a few problems with your model specification.
binomial responses should be specified as cbind(n_success, n_failure), not cbind(n_success, n_total) (in your case cbind(boxes_occupied, total_boxes-boxes_occupied). (I actually find it clearer to use the alternative specification boxes_occupied/total_boxes with the additional argument weights=total_boxes ...)
there are few absolutely hard and fast rules, but it probably makes sense to estimate the effects of at least year as a random effect, and possibly population within site (this will depend on how much you are interested in the detailed differences between populations)
with a total of 63 observations, you need to be parsimonious with your model; a reasonable rule of thumb is that you shouldn't try to estimate more than at most 6 parameters
so I would recommend something like
glmer(boxes_occupied/total_boxes ~ caterpillar + population + (1|year) + (1|site) + caterpillar,
data = data.density,
weights = total_boxes,
family = binomial)
As for singularity, this is a fundamental issue in mixed models, and has been discussed in lots of places. There's a lot of information provided at ?lme4::isSingular (i.e., look up the help page for the isSingular() function), and at the GLMM FAQ, or see some of the existing answers on Stack Overflow about singular fits ...
I'm doing an assignment for university and have copied and pasted the R code so I know it's right but I'm still not getting any P or F values from my data:
Food Temperature Area
50 11 820.2175
100 11 936.5437
50 14 1506.568
100 14 1288.053
50 17 1692.882
100 17 1792.54
This is the code I've used so far:
aovdata<-read.table("Condition by area.csv",sep=",",header=T)
attach(aovdata)
Food <- as.factor(Food) ; Temperature <- as.factor(Temperature)
summary(aov(Area ~ Temperature*Food))
but then this is the output:
Df Sum Sq Mean Sq
Temperature 2 757105 378552
Food 1 1 1
Temperature:Food 2 35605 17803
Any help, especially the code I need to fix it, would be great. I think there could be a problem with the data but I don't know what.
I would do this. Be aware of difference between factor and continues predictors.
library(tidyverse)
df <- sapply(strsplit(c("Food Temperature Area", "50 11 820.2175", "100 11 936.5437",
"50 14 1506.568", "100 14 1288.053", "50 17 1692.882",
"100 17 1792.54")," +"), paste0, collapse=",") %>%
read_csv()
model <- lm(Area ~ Temperature * as.factor(Food),df)
summary(model)
#>
#> Call:
#> lm(formula = Area ~ Temperature * as.factor(Food), data = df)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -83.34 25.50 166.68 -50.99 -83.34 25.50
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -696.328 505.683 -1.377 0.302
#> Temperature 145.444 35.580 4.088 0.055 .
#> as.factor(Food)100 38.049 715.144 0.053 0.962
#> Temperature:as.factor(Food)100 -2.778 50.317 -0.055 0.961
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 151 on 2 degrees of freedom
#> Multiple R-squared: 0.9425, Adjusted R-squared: 0.8563
#> F-statistic: 10.93 on 3 and 2 DF, p-value: 0.08498
ggeffects::ggpredict(model,terms = c('Temperature','Food')) %>% plot()
Created on 2020-12-08 by the reprex package (v0.3.0)
The actual problem with your example is not that you're using factors as predictor variables, but rather that you have fitted a 'saturated' linear model (as many parameters as observations), so there is no variation left to compute a residual SSQ, so the ANOVA doesn't include F/P values etc.
It's fine for temperature and food to be categorical (factor) predictors, that's how they would be treated in a classic two-way ANOVA design. It's just that in order to analyze this design with the interaction you need more replication.
I normally work within a generalized least squares framework estimating, what Wooldridge's Introductory (2013) calls, Random Effects and Fixed Effects models on longitudinal data indexed by an individual and a time dimension.
I've been using the Feasible GLS estimation in plm(), from the plm package, to estimate the Random Effects Model – what some stats literature term the Mixed Model. The plm() function takes an index argument where I indicate the individual and time indexes. However, I’m now faced with some data where each individual has several measures at each time-point, i.e. what a group-wise structure.
I’ve found out that it’s possible to set up such a model using lmer() from the lme4 package, however I am a bit confused by the differences in jargon, and also the likelihood framework, and I wanted to know if specified the model correctly. I fear I could overlook at more substantial as I am not familiar with the framework and this terminology.
I can replicate my usual plm() model using lmer(), but I am unsure as to how I could add the grouping. I’ve tried to illustrate my question in the following.
I found some data that looks somewhat like my data to illustrate my situation. First some packages that are needed,
install.packages(c("mlmRev", "plm", "lme4", "stargazer"), dependencies = TRUE)
and then the data
data(egsingle, package = "mlmRev")
egsingle is a unbalanced panel consisting of 1721 school children, grouped in 60 schools, across five time points. These data are originally distributed with the HLM software package (Bryk, Raudenbush and Congdon, 1996), but can be found the mlmrev package, for details see ? mlmRev::egsingle
Some light data management
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
Here’s a snippet for the data
dta[118:127,c('schoolid','childid','math','year','size','Female')]
#> schoolid childid math year size Female
#> 118 2040 289970511 -1.830 -1.5 502 1
#> 119 2040 289970511 -1.185 -0.5 502 1
#> 120 2040 289970511 0.852 0.5 502 1
#> 121 2040 289970511 0.573 1.5 502 1
#> 122 2040 289970511 1.736 2.5 502 1
#> 123 2040 292772811 -3.144 -1.5 502 0
#> 124 2040 292772811 -2.097 -0.5 502 0
#> 125 2040 292772811 -0.316 0.5 502 0
#> 126 2040 293550291 -2.097 -1.5 502 0
#> 127 2040 293550291 -1.314 -0.5 502 0
Here’s how I would set a random effects model without the schoolid using plm(),
library(plm)
reg.re.plm <- plm(math~Female+size+year, dta, index = c("childid", "year"), model="random")
# summary(reg.re.plm)
I can reproduce these results lme4 like this
require(lme4)
dta$year <- as.factor(dta$year)
reg.re.lmer <- lmer(math~Female+size+year+(1|childid), dta)
# summary(reg.re.lmer)
Now, from reading chapter 2 in Bates (2010) “lme4: Mixed-effects modeling
with R” I believe I’ve this is how I would specific the model including the cluster level, schoolid,
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
# summary(reg.re.lmer.in.school)
However, when I look at the results I am not too convinced I’ve actually specified it correctly (see below).
In my actual data the repeated measures are within individuals, but I take that I can use this data as example. I would appreciate any advice on how to proceed. Maybe a reference to a worked example with notation/terminology not too far from what is used in Wooldridge (2013). And, how do I work backwards and write up the specification for the reg.re.lmer.in.school model?
# library(stargazer)
stargazer::stargazer(reg.re.plm, reg.re.lmer, reg.re.lmer.in.school, type="text")
#> =====================================================================
#> Dependent variable:
#> -------------------------------------------------
#> math
#> panel linear
#> linear mixed-effects
#> (1) (2) (3)
#> ---------------------------------------------------------------------
#> Female -0.025 -0.025 0.008
#> (0.046) (0.047) (0.042)
#>
#> size -0.0004*** -0.0004*** -0.0003
#> (0.0001) (0.0001) (0.0002)
#>
#> year-1.5 0.878*** 0.876*** 0.866***
#> (0.059) (0.059) (0.059)
#>
#> year-0.5 1.882*** 1.880*** 1.870***
#> (0.059) (0.058) (0.058)
#>
#> year0.5 2.575*** 2.574*** 2.562***
#> (0.059) (0.059) (0.059)
#>
#> year1.5 3.149*** 3.147*** 3.133***
#> (0.060) (0.059) (0.059)
#>
#> year2.5 3.956*** 3.954*** 3.939***
#> (0.060) (0.060) (0.060)
#>
#> Constant -2.671*** -2.669*** -2.693***
#> (0.085) (0.086) (0.152)
#>
#> ---------------------------------------------------------------------
#> Observations 7,230 7,230 7,230
#> R2 0.735
#> Adjusted R2 0.735
#> Log Likelihood -8,417.815 -8,284.357
#> Akaike Inf. Crit. 16,855.630 16,590.720
#> Bayesian Inf. Crit. 16,924.490 16,666.460
#> F Statistic 2,865.391*** (df = 7; 7222)
#> =====================================================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
After having studied Robert Long's great answer on stats.stackexchange I have found the the correct specification of the model is a nested design, i.e. (1| schoolid /childid). However due to the way the data is coded (unniqe childid's within schoolid) the crossed design or specification, i.e. (1|childid)+(1|schoolid) what I used above, yields identical results.
Here is an illustration using the same data as above,
data(egsingle, package = "mlmRev")
dta <- egsingle
dta$Female <- with(dta, ifelse(female == 'Female', 1, 0))
require(lme4)
dta$year <- as.factor(dta$year)
Rerunning the crossed design-model, , reg.re.lmer.in.school, for comparison
reg.re.lmer.in.school <- lmer(math~Female+size+year+(1|childid)+(1|schoolid), dta)
here the nested structure
reg.re.lmer.nested <- lmer(math~Female+size+year+(1| schoolid /childid), dta)
and finally the comparison of the two models using the amazing texreg package,
# install.packages(c("texreg"), dependencies = TRUE)
# require(texreg)
texreg::screenreg(list(reg.re.lmer.in.school, reg.re.lmer.nested), digits = 3)
#> ===============================================================
#> Model 1 Model 2
#> ---------------------------------------------------------------
#> (Intercept) -2.693 *** -2.693 ***
#> (0.152) (0.152)
#> Female 0.008 0.008
#> (0.042) (0.042)
#> size -0.000 -0.000
#> (0.000) (0.000)
#> year-1.5 0.866 *** 0.866 ***
#> (0.059) (0.059)
#> year-0.5 1.870 *** 1.870 ***
#> (0.058) (0.058)
#> year0.5 2.562 *** 2.562 ***
#> (0.059) (0.059)
#> year1.5 3.133 *** 3.133 ***
#> (0.059) (0.059)
#> year2.5 3.939 *** 3.939 ***
#> (0.060) (0.060)
#> ---------------------------------------------------------------
#> AIC 16590.715 16590.715
#> BIC 16666.461 16666.461
#> Log Likelihood -8284.357 -8284.357
#> Num. obs. 7230 7230
#> Num. groups: childid 1721
#> Num. groups: schoolid 60 60
#> Var: childid (Intercept) 0.672
#> Var: schoolid (Intercept) 0.180 0.180
#> Var: Residual 0.334 0.334
#> Num. groups: childid:schoolid 1721
#> Var: childid:schoolid (Intercept) 0.672
#> ===============================================================
#> *** p < 0.001, ** p < 0.01, * p < 0.05
Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)