I have a modelsummary of three fixed effects regressions like so:
remotes::install_github("lrberge/fixest")
remotes::install_github("vincentarelbundock/modelsummary")
library(fixest)
library(modelsummary)
mod1 <- feols(mpg ~ hp | cyl, data = mtcars)
mod2 <- feols(mpg ~ wt | cyl, data = mtcars)
mod3 <- feols(mpg ~ drat | cyl, data = mtcars)
modelsummary(list(mod1, mod2, mod3), output = "markdown")
Model 1
Model 2
Model 3
hp
-0.024
(0.015)
wt
-3.206
(1.188)
drat
1.793
(1.564)
Num.Obs.
32
32
32
R2
0.754
0.837
0.745
R2 Adj.
0.727
0.820
0.718
R2 Within
0.080
0.392
0.048
R2 Within Adj.
0.047
0.371
0.014
AIC
167.9
154.6
169.0
BIC
173.8
160.5
174.9
RMSE
2.94
2.39
2.99
Std.Errors
by: cyl
by: cyl
by: cyl
FE: cyl
X
X
X
Instead of having the table show merely whether certain fixed effects were present, is it possible to show the number of fixed effects that were estimated instead?
The raw models do contain this information:
> mod1
OLS estimation, Dep. Var.: mpg
Observations: 32
Fixed-effects: cyl: 3
Standard-errors: Clustered (cyl)
Estimate Std. Error t value Pr(>|t|)
hp -0.024039 0.015344 -1.56664 0.25771
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 2.94304 Adj. R2: 0.727485
Within R2: 0.07998
Yes, you’ll need to define a glance_custom.fixest() method. See this section of the docs for detailed instructions and many examples:
https://vincentarelbundock.github.io/modelsummary/articles/modelsummary.html#customizing-existing-models-part-i
And here’s an example with fixest:
library(fixest)
library(tibble)
library(modelsummary)
models <- list(
feols(mpg ~ hp | cyl, data = mtcars),
feols(mpg ~ hp | am, data = mtcars),
feols(mpg ~ hp | cyl + am, data = mtcars)
)
glance_custom.fixest <- function(x, ...) {
tibble::tibble(`# FE` = paste(x$fixef_sizes, collapse = " + "))
}
modelsummary(models, gof_map = c("nobs", "# FE"))
(1)
(2)
(3)
hp
-0.024
-0.059
-0.044
(0.015)
(0.000)
(0.016)
Num.Obs.
32
32
32
# FE
3
2
3 + 2
Related
Currently I am learning ANCOVA, but I'm confused with the result.
I created a linear regression model using mtcars like this:
summary(lm(qsec ~ wt+factor(am), data = mtcars))
The output is:
Call:
lm(formula = qsec ~ wt + factor(am), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-2.6898 -1.3063 0.0167 1.1398 3.9917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5990 1.5596 14.490 8.17e-15 ***
wt -1.1716 0.4025 -2.911 0.00685 **
factor(am)1 -2.4141 0.7892 -3.059 0.00474 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.582 on 29 degrees of freedom
Multiple R-squared: 0.267, Adjusted R-squared: 0.2165
F-statistic: 5.283 on 2 and 29 DF, p-value: 0.01106
As you see, the p value of wt showed 0.00685, which meaned a strong linear correlation between wt and qsec, as well as am.
But when I ran aov code:
summary(aov(qsec ~ wt+factor(am), data = mtcars))
With the output:
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 3.02 3.022 1.208 0.28081
factor(am) 1 23.41 23.413 9.358 0.00474 **
Residuals 29 72.55 2.502
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It seems like there was no effect from wt on qsec.
Does it mean that a strong linear correlation between wt and qsec could be confirmed but there is no great effect from wt on qsec?
Is my explanation appropriate?
First drop the factor since am only has two values so making it a factor will not have any effect on the results.
Now regarding the tests to get the p values they are different. For lm the wt p value is based on the comparison of these two models
qsec ~ am
qsec ~ wt + am
so we have
anova(lm(qsec ~ am, mtcars), lm(qsec ~ mt + am, mtcars))
## Model 1: qsec ~ am
## Model 2: qsec ~ wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 93.758
## 2 29 72.554 1 21.204 8.4753 0.006854 ** <-- 0.00685
summary(lm(qsec ~ wt + am, mtcars))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5990 1.5596 14.490 8.17e-15 ***
## wt -1.1716 0.4025 -2.911 0.00685 ** <-- 0.00685
## am -2.4141 0.7892 -3.059 0.00474 **
whereas aov is really meant for balanced designs where the terms are orthogonal and if not, as here, then it conceptually orthogonalizes them sequentially so in that case the comparison is effectively between these two models
qsec ~ r.am
qsec ~ r.wt + r.am
where r.wt is the portion of wt orthogonal to the intercept and r.am is the portion of am orthogonal to wt and the intercept so we have
r.wt <- resid(lm(wt ~ 1, mtcars))
r.am <- resid(lm(am ~ wt, mtcars))
anova(lm(qsec ~ r.am, mtcars), lm(qsec ~ r.wt + r.am, mtcars))
## Model 1: qsec ~ r.am
## Model 2: qsec ~ r.wt + r.am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 75.576
## 2 29 72.554 1 3.0217 1.2078 0.2808 <------- 0.2808
summary(aov(qsec ~ wt + am, mtcars))
## Df Sum Sq Mean Sq F value Pr(>F)
## wt 1 3.02 3.022 1.208 0.28081 <------- 0.28081
## am 1 23.41 23.413 9.358 0.00474 **
## Residuals 29 72.55 2.502
It would also be possible to demonstrate this by performing Gram Schmidt on the cbind(1, wt, am) model matrix to make the columns orthogonal. The pracma package has a Gram Schmidt routine.
These are the regression models that I want to obtain. I want to select many variables at the same time to develop a multivariate model, since my data frame has 357 variables.
summary(lm(formula = bci_bci ~ bti_acp, data = qog))
summary(lm(formula = bci_bci ~ wdi_pop, data = qog))
summary(lm(formula = bci_bci ~ ffp_sl, data = qog))
Instead of listing all your variables using + signs, you can also use the shorthand notation . to add all variables in data as explanatory variables (except the target variable on the left hand side of course).
data("mtcars")
mod <- lm(mpg ~ ., data = mtcars)
summary(mod)
#>
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.4506 -1.6044 -0.1196 1.2193 4.6271
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.30337 18.71788 0.657 0.5181
#> cyl -0.11144 1.04502 -0.107 0.9161
#> disp 0.01334 0.01786 0.747 0.4635
#> hp -0.02148 0.02177 -0.987 0.3350
#> drat 0.78711 1.63537 0.481 0.6353
#> wt -3.71530 1.89441 -1.961 0.0633 .
#> qsec 0.82104 0.73084 1.123 0.2739
#> vs 0.31776 2.10451 0.151 0.8814
#> am 2.52023 2.05665 1.225 0.2340
#> gear 0.65541 1.49326 0.439 0.6652
#> carb -0.19942 0.82875 -0.241 0.8122
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.65 on 21 degrees of freedom
#> Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
#> F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
par(mfrow=c(2,2))
plot(mod)
par(mfrow=c(1,1))
Created on 2021-12-21 by the reprex package (v2.0.1)
If you want to include all two-way interactions, the notation would be this:
lm(mpg ~ (.)^2, data = mtcars)
If you want to include all three-way interactions, the notation would be this:
lm(mpg ~ (.)^3, data = mtcars)
If you create very large models (with many variables or interactions), make sure that you also perform some model size reduction after that, e.g. using the function step(). It's very likely that not all your predictors are actually going to be informative, and many could be correlated, which causes problems in multivariate models. One way out of this could be to remove any predictors that are highly correlated to other predictors from the model.
I have a lot of lmer models with a summary() output that looks like this (usually with more variables, this was just a very quick nonsense example I generated):
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: mpg ~ cyl + hp + disp * drat + (1 | gear) + (1 | carb)
Data: mtcars
REML criterion at convergence: 171.8
Scaled residuals:
Min 1Q Median 3Q Max
-1.4583 -0.5671 -0.2118 0.3912 2.1303
Random effects:
Groups Name Variance Std.Dev.
carb (Intercept) 2.07353 1.4400
gear (Intercept) 0.04659 0.2158
Residual 7.63829 2.7637
Number of obs: 32, groups: carb, 6; gear, 3
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.27454 11.75344 25.17470 0.534 0.5981
cyl -0.30519 0.84443 25.55852 -0.361 0.7208
hp -0.01412 0.01798 10.10227 -0.785 0.4503
disp 0.03101 0.03649 23.25810 0.850 0.4041
drat 5.97395 2.65430 24.44199 2.251 0.0337 *
disp:drat -0.01380 0.01172 24.05299 -1.178 0.2503
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) cyl hp disp drat
cyl -0.512
hp -0.087 -0.337
disp -0.710 -0.021 0.341
drat -0.965 0.317 0.142 0.823
disp:drat 0.698 -0.131 -0.461 -0.950 -0.778
convergence code: 0
Model failed to converge with max|grad| = 0.0112263 (tol = 0.002, component 1)
Is there a way to generate a summary() output with all of this information, but nicely formatted for RMarkdown? I've found a variety of solutions for Markdown (e.g. kable, huxtable, pander), but most of them are only able to display for example the Fixed Effects table. In those examples, the Signif. codes stars (*-***) are usually also not shown.
I'm looking for a function that gives me an output just like this, but formatted for RMarkdown and including the significance codes.
I am analyzing some microbiome data by using unconstrained ordination (PCA or NMDS) followed by environmental vector fitting with the envfit function in the vegan package. The output of envfit includes an r2 value for each vector or factor included in the envfit model, but I am interested in the total amount of variation explained by all the vectors/factors, rather than just stand-alone variables. I presume I cannot simply add up the R2 values assigned to each environmental variable, because there may be overlap in the microbiome variation that is "explained" by each environmental variable. However, there does not seem to be any way of accessing the total r2 value for the model.
Using an example dataset, this is what I have tried so far:
library(vegan)
library(MASS)
data(varespec, varechem)
library(MASS)
ord <- metaMDS(varespec)
fit <- envfit(ord, varechem, perm = 999)
fit
This shows r2 for each environmental variable, but how do I extract the r2 value for the entire model?
I have tried running fit$r, attributes(fit)$r, and Rsquare.Adj(fit), but these all return NULL.
R-squared = Explained variation / Total variation, or r^2 = 1 - SSE/SST. For two different responses, the residuals will be on different scales, so calculating a combined R^2 for two responses does not make sense.
For example, in classical lm, they are calculated separately:
> summary(lm(cbind(mpg,wt) ~.,data=mtcars))
Response mpg :
Call:
lm(formula = mpg ~ cyl + disp + hp + drat + qsec + vs + am +
gear + carb, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.6453 -1.2655 -0.4199 1.6320 5.0843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.57062 19.81294 0.786 0.4403
cyl 0.11982 1.10348 0.109 0.9145
disp -0.01361 0.01212 -1.122 0.2738
hp -0.01122 0.02246 -0.500 0.6223
drat 1.32726 1.71312 0.775 0.4467
qsec 0.09428 0.66944 0.141 0.8893
vs 0.66770 2.22845 0.300 0.7673
am 2.90074 2.17590 1.333 0.1961
gear 1.18650 1.56061 0.760 0.4552
carb -1.32912 0.63321 -2.099 0.0475 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.816 on 22 degrees of freedom
Multiple R-squared: 0.845, Adjusted R-squared: 0.7816
F-statistic: 13.33 on 9 and 22 DF, p-value: 5.228e-07
Response wt :
Call:
lm(formula = wt ~ cyl + disp + hp + drat + qsec + vs + am + gear +
carb, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.40769 -0.18831 0.00012 0.15204 0.50382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.879401 2.098183 -0.419 0.679189
cyl -0.062246 0.116858 -0.533 0.599603
disp 0.007252 0.001284 5.649 1.11e-05 ***
hp -0.002763 0.002378 -1.162 0.257792
drat -0.145385 0.181419 -0.801 0.431483
qsec 0.195613 0.070893 2.759 0.011445 *
vs -0.094189 0.235992 -0.399 0.693653
am -0.102418 0.230427 -0.444 0.661045
gear -0.142945 0.165268 -0.865 0.396411
carb 0.304068 0.067056 4.535 0.000163 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2983 on 22 degrees of freedom
Multiple R-squared: 0.9341, Adjusted R-squared: 0.9071
F-statistic: 34.63 on 9 and 22 DF, p-value: 5.944e-11
For this example, you have to calculate each R^2 for each environment variable
I'm trying to test the difference between two marginal effects. I can get R to calculate the effects, but I can't find any resource explaining how to test their difference.
I've looked in the margins documentations and other marginal effects packages but have not been able to find something that tests the difference.
data("mtcars")
mod<-lm(mpg~as.factor(am)*disp,data=mtcars)
(marg<-margins(model = mod,at = list(am = c("0","1"))))
at(am) disp am1
0 -0.02758 0.4518
1 -0.05904 0.4518
summary(marg)
factor am AME SE z p lower upper
am1 1.0000 0.4518 1.3915 0.3247 0.7454 -2.2755 3.1791
am1 2.0000 0.4518 1.3915 0.3247 0.7454 -2.2755 3.1791
disp 1.0000 -0.0276 0.0062 -4.4354 0.0000 -0.0398 -0.0154
disp 2.0000 -0.0590 0.0096 -6.1353 0.0000 -0.0779 -0.0402
I want to produce a test that decides whether or not the marginal effects in each row of marg are significantly different; i.e., that the slopes in the marginal effects plots are different. This appears to be true because the confidence intervals do not overlap -- indicating that the effect of displacement is different for am=0 vs am=1.
We discuss in the comments below that we can test contrasts using emmeans, but that is a test of the average response across am=0 and am=1.
emm<-emmeans(mod,~ as.factor(am)*disp)
emm
am disp emmean SE df lower.CL upper.CL
0 231 18.8 0.763 28 17.2 20.4
1 231 19.2 1.164 28 16.9 21.6
cont<-contrast(emm,list(`(0-1)`=c(1,-1)))
cont
contrast estimate SE df t.ratio p.value
(0-1) -0.452 1.39 28 -0.325 0.7479
Here the p-value is large indicating that average response when am=0 is not significantly different than when am=1.
Is it reasonable to do this (like testing the difference of two means)?
smarg<-summary(marg)
(z=as.numeric((smarg$AME[3]-smarg$AME[4])/sqrt(smarg$SE[3]^2+smarg$SE[4]^2)))
[1] 2.745
2*pnorm(-abs(z))
[1] 0.006044
This p-value appears to agree with the analysis of non overlapping confidence intervals.
If I understand your question, it can be answered using emtrends:
library(emmeans)
emt = emtrends(mod, "am", var = "disp")
emt # display the estimated slopes
## am disp.trend SE df lower.CL upper.CL
## 0 -0.0276 0.00622 28 -0.0403 -0.0148
## 1 -0.0590 0.00962 28 -0.0787 -0.0393
##
## Confidence level used: 0.95
pairs(emt) # test the difference of slopes
## contrast estimate SE df t.ratio p.value
## 0 - 1 0.0315 0.0115 28 2.745 0.0104
For the question of "Are the slopes statistically different, indicating that the effect of displacement is different for am=0 vs am=1?" question, you can get the p-value associated with the comparison directly from the summary of the lm() fit.
> summary(mod)
Call:
lm(formula = mpg ~ as.factor(am) * disp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.6056 -2.1022 -0.8681 2.2894 5.2315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.157064 1.925053 13.068 1.94e-13 ***
as.factor(am)1 7.709073 2.502677 3.080 0.00460 **
disp -0.027584 0.006219 -4.435 0.00013 ***
as.factor(am)1:disp -0.031455 0.011457 -2.745 0.01044 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.907 on 28 degrees of freedom
Multiple R-squared: 0.7899, Adjusted R-squared: 0.7674
F-statistic: 35.09 on 3 and 28 DF, p-value: 1.27e-09
Notice that the p-value for the as.factor(am)1:disp term is 0.01044, which matches the output from pairs(emt) in Russ Lenth's answer.
(posting as an answer because insufficient reputation to post as a comment, yet)
Not sure, but probably you're looking at contrasts or pairwise comparisons of marginal effects? You can do this using the emmeans package:
library(margins)
library(emmeans)
library(magrittr)
data("mtcars")
mod <- lm(mpg ~ as.factor(am) * disp, data = mtcars)
marg <- margins(model = mod, at = list(am = c("0", "1")))
marg
#> Average marginal effects at specified values
#> lm(formula = mpg ~ as.factor(am) * disp, data = mtcars)
#> at(am) disp am1
#> 0 -0.02758 0.4518
#> 1 -0.05904 0.4518
emmeans(mod, c("am", "disp")) %>%
contrast(method = "pairwise")
#> contrast estimate SE df t.ratio p.value
#> 0,230.721875 - 1,230.721875 -0.452 1.39 28 -0.325 0.7479
emmeans(mod, c("am", "disp")) %>%
contrast()
#> contrast estimate SE df t.ratio p.value
#> 0,230.721875 effect -0.226 0.696 28 -0.325 0.7479
#> 1,230.721875 effect 0.226 0.696 28 0.325 0.7479
#>
#> P value adjustment: fdr method for 2 tests
Or simply use summary():
library(margins)
data("mtcars")
mod <- lm(mpg ~ as.factor(am) * disp, data = mtcars)
marg <- margins(model = mod, at = list(am = c("0", "1")))
marg
#> Average marginal effects at specified values
#> lm(formula = mpg ~ as.factor(am) * disp, data = mtcars)
#> at(am) disp am1
#> 0 -0.02758 0.4518
#> 1 -0.05904 0.4518
summary(marg)
#> factor am AME SE z p lower upper
#> am1 1.0000 0.4518 1.3915 0.3247 0.7454 -2.2755 3.1791
#> am1 2.0000 0.4518 1.3915 0.3247 0.7454 -2.2755 3.1791
#> disp 1.0000 -0.0276 0.0062 -4.4354 0.0000 -0.0398 -0.0154
#> disp 2.0000 -0.0590 0.0096 -6.1353 0.0000 -0.0779 -0.0402
Created on 2019-06-07 by the reprex package (v0.3.0)