How to explain the result of ANCOVA and linear regression - r

Currently I am learning ANCOVA, but I'm confused with the result.
I created a linear regression model using mtcars like this:
summary(lm(qsec ~ wt+factor(am), data = mtcars))
The output is:
Call:
lm(formula = qsec ~ wt + factor(am), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-2.6898 -1.3063 0.0167 1.1398 3.9917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5990 1.5596 14.490 8.17e-15 ***
wt -1.1716 0.4025 -2.911 0.00685 **
factor(am)1 -2.4141 0.7892 -3.059 0.00474 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.582 on 29 degrees of freedom
Multiple R-squared: 0.267, Adjusted R-squared: 0.2165
F-statistic: 5.283 on 2 and 29 DF, p-value: 0.01106
As you see, the p value of wt showed 0.00685, which meaned a strong linear correlation between wt and qsec, as well as am.
But when I ran aov code:
summary(aov(qsec ~ wt+factor(am), data = mtcars))
With the output:
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 3.02 3.022 1.208 0.28081
factor(am) 1 23.41 23.413 9.358 0.00474 **
Residuals 29 72.55 2.502
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It seems like there was no effect from wt on qsec.
Does it mean that a strong linear correlation between wt and qsec could be confirmed but there is no great effect from wt on qsec?
Is my explanation appropriate?

First drop the factor since am only has two values so making it a factor will not have any effect on the results.
Now regarding the tests to get the p values they are different. For lm the wt p value is based on the comparison of these two models
qsec ~ am
qsec ~ wt + am
so we have
anova(lm(qsec ~ am, mtcars), lm(qsec ~ mt + am, mtcars))
## Model 1: qsec ~ am
## Model 2: qsec ~ wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 93.758
## 2 29 72.554 1 21.204 8.4753 0.006854 ** <-- 0.00685
summary(lm(qsec ~ wt + am, mtcars))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5990 1.5596 14.490 8.17e-15 ***
## wt -1.1716 0.4025 -2.911 0.00685 ** <-- 0.00685
## am -2.4141 0.7892 -3.059 0.00474 **
whereas aov is really meant for balanced designs where the terms are orthogonal and if not, as here, then it conceptually orthogonalizes them sequentially so in that case the comparison is effectively between these two models
qsec ~ r.am
qsec ~ r.wt + r.am
where r.wt is the portion of wt orthogonal to the intercept and r.am is the portion of am orthogonal to wt and the intercept so we have
r.wt <- resid(lm(wt ~ 1, mtcars))
r.am <- resid(lm(am ~ wt, mtcars))
anova(lm(qsec ~ r.am, mtcars), lm(qsec ~ r.wt + r.am, mtcars))
## Model 1: qsec ~ r.am
## Model 2: qsec ~ r.wt + r.am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 75.576
## 2 29 72.554 1 3.0217 1.2078 0.2808 <------- 0.2808
summary(aov(qsec ~ wt + am, mtcars))
## Df Sum Sq Mean Sq F value Pr(>F)
## wt 1 3.02 3.022 1.208 0.28081 <------- 0.28081
## am 1 23.41 23.413 9.358 0.00474 **
## Residuals 29 72.55 2.502
It would also be possible to demonstrate this by performing Gram Schmidt on the cbind(1, wt, am) model matrix to make the columns orthogonal. The pracma package has a Gram Schmidt routine.

Related

Fixed effect counts in modelsummary

I have a modelsummary of three fixed effects regressions like so:
remotes::install_github("lrberge/fixest")
remotes::install_github("vincentarelbundock/modelsummary")
library(fixest)
library(modelsummary)
mod1 <- feols(mpg ~ hp | cyl, data = mtcars)
mod2 <- feols(mpg ~ wt | cyl, data = mtcars)
mod3 <- feols(mpg ~ drat | cyl, data = mtcars)
modelsummary(list(mod1, mod2, mod3), output = "markdown")
Model 1
Model 2
Model 3
hp
-0.024
(0.015)
wt
-3.206
(1.188)
drat
1.793
(1.564)
Num.Obs.
32
32
32
R2
0.754
0.837
0.745
R2 Adj.
0.727
0.820
0.718
R2 Within
0.080
0.392
0.048
R2 Within Adj.
0.047
0.371
0.014
AIC
167.9
154.6
169.0
BIC
173.8
160.5
174.9
RMSE
2.94
2.39
2.99
Std.Errors
by: cyl
by: cyl
by: cyl
FE: cyl
X
X
X
Instead of having the table show merely whether certain fixed effects were present, is it possible to show the number of fixed effects that were estimated instead?
The raw models do contain this information:
> mod1
OLS estimation, Dep. Var.: mpg
Observations: 32
Fixed-effects: cyl: 3
Standard-errors: Clustered (cyl)
Estimate Std. Error t value Pr(>|t|)
hp -0.024039 0.015344 -1.56664 0.25771
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 2.94304 Adj. R2: 0.727485
Within R2: 0.07998
Yes, you’ll need to define a glance_custom.fixest() method. See this section of the docs for detailed instructions and many examples:
https://vincentarelbundock.github.io/modelsummary/articles/modelsummary.html#customizing-existing-models-part-i
And here’s an example with fixest:
library(fixest)
library(tibble)
library(modelsummary)
models <- list(
feols(mpg ~ hp | cyl, data = mtcars),
feols(mpg ~ hp | am, data = mtcars),
feols(mpg ~ hp | cyl + am, data = mtcars)
)
glance_custom.fixest <- function(x, ...) {
tibble::tibble(`# FE` = paste(x$fixef_sizes, collapse = " + "))
}
modelsummary(models, gof_map = c("nobs", "# FE"))
(1)
(2)
(3)
hp
-0.024
-0.059
-0.044
(0.015)
(0.000)
(0.016)
Num.Obs.
32
32
32
# FE
3
2
3 + 2

how does R handle NA values vs deleted values with regressions

Say I have a table and I remove all the inapplicable values and I ran a regression. If I ran the exact same regression on the same table, but this time instead of removing the inapplicable values, I turned them into NA values, would the regression still give me the same coefficients?
The regression would omit any NA values prior to doing the analysis (i.e. deleting any row that contains a missing NA in any of the predictor variables or the outcome variable). You can check this by comparing the degrees of freedom and other statistics for both models.
Here's a toy example:
head(mtcars)
# check the data set size (all non-missings)
dim(mtcars) # has 32 rows
# Introduce some missings
set.seed(5)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 5)] <- NA
head(mtcars)
# Create an alternative where all missings are omitted
mtcars_NA_omit <- na.omit(mtcars)
# Check the data set size again
dim(mtcars_NA_omit) # Now only has 27 rows
# Now compare some simple linear regressions
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars))
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit))
Comparing the two summaries you can see that they are identical, with the one exception that for the first model, there's a warning message that 5 csaes have been dropped due to missingness, which is exactly what we did manually in our mtcars_NA_omit example.
# First, original model
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07
# Second model where we dropped missings manually
Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit)
Residuals:
Min 1Q Median 3Q Max
-5.0835 -1.7594 -0.2023 1.4313 5.6948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.64284 7.02359 4.220 0.000352 ***
cyl -1.04494 0.83565 -1.250 0.224275
hp -0.03913 0.01918 -2.040 0.053525 .
am 4.02895 1.90342 2.117 0.045832 *
gear 0.31413 1.48881 0.211 0.834833
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.947 on 22 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7635
F-statistic: 21.98 on 4 and 22 DF, p-value: 2.023e-07

Generate beautiful summary table output for lmer models in R Markdown?

I have a lot of lmer models with a summary() output that looks like this (usually with more variables, this was just a very quick nonsense example I generated):
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: mpg ~ cyl + hp + disp * drat + (1 | gear) + (1 | carb)
Data: mtcars
REML criterion at convergence: 171.8
Scaled residuals:
Min 1Q Median 3Q Max
-1.4583 -0.5671 -0.2118 0.3912 2.1303
Random effects:
Groups Name Variance Std.Dev.
carb (Intercept) 2.07353 1.4400
gear (Intercept) 0.04659 0.2158
Residual 7.63829 2.7637
Number of obs: 32, groups: carb, 6; gear, 3
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.27454 11.75344 25.17470 0.534 0.5981
cyl -0.30519 0.84443 25.55852 -0.361 0.7208
hp -0.01412 0.01798 10.10227 -0.785 0.4503
disp 0.03101 0.03649 23.25810 0.850 0.4041
drat 5.97395 2.65430 24.44199 2.251 0.0337 *
disp:drat -0.01380 0.01172 24.05299 -1.178 0.2503
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) cyl hp disp drat
cyl -0.512
hp -0.087 -0.337
disp -0.710 -0.021 0.341
drat -0.965 0.317 0.142 0.823
disp:drat 0.698 -0.131 -0.461 -0.950 -0.778
convergence code: 0
Model failed to converge with max|grad| = 0.0112263 (tol = 0.002, component 1)
Is there a way to generate a summary() output with all of this information, but nicely formatted for RMarkdown? I've found a variety of solutions for Markdown (e.g. kable, huxtable, pander), but most of them are only able to display for example the Fixed Effects table. In those examples, the Signif. codes stars (*-***) are usually also not shown.
I'm looking for a function that gives me an output just like this, but formatted for RMarkdown and including the significance codes.

How do I access the R squared value for the entire model generated by vegan function envfit?

I am analyzing some microbiome data by using unconstrained ordination (PCA or NMDS) followed by environmental vector fitting with the envfit function in the vegan package. The output of envfit includes an r2 value for each vector or factor included in the envfit model, but I am interested in the total amount of variation explained by all the vectors/factors, rather than just stand-alone variables. I presume I cannot simply add up the R2 values assigned to each environmental variable, because there may be overlap in the microbiome variation that is "explained" by each environmental variable. However, there does not seem to be any way of accessing the total r2 value for the model.
Using an example dataset, this is what I have tried so far:
library(vegan)
library(MASS)
data(varespec, varechem)
library(MASS)
ord <- metaMDS(varespec)
fit <- envfit(ord, varechem, perm = 999)
fit
This shows r2 for each environmental variable, but how do I extract the r2 value for the entire model?
I have tried running fit$r, attributes(fit)$r, and Rsquare.Adj(fit), but these all return NULL.
R-squared = Explained variation / Total variation, or r^2 = 1 - SSE/SST. For two different responses, the residuals will be on different scales, so calculating a combined R^2 for two responses does not make sense.
For example, in classical lm, they are calculated separately:
> summary(lm(cbind(mpg,wt) ~.,data=mtcars))
Response mpg :
Call:
lm(formula = mpg ~ cyl + disp + hp + drat + qsec + vs + am +
gear + carb, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.6453 -1.2655 -0.4199 1.6320 5.0843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.57062 19.81294 0.786 0.4403
cyl 0.11982 1.10348 0.109 0.9145
disp -0.01361 0.01212 -1.122 0.2738
hp -0.01122 0.02246 -0.500 0.6223
drat 1.32726 1.71312 0.775 0.4467
qsec 0.09428 0.66944 0.141 0.8893
vs 0.66770 2.22845 0.300 0.7673
am 2.90074 2.17590 1.333 0.1961
gear 1.18650 1.56061 0.760 0.4552
carb -1.32912 0.63321 -2.099 0.0475 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.816 on 22 degrees of freedom
Multiple R-squared: 0.845, Adjusted R-squared: 0.7816
F-statistic: 13.33 on 9 and 22 DF, p-value: 5.228e-07
Response wt :
Call:
lm(formula = wt ~ cyl + disp + hp + drat + qsec + vs + am + gear +
carb, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.40769 -0.18831 0.00012 0.15204 0.50382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.879401 2.098183 -0.419 0.679189
cyl -0.062246 0.116858 -0.533 0.599603
disp 0.007252 0.001284 5.649 1.11e-05 ***
hp -0.002763 0.002378 -1.162 0.257792
drat -0.145385 0.181419 -0.801 0.431483
qsec 0.195613 0.070893 2.759 0.011445 *
vs -0.094189 0.235992 -0.399 0.693653
am -0.102418 0.230427 -0.444 0.661045
gear -0.142945 0.165268 -0.865 0.396411
carb 0.304068 0.067056 4.535 0.000163 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2983 on 22 degrees of freedom
Multiple R-squared: 0.9341, Adjusted R-squared: 0.9071
F-statistic: 34.63 on 9 and 22 DF, p-value: 5.944e-11
For this example, you have to calculate each R^2 for each environment variable

How is 95% CI calculated using confint in R?

I use the example provided in confint help page
> fit <- lm(100/mpg ~ disp + hp + wt + am, data=mtcars)
> summary(fit)
Call:
lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-1.6923 -0.3901 0.0579 0.3649 1.2608
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.740648 0.738594 1.003 0.32487
disp 0.002703 0.002715 0.996 0.32832
hp 0.005275 0.003253 1.621 0.11657
wt 1.001303 0.302761 3.307 0.00267 **
am 0.155815 0.375515 0.415 0.68147
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6754 on 27 degrees of freedom
Multiple R-squared: 0.8527, Adjusted R-squared: 0.8309
F-statistic: 39.08 on 4 and 27 DF, p-value: 7.369e-11
> confint(fit)
2.5 % 97.5 %
(Intercept) -0.774822875 2.256118188
disp -0.002867999 0.008273849
hp -0.001400580 0.011949674
wt 0.380088737 1.622517536
am -0.614677730 0.926307310
> confint(fit, "wt")
2.5 % 97.5 %
> wt 0.3800887 1.622518
>confint.default(fit,"wt")
2.5 % 97.5 %
wt 0.4079023 1.594704
> 1.001303 + 1.96*0.302761
[1] 1.594715
> 1.001303 - 1.96*0.302761
[1] 0.4078914
So the 95% CI obtained from confint.default is based on asymptotic normality. What about for confint?
Thanks
You can check out the code for each method.
# View code for 'default'
confint.default
# View code of lm objects
getAnywhere(confint.lm)
The difference appears to be that default uses normal quantiles and the method for linear models uses T-quantiles instead.

Resources