Levene's test using the squared residuals - r

How can Levene's test be performed using the squared residuals rather than the absolute?
I have tried levene.test in the lawstat package, and leveneTest in the car package, which both use the absolute residuals.
The purpose is to reproduce SAS output, which uses squared residuals by default.

iris.lm <- lm(Petal.Width ~ Species, data = iris)
anova(lm(residuals(iris.lm)^2 ~ iris$Species))
## Analysis of Variance Table
##
## Response: residuals(iris.lm)^2
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 0.100 0.0500 14.8 1.4e-06 ***
## Residuals 147 0.497 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Perhaps it helps to understand how this works.
As is pointed out out here, Levene's test is just an ANOVA of the distance between each observation and the centre of its group. Different implementations of Levene's test differ by their definitions of “distance” and “centre”.
“Distance” can mean the absolute differences, or the squared differences.
“Centre” can mean the mean or the median.
SAS uses squared differences by default, and the mean. leveneTest in the car package of R uses the absolute differences only, and the median by default. So does levene.test in the lawstat package.
All four possible combinations can be done by hand as follows.
require(plyr)
x <- ddply(iris, .(Species), summarize
, abs.mean = abs(Petal.Width - mean(Petal.Width))
, abs.median = abs(Petal.Width - median(Petal.Width))
, squared.mean = (Petal.Width - mean(Petal.Width))^2
, squared.median = (Petal.Width - median(Petal.Width))^2)
anova(lm(abs.mean ~ Species, data = x)) # Levene's test
## Analysis of Variance Table
##
## Response: abs.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.53 0.2648 19.6 2.7e-08 ***
## Residuals 147 1.98 0.0135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(abs.median ~ Species, data = x)) # Brown-Forsythe test
## Analysis of Variance Table
##
## Response: abs.median
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.642 0.321 19.9 2.3e-08 ***
## Residuals 147 2.373 0.016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(squared.mean ~ Species, data = x)) # default SAS Levene's Test
## Analysis of Variance Table
##
## Response: squared.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.100 0.0500 14.8 1.4e-06 ***
## Residuals 147 0.497 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(squared.median ~ Species, data = x)) # Who-Knows-Whose Test
## Analysis of Variance Table
##
## Response: squared.median
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 0.096 0.0478 13.6 3.7e-06 ***
## Residuals 147 0.515 0.0035
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To show that the first two above reproduce leveneTest:
require(car)
leveneTest(Petal.Width ~ Species, data = iris, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 2 19.6 2.7e-08 ***
## 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(Petal.Width ~ Species, data = iris, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 19.9 2.3e-08 ***
## 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since one often has a linear model to hand with the residuals (mean) ready to go, it's often more convenient to do
iris.lm <- lm(Petal.Width ~ Species, data = iris)
anova(lm(residuals(iris.lm)^2 ~ iris$Species))
## Analysis of Variance Table
##
## Response: residuals(iris.lm)^2
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 0.100 0.0500 14.8 1.4e-06 ***
## Residuals 147 0.497 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hence the answer at the top.

Related

lmer and lm - use all factor levels as 'control' group

In the case when there's a categorical variable (unordered factor) in a linear model or lmer formula the function uses the first factor level as the 'control' group for contrasts. In my case I have a categorical variable with several levels and would like for each level to be the 'control' base group. Is there a function that automates this process and creates a nice matrix with p-values for all combinations? Here's a sample code using the diamonds dataset.
library(lmer);library(lmerTest)
#creating unordered factor
diamonds$color=factor(sample(c('red','white','blue','green','black'),nrow(diamonds),replace=T))
#lmer formula with factor in fixed effects
mod=lmer(data=diamonds,carat~color+(1|clarity))
summary(mod,corr=F)
As show in the summary, 'black' is used as the control, so I would like all the other colors to be used as control.
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest] Formula: carat ~ color + (1 | clarity) Data:
diamonds
REML criterion at convergence: 64684
Scaled residuals: Min 1Q Median 3Q Max
-2.228 -0.740 -0.224 0.540 8.471
Random effects: Groups Name Variance Std.Dev. clarity
(Intercept) 0.0763 0.276 Residual 0.1939 0.440
Number of obs: 53940, groups: clarity, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|) (Intercept) 0.786709 0.097774 7.005805 8.05 0.000087
*** colorblue -0.000479 0.005989 53927.996020 -0.08 0.94 colorgreen 0.007455 0.005998 53927.990722 1.24 0.21 colorred 0.000746 0.005986 53927.988909 0.12 0.90 colorwhite 0.000449 0.005971 53927.993708 0.08 0.94
--- Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
I could imagine wanting to do this for one of two reasons. First, would be to get the predicted value of the outcome at each level of the unordered factor (controlling for everything else in the model). The other would be to calculate all of the pairwise differences across the levels of the factor. If either of these is your goal, there are better ways to do it. Let's take the first one - generating predicted outcomes for each value of the factor holding everything else constant. Let's start by using the diamonds data and using the existing color variable, but making it an unordered factor.
library(lme4)
library(lmerTest)
library(multcomp)
library(ggeffects)
#creating unordered factor
data(diamonds, package="ggplot2")
diamonds$color <- as.factor(as.character(diamonds$color))
Now, we can run the model:
#lmer formula with factor in fixed effects
mod=lmer(data=diamonds,carat~color+(1|clarity))
The function glht in the multcomp package tests pairwise differences among factor levels. Here is the output.
summary(glht(mod, linfct = mcp(color="Tukey")))
#>
#> Simultaneous Tests for General Linear Hypotheses
#>
#> Multiple Comparisons of Means: Tukey Contrasts
#>
#>
#> Fit: lmer(formula = carat ~ color + (1 | clarity), data = diamonds)
#>
#> Linear Hypotheses:
#> Estimate Std. Error z value Pr(>|z|)
#> E - D == 0 0.025497 0.006592 3.868 0.00216 **
#> F - D == 0 0.116241 0.006643 17.497 < 0.001 ***
#> G - D == 0 0.181010 0.006476 27.953 < 0.001 ***
#> H - D == 0 0.271558 0.006837 39.721 < 0.001 ***
#> I - D == 0 0.392373 0.007607 51.577 < 0.001 ***
#> J - D == 0 0.511159 0.009363 54.592 < 0.001 ***
#> F - E == 0 0.090744 0.005997 15.130 < 0.001 ***
#> G - E == 0 0.155513 0.005789 26.863 < 0.001 ***
#> H - E == 0 0.246061 0.006224 39.536 < 0.001 ***
#> I - E == 0 0.366876 0.007059 51.975 < 0.001 ***
#> J - E == 0 0.485662 0.008931 54.380 < 0.001 ***
#> G - F == 0 0.064768 0.005807 11.154 < 0.001 ***
#> H - F == 0 0.155317 0.006258 24.819 < 0.001 ***
#> I - F == 0 0.276132 0.007091 38.939 < 0.001 ***
#> J - F == 0 0.394918 0.008962 44.065 < 0.001 ***
#> H - G == 0 0.090548 0.006056 14.952 < 0.001 ***
#> I - G == 0 0.211363 0.006910 30.587 < 0.001 ***
#> J - G == 0 0.330150 0.008827 37.404 < 0.001 ***
#> I - H == 0 0.120815 0.007276 16.606 < 0.001 ***
#> J - H == 0 0.239602 0.009107 26.311 < 0.001 ***
#> J - I == 0 0.118787 0.009690 12.259 < 0.001 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Adjusted p values reported -- single-step method)
If you wanted all the predicted values of carat for the different values of color, you could us ggpredict() from the ggeffects package:
g <- ggpredict(mod, terms = "color")
plot(g)
Plotting the g object produces the plot, but printing it will show the values and confidence intervals/
Created on 2023-02-01 by the reprex package (v2.0.1)

Coeftest function in R--Variable not reported in output

I have a linear regression I am running in R. I am calculating clustered standard errors. I get the output of coeftest() but in some cases it doesn't report anything for a variable. I don't get an error. Does this mean the coefficient couldn't be calculated or does the coeftest not report variables that are insignificant? I can't seem to find the answer in any of the R documentation.
Here is the output from R:
lm1 <- lm(PeaceA ~ Soc_Edu + Pol_Constitution + mediation + gdp + enrollratio + infantmortality , data=qsi.surv)
coeftest(lm1, vcov = vcovHC(lm1, type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.05780946 0.20574444 -5.1414 4.973e-06 ***
Soc_Edu -1.00735592 0.11756507 -8.5685 3.088e-11 ***
mediation 0.65682159 0.06291926 10.4391 6.087e-14 ***
gdp 0.00041894 0.00010205 4.1052 0.000156 ***
enrollratio 0.00852143 0.00177600 4.7981 1.598e-05 ***
infantmortality 0.00455383 0.00079536 5.7255 6.566e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice that there is nothing reported for the variable Pol_Constitution.
I assume you mean functions coeftest() from package lmtest and vcovHC() from package sandwich. In this combination, coefficients for linear dependend columns are silently dropped in coeftest's output. Thus, I assume your variable/column Pol_Constitution suffers from linear dependence.
Below is an example which demonstrates the behaviour with a linear dependend column. See how the estimated coefficient for I(2 * cyl) is NA in a simple summary() and in coeftest() but silently dropped when the latter is combind with vcovHC().
library(lmtest)
library(sandwich)
data(mtcars)
summary(mod <- lm(mpg ~ cyl + I(2*cyl), data = mtcars))
#> [...]
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
#> cyl -2.8758 0.3224 -8.92 6.11e-10 ***
#> I(2 * cyl) NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> [...]
coeftest(mod)
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.88458 2.07384 18.2678 < 2.2e-16 ***
#> cyl -2.87579 0.32241 -8.9197 6.113e-10 ***
#> I(2 * cyl) NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mod, vcov. = vcovHC(mod))
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.88458 2.74154 13.8187 1.519e-14 ***
#> cyl -2.87579 0.38869 -7.3987 3.040e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Convert mgcv or gamm4 gam/bam output to dataframe

The broom package has a great tidy() function for the summary results of simple linear models such as those generated by lm(). However, tidy() does not work for mgcv::bam(), mgcv::gam() or gamm4::gamm4. The bam below produces the following:
library(mgcv)
set.seed(3)
dat <- gamSim(1,n=25000,dist="normal",scale=20)
bs <- "cr";k <- 12
b <- bam(y ~ s(x0,bs=bs)+s(x1,bs=bs)+s(x2,bs=bs,k=k)+
s(x3,bs=bs),data=dat)
summary(b)
tidy(b)
glance(b)
Output of above code:
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, bs = bs) + s(x1, bs = bs) + s(x2, bs = bs, k = k) +
s(x3, bs = bs)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.8918 0.1275 61.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 3.113 3.863 6.667 3.47e-05 ***
s(x1) 2.826 3.511 63.015 < 2e-16 ***
s(x2) 8.620 9.905 52.059 < 2e-16 ***
s(x3) 1.002 1.004 3.829 0.0503 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0295 Deviance explained = 3.01%
fREML = 1.1057e+05 Scale est. = 406.15 n = 25000
> tidy(b)
data frame with 0 columns and 0 rows
> glance(b)
Error in `$<-.data.frame`(`*tmp*`, "logLik", value = -110549.163197452) :
replacement has 1 row, data has 0
How can I convert the summary to a dataframe so I can access outputs like the coefficients?

How can I get ggpairs to display ancova results in upper panels?

I'm making a scatterplot matrix with ggpairs{GGally) as follows, but I'd like to display the p values for each term in my aov results in the upper panels, rather than just the overall and by-species correlation value that comes with the package.
How can I get the right column from this aov result into my upper plots? Can I write a custom function to do this, and how? Is is even possible using ggpairs? Thanks.
library(GGally);library(ggplot2)
pm <- ggpairs(data = iris,
mapping = aes(color = Species),
columns = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
pm
result <- aov(Sepal.Length ~ Sepal.Width*Petal.Length, data = iris)
print(summary(result))
Df Sum Sq Mean Sq F value Pr(>F)
Sepal.Width 1 1.41 1.41 12.9 0.000447 ***
Petal.Length 1 84.43 84.43 771.4 < 2e-16 ***
Sepal.Width:Petal.Length 1 0.35 0.35 3.2 0.075712 .
Residuals 146 15.98 0.11
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Grouping Regressors in Anova Table for Multiple Linear Regression

I am running a regression on R
fbReg <- lm(y~x2+x7+x8,table.b1)
I then run an Anova table to analyze the significance of the regression
anova(fbReg)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 76.193 76.193 26.172 3.100e-05 ***
x7 1 139.501 139.501 47.918 3.698e-07 ***
x8 1 41.400 41.400 14.221 0.0009378 ***
Residuals 24 69.870 2.911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Is there anything i can do to make my anova table sum all the sum of squares for x2,x7,x8 instead of having them separate.
Essentially, have the anova table look like this
df SS MS FvAL PR(>F)
Regression 3 257.094 ETC....
Error(Residual) 24 69.870 ETC.....
Thanks
To illustrate my comment:
> lm2 <- lm(Fertility ~ Catholic+Education+Agriculture, data = swiss)
> lm1 <- lm(Fertility ~ 1, data = swiss)
> anova(lm1,lm2)
Analysis of Variance Table
Model 1: Fertility ~ 1
Model 2: Fertility ~ Catholic + Education + Agriculture
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 7178.0
2 43 2567.9 3 4610.1 25.732 1.089e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Resources