Show family class in TukeyHSD - r

I am use to conducting Tukey post-hoc tests in minitab. When I do, I usually get family grouping of the dependent/predictor variables.
In R, using TukeyHSD() the family grouping is not displayed (or calculated?). It only displays the relationship between each of the dependent/predictor variables. Is it possible to display the family groupings like in minitab?
Using the diamonds data set:
av <- aov(price ~ cut, data = diamonds)
tk <- TukeyHSD(av, ordered = T, which = "cut")
plot(tk)
Output:
Fit: aov(formula = price ~ cut, data = diamonds)
$cut
diff lwr upr p adj
Good-Ideal 471.32248 300.28228 642.3627 0.0000000
Very Good-Ideal 524.21792 401.33117 647.1047 0.0000000
Fair-Ideal 901.21579 621.86019 1180.5714 0.0000000
Premium-Ideal 1126.71573 1008.80880 1244.6227 0.0000000
Very Good-Good 52.89544 -130.15186 235.9427 0.9341158
Fair-Good 429.89331 119.33783 740.4488 0.0014980
Premium-Good 655.39325 475.65120 835.1353 0.0000000
Fair-Very Good 376.99787 90.13360 663.8622 0.0031094
Premium-Very Good 602.49781 467.76249 737.2331 0.0000000
Premium-Fair 225.49994 -59.26664 510.2665 0.1950425
Picture added to help clarify my response to Maruits's comment:

Here is a step-by-step example on how to reproduce minitab's table for the ggplot2::diamonds dataset. I've included details/explanation as much as possible.
Please note that as far as I can tell, results shown in minitab's table are not dependent/related to results from Tukey's post-hoc test; they are based on results from the analysis of variance. Tukey's honest significant difference (HSD) test is a post-hoc test that establishes which comparisons (of all the possible pairwise comparisons of group means) are (honestly) statistically significant, given the ANOVA results.
In order to reproduce minitabs "mean-grouping" summary table (see the first table of "Interpret the results: Step 3" of the minitab Express Support), I recommend (re-)running a linear model to extract means and confidence intervals. Note that this is exactly how aov fits the analysis of variance model for each group.
Fit a linear model
We specify a 0 offset to get absolute estimates for every group (rather than estimates for the changes relative to an offset).
fit <- lm(price ~ 0 + cut, data = diamonds)
coef <- summary(fit)$coef;
coef;
# Estimate Std. Error t value Pr(>|t|)
#cutFair 4358.758 98.78795 44.12236 0
#cutGood 3928.864 56.59175 69.42468 0
#cutVery Good 3981.760 36.06181 110.41487 0
#cutPremium 4584.258 33.75352 135.81570 0
#cutIdeal 3457.542 27.00121 128.05137 0
Determine family groupings
In order to obtain something similar to minitab's "family groupings", we adopt the following approach:
Calculate confidence intervals for all parameters
Perform a hierarchical clustering analysis on the confidence interval data for all parameters
Cut the resulting tree at a height corresponding to the standard deviation of the CIs. This will gives us a grouping of parameter estimates based on their confidence intervals. This is a somewhat empirical approach but justifiable as the tree measures pairwise distances between the confidence intervals, and the standard deviation can be interpreted as a Euclidean distance.
We start by calculating the confidence interval and cluster the resulting distance matrix using hierarchical clustering using complete linkage.
CI <- confint(fit);
hc <- hclust(dist(CI));
We inspect the cluster dendrogram
plot(hc);
We now cut the tree at a height corresponding to the standard deviation of all CIs across all parameter estimates to get the "family groupings"
grps <- cutree(hc, h = sd(CI))
Summarise results
Finally, we collate all quantities and store results in a table similar to minitab's "mean-grouping" table.
library(tidyverse)
bind_cols(
cut = rownames(coef),
N = as.integer(table(fit$model$cut)),
Mean = coef[, 1],
Groupings = grps) %>%
as.data.frame()
# cut N Mean Groupings
#1 cutFair 1610 4358.758 1
#2 cutGood 4906 3928.864 2
#3 cutVery Good 12082 3981.760 2
#4 cutPremium 13791 4584.258 1
#5 cutIdeal 21551 3457.542 3
Note the near-perfect agreement of our results with those from the minitab "mean-grouping" table: cut = Ideal is by itself in group 3 (group C in minitab's table), while Fair+Premium share group 1 (minitab: group A ), and Good+Very Good share group 2 (minitab: group B).

See the cld function in the multcomp package, as explained here (copy-pasted below).
Example data set:
> data(ToothGrowth)
> ToothGrowth$treat <- with(ToothGrowth, interaction(supp,dose))
> str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
$ treat: Factor w/ 6 levels "OJ.0.5","VC.0.5",..: 2 2 2 2 2 2 2 2 2 2 ...
Model fit:
> fit <- lm(len ~ treat, data=ToothGrowth)
All pairwise comparisons with Tukey test:
> apctt <- multcomp::glht(fit, linfct = multcomp::mcp(treat = "Tukey"))
Letter-based representation of all-pairwise comparisons (algorithm from Piepho 2004):
> lbrapc <- multcomp::cld(apctt)
> lbrapc
OJ.0.5 VC.0.5 OJ.1 VC.1 OJ.2 VC.2
"b" "a" "c" "b" "c" "c"

Related

How to get emmeans to print degrees of freedom for glmer class

I'm trying to get the degrees of freedom from emmeans of a glmer model for reporting reasons, but they just show Inf.
Here's some sample data. In the real data, there is no nesting structure, this is just a consequence of how I built the data frame:
set.seed(1234)
dat <- data.frame(
dv=c(rnorm(mean=1, sd=0.2, n=12000)),
id=c(rep(c("1", "2", "3"), times=c(4000, 4000, 4000))),
region=c(rep(rep(c("1", "2"), times=c(2000, 2000)), 3)),
intervention=c(rep(c("1", "2", "1"), times=c(4000, 4000, 4000))),
timepoint=c(rep(rep(c("1", "2"), times=c(2000, 2000)), times=3)),
direction=c(rep(rep(c("1", "2"), times=c(2000, 2000)), 3))
)
glmm_1 <- glmer(dv ~ intervention*timepoint*region + direction + (1|id), data=dat, family=gaussian(link="log"))
glmm_1_emm <- emmeans::emmeans(glmm_1, pairwise ~ intervention*region*timepoint, type = "response")
glmm_1_emm$emmeans
NOTE: A nesting structure was detected in the fitted model:
timepoint %in% (direction*region), region %in% direction
region timepoint direction intervention response SE df asymp.LCL asymp.UCL
1 1 1 1 1 0.00313 Inf 0.994 1.01
2 2 2 1 1 0.00313 Inf 0.998 1.01
1 1 1 2 1 0.00442 Inf 0.992 1.01
2 2 2 2 1 0.00442 Inf 0.995 1.01
Confidence level used: 0.95
Intervals are back-transformed from the log scale
This is really more of a statistical (i.e. for CrossValidated) than a computational question. tl;dr finite-size corrections are rarely considered for GLMs or GLMMs, and for GLMMs in particular there is little theoretical work I'm aware of that would even specify how to compute them. That's why emmeans etc. report df as Inf.
df in emmeans output represents the "denominator degrees of freedom" (i.e. the nu2 value you would use if testing against an F distribution F_{nu1,nu2}), which is something like (number of observations - number of parameters estimated) for simple (non-mixed) models like a linear regression or simple ANOVA, but which is considerably harder to define for multilevel models (i.e. linear mixed models). For generalized linear (and linear mixed) models, it gets even worse. Quoting from the "degrees of freedom" section of the GLMM FAQ (see there for full references):
When the responses are not normally distributed (as in GLMs and GLMMs), and when the scale parameter is not estimated (as in standard Poisson- and binomial-response models), then the deviance differences are only asymptotically F- or chi-square-distributed (i.e. not for our real, finite-size samples). In standard GLM practice, we usually ignore this problem; there is some literature on finite-size corrections for GLMs under the rubrics of “Bartlett corrections” and “higher order asymptotics” (see McCullagh and Nelder (1989), Cordeiro, Paula, and Botter (1994), Cordeiro and Ferrari (1998) and the cond package (on CRAN) [which works with GLMs, not GLMMs]), but it’s rarely used. (The bias correction/Firth approach implemented in the brglm package attempts to address the problem of finite-size bias, not finite-size non-chi-squaredness of the deviance differences.)
When the scale parameter in a GLM is estimated rather than fixed (as in Gamma or quasi-likelihood models), it is sometimes recommended to use an F
test to account for the uncertainty of the scale parameter (e.g. Venables and Ripley (2002) recommend anova(...,test="F") for quasi-likelihood models)
Combining these issues, one has to look pretty hard for information on small-sample or finite-size corrections for GLMMs: Feng, Braun, and McCulloch (2004) and Bell and Grunwald (2010) look like good starting points, but it’s not at all trivial.
Apparently emmeans::emmeans calculates Inf degrees of freedom, not sure why. But I've spotted this:
str(glmm_1_emm$emmeans)
# 'emmGrid' object with variables:
# region = 1, 2
# timepoint = 1, 2
# direction = 1, 2
# intervention = 1, 2
# Nesting structure: timepoint %in% (direction*region), region %in% direction
# Transformation: “log”
# Some things are non-estimable (null space dim = 5) ## <--------------------- !!!
There's a summary method involved, emmeans:::summary.emmGrid, since the summary is not calculated until you print it, i.e. evaluate glmm_1_emm$emmeans.
Provided that the degrees of freedom are correct later on, then you could extract them using a rather artless capture.output approach:
tmp <- capture.output(glmm_1_emm$emmeans)
res <- read.table(text=tmp[1:(which(tmp == '') - 1)], header=TRUE)
res
# region timepoint direction intervention response SE df asymp.LCL asymp.UCL
# 1 1 1 1 1 1 0.00313 Inf 0.994 1.01
# 2 2 2 2 1 1 0.00313 Inf 0.998 1.01
# 3 1 1 1 2 1 0.00442 Inf 0.992 1.01
# 4 2 2 2 2 1 0.00442 Inf 0.995 1.01
res[, 7] ## degrees of freedom
# [1] Inf Inf Inf Inf

R post hoc comparisons of ezANOVA

I perform following ezANOVA:
RMANOVAGHB1 <- ezANOVA(GHB1, dv=DIF.SCORE.STARTLE, wid=RAT.ID, within=TRIAL.TYPE, between=GROUP, detailed = TRUE, return_aov = TRUE)
My dataset looks like this:
RAT.ID DIF.SCORE.STARTLE GROUP TRIAL.TYPE
1 1 170.73 SAL TONO
2 1 80.07 SAL NOAL
3 2 456.40 PROP TONO
4 2 290.40 PROP NOAL
5 3 507.20 SAL TONO
6 3 261.60 SAL NOAL
7 4 208.67 PROP TONO
8 4 137.60 PROP NOAL
9 5 500.50 SAL TONO
10 5 445.73 SAL NOAL
up until rat.id 16.
My supervisors don't work with R, so they can't help me. I need code that will give me all post hoc contrasts, but looking it up only confuses me more and more.
I already tried to do TukeyHSD on the aov output of ezANOVA and tried pairwise.t.test next (as I found out bonferroni is a more appropriate correction in this case), but none seem to work. I've also found things about using a linear model and then multcomp, but I don't know if that would be a good solution in this case. I feel like the problem with everything I tried is either that I have between and within variables or that my dataset is not set up right. Another complicating factor is that I'm just a beginner with R and my statistical knowledge is still pretty basic as this is one of my first practical experiences with doing analyses.
If it's important, this is the output of the anova:
$ANOVA
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 14 1233568.9 1076460.9 16.043280 0.001302172 * 0.508451750
2 GROUP 1 14 212967.9 1076460.9 2.769771 0.118272657 0.151521743
3 TRIAL.TYPE 1 14 137480.6 116097.9 16.578499 0.001143728 * 0.103365833
4 GROUP:TRIAL.TYPE 1 14 11007.2 116097.9 1.327335 0.268574391 0.009145489
$aov
Call:
aov(formula = formula(aov_formula), data = data)
Grand Mean: 196.3391
Stratum 1: RAT.ID
Terms:
GROUP Residuals
Sum of Squares 212967.9 1076460.9
Deg. of Freedom 1 14
Residual standard error: 277.2906
1 out of 2 effects not estimable
Estimated effects are balanced
Stratum 2: RAT.ID:TRIAL.TYPE
Terms:
TRIAL.TYPE GROUP:TRIAL.TYPE Residuals
Sum of Squares 137480.6 11007.2 116097.9
Deg. of Freedom 1 1 14
Residual standard error: 91.0643
Estimated effects may be unbalanced
My solution, considering your dataset - first 5 rats:
1. Let's build the linear model:
model.lm = lm(DIF_SCORE_STARTLE ~ GROUP * TRIAL_TYPE, data = dat)
2. Let's chceck the homogeneity of variance (leveneTest) and distribution of our model (Shapiro-Wilk). We are looking for normal distribution and our variance should be homogenic. Two tests for this:
>shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.91783, p-value = 0.3392
> leveneTest(DIF_SCORE_STARTLE ~ GROUP * TRIAL_TYPE, data = dat)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 0.066 0.976
6
Our p-values are higher than 0.05 in both cases so we don't have proof that our variance differs between groups. In case of normality test we can also conclude that the sample doesn't deviate from normality. Summarizing we can use parametrical tests such as ANOVA or pairwise t-test.
3.Yo can also run:
hist(resid(model.lm))
To check how does distribution of our data look like. And check the model:
plot(model.lm)
Here: https://stats.stackexchange.com/questions/58141/interpreting-plot-lm/65864 you'll find interpretation of plots produced by this function. As I saw, data looks fine.
4.Now finally we can do ANOVA test and post hoc HSD test:
> anova(model.lm)
Analysis of Variance Table
Response: DIF_SCORE_STARTLE
Df Sum Sq Mean Sq F value Pr(>F)
GROUP 1 7095 7095 0.2323 0.6469
TRIAL_TYPE 1 39451 39451 1.2920 0.2990
GROUP:TRIAL_TYPE 1 84 84 0.0027 0.9600
Residuals 6 183215 30536
> (result.hsd = HSD.test(model.lm, list('GROUP', 'TRIAL_TYPE')))
$statistics
Mean CV MSerror HSD r.harmonic
305.89 57.12684 30535.91 552.2118 2.4
$parameters
Df ntr StudentizedRange alpha test name.t
6 4 4.895599 0.05 Tukey GROUP:TRIAL_TYPE
$means
DIF_SCORE_STARTLE std r Min Max
PROP:NOAL 214.0000 108.0459 2 137.60 290.40
PROP:TONO 332.5350 175.1716 2 208.67 456.40
SAL:NOAL 262.4667 182.8315 3 80.07 445.73
SAL:TONO 392.8100 192.3561 3 170.73 507.20
$comparison
NULL
$groups
trt means M
1 SAL:TONO 392.8100 a
2 PROP:TONO 332.5350 a
3 SAL:NOAL 262.4667 a
4 PROP:NOAL 214.0000 a
As you see, our 'pairs' have been grouped in one big group a that means that there are not significant difference between them. However there's some difference between NOAL and TONO no matter of SAL and PROP.

Inflated DF in lsmeans results for an lmer model

I used lmer from the lme4 package to run a linear mixed effects model. I have 3 years of temperature data for untreated (5) and treated plots (10). The model:
modela<-lmer(ave~yr*tr+(1|pl), REML=FALSE, data=mydata)
Model checked for normality of residuals; qqnorm plot
My data:
'data.frame': 6966 obs. of 7 variables:
$ yr : Factor w/ 3 levels "yr1","yr2","yr3": 1 1 1 1 1 1 1 1 1 1 ...
$ pl : Factor w/ 15 levels "C02","C03","C05",..: 1 1 1 1 1 1 1 1 1 1 ...
$ tr : Factor w/ 2 levels "Cont","OTC": 1 1 1 1 1 1 1 1 1 1 ...
$ ave: num 14.8 16.1 11.6 10.3 11.6 ...
The interaction is significant, so I used lsmeans:
lsmeans(modela, pairwise~yr*tr, adjust="tukey")
In the contrasts, I get (excerpts only)
contrast estimate SE df t.ratio p.value
yr1,Cont - yr2,Cont -0.727102895 0.2731808 6947.24 -2.662 0.0832
yr1,OTC - yr2,OTC -0.990574030 0.2015650 6449.10 -4.914 <.0001
yr1,Cont - yr1,OTC -0.005312771 0.3889335 31.89 -0.014 1.0000
yr2,Cont - yr2,OTC -0.268783907 0.3929332 32.97 -0.684 0.9825
My question regards the high dfs for some of the contrasts, and associated, but meaningless low p-values.
Can this be due to:
-presence of NA's in my data set (some improvement when removed)
-unequal sample sizes (e.g. 5 of one treatment, 10 of the other - however, those (yr1,Cont - yr1, OTC) don't seem to be a problem.
Other issues?
I have searched stakoverflow questions, and crossvalidated.
Thanks for any answers, ideas, comments.
In this example, treatments are assigned experimentally to plots. Having small numbers of plots assigned to treatments severely limits the information available to statistically compare the treatments. (If you had only one plot per treatment, it would not even be possible to compare treatments, because you wouldn't be able to sort out the effect of the treatments from the effect of the plots.) You have 10 plots assigned to one treatment and 5 to the other. In terms of the main effect for treatment, you thus have (10-1)+(5-1) = 13 d.f. for the main effect of treatment, and if you do
lsmeans(modela, pairwise ~ tr)
you will see around 13 d.f. (maybe less due to imbalance and missingness) for those statistics. When you compare combinations of years and treatments, you get roughly 3 times the d.f. because there are 3 years. However, in some of those comparisons, year is that same in each combination being compared, and in those comparisons, the variation in plots mostly cancels out (it is a within-plot comparison); and in those cases, the d.f. basically come from the residual error for the model, which has thousands of d.f. Due to imbalances in the data, these comparisons are a little bit polluted by the between-plot variations, making the d.f. somewhat smaller than the residual d.f.
It appears you are not particularly interested in cross-comparisons such as treat1, year1 vs. treat2, year3. I suggest using "by" variables to cut down on the number of comparisons tested, because when you test them all, the multiplicity correction is unnecessarily conservative. It would go something like this:
modela.lsm = lsmeans(modela, ~ tr * yr)
pairs(modela.lsm, by = "yr") # compare tr for each yr
pairs(modela.lsm, by = "tr") # compare yr for each tr
These calls will apply the Tukey correction separately to each "by" group. If you want a multiplicity correction for each whole family, do this:
rbind(pairs(modela.lsm, by = "yr"))
rbind(pairs(modela.lsm, by = "tr"))
By default, a multivariate t correction is used (Tukey is not the right method here). You can even do
rbind(pairs(modela.lsm, by = "yr"), pairs(modela.lsm, by = "tr"))
to group all of the comparisons into one family and apply a multivariate t adjustment.

Dummy variables for Logistic regression in R

I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.

How to properly set contrasts in R

I have been asked to see if there is a linear trend in 3 groups of data (5 points each) by using ANOVA and linear contrasts. The 3 groups represent data collected in 2010, 2011 and 2012. I want to use R for this procedure and I have tried both of the following:
contrasts(data$groups, how.many=1) <- contr.poly(3)
contrasts(data$groups) <- contr.poly(3)
Both ways seem to work fine but give slightly different answers in terms of their p-values. I have no idea which is correct and it is really tricky to find help for this on the web. I would like help figuring out what is the reasoning behind the different answers. I'm not sure if it has something to do with partitioning sums of squares or whatnot.
Both approaches differ with respect to whether a quadratic polynomial is used.
For illustration purposes, have a look at this example, both x and y are a factor with three levels.
x <- y <- gl(3, 2)
# [1] 1 1 2 2 3 3
# Levels: 1 2 3
The first approach creates a contrast matrix for a quadratic polynomial, i.e., with a linear (.L) and a quadratic trend (.Q). The 3 means: Create the 3 - 1th polynomial.
contrasts(x) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L .Q
# 1 -7.071068e-01 0.4082483
# 2 -7.850462e-17 -0.8164966
# 3 7.071068e-01 0.4082483
# Levels: 1 2 3
In contrast, the second approach results in a polynomial of first order (i.e., a linear trend only). This is due to the argument how.many = 1. Hence, only 1 contrast is created.
contrasts(y, how.many = 1) <- contr.poly(3)
# [1] 1 1 2 2 3 3
# attr(,"contrasts")
# .L
# 1 -7.071068e-01
# 2 -7.850462e-17
# 3 7.071068e-01
# Levels: 1 2 3
If you're interested in the linear trend only, the second option seems more appropriate for you.
Changing the contrasts you ask for changes the degrees of freedom of the model. If one model requests linear and quadratic contrasts, and a second specifies only, say, the linear contrast, then the second model has an extra degree of freedom: this will increase the power to test the linear hypothesis, (at the cost of preventing the model fitting the quadratic trend).
Using the full ("nlevels - 1") set of contrasts creates an orthogonal set of contrasts which explore the full set of (independent) response configurations. Cutting back to just one prevents the model from fitting one configuration (in this case the quadratic component which our data in fact possess.
To see how this works, use the built-in dataset mtcars, and test the (confounded) relationship of gears to gallons. We'll hypothesize that the more gears the better (at least up to some point).
df = mtcars # copy the dataset
df$gear = as.ordered(df$gear) # make an ordered factor
Ordered factors default to polynomial contrasts, but we'll set them here to be explicit:
contrasts(df$gear) <- contr.poly(nlevels(df$gear))
Then we can model the relationship.
m1 = lm(mpg ~ gear, data = df);
summary.lm(m1)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 20.6733 0.9284 22.267 < 2e-16 ***
# gear.L 3.7288 1.7191 2.169 0.03842 *
# gear.Q -4.7275 1.4888 -3.175 0.00353 **
#
# Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
# F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
Note we have F(2,29) = 10.9 for the overall model and p=.038 for our linear effect with an estimated extra 3.7 mpg/gear.
Now let's only request the linear contrast, and run the "same" analysis.
contrasts(df$gear, how.many = 1) <- contr.poly(nlevels(df$gear))
m1 = lm(mpg ~ gear, data = df)
summary.lm(m1)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.317 1.034 20.612 <2e-16 ***
# gear.L 5.548 1.850 2.999 0.0054 **
# Multiple R-squared: 0.2307, Adjusted R-squared: 0.205
# F-statistic: 8.995 on 1 and 30 DF, p-value: 0.005401
The linear effect of gear is now bigger (5.5 mpg) and p << .05 - A win? Except the overall model fit is now significantly worse: variance accounted for is now just 23% (was 43%)! Why is clear if we plot the relationship:
plot(mpg ~ gear, data = df) # view the relationship
So, if you're interested in the linear trend, but also expect (or are unclear about) additional levels of complexity, you should also test these higher polynomials. The quadratic (or, in general, trends up to levels-1).
Note too that in this example the physical mechanism is confounded: We've forgotten that number of gears is confounded with automatic vs manual transmission, and also with weight, and sedan vs sports car.
If someone wants to test the hypothesis that 4 gears is better than 3, they could answer this question :-)

Resources