How to do negative binomial regression with the rms package in R? - r

How can I use the rms package in R to execute a negative binomial regression? (I originally posted this question on Statistics SE, but it was closed apparently because it is a better fit here.)
With the MASS package, I use the glm.nb function, but I am trying to switch to the rms package because I sometimes get weird errors when bootstrapping with glm.nb and some other functions. But I cannot figure out how to do a negative binomial regression with the rms package.
Here is sample code of what I would like to do (copied from the rms::Glm function documentation):
library(rms)
## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
f <- Glm(counts ~ outcome + treatment, family=poisson())
f
anova(f)
summary(f, outcome=c('1','2','3'), treatment=c('1','2','3'))
So, instead of using family=poisson(), I would like to use something like family=negative.binomial(), but I cannot figure out how to do this.
In the documentation for family {stats}, I found this note in the "See also" section:
For binomial coefficients, choose; the binomial and negative binomial distributions, Binomial, and NegBinomial.
But even after clicking the link for ?NegBinomial, I cannot make any sense of this.
I would appreciate any help on how to use the rms package in R to execute a negative binomial regression.

opinion up front You might be better off posting (as a separate question) a reproducible example of the "weird errors" from your bootstrap attempts and seeing whether people have ideas for resolving them. It's fairly common for NB fitting procedures to throw warnings or errors when data are equi- or underdispersed, as the estimates of the dispersion parameter become infinite in this case ...
#coffeinjunky is correct that using family = negative.binomial(theta=VALUE) will work (where VALUE is a numeric constant, e.g. theta=1 for the geometric distribution [a special case of the NB]). However: you won't be able (without significantly more work) be able to fit the general NB model, i.e. the model where the dispersion parameter (theta) is estimated as part of the fitting procedure. That's what MASS::glm.nb does, and AFAICS there is no analogue in the rms package.
There are a few other packages/functions in addition to MASS::glm.nb that fit the negative binomial model, including (at least) bbmle and glmmTMB — there may be others such as gamlss.
## Dobson (1990) Page 93: Randomized Controlled Trial :
dd < data.frame(
counts = c(18,17,15,20,10,20,25,13,12)
outcome = gl(3,1,9),
treatment = gl(3,3))
MASS::glm.nb
library(MASS)
m1 <- glm.nb(counts ~ outcome + treatment, data = dd)
## "iteration limit reached" warning
glmmTMB
library(glmmTMB)
m2 <- glmmTMB(counts ~ outcome + treatment, family = nbinom2, data = dd)
## "false convergence" warning
bbmle
library(bbmle)
m3 <- mle2(counts ~ dnbinom(mu = exp(logmu), size = exp(logtheta)),
parameters = list(logmu ~outcome + treatment),
data = dd,
start = list(logmu = 0, logtheta = 0)
)
signif(cbind(MASS=coef(m1), glmmTMB=fixef(m2)$cond, bbmle=coef(m3)[1:5]), 5)
MASS glmmTMB bbmle
(Intercept) 3.0445e+00 3.04540000 3.0445e+00
outcome2 -4.5426e-01 -0.45397000 -4.5417e-01
outcome3 -2.9299e-01 -0.29253000 -2.9293e-01
treatment2 -1.1114e-06 0.00032174 8.1631e-06
treatment3 -1.9209e-06 0.00032823 6.5817e-06
These all agree fairly well (at least for the intercept/outcome parameters). This example is fairly difficult for a NB model (5 parameters + dispersion for 9 observations, data are Poisson rather than NB).

Based on this, the following seems to work:
library(rms)
library(MASS)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
Glm(counts ~ outcome + treatment, family = negative.binomial(theta = 1))
General Linear Model
rms::Glm(formula = counts ~ outcome + treatment, family = negative.binomial(theta = 1))
Model Likelihood
Ratio Test
Obs 9 LR chi2 0.31
Residual d.f.4 d.f. 4
g 0.2383063 Pr(> chi2) 0.9892
Coef S.E. Wald Z Pr(>|Z|)
Intercept 3.0756 0.2121 14.50 <0.0001
outcome=2 -0.4598 0.2333 -1.97 0.0487
outcome=3 -0.2962 0.2327 -1.27 0.2030
treatment=2 -0.0347 0.2333 -0.15 0.8819
treatment=3 -0.0503 0.2333 -0.22 0.8293

Related

lme4: How to specify random slopes while constraining all correlations to 0?

Due to an interesting turn of events, I'm trying use the lme4 package in R to fit a model in which the random slopes are not allowed to correlate with each other or the random intercept. Effectively, I want to estimate the variance parameter for each random slope, but none of the correlations/covariances. From the reading I've done so far, I think what I want is effectively a diagonal variance/covariance structure for the random effects.
An answer to a similar question here provides a workaround to specify a model where slopes are correlated with intercepts, but not with each other. I also know the || syntax in lme4 makes slopes that are correlated with each other, but not with the intercepts. Neither of these seems to fully accomplish what I'm looking to do.
Borrowing the example from the earlier post, if my model is:
m1 <- lmer (Y ~ A + B + (1+A+B|Subject), data=mydata)
is there a way to specify the model such that I estimate variance parameters for A and B while constraining all three correlations to 0? I would like to achieve a result that looks something like this:
VarCorr(m1)
## Groups Name Std.Dev. Corr
## Subject (Intercept) 1.41450
## A 1.49374 0.000
## B 2.47895 0.000 0.000
## Residual 0.96617
I'd prefer a solution that could achieve this for an arbitrary number of random slopes. For example, if I were to add a random effect for a third variable C, there would be 6 correlation parameters to fix at 0 rather than 3. However, anything that could get me started in the right direction would be extremely helpful.
Edit:
On asking this question, I misunderstood what the || syntax does in lme4. Struck through the incorrect statement above to avoid misleading anyone in the future.
This is exactly what the double-bar notation does. However, note that the || in lme4 does not work as one might expect for factor variables. It does work 'properly' in glmmTMB, and the afex::mixed() function is a wrapper for [g]lmer which does implement a fully functional version of ||. (I have meant to import this into lme4 for years but just haven't gotten around to it yet ...)
simulated example
library(lme4)
set.seed(101)
dd <- data.frame(A = runif(500), B = runif(500),
Subject = factor(rep(1:25, 20)))
dd$Y <- simulate(~ A + B + (1 + A + B|Subject),
newdata = dd,
family = gaussian,
newparams = list(beta = rep(1,3), theta = rep(1,6), sigma = 1))[[1]]
solution
summary(m <- lmer (Y ~ A + B + (1+A+B||Subject), data=dd))
The correlations aren't listed because they are structurally absent (internally, the random effects term is expanded to (1|Subject) + (0 + A|Subject) + (0+B|Subject), which is also why the groups are listed as Subject, Subject.1, Subject.2).
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.8744 0.9351
Subject.1 A 2.0016 1.4148
Subject.2 B 2.8718 1.6946
Residual 0.9456 0.9724
Number of obs: 500, groups: Subject, 25

R geepack: unreasonably large estimates using GEE

I am using geepack for R to estimate logistic marginal model by geeglm(). But I am getting garbage estimates. They about 16 orders of magnitude too large. However the p-values seems to similar to what I expected. This means that the response essentially becomes a step function. See attached plot
Here is the code that generates the plot:
require(geepack)
data = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=data, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=data)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
Here is the regression table:
Call:
geeglm(formula = moden ~ 1 + power, family = binomial, data = data,
id = defacto, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -7.38e+15 1.47e+15 25.1 5.4e-07 ***
power 2.05e+13 1.60e+12 164.4 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1.03e+15 1.65e+37
Correlation: Structure = exchangeable Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.196 3.15e+21
Number of clusters: 3 Maximum cluster size: 381
Hoping for some help. Thanks!
Kind regards,
Marius
I will give three procedures, each of which is a marginalized random intercept model (MRIM). These MRIMs have coefficients with marginal logistic interpretations and are of smaller magnitude than the GEE:
| Model | (Intercept) | power | LogL |
|-------|-------------|--------|--------|
| `L_N` | -1.050| 0.00267| -270.1|
| `LLB` | -0.668| 0.00343| -273.8|
| `LPN` | -1.178| 0.00569| -266.4|
compared to a glm that doesn't account for any correlation, for reference:
| Model | (Intercept) | power | LogL |
|-------|-------------|--------|--------|
| strt | -0.207| 0.00216| -317.1|
A marginalized random intercept model (MRIM) is worth exploring because you want a marginal model with exchangeable correlation structure for the clustered data, and that is the type of structure MRIMs exhibit.
The code (especially R script with comments) and PDFs for literature are in the GITHUB repo. I detail the code and literature down below.
The concept of MRIM has been around since 1999, and some background reading on this is in the GITHUB repo. I suggest reading Swihart et al 2014 first because it reviews the other papers.
In chronological order --
L_N Heagerty (1999): the approach fits a random intercept logistic model with a normally distributed random intercept. The trick is that the predictor in the random intercept model is nonlinearly parameterized with marginal coefficients so that the resulting marginal model has a marginal logistic interpretation. Its code is the lnMLE R package (not on CRAN, but on Patrick Heagerty's website here). This approach is denoted L_N in the code to indicate logit (L) on the marginal, no interepretation on conditional scale (_) and a normally (N) distributed random intercept.
LLB Wang & Louis (2003): the approach fits a random intercept logistic model with a bridge distributed random intercept. Unlike Heagerty 1999 where the trick is nonlinear-predictor for the random intercept model, the trick is a special random effects distribution (the bridge distribution) that allows both the random intercept model and the resulting marginal model to have a logistic interpretation. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. This approach is denoted LLB in the code to indicate logit (L) on the marginal, logit (L) on the conditional scale and a bridge (B) distributed intercept.
LPN Caffo and Griswold (2006): the approach fits a random intercept probit model with a normally distributed random intercept, whereas Heagerty 1999 used a logit random intercept model. This substitution makes computations easier and still yields a marginal logit model. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. This approach is denoted LPN in the code to indicate logit (L) on the marginal, probit (P) on the conditional scale and a normally (N) distributed intercept.
Griswold et al (2013): another review / practical introduction.
Swihart et al 2014: This is a review paper for Heagerty 1999 and Wang & Louis 2003 as well as others and generalizes the MRIM method. One of the most interesting generalizations is allowing the logistic CDF (equivalently, logit link) in both the marginal and conditional models to instead be a stable distribution that approximates a logistic CDF. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. I denote this SSS in the R script with comments to indicate stable (S) on the marginal, stable (S) on the conditional scale and a stable (S) distributed intercept. It is included in the R script but not detailed in this post on SO.
Prep
#code from OP Question: edit `data` to `d`
require(geepack)
d = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=d, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=d)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
#get some starting values from glm():
strt <- coef(glm(moden ~ power, family = binomial, data=d))
strt
#I'm so sorry but these methods use attach()
attach(d)
L_N Heagerty (1999)
# marginally specifies a logit link and has a nonlinear conditional model
# the following code will not run if lnMLE is not successfully installed.
# See https://faculty.washington.edu/heagerty/Software/LDA/MLV/
library(lnMLE)
L_N <- logit.normal.mle(meanmodel = moden ~ power,
logSigma= ~1,
id=defacto,
model="marginal",
data=d,
beta=strt,
r=10)
print.logit.normal.mle(L_N)
Prep for LLB and LPN
library("gnlm")
library("repeated")
source("gnlmix4MMM.R") ## see ?gnlmix; in GITHUB repo
y <- cbind(d$moden,(1-d$moden))
LLB Wang and Louis (2003)
LLB <- gnlmix4MMM(y = y,
distribution = "binomial",
mixture = "normal",
random = "rand",
nest = defacto,
mu = ~ 1/(1+exp(-(a0 + a1*power)*sqrt(1+3/pi/pi*exp(pmix)) - sqrt(1+3/pi/pi*exp(pmix))*log(sin(pi*pnorm(rand/sqrt(exp(pmix)))/sqrt(1+3/pi/pi*exp(pmix)))/sin(pi*(1-pnorm(rand/sqrt(exp(pmix))))/sqrt(1+3/pi/pi*exp(pmix)))))),
pmu = c(strt, log(1)),
pmix = log(1))
print("code: 1 -best 2-ok 3,4,5 - problem")
LLB$code
print("coefficients")
LLB$coeff
print("se")
LLB$se
LPN Caffo and Griswold (2006)
LPN <- gnlmix4MMM(y = y,
distribution = "binomial",
mixture = "normal",
random = "rand",
nest = defacto,
mu = ~pnorm(qnorm(1/(1+exp(-a0 - a1*power)))*sqrt(1+exp(pmix)) + rand),
pmu = c(strt, log(1)),
pmix = log(1))
print("code: 1 -best 2-ok 3,4,5 - problem")
LPN$code
print("coefficients")
LPN$coeff
print("se")
LPN$se
coefficients from 3 approaches:
rbind("L_N"=L_N$beta, "LLB" = LLB$coefficients[1:2], "LPN"=LPN$coefficients[1:2])
max log likelihood for 3 models:
rbind("L_N"=L_N$logL, "LLB" = -LLB$maxlike, "LPN"=-LPN$maxlike)

Is there a way of getting "marginal effects" from a `glmer` object

I am estimating random effects logit model using glmer and I would like to report Marginal Effects for the independent variables. For glm models, package mfx helps compute marginal effects. Is there any package or function for glmer objects?
Thanks for your help.
A reproducible example is given below
## mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
## as of 2020-08-24:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank) #creating ranks
id <- rep(1:ceiling(nrow(mydata)/2), times=c(2)) #creating ID variable
mydata <- cbind(mydata,data.frame(id,stringsAsFactors=FALSE))
set.seed(12345)
mydata$ran <- runif(nrow(mydata),0,1) #creating a random variable
library(lme4)
cfelr <- glmer(admit ~ (1 | id) + rank + gpa + ran + gre, data=mydata ,family = binomial)
summary(cfelr)
Here's an approach using the margins() package:
library(margins)
library(lme4)
gm1 <- glmer(cbind(incidence, size - incidence) ~ period +
(1 | herd),
data = cbpp,
family = binomial)
m <- margins(gm1, data = cbpp)
m
You could use the ggeffects-package (examples in the package-vignettes). So, for your code this might look like this:
library(ggeffects)
# dat is a data frame with marginal effects
dat <- ggpredict(cfelr, term = "rank")
plot(dat)
or you use, as Benjamin described, the You could use the sjPlot-package, using the plot_model() function with plot-type "pred" (this simply wraps the ggeffects package for marginal effect plots):
library(sjPlot)
plot_model(cfelr, type = "pred", term = "rank")
This is a much less technical answer, but perhaps provides a useful resource. I am a fan of the sjPlot package which provides plots of marginal effects of glmer objects, like so:
library(sjPlot)
sjp.glmer(cfelr, type = "eff")
The package provides a lot of options for exploring a glmer model's fixed and random effects as well. https://github.com/strengejacke/sjPlot
My solution does not answer the question,
"Is there a way of getting “marginal effects” from a glmer object",
but rather,
"Is there a way of getting marginal logistic regression coefficients from a conditional logistic regression with one random intercept?"
I am only offering this write-up because the reproducible example provided was a conditional logistic regression with one random intercept and I'm intending to be helpful. Please do not downvote; I will take down if this answer is deemed too off topic.
The R-code is based on the work of Patrick Heagerty (click "View Raw" to see pdf), and I include a reproducible example below from my github version of his lnMLE package (excuse the warnings at installation -- I'm shoehorning Patrick's non-CRAN package). I'm omitting the output for all except the last line, compare, which shows the fixed effect coefficients side-by-side.
library(devtools)
install_github("lnMLE_1.0-2", "swihart")
library(lnMLE)
## run the example from the logit.normal.mle help page
## see also the accompanying document (click 'View Raw' on page below:)
## https://github.com/swihart/lnMLE_1.0-2/blob/master/inst/doc/lnMLEhelp.pdf
data(eye_race)
attach(eye_race)
marg_model <- logit.normal.mle(meanmodel = value ~ black,
logSigma= ~1,
id=eye_race$id,
model="marginal",
data=eye_race,
tol=1e-5,
maxits=100,
r=50)
marg_model
cond_model <- logit.normal.mle(meanmodel = value ~ black,
logSigma= ~1,
id=eye_race$id,
model="conditional",
data=eye_race,
tol=1e-5,
maxits=100,
r=50)
cond_model
compare<-round(cbind(marg_model$beta, cond_model$beta),2)
colnames(compare)<-c("Marginal", "Conditional")
compare
The output of the last line:
compare
Marginal Conditional
(Intercept) -2.43 -4.94
black 0.08 0.15
I attempted the reproducible example given, but had problems with both the glmer and lnMLE implementations; again I only include output pertaining to the comparison results and the warnings from the glmer() call:
##original question / answer... glmer() function gave a warning and the lnMLE did not fit well...
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank) #creating ranks
id <- rep(1:ceiling(nrow(mydata)/2), times=c(2)) #creating ID variable
mydata <- cbind(mydata,data.frame(id,stringsAsFactors=FALSE))
set.seed(12345)
mydata$ran <- runif(nrow(mydata),0,1) #creating a random variable
library(lme4)
cfelr <- glmer(admit ~ (1 | id) + rank + gpa + ran + gre,
data=mydata,
family = binomial)
Which gave:
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00161047 (tol = 0.001, component 2)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?
but I foolishly went on without rescaling, trying to apply the logit.normal.mle to the example given. However, the conditional model doesn't converge or produce standard error estimates,
summary(cfelr)
library(devtools)
install_github("lnMLE_1.0-2", "swihart")
library(lnMLE)
mydata$rank2 = mydata$rank==2
mydata$rank3 = mydata$rank==3
mydata$rank4 = mydata$rank==4
cfelr_cond = logit.normal.mle(meanmodel = admit ~ rank2+rank3+rank4+gpa+ran+gre,
logSigma = ~1 ,
id=id,
model="conditional",
data=mydata,
r=50,
tol=1e-6,
maxits=500)
cfelr_cond
cfelr_marg = logit.normal.mle(meanmodel = admit ~ rank2+rank3+rank4+gpa+ran+gre,
logSigma = ~1 ,
id=id,
model="marginal",
data=mydata,
r=50,
tol=1e-6,
maxits=500)
cfelr_marg
compare_glmer<-round(cbind(cfelr_marg$beta, cfelr_cond$beta,summary(cfelr)$coeff[,"Estimate"]),3)
colnames(compare_glmer)<-c("Marginal", "Conditional","glmer() Conditional")
compare_glmer
The last line of which reveals that the conditional model from cfelr_cond did not evaluate a conditional model but just returned the marginal coefficients and no standard errors.
> compare_glmer
Marginal Conditional glmer() Conditional
(Intercept) -4.407 -4.407 -4.425
rank2 -0.667 -0.667 -0.680
rank3 -1.832 -1.833 -1.418
rank4 -1.930 -1.930 -1.585
gpa 0.547 0.548 0.869
ran 0.860 0.860 0.413
gre 0.004 0.004 0.002
I hope to iron out these issues. Any help/comments appreciated. I'll give status updates when I can.

Testing differences in coefficients including interactions from piecewise linear model

I'm running a piecewise linear random coefficient model testing the influence of a covariate on the second piece. Thereby, I want to test whether the coefficient of the second piece under the influence of the covariate (piece2 + piece2:covariate) differs from the coefficient of the first piece (piece1), hence whether the growth rate differs.
I set up some exemplary data:
set.seed(100)
# set up dependent variable
temp <- rep(seq(0,23),50)
y <- c(rep(seq(0,23),50)+rnorm(24*50), ifelse(temp <= 11, temp + runif(1200), temp + rnorm(1200) + (temp/sqrt(temp))))
# set up ID variable, variables indicating pieces and the covariate
id <- sort(rep(seq(1,100),24))
piece1 <- rep(c(seq(0,11), rep(11,12)),100)
piece2 <- rep(c(rep(0,12), seq(1,12)),100)
covariate <- c(rep(0,24*50), rep(c(rep(0,12), rep(1,12)), 50))
# data frame
example.data <- data.frame(id, y, piece1, piece2, covariate)
# run piecewise linear random effects model and show results
library(lme4)
lmer.results <- lmer(y ~ piece1 + piece2*covariate + (1|id) , example.data)
summary(lmer.results)
I came across the linearHypothesis() command from the car package to test differences in coefficients. However, I could not find an example on how to use it when including interactions.
Can I even use linearHypothesis() to test this or am I aiming for the wrong test?
I appreciate your help.
Many thanks in advance!
Mac
Assuming your output looks like this
Estimate Std. Error t value
(Intercept) 0.26293 0.04997 5.3
piece1 0.99582 0.00677 147.2
piece2 0.98083 0.00716 137.0
covariate 2.98265 0.09042 33.0
piece2:covariate 0.15287 0.01286 11.9
if I understand correctly what you want, you are looking for the contrast:
piece1-(piece2+piece2:covariate)
or
c(0,1,-1,0,-1)
My preferred tool for this is function estimable in gmodels; you could also do it by hand or with one of the functions in Frank Harrel's packages.
library(gmodels)
estimable(lmer.results,c(0,1,-1,0,-1),conf.int=TRUE)
giving
Estimate Std. Error p value Lower.CI Upper.CI
(0 1 -1 0 -1) -0.138 0.0127 0 -0.182 -0.0928

Anova Type 2 and Contrasts

the study design of the data I have to analyse is simple. There is 1 control group (CTRL) and
2 different treatment groups (TREAT_1 and TREAT_2). The data also includes 2 covariates COV1 and COV2. I have been asked to check if there is a linear or quadratic treatment effect in the data.
I created a dummy data set to explain my situation:
df1 <- data.frame(
Observation = c(rep("CTRL",15), rep("TREAT_1",13), rep("TREAT_2", 12)),
COV1 = c(rep("A1", 30), rep("A2", 10)),
COV2 = c(rep("B1", 5), rep("B2", 5), rep("B3", 10), rep("B1", 5), rep("B2", 5), rep("B3", 10)),
Variable = c(3944133, 3632461, 3351754, 3655975, 3487722, 3644783, 3491138, 3328894,
3654507, 3465627, 3511446, 3507249, 3373233, 3432867, 3640888,
3677593, 3585096, 3441775, 3608574, 3669114, 4000812, 3503511, 3423968,
3647391, 3584604, 3548256, 3505411, 3665138,
4049955, 3425512, 3834061, 3639699, 3522208, 3711928, 3576597, 3786781,
3591042, 3995802, 3493091, 3674475)
)
plot(Variable ~ Observation, data = df1)
As you can see from the plot there is a linear relationship between the control and the treatment groups. To check if this linear effect is statistical significant I change the contrasts using the contr.poly() function and fit a linear model like this:
contrasts(df1$Observation) <- contr.poly(levels(df1$Observation))
lm1 <- lm(log(Variable) ~ Observation, data = df1)
summary.lm(lm1)
From the summary we can see that the linear effect is statistically significant:
Observation.L 0.029141 0.012377 2.355 0.024 *
Observation.Q 0.002233 0.012482 0.179 0.859
However, this first model does not include any of the two covariates. Including them results in a non-significant p-value for the linear relationship:
lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1)
summary.lm(lm2)
Observation.L 0.04116 0.02624 1.568 0.126
Observation.Q 0.01003 0.01894 0.530 0.600
COV1A2 -0.01203 0.04202 -0.286 0.776
COV2B2 -0.02071 0.02202 -0.941 0.354
COV2B3 -0.02083 0.02066 -1.008 0.320
So far so good. However, I have been told to conduct a Type II Anova rather than Type I. To conduct a Type II Anova I used the Anova() function from the car package.
Anova(lm2, type="II")
Anova Table (Type II tests)
Response: log(Variable)
Sum Sq Df F value Pr(>F)
Observation 0.006253 2 1.4651 0.2453
COV1 0.000175 1 0.0820 0.7763
COV2 0.002768 2 0.6485 0.5292
Residuals 0.072555 34
The problem here with using Type II is that you do not get a p-value for the linear and quadratic effect. So I do not know if the effect is statistically linear and or quadratic.
I found out that the following code produces the same p-value for Observation as the Anova() function. But the result also does not include any p-values for the linear or quadratic effect:
lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1)
lm3 <- lm(log(Variable) ~ COV1 + COV2, data = df1)
anova(lm2, lm3)
Does anybody know how to conduct a Type II anova and the contrasts function to obtain the p-values for the linear and quadratic effects?
Help would be very much appreciated.
Best
Peter
I found one partial workaround for this, but it may require further correction. The documentation for the function drop1() from the stats package indicates that this function produces Type II sums of squares (although this page: http://www.statmethods.net/stats/anova.html ) declares that drop1() produces Type III sums of squares, and I didn't spend too much time poring over this (http://afni.nimh.nih.gov/sscc/gangc/SS.html) to cross-check sums of squares calculations. You could use it to calculate everything manually, but I suspect you're asking this question because it would be nice if someone had already worked through it.
Anyway, I added a second vector to the dummy data called Observation2, and set it up with just the linear contrasts (you can only specify one set of contrasts for a given vector at a given time):
df1[,"Observation2"]<-df1$Observation
contrasts(df1$Observation2, how.many=1)<-contr.poly
Then created a third linear model:
lm3<-lm(log(Variable)~Observation2+COV1+COV2, data=df1)
And conducted F tests with drop1 to compare F statistics from Type II ANOVAs between the two models:
lm2, which contains both the linear and quadratic terms:
drop1(lm2, test="F")
lm3, which contains just the linear contrasts:
drop1(lm3, test="F")
This doesn't include a direct comparison of the models against each other, although the F statistic is higher (and p value accordingly lower) for the linear model, which would lead one to rely upon it instead of the quadratic model.

Resources