I'm using the glht function (multcomp) to compute point estimates and standard errors of the linear combination of coefficients after a regression. The regression includes a four-way interaction term for which not all combinations have data points, so the coefficients are missing for them. When I use the glht function I get the following error message:
Error in modelparm.default(model, ...) :
dimensions of coefficients and covariance matrix don't match
When I do the same thing with regression results that don't include NA coefficients (so for instance with no 4-way interaction), the glht command works, which makes me think that the NA rows are the problem. I've done this in Stata and it works, but I'm not sure what Stata does with those NA rows.
regression
reg4 <- lm(choice_n ~
sex+e*c*r*r1+brit_par+occupation+residency, data = newdata,
na.action=na.omit, weights = Weight)
summary(reg4)
multcomp package
install.packages("multcomp")
library(multcomp)
linear combination of Ireland*Excellent English
summary(glht(reg4, linfct =c("cIreland+eExcellent:cIreland=0")))
These are part of the results of the regression
Call:
lm(formula = choice_n ~ sex + e * c * r * r1 + brit_par + occupation +
residency, data = innovation_panel, weights = Weight, na.action =
na.exclude)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.0214 -0.3679 0.1630 0.2728 1.1978
Coefficients: (159 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 1.727242 0.028168 61.318
sexWoman -0.002559 0.006775 -0.378
eGood 0.123681 0.036599 3.379
eExcellent 0.073905 0.035809 2.064
cPoland 0.068136 0.040527 1.681
cItaly 0.010660 0.038552 0.277
cIndia 0.083874 0.040283 2.082
cPakistan 0.008349 0.052466 0.159
cNigeria -0.063009 0.051064 -1.234
cIreland 0.080269 0.031548 2.544
cAustralia 0.069785 0.031285 2.231
cSyria -0.025254 0.056821 -0.444
cSomalia 0.018274 0.051287 0.356
rMuslim -0.094982 0.064429 -1.474
rNo religion 0.016348 0.035872 0.456
r1Yes -0.002653 0.066710 -0.040
r1NA NA NA NA
brit_parBritish grandparent -0.034635 0.008292 -4.177
brit_parNeither -0.096933 0.008268 -11.724
occupationDoctor 0.045576 0.014274 3.193
occupationIT professional 0.020612 0.014246 1.447
occupationLanguage teacher -0.008675 0.014467 -0.600
occupationAdmin worker -0.049867 0.014283 -3.491
occupationFarmer -0.066845 0.014500 -4.610
occupationCleaner -0.093354 0.014352 -6.505
occupationUnemployed -0.359872 0.014304 -25.159
occupationStay at home parent -0.170595 0.014394 -11.851
residency6 years 0.023541 0.009569 2.460
residency10 years 0.085133 0.009553 8.912
residency20 years 0.115200 0.009600 12.000
eGood:cPoland -0.104022 0.060583 -1.717
eExcellent:cPoland -0.074505 0.058459 -1.274
eGood:cItaly -0.021625 0.056038 -0.386
eExcellent:cItaly 0.014001 0.056616 0.247
eGood:cIndia -0.085940 0.057982 -1.482
eExcellent:cIndia -0.022007 0.058273 -0.378
eGood:cPakistan -0.063008 0.074928 -0.841
eExcellent:cPakistan 0.024541 0.073136 0.336
eGood:cNigeria -0.016136 0.069657 -0.232
eExcellent:cNigeria 0.071779 0.070881 1.013
eGood:cIreland NA NA NA
eExcellent:cIreland NA NA NA
eGood:cAustralia NA NA NA
eExcellent:cAustralia NA NA NA
I have the (sample) dataset below:
round<-c( 0.125150, 0.045800, -0.955299, -0.232007, 0.120880, -0.041525, 0.290473, -0.648752, 0.113264, -0.403685)
square<-c(-0.634753, 0.000492, -0.178591, -0.202462, -0.592054, -0.583173, -0.632375, -0.176673, -0.680557, -0.062127)
ideo<-c(0,1,0,1,0,1,0,0,1,1)
ex<-data.frame(round,square,ideo)
When I ran the GEE regression in SPSS I took this table as a result.
I used packages gee and geepack in R to run the same analysis and I took these results:
#gee
summary(gee(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 1.0541 0.4099 2.572 0.1328 7.937
square 1.1811 0.8321 1.419 0.4095 2.884
round 0.7072 0.5670 1.247 0.1593 4.439
#geepack
summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 1.054 0.133 63.00 2.1e-15 ***
square 1.181 0.410 8.32 0.0039 **
round 0.707 0.159 19.70 9.0e-06 ***
---
I would like to recreate exactly the table of SPSS(not the results as I use a subset of the original dataset)but I do not know how to achieve all these results.
A tiny bit of tidyverse magic can get the same results - more or less.
Get the information from coef(summary(geeglm())) and compute the necessary columns:
library("tidyverse")
library("geepack")
coef(summary(geeglm(ideo ~ square + round,data = ex, id = ideo,
corstr = "independence"))) %>%
mutate(lowerWald = Estimate-1.96*Std.err, # Lower Wald CI
upperWald=Estimate+1.96*Std.err, # Upper Wald CI
df=1,
ExpBeta = exp(Estimate)) %>% # Transformed estimate
mutate(lWald=exp(lowerWald), # Upper transformed
uWald=exp(upperWald)) # Lower transformed
This produces the following (with the data you provided). The order and the names of the columns could be modified to suit your needs
Estimate Std.err Wald Pr(>|W|) lowerWald upperWald df ExpBeta lWald uWald
1 1.0541 0.1328 62.997 2.109e-15 0.7938 1.314 1 2.869 2.212 3.723
2 1.1811 0.4095 8.318 3.925e-03 0.3784 1.984 1 3.258 1.460 7.270
3 0.7072 0.1593 19.704 9.042e-06 0.3949 1.019 1 2.028 1.484 2.772
I have to fit an LMM with an interaction random effect but without the marginal random effect, using the lme command. That is, I want to fit the model in oats.lmer (see below) but using the function lme from the nlme package.
The code is
require("nlme")
require("lme4")
oats.lmer <- lmer(yield~nitro + (1|Block:Variety), data = Oats)
summary(oats.lmer)
#Linear mixed model fit by REML ['lmerMod']
#Formula: yield ~ nitro + (1 | Block:Variety)
# Data: Oats
#
#REML criterion at convergence: 598.1
#
#Scaled residuals:
# Min 1Q Median 3Q Max
#-1.66482 -0.72807 -0.00079 0.56416 1.85467
#
#Random effects:
# Groups Name Variance Std.Dev.
# Block:Variety (Intercept) 306.8 17.51
# Residual 165.6 12.87
#Number of obs: 72, groups: Block:Variety, 18
#
#Fixed effects:
# Estimate Std. Error t value
#(Intercept) 81.872 4.846 16.90
#nitro 73.667 6.781 10.86
#
#Correlation of Fixed Effects:
# (Intr)
#nitro -0.420
I started playing with this
oats.lme <- lme(yield~nitro, data = Oats, random = (~1|Block/Variety))
summary(oats.lme)
#Linear mixed-effects model fit by REML
# Data: Oats
# AIC BIC logLik
# 603.0418 614.2842 -296.5209
#
#Random effects:
# Formula: ~1 | Block
# (Intercept)
#StdDev: 14.50596
#
# Formula: ~1 | Variety %in% Block
# (Intercept) Residual
#StdDev: 11.00468 12.86696
#
#Fixed effects: yield ~ nitro
# Value Std.Error DF t-value p-value
#(Intercept) 81.87222 6.945273 53 11.78819 0
#nitro 73.66667 6.781483 53 10.86291 0
# Correlation:
# (Intr)
#nitro -0.293
#
#Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
#-1.74380770 -0.66475227 0.01710423 0.54298809 1.80298890
#
#Number of Observations: 72
#Number of Groups:
# Block Variety %in% Block
# 6 18
but the problem is that it puts also a marginal random effect for Variety which I want to omit.
The question is: how to specify the random effects in oats.lme such that oats.lme is identical (at least structurally) to oats.lmer ?
It can be as simple as following:
library(nlme)
data(Oats)
## construct an auxiliary factor `f` for interaction / nesting effect
Oats$f <- with(Oats, Block:Variety)
## use `random = ~ 1 | f`
lme(yield ~ nitro, data = Oats, random = ~ 1 | f)
#Linear mixed-effects model fit by REML
# Data: Oats
# Log-restricted-likelihood: -299.0328
# Fixed: yield ~ nitro
#(Intercept) nitro
# 81.87222 73.66667
#
#Random effects:
# Formula: ~1 | f
# (Intercept) Residual
#StdDev: 17.51489 12.86695
#
#Number of Observations: 72
#Number of Groups: 18
Does anyone know how to do post hoc tests in an ANCOVA model with a factorial design?
I have two vectors consisting of 23 baseline values (covariate) and 23 values after treatment (dependent variable) and I have two factors with both two levels. I created an ANCOVA model and calculated the adjusted means, standard errors and confidence intervals. Example:
library(effects)
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
ANC = aov(after_treatment ~ baseline + treatment*age)
effect_treatage = effect("treatment*age",ANC)
data.frame(effect_treatage)
treatment age fit se lower upper
1 Drug Old 0.8232137 0.1455190 0.5174897 1.1289377
2 Placebo Old 0.6168641 0.1643178 0.2716452 0.9620831
3 Drug Young 0.5689036 0.1469175 0.2602413 0.8775659
4 Placebo Young 0.7603360 0.1462715 0.4530309 1.0676410
Now I want test if there is a difference between the adjusted means of
Young-Placebo:Young-Drug
Old-Placebo:Old-Drug
Young-Placebo:Old-Drug
Old-Placebo:Young-Drug
So I tried:
library(multcomp)
pH = glht(ANC, linfct = mcp(treatment*age="Tukey"))
# Error: unexpected '=' in "ph = glht(ANC_nback, linfct = mcp(treat*age="
And:
pH = TukeyHSD(ANC)
# Error in rep.int(n, length(means)) : unimplemented type 'NULL' in 'rep3'
# In addition: Warning message:
# In replications(paste("~", xx), data = mf) : non-factors ignored: baseline
Does anyone know how to resolve this?
Many thanks!
PS for more info see
R: How to graphically plot adjusted means, SE, CI ANCOVA
If you wish to use multcomp, then you can take advantage of the wonderful and seamless interface between lsmeans and multcomp packages (see ?lsm), whereas lsmeans provides support for glht().
baseline = c(0.7672,1.846,0.6487,0.4517,0.5599,0.2255,0.5946,1.435,0.5374,0.4901,1.258,0.5445,1.078,1.142,0.5,1.044,0.7824,1.059,0.6802,0.8003,0.5547,1.003,0.9213)
after_treatment = c(0.4222,1.442,0.8436,0.5544,0.8818,0.08789,0.6291,1.23,0.4093,0.7828,-0.04061,0.8686,0.8525,0.8036,0.3758,0.8531,0.2897,0.8127,1.213,0.05276,0.7364,1.001,0.8974)
age = factor(c(rep(c("Young","Old"),11),"Young"))
treatment = factor(c(rep("Drug",12),rep("Placebo",11)))
Treat <- data.frame(baseline, after_treatment, age, treatment)
ANC <- aov(after_treatment ~ baseline + treatment*age, data=Treat)
library(multcomp)
library(lsmeans)
summary(glht(ANC, linfct = lsm(pairwise ~ treatment * age)))
## Note: df set to 18
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: aov(formula = after_treatment ~ baseline + treatment * age, data = Treat)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Drug,Old - Placebo,Old == 0 0.20635 0.21913 0.942 0.783
## Drug,Old - Drug,Young == 0 0.25431 0.20698 1.229 0.617
## Drug,Old - Placebo,Young == 0 0.06288 0.20647 0.305 0.990
## Placebo,Old - Drug,Young == 0 0.04796 0.22407 0.214 0.996
## Placebo,Old - Placebo,Young == 0 -0.14347 0.22269 -0.644 0.916
## Drug,Young - Placebo,Young == 0 -0.19143 0.20585 -0.930 0.789
## (Adjusted p values reported -- single-step method)
This eliminates the need for reparametrization. You can achieve the same results by using lsmeans alone:
lsmeans(ANC, list(pairwise ~ treatment * age))
## $`lsmeans of treatment, age`
## treatment age lsmean SE df lower.CL upper.CL
## Drug Old 0.8232137 0.1455190 18 0.5174897 1.1289377
## Placebo Old 0.6168641 0.1643178 18 0.2716452 0.9620831
## Drug Young 0.5689036 0.1469175 18 0.2602413 0.8775659
## Placebo Young 0.7603360 0.1462715 18 0.4530309 1.0676410
##
## Confidence level used: 0.95
##
## $`pairwise differences of contrast`
## contrast estimate SE df t.ratio p.value
## Drug,Old - Placebo,Old 0.20634956 0.2191261 18 0.942 0.7831
## Drug,Old - Drug,Young 0.25431011 0.2069829 18 1.229 0.6175
## Drug,Old - Placebo,Young 0.06287773 0.2064728 18 0.305 0.9899
## Placebo,Old - Drug,Young 0.04796056 0.2240713 18 0.214 0.9964
## Placebo,Old - Placebo,Young -0.14347183 0.2226876 18 -0.644 0.9162
## Drug,Young - Placebo,Young -0.19143238 0.2058455 18 -0.930 0.7893
##
## P value adjustment: tukey method for comparing a family of 4 estimates
You need to use the which argument in TukeyHSD; "listing terms in the fitted model for which the intervals should be calculated". This is needed because you have a non-factor variable in the model ('baseline'). The variable causes trouble when included, which is default when which is not specified.
ANC = aov(after_treatment ~ baseline + treatment*age)
TukeyHSD(ANC, which = c("treatment:age"))
If you wish to use the more flexible glht, see section 3, page 8- here
Reparametrization is a possibility here:
treatAge <- interaction(treatment, age)
ANC1 <- aov(after_treatment ~ baseline + treatAge)
#fits are equivalent:
all.equal(logLik(ANC), logLik(ANC1))
#[1] TRUE
library(multcomp)
summary(glht(ANC1, linfct = mcp(treatAge="Tukey")))
# Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: aov(formula = after_treatment ~ baseline + treatAge)
#
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#Placebo.Old - Drug.Old == 0 -0.20635 0.21913 -0.942 0.783
#Drug.Young - Drug.Old == 0 -0.25431 0.20698 -1.229 0.617
#Placebo.Young - Drug.Old == 0 -0.06288 0.20647 -0.305 0.990
#Drug.Young - Placebo.Old == 0 -0.04796 0.22407 -0.214 0.996
#Placebo.Young - Placebo.Old == 0 0.14347 0.22269 0.644 0.916
#Placebo.Young - Drug.Young == 0 0.19143 0.20585 0.930 0.789
#(Adjusted p values reported -- single-step method)
I've never noticed this behavior before, but I'm surprised at the output naming conventions for linear model summaries. My question, essentially, is why row names in a linear model summary always seem to carry the name of the column they came from.
An example
Suppose you had some data for 300 movie audience members from three different cities:
Chicago
Milwaukee
Dayton
And suppose all of them were subjected to the stinking pile of confusing, contaminated waste that was Spider-Man 3. After enduring the entirety of that cinematic abomination, they were asked to rate the movie on a 100-point scale.
Because all of the audience members were reasonable human beings, the ratings were all below zero. (Naturally. Anyone who's seen the movie would agree.)
Here's what that might look like in R:
> score <- rnorm(n = 300, mean = -50, sd = 10)
> city <- rep(c("Chicago", "Milwaukee", "Dayton"), times = 100)
> spider.man.3.sucked <- data.frame(score, city)
> head(spider.man.3.sucked)
score city
1 -64.57515 Chicago
2 -50.51050 Milwaukee
3 -56.51409 Dayton
4 -45.55133 Chicago
5 -47.88686 Milwaukee
6 -51.22812 Dayton
Great. So let's run a quick linear model, assign it to lm1, and get its summary output:
> lm1 <- lm(score ~ city, data = spider.man.3.sucked)
> summary(lm1)
Call:
lm(formula = score ~ city, data = spider.man.3.sucked)
Residuals:
Min 1Q Median 3Q Max
-29.8515 -6.1090 -0.4745 6.0340 26.2616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.3621 0.9630 -53.337 <2e-16 ***
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.63 on 297 degrees of freedom
Multiple R-squared: 0.002693, Adjusted R-squared: -0.004023
F-statistic: 0.4009 on 2 and 297 DF, p-value: 0.6701
What's bugging me
The part I want to highlight is this:
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
It looks like R sensibly concatenated the column name (city, if you remember from above) with the distinct value (in this case either Dayton or Milwaukee). If I don't want R to output in that format, is there any way to override it? For example, in my case all I'd need is simply:
Dayton 1.1892 1.3619 0.873 0.383
Milwaukee 0.8288 1.3619 0.609 0.543
Two questions in one
So,
What's controlling the format of the output for linear model summary rows, and
Can/should I change it?
The extractor function for that component of a summary object is coef. Does this provide the means to control your output acceptably:
summ <- summary(lm1)
csumm <- coef(summ)
rownames(csumm) <- sub("^city", "", rownames(csumm))
print(csumm[-1,], digits=4)
# Estimate Std. Error t value Pr(>|t|)
# Dayton 0.8133 1.485 0.5478 0.5842
# Milwaukee 0.3891 1.485 0.2621 0.7934
(No random seed was set so cannot match your values.)
For 1) it appears to happen inside model.matrix.default() and inside internal R compiled code for that matter.
It might be difficult to change this easily - the obvious way would be to write your own model.matrix.default() that calls model.matrix.default() and updates the names afterwards. But this isn't tested or tried.
Here is a hack
# RUN REGRESSION
require(ggplot2)
lm1 = lm(tip ~ total_bill + sex + day, data = tips)
# FUNCTION TO REMOVE FACTOR NAMES FROM MODEL SUMMARY
remove_factors = function(mod){
mydf = mod$model
# PREPARE VECTOR OF VARIABLES WITH REPETITIONS = UNIQUE FACTOR LEVELS
vars = names(mod$model)[-1]
eachlen = sapply(mydf[,vars,drop=F], function(x)
ifelse(is.numeric(x), 1, length(unique(x)) - 1))
vars = rep(vars, eachlen)
# REPLACE COEF NAMES WITH VARIABLE NAME WHEN APPROPRIATE
coefs = names(lm1$coefficients)[-1]
coefs2 = stringr::str_replace(coefs, vars, "")
names(mod$coefficients)[-1] = ifelse(coefs2 == "", coefs, coefs2)
return(mod)
}
summary(remove_factors(lm1))
This gives
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.95588 0.27579 3.47 0.00063 ***
total_bill 0.10489 0.00758 13.84 < 2e-16 ***
Male -0.03844 0.14215 -0.27 0.78706
Sat -0.08088 0.26226 -0.31 0.75806
Sun 0.08282 0.26741 0.31 0.75706
Thur -0.02063 0.26975 -0.08 0.93910
However, it is not always advisable to do this, as you can see from running the same hack for a different regression. It is not clear what the Yes variable in the last name stands for. R by default writes it as smokerYes to make its meaning clear. So use with caution.
lm2 = lm(tip ~ total_bill + sex + day + smoker, data = tips)
summary(remove_factors(lm2))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05182 0.29315 3.59 0.00040 ***
total_bill 0.10569 0.00763 13.86 < 2e-16 ***
Male -0.03769 0.14217 -0.27 0.79114
Sat -0.12636 0.26648 -0.47 0.63582
Sun 0.00407 0.27959 0.01 0.98841
Thur -0.09283 0.27994 -0.33 0.74048
Yes -0.13935 0.14422 -0.97 0.33489