post hoc test for linear mixed model with two variables - r

I built a linear mixed model and did a post hoc test for it. Fixed factors are the phase numbers (time) and the group.
statistic_of_comp <- function (x, df) {
x.full.1 <- lmer(x ~ phase_num + group + (1|mouse), data=df, REML = FALSE)
x_phase.null.1 <- lmer(x ~ group + (1|mouse), data=df, REML = FALSE)
print(anova (x.full.1, x_phase.null.1))
summary(glht(x.full.1, linfct=mcp(phase_num="Tukey")))
}
Now my problem is, that I want to do a post hoc test with more than one fixed factor. I found the following
linfct=mcp(phase_num="Tukey", group="Tukey)
but that doesn't give the result I want. At the moment I get the comparison for the groups with Tukey (every group with every other group) and the comparison between the two phases.
What I want is a comparison of the phase_numbers for every group.
e.g. group1 phase1-phase2 ..., group2 phase1-phase2 etc.

I'm sure you can do this with multcomp, but let me illustrate how to do it with the emmeans package. I'm going to use a regular linear model (since you haven't given a reproducible example), but the recipe below should work just as well with a mixed model.
Linear model from ?emmeans (using a built-in data set):
warp.lm <- lm(breaks ~ wool * tension, data = warpbreaks)
Apply emmeans(), followed by the pairs() function:
pairs(emmeans(warp.lm , ~tension|wool))
wool = A:
contrast estimate SE df t.ratio p.value
L - M 20.556 5.16 48 3.986 0.0007
L - H 20.000 5.16 48 3.878 0.0009
M - H -0.556 5.16 48 -0.108 0.9936
wool = B:
contrast estimate SE df t.ratio p.value
L - M -0.556 5.16 48 -0.108 0.9936
L - H 9.444 5.16 48 1.831 0.1704
M - H 10.000 5.16 48 1.939 0.1389
For more information, see ?pairs.emmGrid or vignette("comparisons",package="emmeans") (which clarifies that these tests do indeed use Tukey comparisons by default).

Related

Zero-inflated two-part models in GLMMadaptive (R): anova on fixed effects zero-part?

I'm running a hurdle lognormal model using the GLMMadaptive package in R. Both the continuous part as well as the zero-part have categorical variables defined in the fixed effects. I would like to run an ANOVA on these categorical variables to detect if there is a main effect.
I've seen that using the glmmTMB package you are able to separately run an ANOVA on the conditional model and the zero-part model separately, as is demonstrated here.
Is there a similar strategy available for the GLMMadaptive package? (The glmmTMB does not support hurdle lognormal models as far as I understood). Perhaps using the joint_tests function from the emmeans package? If so, how do you define that you want to test the zero-part model? As emmeans::joint_tests(hurdlemodel) only gives the F-tests for the conditional part of the model.
Or as an alternative method, could you compare the fit of the models where you exclude the variable of interest against a the full model, as is demonstrated for the relevance of random effects in this vignette?
Many thanks!
The suggestion by Russ Lenth in the comments are implemented below, using the data and model in the GLMMadaptive two-part model vignette:
library(GLMMadaptive)
library(emmeans)
# data generating code from the vignette:
{
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we construct a data frame with the design:
# everyone has a baseline measurement, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ 1, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(1.5, 0.05, 0.05, -0.03) # fixed effects coefficients non-zero part
shape <- 2 # shape/size parameter of the negative binomial distribution
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 2, drop = FALSE]))
# we simulate negative binomial longitudinal data
DF$y <- rnbinom(n * K, size = shape, mu = exp(eta_y))
# we set the extra zeros
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
}
#create categorical time variable
DF$time_categorical[DF$time<2.5] <- "early"
DF$time_categorical[DF$time>=2.5] <- "late"
DF$time_categorical <- as.factor(DF$time_categorical)
#model with interaction in fixed effects zero part and adding nesting in zero part as in model above
km3 <- mixed_model(y ~ sex * time_categorical, random = ~ 1 | id, data = DF,
family = hurdle.lognormal(), n_phis = 1,
zi_fixed = ~ sex * time_categorical, zi_random = ~ 1 | id)
#### ATTEMPT at QDRG function in emmeans ####
coef_zero_part <- fixef(km3, sub_model = "zero_part")
vcov_zero_part <- vcov(km3)[9:12,9:12]
qd_km3 <- emmeans::qdrg(formula = ~ sex * time_categorical, data = DF,
coef = coef_zero_part, vcov = vcov_zero_part)
Output:
> joint_tests(qd_km3)
model term df1 df2 F.ratio p.value
sex 1 Inf 11.509 0.0007
time_categorical 1 Inf 0.488 0.4848
sex:time_categorical 1 Inf 1.077 0.2993
> emmeans(qd_km3, pairwise ~ sex|time_categorical)
$emmeans
time_categorical = early:
sex emmean SE df asymp.LCL asymp.UCL
male -1.592 0.201 Inf -1.99 -1.198
female -1.035 0.187 Inf -1.40 -0.669
time_categorical = late:
sex emmean SE df asymp.LCL asymp.UCL
male -1.914 0.247 Inf -2.40 -1.429
female -0.972 0.188 Inf -1.34 -0.605
Confidence level used: 0.95
$contrasts
time_categorical = early:
contrast estimate SE df z.ratio p.value
male - female -0.557 0.270 Inf -2.064 0.0390
time_categorical = late:
contrast estimate SE df z.ratio p.value
male - female -0.942 0.306 Inf -3.079 0.0021
Checking if contrasts correspond with zero-part fixed effects:
> fixef(km3, sub_model = "zero_part")
(Intercept) sexfemale time_categoricallate sexfemale:time_categoricallate
-1.5920415 0.5568072 -0.3220390 0.3849780
> (-1.5920) - (-1.5920 + 0.5568)
[1] -0.5568 #matches contrast within "early" level of "time_categorical"
> (-1.5920 + -0.3220) - (-1.5920 + -0.3220 + 0.5568 + 0.3850)
[1] -0.9418 #matches contrast within "late" level of "time_categorical"
The function emmeans::qdrg() can sometimes be used to create the needed object for a model not directly supported by emmeans. See its documentation. In very simple models (e.g., inheriting from lm, it may be enough to supply the object and data arguments.
That usually does not work for more sophisticated models, in which case
you will need to specify data, the fixed-effects formula for the conditional or zero part of the model, and the associated regression coefficients (coef) and variance-covariance matrix (vcov) for the part of the model in question. Often with models like this with multiple components, you likely will have to pick a subset of the coefficients and covariance matrix. These all must conform: the length of coef must equal the number of rows and columns of vcov and the number of columns in the model matrix generated by formula [which may be checked via model.matrix(formula, data = data)].
qdrg() will not work for a multivariate model -- or at least it's tricky -- because the implied model involves other factor(s) that delineate the levels of the multivariate response. If there are special provisions for, say, spline smoothing, that is another instance where qdrg() probably can't be made to work.
Once qdrg() actually runs and produces results, it is a good idea to use it to estimate some contrasts that are estimated by the model parameterization. For example, suppose that the model was fitted with the default contr.treatment contrasts. Then the regression coefficients are interpretable as a comparison with the first level as a reference level. Accordingly, if we computedrg <- qdrg(...), and one of the factors is "treat", look at contrast(rg, "trt.vs.ctrl1", simple = "treat"), and check to see if the first set of estimated contrasts matches the main-effect estimates for treat.
I will illustrate all of this with a simple lm model, ignoring the fact that it is already supported by emmeans.
> warp.lm <- lm(breaks ~ wool * tension, data = warpbreaks)
Here is the reference grid
> rg <- qdrg(~ wool * tension, coef = coef(warp.lm), vcov = vcov(warp.lm),
+ df = df.residual(warp.lm), data = warpbreaks)
Here is a sanity check -- First, look at the model summary:
> summary(warp.lm)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.55556 3.646761 12.217842 2.425903e-16
woolB -16.33333 5.157299 -3.167032 2.676803e-03
tensionM -20.55556 5.157299 -3.985721 2.280796e-04
tensionH -20.00000 5.157299 -3.877999 3.199282e-04
woolB:tensionM 21.11111 7.293523 2.894501 5.698287e-03
woolB:tensionH 10.55556 7.293523 1.447251 1.543266e-01
Second, look at selected contrasts:
> contrast(rg, "trt.vs.ctrl1", simple = "wool")
tension = L:
contrast estimate SE df t.ratio p.value
B - A -16.33 5.16 48 -3.167 0.0027
tension = M:
contrast estimate SE df t.ratio p.value
B - A 4.78 5.16 48 0.926 0.3589
tension = H:
contrast estimate SE df t.ratio p.value
B - A -5.78 5.16 48 -1.120 0.2682
> contrast(rg, "trt.vs.ctrl1", simple = "tension")
wool = A:
contrast estimate SE df t.ratio p.value
M - L -20.556 5.16 48 -3.986 0.0005
H - L -20.000 5.16 48 -3.878 0.0006
wool = B:
contrast estimate SE df t.ratio p.value
M - L 0.556 5.16 48 0.108 0.9863
H - L -9.444 5.16 48 -1.831 0.1338
P value adjustment: dunnettx method for 2 tests
Comparing with the regression coefficients, we do confirm that the first contrast for wool is estimated as -16.33, matching the regression coefficient for woolB. Also, the first set of contrasts for tension are estimated as -20.556 and -20.0, matching the regression coefficients for tensionM and tensionH. The SEs and t ratios match as well. (The P values for the second set do not match due to the multiplicity adjustment.)

Linear model in R doesn't fit properly

I know that the title doesn't specify exactly what I mean so let me explain it here.
I working on a dataset that consists of yield of wheat given a certain wheat type (A,B,C,D). Now my issue when fitting linear model is that I'm trying to fit:
lm1 = lm(yield ~ type), when doing so R commits the first wheat type(A) and marks it as a global intercept and then estimates influence of all other types on the yield.
I know that I can fit a linear model like such:
lm2 = lm(yield ~ 0 + type) which will give me estimates of the influence of each type on the yield however what I really want to see is a sort of combination of the two of them.
Is there an option to fit a linear model in R s.t
lm3 = lm(yield ~ GlobalIntercept + type) where GlobalIntercept would represent the general intercept of my linear model and then I could see the influence of each type of wheat on that general intercept. So kind of like in the first model though this time we'd estimate the influence of all types of wheat (A,B,C,D) on the general yield.
Questions to SO should include minimal reproducible example data -- see instructions at the top of the r tag page. Since the question did not include this we will provide it this time by using the built-in InsectSprays data set that comes with R.
Here are a few approaches:
1) lm/contr.sum/dummy.coef Try using contr.sum sum-to-zero contrasts for the spray factor and look at the dummy coefficients. That will expand the coefficients to include all 6 levels of the spray factor in this example:
fm <- lm(count ~ spray, InsectSprays, contrasts = list(spray = contr.sum))
dummy.coef(fm)
## Full coefficients are
##
## (Intercept): 9.5
## spray: A B C D E F
## 5.000000 5.833333 -7.416667 -4.583333 -6.000000 7.166667
sum(dummy.coef(fm)$spray) # check that coefs sum to zero
## [1] 0
2) tapply If each level has the same number of rows in the data set such as is the case with InsectSprays where each level has 12 rows then we can take the mean for each level and then subtract the Intercept (which is the overall mean). This does not work if the data set is unbalanced, i.e. if the different levels have different numbers of rows. Note how the calculations below give the same result as (1).
mean(InsectSprays$count) # intercept
## [1] 9.5
with(InsectSprays, tapply(count, spray, mean) - mean(count))
## A B C D E F
## 5.000000 5.833333 -7.416667 -4.583333 -6.000000 7.166667
3) aov/model.tables We can also use aov with model.tables like this:
fm2 <- aov(count ~ spray, InsectSprays)
model.tables(fm2)
## Tables of effects
##
## spray
## spray
## A B C D E F
## 5.000 5.833 -7.417 -4.583 -6.000 7.167
model.tables(fm2, type = "means")
## Tables of means
## Grand mean
##
## 9.5
##
## spray
## spray
## A B C D E F
## 14.500 15.333 2.083 4.917 3.500 16.667
4) emmeans We can use lm followed by emmeans like this:
library(emmeans)
fm <- lm(count ~ spray, InsectSprays)
emmeans(fm, "spray")
## spray emmean SE df lower.CL upper.CL
## A 14.50 1.13 66 12.240 16.76
## B 15.33 1.13 66 13.073 17.59
## C 2.08 1.13 66 -0.177 4.34
## D 4.92 1.13 66 2.656 7.18
## E 3.50 1.13 66 1.240 5.76
## F 16.67 1.13 66 14.406 18.93
##
## Confidence level used: 0.95
As per the information provided by you, I could infer that you are modeling the yield as a linear function of type which has four categories. Your expectation is to have an intercept apart from the coefficients of each of the types. This doesn't make sense.
You are predicting the yield based on nominal variable. If you want to have regression with intercept, you need to have the predictor variable with origin. The property of a nominal variable is that it doesn't have origin. The origin means that the zero value for the predictor. A nominal variable cannot have an origin. In other words, the intercept (with a continuous predictor variable) means the value of the dependent variable y, when the predictor value is zero (in your case, the category of the type is zero which is practically impossible). That is why your model takes one of the categories as a reference category and calculates the intercept for it. The changes in the y variable when the category is different than the reference category is given by the coefficients.

Hypothesis test for intercepts in general mixed linear models with R

I have data of fixed effects: genotypes = C, E, K, M; age = 30, 45, 60, 75, 90 days; random effects: block = 1, 2, 3; and variable = weight_DM.
The file is in: https://drive.google.com/open?id=1_H6YZbdesK7pk5H23mZtp5KhVRKz0Ozl
I have the slopes linear and quadratic of ages for each genotype, but I do not have the intercepts and standard errors. The R code is:
library(nlme)
library(lme4)
library(car)
library(carData)
library(emmeans)
library(ggplot2)
library(Matrix)
library(multcompView)
datos_weight <- read.csv2("D:/investigacion/publicaciones/articulos-escribiendo/pennisetum/pennisetum-agronomicas/data_weight.csv",header=T, sep = ";", dec = ",")
parte_fija_3 <- formula(weight_DM
~ Genotypes
+ Age
+ I(Age^2)
+ Genotypes*Age
+ Genotypes*I(Age^2))
heterocedasticidad_5 <- varComb(varExp(form = ~fitted(.)))
correlacion_4 <- corCompSymm(form = ~ 1|Block/Genotypes)
modelo_43 <- gls(parte_fija_3,
weights = heterocedasticidad_5,
correlation = correlacion_4,
na.action = na.omit,
data = datos_weight)
anova(modelo_43)
#response
Denom. DF: 48
numDF F-value p-value
(Intercept) 1 597.3828 <.0001
Genotypes 3 2.9416 0.0424
Age 1 471.6933 <.0001
I(Age^2) 1 22.7748 <.0001
Genotypes:Age 3 5.9425 0.0016
Genotypes:I(Age^2) 3 0.7544 0.5253
#################################
#test whether the linear age slopes of each genotype is equal to zero
################################
(tendencias_em_lin <- emtrends(modelo_43,
"Genotypes",
var = "Age"))
#response
Genotypes Age.trend SE df lower.CL upper.CL
C 1.693331 0.2218320 48 1.247308 2.139354
E 1.459517 0.2135037 48 1.030239 1.888795
K 2.001097 0.2818587 48 1.434382 2.567811
M 1.050767 0.1301906 48 0.789001 1.312532
Confidence level used: 0.95
(tendencias_em_lin_prueba <- update(tendencias_em_lin,
infer = c(TRUE,TRUE),
null = 0))
#response
Genotypes Age.trend SE df lower.CL upper.CL t.ratio p.value
C 1.693331 0.2218320 48 1.247308 2.139354 7.633 <.0001
E 1.459517 0.2135037 48 1.030239 1.888795 6.836 <.0001
K 2.001097 0.2818587 48 1.434382 2.567811 7.100 <.0001
M 1.050767 0.1301906 48 0.789001 1.312532 8.071 <.0001
Confidence level used: 0.95
########################################
#test differences between slope of the age linear for each genotype
########################################
CLD(tendencias_em_lin,
adjust = "bonferroni",
alpha = 0.05)
#response
Genotypes Age.trend SE df lower.CL upper.CL .group
M 1.050767 0.1301906 48 0.7128801 1.388653 1
E 1.459517 0.2135037 48 0.9054057 2.013628 12
C 1.693331 0.2218320 48 1.1176055 2.269057 12
K 2.001097 0.2818587 48 1.2695822 2.732611 2
Confidence level used: 0.95
Conf-level adjustment: bonferroni method for 4 estimates
P value adjustment: bonferroni method for 6 tests
significance level used: alpha = 0.05
Questions
How to test whether the age intercepts of each genotype is equal to zero?
How to test differences between intercepts of the age for each genotype?
Which are the standard errors of the intercepts of each genotype?
Thanks for your help.
You can answer these questions by using emmeans() in similar ways to what you did with emtrends().
Also look at the documentation for summary.emmGrid, and note that you can choose whether to do CIs, tests, or both. e.g.,
emm <- emmeans(...)
summary(emm, infer = c(TRUE,TRUE))
summary(emm, infer = c(TRUE,FALSE)) # or confint(emm)
summary(emm, infer = c(FALSE,TRUE)) # or test(emm)
True intercepts
If in fact you want the actual y intercepts, you can do that using
emm <- emmeans(..., at = list(age = 0))
The, predictions are made at age 0, which are the intercepts in the regression equations for each set of conditions. However, I would like to try to dissuade you from doing that because this because (a) these predictions are huge extrapolations, hence their standard errors are huge as well; and (b) it makes no practical sense to predict responses at age 0. For that reason, I think question #1 is basically meaningless.
If you leave that at part out, then emmeans() makes predictions at the mean age in the dataset. Those predictions will have a much smaller standard error than the intercepts do. Since you have interactions involving age in the model, the predictions compare differently at each age. I suggest it would be useful to put
emm <- emmeans(..., cov.reduce = FALSE, by = "age")
which is equivalent to using at to specify the set of age values that occur in the dataset, and doing separate comparisons at each of those age values.

Binary logistic regression with multiply imputed data

I have been trying to work with options available within R (i.e. MICE) to do binary logistic regression analyses (with interaction between continuous and categorical predictors).
However, I am struggling to carry out this simple analysis on multiply imputed data (details and reproducible example here).
Specifically, I have not been able to figure out a way to pool every aspect of the output including an equivalence of 'log likelihood ratio' using the GLM function of Mice.
To avoid redundancy from a previous post, I am seeking ANY suggestions for R packages or other softwares that may make it easy/possible to pool all essential components of the output for binary logistic regression (i.e. equivalent of model likelihood ratio test, regression coefficients, wald test). See below for an example that I was able to obtain using rms on a non-imputed data (could not figure out a way to run this on multiply imputed data)
> mylogit
Frequencies of Missing Values Due to Each Variable
P1 ST P8
18 0 31
Logistic Regression Model
lrm(formula = P1 ~ ST + P8 + ST * P8, data = PS, x = TRUE,
y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 362 LR chi2 18.34 R2 0.077 C 0.652
0 287 d.f. 9 g 0.664 Dxy 0.304
1 75 Pr(> chi2) 0.0314 gr 1.943 gamma 0.311
max |deriv| 8e-08 gp 0.099 tau-a 0.100 Brier 0.155
Coef S.E. Wald Z Pr(>|Z|)
Intercept -0.5509 0.3388 -1.63 0.1040
ST= 2 -0.5688 0.4568 -1.25 0.2131
ST= 3 -0.7654 0.4310 -1.78 0.0757
ST= 4 -0.7995 0.5229 -1.53 0.1263
ST= 5 -1.2813 0.4276 -3.00 0.0027
P8 0.2162 0.4189 0.52 0.6058
ST= 2 * P8 -0.1527 0.5128 -0.30 0.7659
ST= 3 * P8 -0.0461 0.5130 -0.09 0.9285
ST= 4 * P8 -0.5031 0.5635 -0.89 0.3719
ST= 5 * P8 0.3661 0.4734 0.77 0.4393
In sum, my question is: 1) package/software that is capable of handling multiply imputed data to complete a traditional binary logistic regression analysis, esp with interaction term 2) possible steps I need to take to do run the analysis in that program
The rms package has great features for combining multiply imputed data using the fit.mult.impute() function. Here is a small working example:
dat <- mtcars
## introduce NAs
dat[sample(rownames(dat), 10), "cyl"] <- NA
im <- aregImpute(~ cyl + wt + mpg + am, data = dat)
fit.mult.impute(am ~ cyl + wt + mpg, xtrans = im, data = dat, fitter = lrm)

Odds ratio and confidence intervals from glmer output

I have made a model that looks at a number of variables and the effect that has on pregnancy outcome. The outcome is a grouped binary. A mob of animals will have 34 pregnant and 3 empty, the next will have 20 pregnant and 4 empty and so on.
I have modelled this data using the glmer function where y is the pregnancy outcome (pregnant or empty).
mclus5 <- glmer(y~adg + breed + bw_start + year + (1|farm),
data=dat, family=binomial)
I get all the usual output with coefficients etc. but for interpretation I would like to transform this into odds ratios and confidence intervals for each of the coefficients.
In past logistic regression models I have used the following code
round(exp(cbind(OR=coef(mclus5),confint(mclus5))),3)
This would very nicely provide what I want, but it does not seem to work with the model I have run.
Does anyone know a way that I can get this output for my model through R?
The only real difference is that you have to use fixef() rather than coef() to extract the fixed-effect coefficients (coef() gives you the estimated coefficients for each group).
I'll illustrate with a built-in example from the lme4 package.
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
Fixed-effect coefficients and confidence intervals, log-odds scale:
cc <- confint(gm1,parm="beta_") ## slow (~ 11 seconds)
ctab <- cbind(est=fixef(gm1),cc)
(If you want faster-but-less-accurate Wald confidence intervals you can use confint(gm1,parm="beta_",method="Wald") instead; this will be equivalent to #Gorka's answer but marginally more convenient.)
Exponentiate to get odds ratios:
rtab <- exp(ctab)
print(rtab,digits=3)
## est 2.5 % 97.5 %
## (Intercept) 0.247 0.149 0.388
## period2 0.371 0.199 0.665
## period3 0.324 0.165 0.600
## period4 0.206 0.082 0.449
A marginally simpler/more general solution:
library(broom.mixed)
tidy(gm1,conf.int=TRUE,exponentiate=TRUE,effects="fixed")
for Wald intervals, or add conf.method="profile" for profile confidence intervals.
I believe there is another, much faster way (if you are OK with a less accurate result).
From: http://www.ats.ucla.edu/stat/r/dae/melogit.htm
First we get the confidence intervals for the Estimates
se <- sqrt(diag(vcov(mclus5)))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(mclus5), LL = fixef(mclus5) - 1.96 * se, UL = fixef(mclus5) + 1.96 * se)
Then the odds ratios with 95% CI
print(exp(tab), digits=3)
Other option I believe is to just use package emmeans :
library(emmeans)
data.frame(confint(pairs(emmeans(fit, ~ factor_name,type="response"))))

Resources