I have this fake dataset that describes the effect of air temperature on the growth of two plant species (a and b).
data1 <- read.csv(text = "
year,block,specie,temperature,growth
2019,1,a,0,7.217496163
2019,1,a,1,2.809792001
2019,1,a,2,16.09505635
2019,1,a,3,24.52673264
2019,1,a,4,49.98455022
2019,1,a,5,35.78568291
2019,2,a,0,8.332533323
2019,2,a,1,16.5997836
2019,2,a,2,11.95833966
2019,2,a,3,34.4
2019,2,a,4,54.19081002
2019,2,a,5,41.1291734
2019,1,b,0,14.07939683
2019,1,b,1,13.73257973
2019,1,b,2,31.33076651
2019,1,b,3,44.81995622
2019,1,b,4,79.27999184
2019,1,b,5,75.0527336
2019,2,b,0,14.18896232
2019,2,b,1,29.00692747
2019,2,b,2,27.83736734
2019,2,b,3,61.46006916
2019,2,b,4,93.91100024
2019,2,b,5,92.47922985
2020,1,a,0,4.117536842
2020,1,a,1,12.70711508
2020,1,a,2,16.09570046
2020,1,a,3,29.49417491
2020,1,a,4,35.94571498
2020,1,a,5,50.74477018
2020,2,a,0,3.490585144
2020,2,a,1,3.817105315
2020,2,a,2,22.43112718
2020,2,a,3,14.4
2020,2,a,4,46.84223604
2020,2,a,5,39.10398717
2020,1,b,0,10.17712428
2020,1,b,1,22.04514586
2020,1,b,2,30.37221799
2020,1,b,3,51.80333619
2020,1,b,4,76.22765452
2020,1,b,5,78.37284714
2020,2,b,0,7.308139613
2020,2,b,1,22.03241605
2020,2,b,2,45.88385871
2020,2,b,3,30.43669633
2020,2,b,4,76.12904988
2020,2,b,5,85.9324324
")
The experiment was conducted two years and in a block design (nested within years). The goal is to inform how much growth is affected per unit of change in temperature. Also, the is a need to provide a measure of uncertainty (standard error) for this estimate. The same needs to be done for the growth recorded at zero degrees of temperature.
library(lme4)
library(lmerTest)
library(lsmeans)
test.model.1 <- lmer(growth ~
specie +
temperature +
specie*temperature +
(1|year) +
(1|year:block),
data= data1,
REML=T,
control=lmerControl(check.nobs.vs.nlev = "ignore",
check.nobs.vs.rankZ = "ignore",
check.nobs.vs.nRE="ignore"))
summary(test.model.1)
The summary give me this output for the fixed effect:
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: growth ~ specie + temperature + specie * temperature + (1 | year) +
(1 | year:block)
Data: data1
Control: lmerControl(check.nobs.vs.nlev = "ignore", check.nobs.vs.rankZ = "ignore",
check.nobs.vs.nRE = "ignore")
REML criterion at convergence: 331.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.6408 -0.7637 0.1516 0.5248 2.4809
Random effects:
Groups Name Variance Std.Dev.
year:block (Intercept) 6.231 2.496
year (Intercept) 0.000 0.000
Residual 74.117 8.609
Number of obs: 48, groups: year:block, 4; year, 2
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.699 3.356 26.256 0.804 0.428
specieb 4.433 4.406 41.000 1.006 0.320
temperature 8.624 1.029 41.000 8.381 2.0e-10 ***
specieb:temperature 7.088 1.455 41.000 4.871 1.7e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) specib tmprtr
specieb -0.656
temperature -0.767 0.584
spcb:tmprtr 0.542 -0.826 -0.707
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
From this I can get the growth at 0 degrees of temperature for specie "a" (2.699), and for specie "b" (2.699 + 4.443 = 7.132). Also, the rate of change in growth per unit change in temperature is (8.624) for species "a" and (8.624 + 7.088 = 15.712). The problem I have is that the standard deviation reported in summary() is for the marginal estimate, not for the actual value of the parameter. For instance, the standard error for 4.443 (specieb) is 4.406.. but that is not the standard error for the actual growth at 0 degrees for specie b that is 7.132. What I am looking for is the standard error of let's say 7.132. Also, I'd be nice to have all the calculations I did by hand automatically performed.
I was trying making some tries with emmeans() from lsmeans package but I didn't succeed.
emmeans(test.model.1, growth ~ specie*temperature)
Error:
Error in contrast.emmGrid(object = new("emmGrid", model.info = list(call = lmer(formula = growth ~ :
Contrast function 'growth.emmc' not found
I think your main problem is that you don't need the response variable on the left side of the formula you give to emmeans (the package assumes that you're going to use the same response variable as in the original model!) The left-hand side of the formula is reserved for specifying contrasts, e.g. pairwise ~ ... - see help("contrast-methods", package = "emmeans").
I think you might be looking for:
emmeans(test.model.1, ~specie, at = list(temperature=0))
NOTE: Results may be misleading due to involvement in interactions
specie emmean SE df lower.CL upper.CL
a 2.70 3.36 11.3 -4.665 10.1
b 7.13 3.36 11.3 -0.232 14.5
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
If you don't specify the value of temperature, then emmeans uses (I think) the overall average temperature.
For slopes, you want emtrends:
emtrends(test.model.1, ~specie, var = "temperature")
specie temperature.trend SE df lower.CL upper.CL
a 8.62 1.03 41 6.55 10.7
b 15.71 1.03 41 13.63 17.8
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
I highly recommend the extensive and clearly written vignettes for the emmeans package. Since emmeans has so many capabilities it may take a little while to find the answers to your precise questions, but the effort will be repaid in the long term.
As a small picky point, I would say that what summary() gives you are the "actual" parameters that R uses internally, and what emmeans() gives you are the marginal means (as suggested by the name of the package — expected marginal means ...)
I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.
Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.
The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))
Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf
I have made a model that looks at a number of variables and the effect that has on pregnancy outcome. The outcome is a grouped binary. A mob of animals will have 34 pregnant and 3 empty, the next will have 20 pregnant and 4 empty and so on.
I have modelled this data using the glmer function where y is the pregnancy outcome (pregnant or empty).
mclus5 <- glmer(y~adg + breed + bw_start + year + (1|farm),
data=dat, family=binomial)
I get all the usual output with coefficients etc. but for interpretation I would like to transform this into odds ratios and confidence intervals for each of the coefficients.
In past logistic regression models I have used the following code
round(exp(cbind(OR=coef(mclus5),confint(mclus5))),3)
This would very nicely provide what I want, but it does not seem to work with the model I have run.
Does anyone know a way that I can get this output for my model through R?
The only real difference is that you have to use fixef() rather than coef() to extract the fixed-effect coefficients (coef() gives you the estimated coefficients for each group).
I'll illustrate with a built-in example from the lme4 package.
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
Fixed-effect coefficients and confidence intervals, log-odds scale:
cc <- confint(gm1,parm="beta_") ## slow (~ 11 seconds)
ctab <- cbind(est=fixef(gm1),cc)
(If you want faster-but-less-accurate Wald confidence intervals you can use confint(gm1,parm="beta_",method="Wald") instead; this will be equivalent to #Gorka's answer but marginally more convenient.)
Exponentiate to get odds ratios:
rtab <- exp(ctab)
print(rtab,digits=3)
## est 2.5 % 97.5 %
## (Intercept) 0.247 0.149 0.388
## period2 0.371 0.199 0.665
## period3 0.324 0.165 0.600
## period4 0.206 0.082 0.449
A marginally simpler/more general solution:
library(broom.mixed)
tidy(gm1,conf.int=TRUE,exponentiate=TRUE,effects="fixed")
for Wald intervals, or add conf.method="profile" for profile confidence intervals.
I believe there is another, much faster way (if you are OK with a less accurate result).
From: http://www.ats.ucla.edu/stat/r/dae/melogit.htm
First we get the confidence intervals for the Estimates
se <- sqrt(diag(vcov(mclus5)))
# table of estimates with 95% CI
tab <- cbind(Est = fixef(mclus5), LL = fixef(mclus5) - 1.96 * se, UL = fixef(mclus5) + 1.96 * se)
Then the odds ratios with 95% CI
print(exp(tab), digits=3)
Other option I believe is to just use package emmeans :
library(emmeans)
data.frame(confint(pairs(emmeans(fit, ~ factor_name,type="response"))))
I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823