I have a fairly complicated ZINB model. I have tried to replicate the basic structure of what I'm trying to do:
MyDat<-cbind.data.frame(fac1 = rep(c("A","B","C","D"),10),
fac2=c(rep("X",20),rep("Y",20)),
offset=c(runif(20, 50,60),runif(20,150,165)),
fac3=rep(c(rep("a1",4),rep("a2",4),rep("a3",4),rep("a4",4),rep("a5",4)),2),
Y=c(0,0,0,1,0,0,11,10,0,0,0,5,0,0,0,35,60,0,0,0,0,2,0,0,16,0,0,0,0,0,3,88,0,0,0,0,0,0,27,0))
f<-formula(Y~fac1+ offset(log(offset))|fac3+ fac2)
ZINB <-zeroinfl(f, dist = "negbin",link = "logit", data = MyDat)
summary(ZINB)
The primary goal of this model is to look at the effect of fac1 across the four levels. The other variables are more just artifacts of the sampling process.
Here is the output:
Call:
zeroinfl(formula = f, data = MyDat, dist = "negbin", link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-0.418748 -0.338875 -0.265109 -0.001566 2.682920
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7192 0.9220 -1.865 0.062239 .
fac1B -4.4161 1.4700 -3.004 0.002663 **
fac1C -1.2008 1.2896 -0.931 0.351778
fac1D 0.1928 1.3003 0.148 0.882157
Log(theta) -1.7349 0.4558 -3.806 0.000141 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.5899 210.8434 -0.055 0.956
fac3a2 -0.4775 2.4608 -0.194 0.846
fac3a3 -11.2284 427.5200 -0.026 0.979
fac3a4 10.7771 210.8056 0.051 0.959
fac3a5 -0.3135 2.3358 -0.134 0.893
fac2Y 11.8292 210.8298 0.056 0.955
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 0.1764
Number of iterations in BFGS optimization: 76
Log-likelihood: -63.82 on 11 Df
I have consulted papers and stats books and forums, but I'm still not sure how to present this information. What I really want is a bar plot showing the effects on the Y-axis and the 4 levels on the X.
If I understand correctly, level A of fac1 is currently set at 0, and is my reference level (please correct me if I'm wrong here). So, I can make a plot of the 4 levels (including level A as zero). This doesn't seem ideal. I would really like to have 95%CIs for all levels.
I can also use the predict function, however predict.zeroinfl does not give error estimates, and I'm unsure how to interpret the effect of the offset.
Similar papers have just put a boxplot of the original data next to a boxplot of the predictions and compared. I feel like I should be able to do better.
Below is the code and plot to create the predicted values:
MyDat$phat<-predict(ZINB, type="response")
MyDat$phat_os<-MyDat$phat/MyDat$offset
plot(phat~fac1, MyDat)
Predictions plot
Is bootstrapping the way to go? I have tried this and run into all kinds of trouble for something I'm not sure is necessary.
Thank you in advance, and please go easy on me if I'm making a silly oversight/assumption. I'm still learning, but these stats feel a bit out of my reach.
For starters, you can plot the model coefficients with their confidence intervals. The arm package has the coefplot function, but it doesn't have a method for zeroinfl models, so I've created a simple coefficient plot below using ggplot2. The predict method for zeroinfl models doesn't provide confidence intervals for predictions, but this answer to a question on CrossValidated shows how to construct bootstrapped confidence intervals for zeroinfl models.
Regarding the levels of fac1: A is the reference level, so the coefficients for the other levels are relative to fac1 = "A".
library(pscl)
library(ggplot2)
MyDat<-cbind.data.frame(fac1 = rep(c("A","B","C","D"),10),
fac2=c(rep("X",20),rep("Y",20)),
offset=c(runif(20, 50,60),runif(20,150,165)),
fac3=rep(c(rep("a1",4),rep("a2",4),rep("a3",4),rep("a4",4),rep("a5",4)),2),
Y=c(0,0,0,1,0,0,11,10,0,0,0,5,0,0,0,35,60,0,0,0,0,2,0,0,16,0,0,0,0,0,3,88,0,0,0,0,0,0,27,0))
f<-formula(Y ~ fac1 + offset(log(offset))|fac3 + fac2)
ZINB <-zeroinfl(f, dist = "negbin",link = "logit", data = MyDat)
# Extract coefficients and standard errors from model summary
coefs = as.data.frame(summary(ZINB)$coefficients$count[,1:2])
names(coefs)[2] = "se"
coefs$vars = rownames(coefs)
# Coefficient plot
ggplot(coefs, aes(vars, Estimate)) +
geom_hline(yintercept=0, lty=2, lwd=1, colour="grey50") +
geom_errorbar(aes(ymin=Estimate - 1.96*se, ymax=Estimate + 1.96*se),
lwd=1, colour="red", width=0) +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se),
lwd=2.5, colour="blue", width=0) +
geom_point(size=4, pch=21, fill="yellow") +
theme_bw()
And here's what the plot looks like.
Related
I'm trying to identify the impact of how Sycamore_biomass affects the day which a bird lays its first_egg. My model output indicates a weak positive relationship - i.e. as sycamore biomass increases, the day of the first egg being laid should increase (i.e. should be later) (note I'm including confounding factors in this model):
Call:
lm(formula = First_egg ~ Sycamore_biomass + Distance_to_road +
Distance_to_light + Anthropogenic_cover + Canopy_cover, data = egglay_date)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.61055 16.21391 2.443 0.0347 *
Sycamore_biomass 0.15123 0.53977 0.280 0.7851
Distance_to_road 0.01773 0.46323 0.038 0.9702
Distance_to_light -0.02626 0.44225 -0.059 0.9538
Anthropogenic_cover -0.13879 0.28306 -0.490 0.6345
Canopy_cover -0.30219 0.20057 -1.507 0.1628
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.99 on 10 degrees of freedom
Multiple R-squared: 0.2363, Adjusted R-squared: -0.1455
F-statistic: 0.6189 on 5 and 10 DF, p-value: 0.6891
However, when I plot this using ggplot, the regression line indicates a negative relationship? Can anyone help me out with what is happening here?
ggplot(egglay_date, aes(x=Sycamore_biomass, y=First_egg)) +
geom_point(shape=19, alpha=1/4) +
geom_smooth(method=lm)
GG PLOT of Sycamore biomass and First egg date
I suppose this is because you look at the raw data you fed into the model, not the model predictions. In the plot, you don't "isolate" a single predictor. You look at the result of all predictors doing something to the response variable. I suppose the effect of this predictor is "overshadowed" by the effects of the other predictors.
To take a look at the effect of solely one predictor, you need to predict new values from the model while fixing all other predictors. You can try something along the lines of:
preds <- predict(yourmodel, newdata = data.frame(
"Sycamore_biomass" = 0:25,
"Distance_to_road" = mean(egglay_date$Distance_to_road),
"Distance_to_light" = mean(egglay_date$Distance_to_light),
"Anthropogenic_cover" = mean(egglay_date$Anthropogenic_cover),
"Canopy_cover" = mean(egglay_date$Canopy_cover)))
new_data <- data.frame(
"Sycamore_biomass" = 0:25,
"First_egg" = preds)
ggplot(new_data, aes(x=Sycamore_biomass, y=First_egg)) +
geom_point(shape=19, alpha=1/4) +
geom_smooth(method=lm)
This should give you the predictions of your model when only considering the effect of the one predictor.
The answer to your question is quite simple (but I understand why it may seems complex at first).
First off, your model indicates a positive relationship because you have included all your other variables. Keep in mind, your best fit line through your data here is when you take all your data points, and fit a line to make the sum of residuals = 0. Note: this is not the same as sum of residuals squared.
Since you didn't provide your data (please do on future posts, or at least, something to work with), I will illustrate my point with the data(mtcars) built into R
data("mtcars")
df <- mtcars
This dataset has many variables, to see them all, just type names(df)
Lets just work with three of them to see if miles per gallon (mpg) is explained by:
1) cyl : # of cylinders
2) hp : horse power
3) drat : rear axle ratio
attach(df)
model <-lm(mpg~cyl+hp+drat)
summary(model)
Let's say, I just want to plot the relationship between cylinders and mpg (for you, it would be sycamour biomass and bird lay). Here, from our model summary, we see that our relationship is negative (negative estimate, aka, coefficient), and that the intercept is at 22.5.
So I do what you just did and just plot mpg~cly (without considering my other variables)
plot(mpg~cyl, pch=15, col="blue",cex=2, cex.axis=2, ylab="MPG", xlab="Number of Cylinders", cex.lab=1.5)
abline(lm(mpg~cyl),lwd=2,col="red")
First off, we see that the y intercept is not 22.5, but rather above 25.
If I were to do the math from first model, if I had 4 cylinders, I should predict:
22.51406 + (4 * -1.3606) = 17.07
So lets see if our prediction is correct on our graph
Definitely not.
So lets run a new model (which you need to do), where we model just mpg~cly
reduced_model <- lm(mpg~cyl)
summary(reduced_model)
See how the intercept and coefficent (estimates) changed? Yours will too when you run a reduced model. Lets see if the plots now make sense following the same steps as above with predicting 4 cylinders
37.8846 + (4 * -2.8758 ) # 26.38
plot(mpg~cyl, pch=15, col="blue",cex=2, cex.axis=2, ylab="MPG", xlab="Number of Cylinders", cex.lab=1.5)
abline(lm(mpg~cyl),lwd=2,col="red")
abline(h=26.38,v=4,lwd=2, col="green")
Looks like everything checks out.
Summary: You need to run a simple model with just your two variables of interest if you want to correctly understand your plot
I have fitted a logistic regression for an outcome (a type of side effect - whether patients have this or not). The formula and results of this model is below:
model <- glm(side_effect_G1 ~ age + bmi + surgerytype1 + surgerytype2 + surgerytype3 + cvd + rt_axilla, family = 'binomial', data= data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.888112 0.859847 -9.174 < 2e-16 ***
age 0.028529 0.009212 3.097 0.00196 **
bmi 0.095759 0.015265 6.273 3.53e-10 ***
surgery11 0.923723 0.524588 1.761 0.07826 .
surgery21 1.607389 0.600113 2.678 0.00740 **
surgery31 1.544822 0.573972 2.691 0.00711 **
cvd1 0.624692 0.290005 2.154 0.03123 *
rt1 -0.816374 0.353953 -2.306 0.02109 *
I want to check my models, so I have plotted residuals against predictors or fitted values. I know, if a model is properly fitted, there should be no correlation between residuals and predictors and fitted values so I essentially run...
residualPlots(model)
My plots look funny because from what I have seen from examples online, is that it should be symmetrical around 0. Also, my factor variables aren't shown in box-plots although I have checked the structure of my data and coded surgery1, surgery2, surgery4,cvd,rt as factors. Can someone help me interpret my plots and guide me how to plot boxplots for my factor variables?
Thanks
Your label or response variable is expected for an imbalanced dataset. From your plots most of your residuals actually go below the dotted line, so I suspect this is the case.
Very briefly, the symmetric around residuals only holds for logistic regression when your classes are balanced. If it is heavily imbalanced towards the reference label (or 0 label), the intercept will be forced towards a low value (i.e the 0 label), and you will see that positive labels will have a very large pearson residual (because they deviate a lot from the expected). You can read more about imbalanced class and logistic regression in this post
Here's an example to demonstrate this, using a dataset where you see the evenly distributed residues :
library(mlbench)
library(car)
data(PimaIndiansDiabetes)
table(PimaIndiansDiabetes$diabetes)
neg pos
500 268
mdl = glm(diabetes ~ .,data=PimaIndiansDiabetes,family="binomial")
residualPlots(mdl)
Let's make it more imbalanced, and you get a plot exactly like yours:
da = PimaIndiansDiabetes
wh = c(which(da$diabetes=="neg"),which(da$diabetes == "pos")[1:100])
da = da[wh,]
table(da$diabetes)
neg pos
500 100
mdl = glm(diabetes ~ .,data=da,family="binomial")
residualPlots(mdl)
I use emmeans to derive adjusted means from my linear mixed-effect regression model, but the results do not seem to be correct. I want to plot the model fit and the adjusted values of the individual data points, but the results look weird:
The estimated adjusted means seems to be too high for Course A and too low on Course C. In my linear mixed-effect regression, I am predicting the posttest with pretest as a covariate and the main effect and interaction of Group and Course. Because I have repeated measures on Course and different testing conditions, I have included a random intercept for Course and School. Using emmeans I get the following estimates:
# model fit
CI_post <- lmer(
post.diff ~
pre.diff +
group * course
+ (1|bib)
+ (1|school),
data = dat,
REML = FALSE)
#estimated adjusted means
emmeans(CI_post, specs = c("course", "group"),lmer.df = "satterthwaite")
# Results
course group emmean SE df lower.CL upper.CL
A blocked 0.311 0.191 6.65 -0.1452 0.768
B blocked 0.649 0.180 5.38 0.1954 1.102
C blocked 1.141 0.195 7.28 0.6847 1.598
A interleaved 0.189 0.194 7.15 -0.2666 0.645
B interleaved 0.497 0.179 5.31 0.0451 0.949
C interleaved 1.046 0.191 6.72 0.5907 1.502
It is these values that I have plotted and that I think is incorrect. Can someone please help me so that I get the correct estimated adjusted means?
After having read this, I suspect that the error is because pre.diff is a fixed value?
ref_grid(CI_post)
#result
'emmGrid' object with variables:
pre.diff = 1.5065
group = blocked, interleaved
course = A, B, C
EDIT
Following Lenth advice, I tried:
post.diff.adj = post.diff + b * (1.506 - pre.diff), which gave me the following figure:
It looks better and more correct. I used the model regression coefficient from my model:
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.66087 0.18158 5.58701 -3.639 0.012280 *
pre.diff 0.64544 0.06178 130.60667 10.448 < 0.0000000000000002 ***
groupinterleaved -0.12209 0.15189 65.38709 -0.804 0.424431
courseB 0.33714 0.09703 131.63603 3.475 0.000693 ***
courseC 0.82993 0.16318 151.09201 5.086 0.00000107 ***
groupinterleaved:courseB -0.02922 0.11777 101.47596 -0.248 0.804563
groupinterleaved:courseC 0.02692 0.11763 100.29319 0.229 0.819435
Then I used calculated it in my tibble:
dat <- dat %>%
mutate(adjustedMean = (post.diff) + (0.6454358 * (1.506 - pre.diff)))
Then I plotted it with ggplot:
CI_post_plot <- ggplot(dat, aes(x = interaction(group, course), y = adjustedMean)) +
geom_point(aes(color=group), size=1.5, position=position_jitter(width=0.1), alpha=0.7)+
scale_y_continuous(name = "Time substracted from straight gliding time (sec.)", breaks = seq(-2, 6, 1)) +
theme_pubr()+
theme(legend.position="none",
axis.title.x=element_blank()) +
geom_hline(aes(yintercept=0), linetype = "dashed", size=0.2) +
scale_x_discrete(labels = c("Blocked\nCourse A", "Interleaved\nCourse A", "Blocked\nCourse B", "Interleaved\nCourse B", "Blocked\nCourse C", "Interleaved\nCourse C"))
CI_post_plot <- CI_post_plot +
geom_point(data = estmarg_mean, aes(x=interaction(group, course), y=emmean, group=group), size=2.5) +
geom_errorbar(data = estmarg_mean, aes(x= interaction(group, course), y = emmean, ymin = lower.CL,ymax = upper.CL), width=0.1)
https://cran.r-project.org/web/packages/emmeans/vignettes/basics.html
Reviewing some comments, the second plot in the OP shows the adjusted response values and the adjusted means (AKA EMMs). The intuition for this is shown in this rough sketch:
Most experimental design texts will show a similar picture for how adjusted means are objained: The model fits parallel lines for each treatment; those lines go through the centers of their respective data clouds. The adjusted means are the estimates at the mean value of the covariate.
To obtain the adjusted data, we do the same thing with each data point; a few representative points are shown. The adjusted data are projections of each data point onto the mean line, with the projections going on paths parallel to the regression lines.
There was an article on "aligned data" in The American Statistician a number of years ago, and this is akin to that. I do not remember the author and didn't find it in a quick search. This is also related to "component-plus-residual" plots, discussed in some regression texts. The underlying idea is to remove the estimated effects of nuisance variables from the data, or equivalently to obtain the model residuals and add the non-nuisance effects.
I'm fixing a linear regression with lm() like
model<-lm(y~x+a, data=dat)
where a is a blocking variable with multiple factor levels.
summary(model)
Call:
lm(formula = y ~ x, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.45006 -0.20737 0.04593 0.26337 0.91628
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.704042 1.088024 -7.081 1.08e-10 ***
x 0.248889 0.036436 6.831 3.81e-10 ***
a1 0.002695 0.150530 0.018 0.98575
a2 0.491749 0.152378 3.227 0.00162 **
a3 0.349772 0.145024 2.412 0.01740 *
a4 -0.009058 0.138717 -0.065 0.94805
a5 0.428085 0.128041 3.343 0.00111 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4505 on 119 degrees of freedom
Multiple R-squared: 0.4228, Adjusted R-squared: 0.3937
F-statistic: 14.53 on 6 and 119 DF, p-value: 2.19e-12
I'm trying to display the same equation and R2 I would get with summary(model) when plotting the raw data and the regression line using ggplot, but because I'm not actually providing a, it's not taking into the fitting of stat_poly_eq()
ggplot(data=dat, aes(x, y)) +
geom_point() +
geom_abline(slope=coef(model)[2], intercept=coef(model)[1], color='red') +
stat_poly_eq(data=plankton.dat, formula = y ~ x,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE, size=3, colour= "red")
Naturally, because lm() and stat_poly_eq() fit the model differently, the resulting parameter estimates and R2 are different.
Is it possible to include the blocking variable in stat_poly_eq and if so, how?
Having factor a six levels, you have fitted six parallel lines, so it does not make much sense to show only one line and one equation. If factor a describes blocks, then using lme() to fit a mixed effects model is possible, and it will give you only one estimate for the line. You have to consider the contrasts used by default in R, and that the first level of a or a0 is the "reference", so the line plotted in your example is for block level a0 and is not valid for the dataset as a whole.
stat_poly_eq() supports only lm(). stat_poly_eq() works in the same way as stat_smooth(method = "lm") as it is intended to be used together with it. If you are fitting the model outside of ggplot then you will need to build a suitable label manually using plotmath syntax, and add it in an annotation layer using annotate(geom = "text", x = 27, y = 1, label = "<your string>", parse = TRUE). To create the string that I show with the placeholder <your string>, you can extract the coefficient estimates in a similar way as you do in geom_abline() in your plot example, and use paste() or sprintf() to assemble the equation. You can also use coef() with a model fitted with lme().
Other statistics in package 'ggpmisc' let you fit a model with lme() but you would anyway need to assemble the label manually. If you will be producing many plots, you may find it worthwhile cheking the User Guide of package 'ggpmisc' for the details.
Somehow as a follow up on the question Creating confidence intervals for regression curve in GLMM using Bootstrapping, I am interested in getting the correct values of a regression curve and the associated confidence interval curves.
Consider a case where in a GLMM, there is one response variable, two continuous fixed effects and one random effect. Here is some fake data:
library (dplyr)
set.seed (1129)
x1 <- runif(100,0,1)
x2 <- rnorm(100,0.5,0.4)
f1 <- gl(n = 5,k = 20)
rnd1<-rnorm(5,0.5,0.1)
my_data <- data.frame(x1=x1, x2=x2, f1=f1)
modmat <- model.matrix(~x1+x2, my_data)
fixed <- c(-0.12,0.35,0.09)
y <- (modmat%*%fixed+rnd1)
my_data$y <- ((y - min (y))/max(y- min (y))) %>% round (digits = 1)
rm (y)
The GLMM that I fit looks like this:
m1<-glmer (y ~x1+x2+(1|f1), my_data, family="binomial")
summary (m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ x1 + x2 + (1 | f1)
Data: my_data
AIC BIC logLik deviance df.resid
65.7 76.1 -28.8 57.7 96
Scaled residuals:
Min 1Q Median 3Q Max
-8.4750 -0.7042 -0.0102 1.5904 14.5919
Random effects:
Groups Name Variance Std.Dev.
f1 (Intercept) 1.996e-10 1.413e-05
Number of obs: 100, groups: f1, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.668 2.051 -4.713 2.44e-06 ***
x1 12.855 2.659 4.835 1.33e-06 ***
x2 4.875 1.278 3.816 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) x1
x1 -0.970
x2 -0.836 0.734
convergence code: 0
boundary (singular) fit: see ?isSingular
Plotting y vs x1:
plot (y~x1, my_data)
It should be possible to get a regression curve from the summary of m1. I have learned that I need to reverse the link-function (in this case, "logit"):
y = 1/(1+exp(-(Intercept+b*x1+c*x2)))
In order to plot a regression curve of x1 in a two-dimensional space, I set x2 = mean(x2) in the formula (which also seems important - the red line in the following plots ignores x2, apparently leading to considerable bias). The regression line:
xx <- seq (from = 0, to = 1, length.out = 100)
yy <- 1/(1+exp(-(-9.668+12.855*xx+4.875*mean(x2))))
yyy <- 1/(1+exp(-(-9.668+12.855*xx)))
lines (yy ~ xx, col = "blue")
lines (yyy~ xx, col = "red")
I think, the blue line looks not so good (and the red line worse, of course). So as a side-question: is y = 1/(1+exp(-(Intercept+b*x1+c*x2))) always the right choice as a back-transformation of the logit-link? I am asking because I found this https://sebastiansauer.github.io/convert_logit2prob/, which made me suspicious. Or is there another reason for the model not to fit so well? Maybe my data creation process is somewhat 'bad'.
What I need now is to add the 95%-confidence interval to the curve. I think that Bootstrapping using the bootMer function should be a good approach. However, all examples that I found were on models with one single fixed effect. #Jamie Murphy asked a similar question, but he was interested in models containing a continuous and a categorical variable as fixed effects here: Creating confidence intervals for regression curve in GLMM using Bootstrapping
But when it comes to models with more than one continuous variables as fixed effects, I get lost. Perhaps someone can help solve this issue - possibly with a modification of the second part of this tutorial:
https://www.r-bloggers.com/2015/06/confidence-intervals-for-prediction-in-glmms/