issue with creating bar plot from logistic regression results - r

I am trying to plot the interaction results of a logistic regression where my independent variable (gbg) is binary and my moderator is binary (gender). The interaction terms is gender*gbg. Here is the code for the logistic regression I ran.
model <- glm(y ~ gender + lunch + cohort + race + gbg + gender*gbg, data = data, family = binomial)
Sample output is here:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.02336 0.20860 -4.906 9.3e-07
gender 0.20342 0.15663 1.299 0.1940
cohort 0.07891 0.13436 0.587 0.5570
race 0.08707 0.18150 0.480 0.6314
gbg 0.22623 0.20622 1.097 0.2726
gxgbg -0.76378 0.30647 -2.492 0.0127
I know how to plot interaction results when at least one of my IVs or moderators are linear, but I don't know how to do it when both are and my outcome is binary. I tried this and no luck
plot_model(model, type = "int")
Can someone please help?

Related

Calculate proportion of random effect variance from zero-inflation component of glmmTMB model

I fitted a zero-inflation Model in glmmTMB onto my data. It is defined in the following way:
MT.total.glmm.zi <- glmmTMB(MT_total ~ Factor1 * Factor2 + Factor3
+ (1|RANEF1)
+ (1|RANEF2)
+ (1|RANEF3)
+ (1|RANEF4)
+ (1|RANEF5),
family = "poisson",
zi=~Factor1 * Factor2 + Factor3
+ (1|RANEF1)
+ (1|RANEF2)
+ (1|RANEF3)
+ (1|RANEF4)
+ (1|RANEF5),
data=df.MT.total)
Now, for the random effects (RANEF1-5) fitted as random intercepts I would like to report the proportion of variance they explain. For the conditional models, there is the function "get_variance" from the "insight" package providing me with the much needed information:
get_variance(MT.total.glmm.zi.complex, component = "all") %>%
print(.)
$var.fixed
[1] 0.02833294
$var.random
[1] 1.029546
$var.residual
[1] 0.4704095
$var.distribution
[1] 0.4704095
$var.dispersion
[1] 0
$var.intercept
RANEF1 RANEF2 RANEF3 RANEF4 RANEF5
0.2862753 0.1710377 0.0486532 0.1655541 0.3580260
Unfortunately, I could not find a downstream wrapper for get_variance to gain the same information for the zero-inflation component of my model. I "only" found downstream wrappers for the Anova, emmeans and effects package in the documentation of the glmmTMB package from Ben Bolker. Unfortunately due to the nature of my data, the models are zero-inflation models and not hurdle/zero-truncated poisson models. Otherwise I could have just modelled a separate binomial model on the binary version of "MT_total".
There is the VarCorr function which allows printing the variances of the individual random effects:
print(VarCorr(MT.total.glmm.zi.complex), comp = "Variance")
Conditional model:
Groups Name Std.Dev.
RANEF1 (Intercept) 0.286275
RANEF2 (Intercept) 0.171038
RANEF3 (Intercept) 0.048653
RANEF4 (Intercept) 0.165554
RANEF5 (Intercept) 0.358026
Zero-inflation model:
Groups Name Std.Dev.
RANEF1 (Intercept) 1.14835
RANEF2 (Intercept) 0.85102
RANEF3 (Intercept) 0.11784
RANEF4 (Intercept) 0.14599
RANEF5 (Intercept) 0.85835
But I don't really know how I could calculate the remaining variance components of the zero-inflation component of my model (var.fixed, var.residual, var.distribution)
So my two questions are:
Is there a function or downstream wrapper I overlooked which would allow me to use get_variance onto the zero-inflation component of my model?
Or
Could someone give me a hint or guide me in the direction of how I can calculate the remaining variance components of my model in order to calculate the proportion of variance explained by my random effects manually?

Ordered probit regression, clustering standard errors in MASS::polr(), including SEs for cut-off points

I am trying to estimate an ordinal probit regression with clustered standard errors on subject level using the MASS package's polr() function. My dependent variable is ordinal with three factors. Treatment is a dummy, randinterval is an integer, period is an integer, and male is a dummy. Standard errors are clustered at the level of the individual subject via SubjectID. I already saw a similar thread here, however I have noticed that the output only includes SE for the coefficients. It does not show the cut-off points from the ordered probit regression. So far, my code looks like this:
oprobit <- polr(factor(frequency) ~ treatment + randinterval + period + male ,
data=my_data, method = "probit", Hess = T)
summary(oprobit)
Call:
polr(formula = factor(frequency) ~ treatment + randinterval +
period + male, data = my_data, Hess = T,
method = "probit")
Coefficients:
Value Std. Error t value
treatment 0.05849 0.064516 0.9065
randinterval -0.05419 0.028686 -1.8890
period -0.01289 0.007973 -1.6163
male 0.01108 0.067770 0.1635
Intercepts:
Value Std. Error t value
0|1 -0.8934 0.1290 -6.9234
1|2 -0.3580 0.1278 -2.8009
Residual Deviance: 2686.905
AIC: 2698.90
--------------------------------------------------------------------------------
oprobit_subj_1 <- coeftest(oprobit, vcovCL, type='HC0', cluster=~SubjectID)
print(oprobit_subj_1, digits=3)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
treatment 0.05849 0.14225 0.41 0.68
randinterval -0.05419 0.03428 -1.58 0.11
period -0.01289 0.00845 -1.53 0.13
male 0.01108 0.14561 0.08 0.94
To the best of my knowledge, STATA does include the cut-off points via the following function:
oprobit frequency treatment randinterval period male, vce(cluster SubjectID)
Is there a way to include the SE for the intercepts (= cut-off points (0|1, 1|2)) in R via MASS as well?
Thanks for your help in advance!

Weird plots when plotting logistic regression residuals vs predictor variables?

I have fitted a logistic regression for an outcome (a type of side effect - whether patients have this or not). The formula and results of this model is below:
model <- glm(side_effect_G1 ~ age + bmi + surgerytype1 + surgerytype2 + surgerytype3 + cvd + rt_axilla, family = 'binomial', data= data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.888112 0.859847 -9.174 < 2e-16 ***
age 0.028529 0.009212 3.097 0.00196 **
bmi 0.095759 0.015265 6.273 3.53e-10 ***
surgery11 0.923723 0.524588 1.761 0.07826 .
surgery21 1.607389 0.600113 2.678 0.00740 **
surgery31 1.544822 0.573972 2.691 0.00711 **
cvd1 0.624692 0.290005 2.154 0.03123 *
rt1 -0.816374 0.353953 -2.306 0.02109 *
I want to check my models, so I have plotted residuals against predictors or fitted values. I know, if a model is properly fitted, there should be no correlation between residuals and predictors and fitted values so I essentially run...
residualPlots(model)
My plots look funny because from what I have seen from examples online, is that it should be symmetrical around 0. Also, my factor variables aren't shown in box-plots although I have checked the structure of my data and coded surgery1, surgery2, surgery4,cvd,rt as factors. Can someone help me interpret my plots and guide me how to plot boxplots for my factor variables?
Thanks
Your label or response variable is expected for an imbalanced dataset. From your plots most of your residuals actually go below the dotted line, so I suspect this is the case.
Very briefly, the symmetric around residuals only holds for logistic regression when your classes are balanced. If it is heavily imbalanced towards the reference label (or 0 label), the intercept will be forced towards a low value (i.e the 0 label), and you will see that positive labels will have a very large pearson residual (because they deviate a lot from the expected). You can read more about imbalanced class and logistic regression in this post
Here's an example to demonstrate this, using a dataset where you see the evenly distributed residues :
library(mlbench)
library(car)
data(PimaIndiansDiabetes)
table(PimaIndiansDiabetes$diabetes)
neg pos
500 268
mdl = glm(diabetes ~ .,data=PimaIndiansDiabetes,family="binomial")
residualPlots(mdl)
Let's make it more imbalanced, and you get a plot exactly like yours:
da = PimaIndiansDiabetes
wh = c(which(da$diabetes=="neg"),which(da$diabetes == "pos")[1:100])
da = da[wh,]
table(da$diabetes)
neg pos
500 100
mdl = glm(diabetes ~ .,data=da,family="binomial")
residualPlots(mdl)

GAM: why mgcv::gam provides different results regarding to the order of the levels of the explanatory variable

I am trying to get the seasonal trend of two groups of individuals using GAMMs. I performed two analysis changing the order of the levels of the explanatory variable in order to get one plot of the seasonal trend for each level.
However, I am surprised with the output of the two GAMMs because they vary according to the order of the levels of the explanatory variable. I expected that the results would be the same because the data and the model are the same in both occasions. However, as you can see below, the results vary the inference of the data studied.
My database contained the next variables:
Species: 4 levels
Populations: 20 levels
Reproductive_State: 2 levels
Survival_probability: range [0-1]
Year
Month
Fortnight: from 1 to 26 (called Seasonality in analysis)
I am trying to get a descriptive estimates and plots of the "common seasonal survival of the species" checking the existence of differences between the two levels of the variable reproductive_state.
In order to check it I performed did:
# Specify the contrast: Reproductive group
data$Reproductive_Group <- as.factor (data$Reproductive_State)
data$Reproductive_Group <- as.ordered(data$Reproductive_Group )
contrasts(data$Reproductive_Group )<-'contr.treatment'
model_1 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
later I change the order of the levels of the Reproductive_Group and perform the same analysis:
data$Reproductive_Group <- factor (data$Reproductive_Group , levels=c("phiNB", "phiB"))
levels (data$Reproductive_Group )
model_2 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
In the first model the output is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83569 0.01202 152.8 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 3.201 3.963 2.430 0.05046 .
s(Seasonality):Rep_GroupphiNB 5.824 6.956 2.682 0.00991 **
whereas the output of the second model is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83554 0.01205 152.4 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 5.927 7.061 6.156 3.66e-07 ***
s(Seasonality):Rep_GroupphiB 3.218 3.981 1.029 0.411
Furthermore I have attached the plots of the two models:
Group_B_as_second_level
Group_NB_as_second_level
I thought that the plot of the seasonality should be the same for both analysis, as long as it represents exclusively the seasonality. However if the seasonality reflects the seasonal trend of the other level, the plot 1 of the first picture should match with the plot 2 of the second picture and viceversa, and they donĀ“t do it.
To note, that I followed the blog Overview GAMM analysis of time series data for writting the formula and checking the differences of the seasonal trend accross the two reproductive state.
Do you know why I obtain different results with these two models?

Estimate SE for all factor levels with zero-inflated model

I have a fairly complicated ZINB model. I have tried to replicate the basic structure of what I'm trying to do:
MyDat<-cbind.data.frame(fac1 = rep(c("A","B","C","D"),10),
fac2=c(rep("X",20),rep("Y",20)),
offset=c(runif(20, 50,60),runif(20,150,165)),
fac3=rep(c(rep("a1",4),rep("a2",4),rep("a3",4),rep("a4",4),rep("a5",4)),2),
Y=c(0,0,0,1,0,0,11,10,0,0,0,5,0,0,0,35,60,0,0,0,0,2,0,0,16,0,0,0,0,0,3,88,0,0,0,0,0,0,27,0))
f<-formula(Y~fac1+ offset(log(offset))|fac3+ fac2)
ZINB <-zeroinfl(f, dist = "negbin",link = "logit", data = MyDat)
summary(ZINB)
The primary goal of this model is to look at the effect of fac1 across the four levels. The other variables are more just artifacts of the sampling process.
Here is the output:
Call:
zeroinfl(formula = f, data = MyDat, dist = "negbin", link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-0.418748 -0.338875 -0.265109 -0.001566 2.682920
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7192 0.9220 -1.865 0.062239 .
fac1B -4.4161 1.4700 -3.004 0.002663 **
fac1C -1.2008 1.2896 -0.931 0.351778
fac1D 0.1928 1.3003 0.148 0.882157
Log(theta) -1.7349 0.4558 -3.806 0.000141 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.5899 210.8434 -0.055 0.956
fac3a2 -0.4775 2.4608 -0.194 0.846
fac3a3 -11.2284 427.5200 -0.026 0.979
fac3a4 10.7771 210.8056 0.051 0.959
fac3a5 -0.3135 2.3358 -0.134 0.893
fac2Y 11.8292 210.8298 0.056 0.955
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 0.1764
Number of iterations in BFGS optimization: 76
Log-likelihood: -63.82 on 11 Df
I have consulted papers and stats books and forums, but I'm still not sure how to present this information. What I really want is a bar plot showing the effects on the Y-axis and the 4 levels on the X.
If I understand correctly, level A of fac1 is currently set at 0, and is my reference level (please correct me if I'm wrong here). So, I can make a plot of the 4 levels (including level A as zero). This doesn't seem ideal. I would really like to have 95%CIs for all levels.
I can also use the predict function, however predict.zeroinfl does not give error estimates, and I'm unsure how to interpret the effect of the offset.
Similar papers have just put a boxplot of the original data next to a boxplot of the predictions and compared. I feel like I should be able to do better.
Below is the code and plot to create the predicted values:
MyDat$phat<-predict(ZINB, type="response")
MyDat$phat_os<-MyDat$phat/MyDat$offset
plot(phat~fac1, MyDat)
Predictions plot
Is bootstrapping the way to go? I have tried this and run into all kinds of trouble for something I'm not sure is necessary.
Thank you in advance, and please go easy on me if I'm making a silly oversight/assumption. I'm still learning, but these stats feel a bit out of my reach.
For starters, you can plot the model coefficients with their confidence intervals. The arm package has the coefplot function, but it doesn't have a method for zeroinfl models, so I've created a simple coefficient plot below using ggplot2. The predict method for zeroinfl models doesn't provide confidence intervals for predictions, but this answer to a question on CrossValidated shows how to construct bootstrapped confidence intervals for zeroinfl models.
Regarding the levels of fac1: A is the reference level, so the coefficients for the other levels are relative to fac1 = "A".
library(pscl)
library(ggplot2)
MyDat<-cbind.data.frame(fac1 = rep(c("A","B","C","D"),10),
fac2=c(rep("X",20),rep("Y",20)),
offset=c(runif(20, 50,60),runif(20,150,165)),
fac3=rep(c(rep("a1",4),rep("a2",4),rep("a3",4),rep("a4",4),rep("a5",4)),2),
Y=c(0,0,0,1,0,0,11,10,0,0,0,5,0,0,0,35,60,0,0,0,0,2,0,0,16,0,0,0,0,0,3,88,0,0,0,0,0,0,27,0))
f<-formula(Y ~ fac1 + offset(log(offset))|fac3 + fac2)
ZINB <-zeroinfl(f, dist = "negbin",link = "logit", data = MyDat)
# Extract coefficients and standard errors from model summary
coefs = as.data.frame(summary(ZINB)$coefficients$count[,1:2])
names(coefs)[2] = "se"
coefs$vars = rownames(coefs)
# Coefficient plot
ggplot(coefs, aes(vars, Estimate)) +
geom_hline(yintercept=0, lty=2, lwd=1, colour="grey50") +
geom_errorbar(aes(ymin=Estimate - 1.96*se, ymax=Estimate + 1.96*se),
lwd=1, colour="red", width=0) +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se),
lwd=2.5, colour="blue", width=0) +
geom_point(size=4, pch=21, fill="yellow") +
theme_bw()
And here's what the plot looks like.

Resources