Predicted Probabilities based on logit model - correct specification? - r

I have a logit model with 4 independent variables:
logit <- glm(y ~ x1 + x2 + x3 + x4, family = binomial(), data = df)
All variables in the model are dichotomous (0,1).
I want to get the predicted probabilities for when x1=1 vs x1=0, while holding all other variables in the model constant, using the following code:
mean(predict(logit,transform(df,x1=1),type='response'))
mean(predict(logit,transform(df,x1=0),type='response'))
Is this the correct way to do this? I'm asking because I tried a few different logit models and compared the difference between the two values produced by the code above with the number the margins command produces:
summary(margins(logit, variables="treatment"))
For most models I tried, the difference between the two values equals the number produced by the margins command above, but for one model specification the numbers were slightly different.
Finally, when I use weights in my logit model, do I have to do anything differently when I calculate the predicted probabilities than when I don't include any weights?

Your approach seems reasonable. The only thing I’ll note is that there are many packages out there which will do this automatically for you, and can estimate standard errors. One example is the marginaleffects package. (Disclaimer: I am the author.)
library(marginaleffects)
mod <- glm(am ~ vs + hp, data = mtcars, family = binomial)
# average predicted outcome for observed units with vs=1 or vs=0
predictions(
mod,
by = "vs")
#> type vs predicted std.error statistic p.value conf.low conf.high
#> 1 response 0 0.3333333 0.1082292 3.079883 0.0020708185 0.1212080 0.5454587
#> 2 response 1 0.5000000 0.1328955 3.762356 0.0001683204 0.2395297 0.7604703
# are the average predicted outcomes equal?
predictions(
mod,
by = "vs",
hypothesis = "pairwise")
#> type term predicted std.error statistic p.value conf.low conf.high
#> 1 response 0 - 1 -0.1666667 0.1713903 -0.9724392 0.3308321 -0.5025855 0.1692522
FYI, there is a wts argument to handle weights in the predictions() function.
Edit:
predictions(
mod,
newdata = datagridcf(vs = 0:1),
by = "vs")
predictions(
mod,
variables = list(vs = 0:1),
by = "vs")

Related

How to calculate confidence intervals for predictive margins/means of predicted values with a logistic regression model

I am not sure if this is a question for stackoverflow or crossvalidated as it contains both an R specific coding part and a general statistics part.
In a project we want to use predictive margins, or more general, means of predicted values. The model used for prediction is a logistic regression model. The data comes from multiple surveys conducted at different times without clusters and comes with weights for known characteristics in the population. The data is weighted for each timespan.
As the data will grow over time, we don't want to do the prediction from the data used for modelling but with a dataframe containing the information of the population.
Thanks to this answer I know how to calculate the confidence intervals for every observation.
However I want to get the mean propabilty for different groups. For the propabilities I can just calculate the mean of the propabilities, but how do I get the right confidence intervals?
It seems that marginpred and svypredmeans from the package survey by Thomas Lumley do some of the things I want, but don't allow to do the prediction on new data.
Here is some code and data to show the approach. Please consider that in my real use case the data I use for predictions is not the same I use for modelling.
#libraries
library(dplyr)
# Data for modelling
data(mtcars)
# Get a weight column because in my real use case the data has population weights
mtcars["weight"] <- runif(nrow(mtcars), 500, 1000)
numtofac <- c("cyl", "vs", "am", "gear", "carb")
mtcars[numtofac] <- lapply(numtofac, function(x) factor(mtcars[[x]]))
mtcars["mpg20"] <- ifelse(mtcars$mpg >=20, 1, 0)
# Get prediction data (standardizes for "disp")
# In my real use case the data for prediction is not created from my data for modelling
mtcars_vs0 <- mtcars
mtcars_vs0["vs"] <- factor(0, levels=c(0,1))
mtcars_vs1 <- mtcars
mtcars_vs1["vs"] <- factor(1, levels=c(0,1))
pred_mtcars <- rbind(mtcars_vs0, mtcars_vs1)[c("vs", "disp", "weight")]
# Logistic Regression Model
glm_mpg20 <- glm(mpg20 ~ vs*disp, family = binomial(link = "logit"), data=mtcars, weights = mtcars$weight)
# Prediction on logit scale
preds <- predict(glm_mpg20, newdata=pred_mtcars, type = "link", se.fit = TRUE)
#Calculate CIs for fitted values
critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit
# Transform from logit to propability scale
fit2 <- glm_mpg20$family$linkinv(fit)
upr2 <- glm_mpg20$family$linkinv(upr)
lwr2 <- glm_mpg20$family$linkinv(lwr)
# combine fitted values and CIs in one dataframe
fitted_mpg20 <- data.frame(propability=fit2,
lwr=lwr2,
upr=upr2)
# bind CIs and data for prediction
predicted_data <- cbind(pred_mtcars, fitted_mpg20)
# calculate mean propability for vs=0 and vs=1
mean_prop <- predicted %>%
group_by(vs) %>%
summarise(mean_propability = sum(propability*weight)/sum(weight))
Thank you very much for your help and let me know if you need something else from me
You can use the predictions() function from the marginaleffects()
package
(disclaimer: I am the author). Here is a minimal example:
library(marginaleffects)
mod <- glm(am ~ hp + mpg, data = mtcars, family = binomial)
predictions(mod, newdata = datagrid())
## rowid type predicted std.error conf.low conf.high hp mpg
## 1 1 response 0.4441406 0.1413224 0.206472 0.7104511 146.6875 20.09062
By default, the datagrid() function will create a data frame with each
variable set to its mean, but you can pick arbitrary values:
predictions(mod, newdata = datagrid(mpg = 30, hp = c(100, 120)))
## rowid type predicted std.error conf.low conf.high mpg hp
## 1 1 response 0.9999380 0.0002845696 0.6674229 1 30 100
## 2 2 response 0.9999794 0.0001046887 0.6992366 1 30 120
Or feed a data frame:
nd <- data.frame(mpg = 30, hp = c(100, 140))
predictions(mod, newdata = nd)
## rowid type predicted std.error conf.low conf.high mpg hp
## 1 1 response 0.9999380 0.0002845696 0.6674229 1 30 100
## 2 2 response 0.9999931 0.0000382223 0.7255411 1 30 140

How to calculate marginal effects of logit model with fixed effects by using a sample of more than 50 million observations

I have a sample of more than 50 million observations. I estimate the following model in R:
model1 <- feglm(rejection~ variable1+ variable1^2 + variable2+ variable3+ variable4 | city_fixed_effects + year_fixed_effects, family=binomial(link="logit"), data=database)
Based on the estimates from model1, I calculate the marginal effects:
mfx2 <- marginaleffects(model1)
summary(mfx2)
This line of code also calculates the marginal effects of each fixed effects which slows down R. I only need to calculate the average marginal effects of variables 1, 2, and 3. If I separately, calculate the marginal effects by using mfx2 <- marginaleffects(model1, variables = "variable1") then it does not show the standard error and the p-value of the average marginal effects.
Any solution for this issue?
Both the fixest and the marginaleffects packages have made recent
changes to improve interoperability. The next official CRAN releases
will be able to do this, but as of 2021-12-08 you can use the
development versions. Install:
library(remotes)
install_github("lrberge/fixest")
install_github("vincentarelbundock/marginaleffects")
I recommend converting your fixed effects variables to factors before
fitting your models:
library(fixest)
library(marginaleffects)
dat <- mtcars
dat$gear <- as.factor(dat$gear)
mod <- feglm(am ~ mpg + mpg^2 + hp + hp^3| gear,
family = binomial(link = "logit"),
data = dat)
Then, you can use marginaleffects and summary to compute average
marginal effects:
mfx <- marginaleffects(mod, variables = "mpg")
summary(mfx)
## Average marginal effects
## type Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
## 1 response mpg 0.3352 40 0.008381 0.99331 -78.06 78.73
##
## Model type: fixest
## Prediction type: response
Note that computing average marginal effects requires calculating a
distinct marginal effect for every single row of your dataset. This can
be computationally expensive when your data includes millions of
observations.
Instead, you can compute marginal effects for specific values of the
regressors using the newdata argument and the typical function.
Please refer to the marginaleffects documentation for details on
those:
marginaleffects(mod,
variables = "mpg",
newdata = typical(mpg = 22, gear = 4))
## rowid type term dydx std.error hp mpg gear predicted
## 1 1 response mpg 1.068844 50.7849 146.6875 22 4 0.4167502

How do I find the p-value for my random effect in my linear mixed effect model?

I am running the following line of code in R:
model = lme(divedepth ~ oarea, random=~1|deployid, data=GDataTimes, method="REML")
summary(model)
and I am seeing this result:
Linear mixed-effects model fit by REML
Data: GDataTimes
AIC BIC logLik
2512718 2512791 -1256352
Random effects:
Formula: ~1 | deployid
(Intercept) Residual
StdDev: 9.426598 63.50004
Fixed effects: divedepth ~ oarea
Value Std.Error DF t-value p-value
(Intercept) 25.549003 3.171766 225541 8.055135 0.0000
oarea2 12.619669 0.828729 225541 15.227734 0.0000
oarea3 1.095290 0.979873 225541 1.117787 0.2637
oarea4 0.852045 0.492100 225541 1.731447 0.0834
oarea5 2.441955 0.587300 225541 4.157933 0.0000
[snip]
Number of Observations: 225554
Number of Groups: 9
However, I cannot find the p-value for the random variable: deployID. How can I see this value?
As stated in the comments, there is stuff about significance tests of random effects in the GLMM FAQ. You should definitely consider:
why you are really interested in the p-value (it's not never of interest, but it's an unusual case)
the fact that the likelihood ratio test is extremely conservative for testing variance parameters (in this case it gives a p-value that's 2x too large)
Here's an example that shows that the lme() fit and the corresponding lm() model without the random effect have commensurate log-likelihoods (i.e., they're computed in a comparable way) and can be compared with anova():
Load packages and simulate data (with zero random effect variance)
library(lme4)
library(nlme)
set.seed(101)
dd <- data.frame(x = rnorm(120), f = factor(rep(1:3, 40)))
dd$y <- simulate(~ x + (1|f),
newdata = dd,
newparams = list(beta = rep(1, 2),
theta = 0,
sigma = 1))[[1]]
Fit models (note that you cannot compare a model fitted with REML to a model without random effects).
m1 <- lme(y ~ x , random = ~ 1 | f, data = dd, method = "ML")
m0 <- lm(y ~ x, data = dd)
Test:
anova(m1, m0)
## Model df AIC BIC logLik Test L.Ratio p-value
## m1 1 4 328.4261 339.5761 -160.2131
## m0 2 3 326.4261 334.7886 -160.2131 1 vs 2 6.622332e-08 0.9998
Here the test correctly identifies that the two models are identical and gives a p-value of 1.
If you use lme4::lmer instead of lme you have some other, more accurate (but slower) options (RLRsim and PBmodcomp packages for simulation-based tests): see the GLMM FAQ.

How can I get the p-value for whether my binomial regression is significantly different from a null model in R?

I have a dataset demos_mn of demographics and an outcome variable. There are 5 variables of interest, so that my glm and null models looks like this:
# binomial model
res.binom <- glm(var.bool ~ var1 + var2*var3 + var4 + var5,
data = demos_mn, family = "binomial")
# null model
res.null <- glm(var.bool ~ 1,
data = demos_mn, family = "binomial")
# calculate marginal R2
print(r.squaredGLMM(res.binom))
# show p value
print(anova(res.null, res.binom))
That is my work flow for glm mixed models, but for my binomial model I do not get a p-value for the overall model only for the predictors. I'm hoping someone could enlighten me?
I did have some success using glmer for a repeated measures version of the model, however that unfortunately means I had to get rid of some key variables that were not measured repeatedly.
Perhaps you forgot test="Chisq" ? From ?anova.glm:
test: a character string, (partially) matching one of ‘"Chisq"’,
‘"LRT"’, ‘"Rao"’, ‘"F"’ or ‘"Cp"’. See ‘stat.anova’.
example("glm") ## to set up / fit the glm.D93 model
null <- update(glm.D93, . ~ 1)
anova(glm.D93, null, test="Chisq")
Analysis of Deviance Table
Model 1: counts ~ outcome + treatment
Model 2: counts ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 4 5.1291
2 8 10.5814 -4 -5.4523 0.244
test="Chisq" is poorly named: it's a likelihood ratio test, note it's an asymptotic test [relies on a large sample size]. For GLMs with an adjustable scale parameter (Gaussian, Gamma, quasi-likelihood) you would use test="F".

Simple slopes for interaction in Negative Binomial regression

I am looking to obtain parameter estimates for one predictor when constraining another predictors to specific values in a negative binomial glm in order to better explain an interaction effect.
My model is something like this:
model <- glm.nb(outcome ~ IV * moderator + covariate1 + covariate2)
Because the IV:moderator term is significant, I would like to obtain parameter estimates for IV at specific values of moderator (i.e., at +1 and -1 SD). I can obtain slope estimates for IV at various levels of moderator using the visreg package but I don't know how to estimate SEs and test statistics. moderator is a continuous variable so I can't use the multcomp package and other packages designed for finding simple slopes (e.g., pequod and QuantPsyc) are incompatible with negative binomial regression. Thanks!
If you want to constrain one of the values in your regression, consider taking that variable out of the model and adding it in as an offset. For example with the sample data.
dd<-data.frame(
x1=runif(50),
x2=runif(50)
)
dd<-transform(dd,
y=5*x1-2*x2+3+rnorm(50)
)
We can run a model with both x1 and x2 as parameters
lm(y ~ x1 + x2,dd)
# Call:
# lm(formula = y ~ x1 + x2, data = dd)
#
# Coefficients:
# (Intercept) x1 x2
# 3.438438 4.135162 -2.154770
Or say that we know that the coefficient of x2 is -2. Then we can not estimate x2 but put that term in as an offset
lm(y ~ x1 + offset(-2*x2), dd)
# Call:
# lm(formula = y ~ x1 + offset(-2 * x2), data = dd)
#
# Coefficients:
# (Intercept) x1
# 3.347531 4.153594
The offset() option basically just create a covariate who's coefficient is always 1. Even though I've demonstrated with lm, this same method should work for glm.nb and many other regression models.

Resources