Obtain predicted probabilities from rstanarm in ordinal regression - r

How can I generate the posterior probability distribution for each outcome for each predictor in an ordinal regression?
e.g.
what I am looking for is this:
library(rstanarm)
fit_f <- MASS::polr(tobgp ~ agegp, data = esoph)
predict(fit_f,newdata=data.frame(agegp=factor(levels(esoph$agegp))),type = "probs")
Now with rstanarm I do:
fit <- stan_polr(tobgp ~ agegp, data = esoph, method = "logit",
prior = R2(0.2, "mean"), init_r = 0.1, seed = 12345)
But how do I obtain the distribution for the individual outcomes/predictors?
I do get a distribution of probabilities using epred, but I don't understand for which outcome/predictor?
posterior_epred(fit, newdata=data.frame(agegp=factor(levels(esoph$agegp))))

The easiest way to do this in rstanarm is to use the posterior_predict function to obtain posterior predictions and then calculate the proportion of predictions that fall in each outcome category by observation. In code,
PPD <- posterior_predict(fit) # uses esoph
probs <- t(apply(PPD, MARGIN = 2, FUN = table) / nrow(PPD))
The matrix called probs has rows equal to the number of observations (in esoph) and columns equal to the number of categories in tobgp and each of its rows sums to 1.
head(probs)
0-9g/day 10-19 20-29 30+
1 0.26400 0.26250 0.22875 0.24475
2 0.25650 0.26750 0.23050 0.24550
3 0.25175 0.27975 0.22450 0.24400
4 0.25575 0.26000 0.24025 0.24400
5 0.26350 0.26625 0.23575 0.23450
6 0.28275 0.26025 0.21500 0.24200

Related

When does brms assume parameters are distributed multivariate normal?

Suppose I fit a varying-intercepts model in brms per below.
library(brms)
# Example data in "long" format:
# 50 subjects each completing 2 trials. Outcomes were recorded 2x per subject
# per trial: once when treat = 1 and once when treat = 0.
trials <- 2
subjects <- 50
N <- trials * subjects
df <- data.frame(
outcome = rnorm(N),
treated = rep(0:1, N),
subject = rep(1:subjects, each = trials)
)
# Fit varying slopes and varying intercepts model
mod <- brm(outcome ~ treated + (1 | subject), df,
iter = 500, warmup = 490, chains = 1)
If we look at the brms documentation, it notes:
(...) group-level parameters u are assumed to come from a multivariate
normal distribution with mean zero and unknown covariance matrix Σ:
However, there is no such covariance matrix for the intercepts:
# empty
mod$cov_ranef
What is assumed to be multivariate normal by brms - i.e. it only when coefficients are hierarchically modeled as varying across groups that there is a covariance matrix? What if there's a predictor that is included that only varies by group but is not modelled hierarchically?

How do I extract standard errors or variation from predicted ordinal logistic regression analyses?

I am undertaking a ordinal logistic regression using R package MASS.
For example:
library(MASS)
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
summary(house.plr, digits = 3)
I am using the s3 method predict() to obtain the predicted values
test_dat <- data.frame(Infl = factor(rep("Low",4)),
Cont = factor(rep("Low",4)),
Type = unique(housing$Type))
predict(house.plr, test_dat, type = "p")
Low Medium High
1 0.3784493 0.2876752 0.3338755
2 0.5190445 0.2605077 0.2204478
3 0.4675584 0.2745383 0.2579033
4 0.6444840 0.2114256 0.1440905
The result is a table of predicted means for each level of Sat given the variables defined in the test_dat.
How might I extract the variation around each of these means in the form of a standard error or standard deviation?
First, your predicted values are the predicted probability of each outcome for each observation. It is not the predicted mean on the response scale.
Second, you can use the marginaleffects package to get the standard errors for the predicted probabilities and then calculate the confidence intervals yourself. Alternatively, you may implement the non-parametric bootstrap. I implement both in the below. Note that I shifted the order of the columns around in the test data to match the training data.
# Packages
library(MASS)
library(marginaleffects)
library(dplyr)
# Create a test set
N <- 4
test_dat <- data.frame(
Infl = factor(rep("Low", N)),
Type = unique(housing$Type),
Cont = factor(rep("Low", N))
)
# Fit ordered logistic regression model
house.plr <- polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = housing,
Hess = TRUE)
# Demonstrate that predict() doesn't provide any measure of variability
# for the predicted class probabilities, as shown in question
predict(house.plr, test_dat, type = "probs")
# Use the marginaleffects package to get delta method standard errors for
# each predicted probability
probs <- marginaleffects::predictions(house.plr,
newdata = test_dat,
type = "probs")
# Compute CIs from the standard error using normal approximation
probs$predicted - 1.96*probs$std.error
probs$predicted + 1.96*probs$std.error
# Alternatively, use non-parametric bootstrapped confidence intervals.
# note that this does not adjust the weights to a constant sum for
# each bootstrap, although it is easy to implement. You're free to
# determine how to handle the weights, including resampling based
# on the weights.
# Generate bootstrapped data.frames
set.seed(123)
sims <- 5
samples <- vector(mode = "list", length = sims)
samples <- lapply(samples, function(x){ slice_sample(housing, n = nrow(housing), replace = TRUE)})
# Fit model on each bootstrapped data.frame
models <- lapply(samples, function(x){polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = x,
Hess = TRUE)})
# Get test predictions into a data.frame
probs_boot <- lapply(models, function(x) {
marginaleffects::predictions(x,
newdata = test_dat,
type = "probs")
})
probs_boot_df <- bind_rows(probs_boot)
# Compute CIs
probs_boot_df %>%
group_by(group, Type.x, Infl, Type.y, Cont) %>%
summarise(ci_low = quantile(predicted, probs = 0.025),
ci_high = quantile(predicted, probs = 0.975))

Is there a difference between gamma hurdle (two-part) models and zero-inflated gamma models?

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.
library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)
### DATA
id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1),
sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)
df <- data.frame(cbind(id, disease, age, gender, cost, time))
# create binary variable for non-zero costs
df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))
### HURDLE MODEL (MY VERSION)
# gamma component
hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)),
data = subset(df, cost > 0),
family = Gamma(link = "log"))
model_parameters(hurdle_gamma, exponentiate = T)
# binomial component
hurdle_binomial <- glm(cost_binary ~ disease + gender + age + time,
data = df, family = "binomial")
model_parameters(hurdle_binomial, exponentiate = T)
# predicted probability of use
df$prob_use <- predict(hurdle_binomial, type = "response")
# predicted mean cost for people with any cost
df_bin <- subset(df, cost_binary == 1)
df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")
# combine data frames
df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")
# replace NA with 0
df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)
# calculate predicted cost for everyone
df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)
# mean predicted cost
mean(df2$cost_pred)
### glmmTMB with ziGamma
zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
family = ziGamma(link = "log"),
ziformula = ~ disease + gender + age + time,
data = df)
model_parameters(zigamma_model, exponentiate = T)
df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"
The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?
In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?
The glmmTMB package can do this:
glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)
ought to do it. Maybe something in VGAM as well?
To answer the questions about coefficients and standard errors:
the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295
6% for the intercept, up to 30% for the effect of age ...
the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).
In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

Calculation of log likelihood function of multinomial logistic regression in R

Suppose I have the following data set
df=data.frame(x1=rnorm(100), #predictor 1
x2=rpois(100,2.5), #predictor 2
x3=rgeom(100,prob = 0.48), #predictor 3
y=as.factor(sample(1:3,100,replace = T)) #categorical response
)
If I run the multinomial logistic regression by considering the 1 as the reference category, then the estimated parameters are
Call:
multinom(formula = y ~ ., data = df)
Coefficients:
(Intercept) x1 x2 x3
2 -0.71018723 -0.4193710 0.15820110 0.05849252
3 -0.05987773 -0.2978596 -0.08335957 0.10149408
I would like to calculate the loglikelihood value of the multinomial logistic regression using these estimated parameters.
Any help is appreciated.
This should work. The log-likelihood is just the sum of the log of the probabilities that each observation takes on its observed value. In the code below probs is an N x m matrix of probabilities for each of the N observations on each of the m categories. We can then get y from the model frame and turn it into a numeric variable which will indicate the category number. We then use cbind(1:length(y), y) to index the probability matrix. This makes an N x 2 matrix that gives for each row number (in the first column) the column number of the probs matrix that you should keep. So, probs[cbind(1:length(y), y)] creates a vector of probabilities that each observation takes on its observed y value. We can log them and then sum them to get the log-likelihood.
df=data.frame(x1=rnorm(100), #predictor 1
x2=rpois(100,2.5), #predictor 2
x3=rgeom(100,prob = 0.48), #predictor 3
y=as.factor(sample(1:3,100,replace = T)) #categorical response
)
mod <- nnet::multinom(formula = y ~ ., data = df)
probs <- predict(mod, type="probs")
y <- as.numeric(model.response(model.frame(mod)))
indiv_ll <- log(probs[cbind(1:length(y), y)])
sum(indiv_ll)
# [1] -106.8012
logLik(mod)
# 'log Lik.' -106.8012 (df=8)

Get marginal effect and predicted probability for glmer model in R

I'm trying to calculate both the predicted probability values and marginal effects values (with p-values) for a categorical variable over time in a logistic regression model in R. Basically, I want to know 1) the predicted probability of the response variable (an event occurring) in each year for sample sites in one of 2 categories and 2) the average marginal effect of a site being in 1 category vs. the other in each year. I can get predicted probability values using the ggeffects package and marginal effects values from the margins package, but I haven't figured out a way to get both sets of values from a single package.
So my questions are 1) is there a package/method to get both of these sets of values, and 2) if I get the predicted probability values from ggeffects and the marginal effects values from margins, are these values compatible? Or are there differences in the ways that the packages treat the models that mean I can't assume the marginal effects from one correspond to the predicted probabilities of the other? 3) In the margins package, how can I get the average marginal effect of the interaction of two factor variables over time? And 4) how can I get margins() to work with a large dataset?
Here is some sample data:
### Make dataset
df <- data.frame(year = rep(2001:2010, each = 100),
state = rep(c("montana", "idaho",
"colorado", "wyoming", "utah"),
times = 10, each = 20),
site_id = as.factor(rep(1:100, times = 10)),
cat_variable = as.factor(rep(0:1, times = 5, each = 10)),
ind_cont_variable = rnorm(100, mean = 20, sd = 5),
event_occurred = as.factor(sample(c(0, 1),
replace = TRUE,
size = 1000)))
### Add dummy columns for states
library(fastDummies)
df <- dummy_cols(df,
select_columns = "state",
remove_first_dummy = TRUE)
I'm interested in the effects of the state and the categorical variable on the probability that the event occurred, and in how the effect of the state and categorical variable changed over time. Here's the model:
library(lme4)
fit_state <- glmer(event_occurred ~ ind_cont_variable +
cat_variable*year*state +
(1|site_id),
data = df,
family = binomial(link = "logit"),
nAGQ = 0,
control = glmerControl(optimizer = "nloptwrap"))
I can use ggeffects to get the predicted probability values for each state and category combination over time:
library(ggeffects)
fit_pp_state <- data.frame(ggpredict(fit_state,
terms = c("year [all]",
"cat_variable",
"state")))
head(fit_pp_state)
### x = year, predicted = predicted probability, group = categorical variable level, facet = state
# x predicted std.error conf.low conf.high group facet
# 2001 0.2835665 0.3981910 0.1535170 0.4634655 0 colorado
# 2001 0.5911911 0.3762090 0.4089121 0.7514289 0 idaho
# 2001 0.5038673 0.3719418 0.3288209 0.6779708 0 montana
# 4 2001 0.7101610 0.3964843 0.5297327 0.8420101 0 utah
# 5 2001 0.5714579 0.3747205 0.3901606 0.7354088 0 wyoming
# 6 2001 0.6788503 0.3892568 0.4963910 0.8192719 1 colorado
This is really great for visualizing the changes in predicted probability over time in the 5 states. But I can't figure out how to go from these values to estimates of marginal effects using ggeffects. Using the margins package, I can get the marginal effect of the categorical variable over time, but I'm not sure how to interpret the outputs of the two different packages together or if that's even appropriate (my first two questions). In addition, I'm not sure how to get margins to give me the marginal effect of a sample site being in each combination of categorical variable level/state over time (bringing me to my third question):
library(margins)
fit_state_me <- summary(margins(fit_state,
at = list(year = 2001:2010),
variables = "cat_variable"))
head(fit_state_me)
# factor year AME SE z p lower
# cat_variable1 2001.0000 0.0224 0.0567 0.3953 0.6926 -0.0887
# cat_variable1 2002.0000 0.0146 0.0490 0.2978 0.7659 -0.0814
# cat_variable1 2003.0000 0.0062 0.0418 0.1478 0.8825 -0.0757
# cat_variable1 2004.0000 -0.0026 0.0359 -0.0737 0.9413 -0.0731
# cat_variable1 2005.0000 -0.0117 0.0325 -0.3604 0.7186 -0.0754
# cat_variable1 2006.0000 -0.0208 0.0325 -0.6400 0.5222 -0.0845
The actual dataset I'm using is fairly large (the csv of raw data is 1.51 GB and the regression model object is 1.29 GB when I save it as a .rds file). When I try to use margins() on my data, I get an error message:
Error: cannot allocate vector of size 369.5 Gb
Any advice for getting around this issue so that I can use this function on my data?
I'd be grateful for any tips-- packages I should check out, mistakes I'm making in my code or my conceptual understanding, etc. Thank you!

Resources