rstanarm for Bayesian hierarchical modeling of binomial experiments - r

Suppose there are three binomial experiments conducted chronologically. For each experiment, I know the #of trials as well as the #of successes. To use the first two older experiments as prior for the third experiment, I want to "fit a Bayesian hierarchical model on the two older experiments and use the posterior form that as prior for the third experiment".
Given my available data (below), my question is: is my rstanarm code below capturing what I described above?
Study1_trial = 70
Study1_succs = 27
#==================
Study2_trial = 84
Study2_succs = 31
#==================
Study3_trial = 100
Study3_succs = 55
What I have tried in package rstanarm:
library("rstanarm")
data <- data.frame(n = c(70, 84, 100), y = c(27, 31, 55));
mod <- stan_glm(cbind(y, n - y) ~ 1, prior = NULL, data = data, family = binomial(link = 'logit'))
## can I use a beta(1.2, 1.2) as prior for the first experiment?

TL;DR: If you were directly predicting the probability of success, the model would be a Bernoulli likelihood with parameter theta (the probability of success) that could take on values between zero and one. You could use a Beta prior for theta in this case. But with a logistic regression model, you're actually modeling the log odds of success, which can take on any value from -Inf to Inf, so a prior with a normal distribution (or some other prior that can take on any real value within some range determined by the available prior information) would be more appropriate.
For a model where the only parameter is the intercept, the prior is the probability distribution for the log odds of success. Mathematically, the model is:
log(p/(1-p)) =  a
Where p is the probability of success and a, the parameter you're estimating, is the intercept, which can be any real number. If the odds of success are 1:1 (that is, p = 0.5) then a = 0. If the odds are greater than 1:1 then a is positive. If the odds are less than 1:1 then a is negative.
Since we want a prior for a, we need a probability distribution that can take on any real value. If we didn't know anything about the odds of success, we might use a very weakly informative prior like a normal distribution with, say, mean=0 and sd=10 (this is the rstanarm default), meaning that one standard deviation would encompass odds of success ranging from about 22000:1 to 1:22000! So this prior is essentially flat.
If we take your first two studies to construct the prior, we can use the probability density based on those studies and then transform it to the log odds scale:
# Possible outcomes (that is, the possible number of successes)
s = 0:(70+84)
# Probability density over all possible outcomes
dens = dbinom(s, 70+84, (27+31)/(70+84))
Assuming we'll use a normal distribution for the prior, we want the most likely probability of success (which will be the mean for the prior) and the standard deviation of the mean.
# Prior parameters
pp = s[which.max(dens)]/(70+84) # most likely probability
psd = sum(dens * (s/max(s) - pp)^2)^0.5 # standard deviation
# Convert prior to log odds scale
pp_logodds = log(pp/(1-pp))
psd_logodds = log(pp/(1-pp)) - log((pp-psd)/(1 - (pp-psd)))
c(pp_logodds, psd_logodds)
[1] -0.5039052 0.1702006
You could generate essentially the same prior by running stan_glm on the first two studies with the default (flat) prior:
prior = stan_glm(cbind(y, n-y) ~ 1,
data = data[1:2,],
family = binomial(link = 'logit'))
c(coef(prior), se(prior))
[1] -0.5090579 0.1664091
Now let's fit the model using data from Study 3 using the default prior and the prior we just generated. I've switched to a standard data frame, since stan_glm seems to fail when the data frame has only one row (as in data = data[3, ]).
# Default weakly informative prior
mod1 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
family = binomial(link = 'logit'))
# Prior based on studies 1 & 2
mod2 <- stan_glm(y ~ 1,
data = data.frame(y=rep(0:1, c(45,55))),
prior_intercept = normal(location=pp_logodds, scale=psd_logodds),
family = binomial(link = 'logit'))
For comparison, let's also generate a model with all three studies and the default flat prior. We would expect this model to give virtually the same results as mod2:
mod3 <- stan_glm(cbind(y, n - y) ~ 1,
data = data,
family = binomial(link = 'logit'))
Now let's compare the three models:
library(tidyverse)
list(`Study 3, Flat Prior`=mod1,
`Study 3, Prior from Studies 1 & 2`=mod2,
`All Studies, Flat Prior`=mod3) %>%
map_df(~data.frame(log_odds=coef(.x),
p_success=predict(.x, type="response")[1]),
.id="Model")
Model log_odds p_success
1 Study 3, Flat Prior 0.2008133 0.5500353
2 Study 3, Prior from Studies 1 & 2 -0.2115362 0.4473123
3 All Studies, Flat Prior -0.2206890 0.4450506
For Study 3 with the flat prior (row 1), the predicted probability of success is 0.55, as expected, since that's what the data says and the prior provides no additional information.
For Study 3 with a prior based on studies 1 and 2, the probability of success is 0.45. The lower probability of success is due to the lower probability of success in Studies 1 and 2 adding additional information. In fact, the probability of success from mod2 is exactly what you'd calculate directly from the data: with(data, sum(y)/sum(n)). mod3 puts all the information into the likelihood instead of splitting it between the prior and the likelihood, but is otherwise essentially the same as mod2.
Answer to (now deleted) comment: If all you know is the number of trials and successes and you think that a binomial probability is a reasonable model for how the data were generated, then it doesn't matter how you split up the data into "prior" and "likelihood" or whether you shuffle the order of the data. The resulting model fit will be the same.

Related

Exchangeable correlations and constant variance in MCMCglmm?

I am attempting to fit a very simple model with MCMCglmm, but am getting quite stuck.
Imagine a class (30 students) get grades for two papers throughout the semester where the paper assignments are exactly the same (we don't want to model a difference in average scores between the papers, there are no "learning effects", and we can assume that the variance in grades are the same.)
Let $i = 1...30$ index the student, $y_{i1}$ and $y_{i2}$ index the scores for that student's first and second papers.
One way to model this data is using random intercepts for student scores to account for correlation between each students scores. Let $\mu_i$ be the student intercept, $sigma$ be the residual sd, and $\sigma_{\mu}$ be the sd of the intercepts. Then we write (in shorthand) our random intercept model at $f(y_{ij}|\mu_i) = Normal(\mu_i, \sigma)$ and $f(\mu_i) = Normal(\mu, \sigma_{\mu)$.
An alternative way to write this model would be to model the residual correlation structure more explicitly. That is, we would write that ${y_{i1}, y_{i2}}$ have a multivariate normal distribution with mean ${\mu, \mu}$ variance $\tau = \sigma^2 + \sigma_{\mu}^2$ and correlation $\rho = \sigma_{\mu}^2 / (\sigma^2 + \sigma_{\mu}^2)$.
To be clear, these models are mathematically equivalent, but statistical software will often have a specific implementation for each. For example we can fit the two approaches separately with nlme:
library(nlme)
library(tidyverse)
library(MCMCglmm)
df <-
tibble(id = factor(rep(1:100, each = 20))) %>%
mutate(paper = 1:n()) %>%
group_by(id) %>%
mutate(mu = rnorm(1),
y = mu + rnorm(n(), 0, 3))
gls(data = df,
model = y~1,
correlation = corCompSymm(form = ~ 1 | id))
lme(data = df, fixed = y ~ 1, random = ~1|id)
It seems MCMCglmm can fit the first parameterization (random intercepts) of the model just fine.
MCMCglmm(data = df,
fixed = y ~ 1,
random = ~id,
nitt = 1000, burnin = 0, thin = 1)
However, I am not seeing a way to implement the second approach. My best attempt involves "widening" the data frame and fitting a multiple response model.
df.wide <- df %>% select(- paper) %>%
pivot_wider(values_from = "y",
names_from = "obs", names_prefix = "paper") %>%
as.data.frame
MCMCglmm(fixed = cbind(paper1, paper2) ~ 1,
rcov = ~us(trait):units,
data = df.wide)
However, (1) I am not sure that I am fitting this model correctly, (2) I am not sure how to interpret the fitted values (especially since my posterior mean covariances seem much too small) and (3) there doesn't seem to be a way to get a constant variance across traits.
p.s. I would appreciate not being told to just fit the random intercept model. I am writing some course materials, and would like students to be able to more directly compare the exchangeable correlation model with other types of correlation structures that we might use when we have more than two observations (i.e. AR, Toeplitz, etc.), and I would like my students to be able to do the comparison of the two parameterizations themselves, as I would do when I used nlme.
FOLLOW-UP: I am currently trying to fit the model with BRMS, though would still be open to any "hacks" in MCMCglmm.
model1 <- brms::brm(data = df,
formula = y ~ 1 + cosy(gr = id, time = obs),
family = "gaussian",
chains = 4, thin = 1, iter = 5000, warmup = 100)
Is exchangeability + equal variances the same as what I would call compound symmetry? (I guess so, since you're using corCompSymm() in nlme) ...
As far as I can tell this isn't possible (I can't rule out that there's some way to hack it with the available variance structures, but it's far from obvious ...) From ?MCMCglmm:
Currently, the only ‘variance.functions’ available are ‘idv’,
‘idh’, ‘us’, ‘cor[]’ and ‘ante[]’. ‘idv’ fits a constant
variance across all components in ‘formula’. Both ‘idh’ and
‘us’ fit different variances across each component in
‘formula’, but ‘us’ will also fit the covariances. ‘corg’
fixes the variances along the diagonal to one and ‘corgh’
fixes the variances along the diagonal to those specified in
the prior. ‘cors’ allows correlation submatrices. ‘ante[]’
fits ante-dependence structures of different order (e.g
ante1, ante2), and the number can be prefixed by a ‘c’ to
hold all regression coefficients of the same order equal. The
number can also be suffixed by a ‘v’ to hold all innovation
variances equal (e.g ‘antec2v’ has 3 parameters).
By using the us() (unstructured, what nlme would call pdSymm for "positive-definite symmetric") structure, I believe you're not constraining the correlation parameters to be all the same (i.e., violating exchangeability).
For what it's worth, one reason (other than pedagogy) to want to specify a compound-symmetric correlation matrix explicitly rather than by composing the sum of group-level and individual-level random effects would be if you wanted to model negative compound symmetry (the sum-of-random-effects approach can only model rho>0).
My guess is that you're also restricted to answers using MCMCglmm, but if "some Bayesian MCMC approach" is good enough, then you could do this via brms or (somewhat more obscurely, sort of) glmmTMB + tmbstan (although this combination does not currently use informative priors!)

Syntax for diagonal variance-covariance matrix for non-linear mixed effects model in nlme

I am analysing routinely collected substance use data during the first 12 months' of treatment in a large sample of outpatients attending drug and alcohol treatment services. I am interested in whether differing levels of methamphetamine use (no use, low use, and high use) at the outset of treatment predicts different levels after a year in treatment, but the data is very irregular, with different clients measured at different times and different numbers of times during their year of treatment.
The data for the high and low use group seem to suggest that drug use at outset reduces during the first 3 months of treatment and then asymptotes. Hence I thought I would try a non-linear exponential decay model.
I started with the following nonlinear generalised least squares model using the gnls() function in the nlme package:
fitExp <- gnls(outcome ~ C*exp(-k*yearsFromStart),
params = list(C ~ atsBase_fac, k ~ atsBase_fac),
data = dfNL,
start = list(C = c(nsC[1], lsC[1], hsC[1]),
k = c(nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = gnlsControl(nlsTol = 0.1))
where outcome is number of days of drug use in the 28 days previous to measurement, atsBase_fac is a three-level categorical predictor indicating level of amphetamine use at baseline (noUse, lowUse, and highUse), yearsFromStart is a continuous predictor indicating time from start of treatment in years (baseline = 0, max - 1), C is a parameter indicating initial level of drug use, and k is the rate of decay in drug use. The starting values of C and k are taken from nls models estimating these parameters for each group. These are the results of that model
Generalized nonlinear least squares fit
Model: outcome ~ C * exp(-k * yearsFromStart)
Data: dfNL
AIC BIC logLik
27672.17 27725.29 -13828.08
Variance function:
Structure: Exponential of variance covariate
Formula: ~yearsFromStart
Parameter estimates:
expon
0.7927517
Coefficients:
Value Std.Error t-value p-value
C.(Intercept) 0.130410 0.0411728 3.16738 0.0015
C.atsBase_faclow 3.409828 0.1249553 27.28839 0.0000
C.atsBase_fachigh 20.574833 0.3122500 65.89218 0.0000
k.(Intercept) -1.667870 0.5841222 -2.85534 0.0043
k.atsBase_faclow 2.481850 0.6110666 4.06150 0.0000
k.atsBase_fachigh 9.485155 0.7175471 13.21886 0.0000
So it looks as if there are differences between groups in initial rate of drug use and in rate of reduction in drug use. I would like to go a step further and fit a nonlinear mixed effects model.I tried consulting Pinhiero and Bates' book accompanying the nlme package but the only models I could find that used irregular, sparse data like mine used a self-starting function, and my model does not do that.
I tried to adapt the gnls() model to nlme like so:
fitNLME <- nlme(model = outcome ~ C*exp(-k*yearsFromStart),
data = dfNL,
fixed = list(C ~ atsBase_fac, k ~ atsBase_fac),
random = pdDiag(yearsFromStart ~ id),
groups = ~ id,
start = list(fixed = c(nsC[1], lsC[1], hsC[1], nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = nlmeControl(optim = "optimizer"))
bit I keep getting error message, I presume through errors in the syntax specifying the random effects.
Can anyone give me some tips on how the syntax for the random effects works in nlme?
The only dataset in Pinhiero and Bates that resembled mine used a diagonal variance-covariance matrix. Can anyone filled me in on the syntax of this nlme function, or suggest a better one?
p.s. I wish I could provide a reproducible example but coming up with synthetic data that re-creates the same errors is way beyond my skills.

hurdle model prediction - count vs response

I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.
Here's a similar result for a toy model (code adapted from here):
library("pscl")
data("RecreationDemand", package = "AER")
## model
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin")
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")
Why is it that the counts are higher than the responses?
added comparison to glm.nb
Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??
library(MASS)
m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,])
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only
The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).
Using the countreg package from R-Forge you can fit the model you attempted to fit with glm.nb in your example. (Alternatively, VGAM or gamlss could also be used to fit the same model.)
library("countreg")
m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
## hurdle zerotrunc negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality 0.02482553 0.02483015 0.01671314
Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.
As for the predictions, one can distinguish three quantities:
The expectation from the untruncated count part, i.e., simply exp(x'b).
The conditional/truncated expectation from the count part, i.e., accounting for the zero trunctation: exp(x'b)/(1 - f(0)) where f(0) is the probability for 0 in that count part.
The overall expectation for the complete hurdle model, i.e., the probability for crossing the hurdle times the conditional expectation from 2.: exp(x'b)/(1 - f(0)) * (1 - g(0)) where g(0) is the probability for 0 in the zero hurdle part of the model.
See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl") for more details and formulas. predict(..., type = "count") computes item 1 from above where predict(..., type = "response") computes item 3 for a hurdle model and item 2 for a zerotrunc model.

Error when trying to run fixed effects logistic regression

not sure where can I get help, since this exact post was considered off-topic on StackExchange.
I want to run some regressions based on a balanced panel with electoral data from Brazil focusing on 2 time periods. I want to understand if after a change in legislation that prohibited firm donations to candidates, those individuals that depended most on these resources had a lower probability of getting elected.
I have already ran a regression like this on R:
model_continuous <- plm(percentage_of_votes ~ time +
treatment + time*treatment, data = dataset, model = 'fd')
On this model I have used a continuous variable (% of votes) as my dependent variable. My treatment units or those that in time = 0 had no campaign contributions coming from corporations.
Now I want to change my dependent variable so that it is a binary variable indicating if the candidate was elected on that year. All of my units were elected on time = 0. How can I estimate a logit or probit model using fixed effects? I have tried using the pglm package in R.
model_binary <- pglm(dummy_elected ~ time + treatment + time*treatment,
data = dataset,
effects = 'twoways',
model = 'within',
family = 'binomial',
start = NULL)
However, I got this error:
Error in maxRoutine(fn = logLik, grad = grad, hess = hess, start = start, :
argument "start" is missing, with no default
Why is that happening? What is wrong with my model? Is it conceptually correct?
I want the second regression to be as similar as possible to the first one.
I have read that clogit function from the survival package could do the job, but I dont know how to do it.
Edit:
this is what a sample dataset could look like:
dataset <- data.frame(individual = c(1,1,2,2,3,3,4,4,5,5),
time = c(0,1,0,1,0,1,0,1,0,1),
treatment = c(0,0,1,1,0,0,1,1,0,0),
corporate = c(0,0,0.1,0,0,0,0.5,0,0,0))
Based on the comments, I believe the logistic regression reduces to treatment and dummy_elected. Accordingly I have fabricated the following dataset:
dataset <- data.frame("treatment" = c(rep(1,1000),rep(0,1000)),
"dummy_elected" = c(rep(1, 700), rep(0, 300), rep(1, 500), rep(0, 500)))
I then ran the GLM model:
library(MASS)
model_binary <- glm(dummy_elected ~ treatment, family = binomial(), data = dataset)
summary(model_binary)
Note that the treatment coefficient is significant and the coefficients are given. The resulting probabilities are thus
Probability(dummy_elected) = 1 => 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Probability(dummy_elected) = 0 => 1 - 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Note that these probabilities are consistent with the frequencies I generated the data.
So for each row, take the max probability across the two equations above and that's the value for dummy_elected.

glm model dataset summarisation

first post, so go easy.
In the insurance world of GLMing, the classic approach is to model claims frequency and average severity. With that in mind, I built a couple of models to experiment for myself and now have a question.
Could somebody please explain how GLM handles varying levels of summarisation of a dataset, particularly with regard to error estimates?
Consider the example below. The data exhibits strong severity trends for both variables:
- A has more expensive claims than B
- Ford > Kia > Vaux > Jag
I fitted a model to unsummarised and a summarised version of the dataset, and accordingly GLM fitted the same parameters in both cases
However, GLM indicates a well fitted model to the unsummarised data. But when I summarise and use a weighted mean, ie average severity, the model fits poorly. Maybe this is as you would expect, after all the unsummarised data has more points to model with. Also, it appears the weighted mean is used to indicate RELATIVE strength, so here, specifiying the weighted mean is pointless, since they are all the same weights.
But more fundementally, can I not model average severity with GLM? I mean, I know the result of fitting a GLM to an unsummarised dataset will be a average severity, but I was hoping to fit a model to already summarised data. It appears that modelling on aggregated datasets will not give a true indication of the model fit.
Apologies if this a stupid question, I'm not a statistician, so don't fully understand the Hessian Matrix.
Please see code below:
library(boot)
library(reshape)
dataset <- data.frame(
Person = rep(c("A", "B"), each=200),
Car = rep(c("Ford", "Kia", "Vaux", "Jag"), 2, each=50),
Amount = c(rgamma(50, 200), rgamma(50, 180), rgamma(50, 160), rgamma(50, 140),
rgamma(50, 100), rgamma(50, 80), rgamma(50, 60), rgamma(50, 40))
)
Agg1 <- ddply(dataset, .(Person, Car), summarise, mean=mean(Amount), length=length(Amount))
m1 <- glm(Amount ~ Person + Car, data = dataset, family = Gamma(link="log"))
m2 <- glm(mean ~ Person + Car, data = Agg1, family = Gamma(link="log"), weights=length)
summary(m1)
summary(m2)
Thanks,
Nick
Bottom line is that both models are identical - the reason the aggregated model "fits poorly" is entirely due to the reduction in degrees of freedom due to aggregation.
Before getting into why the models are identical, I should point out that this does not necessarily mean that either model is a good fit. You should run diagnostics on both, especially using:
par(mfrow=c(2,2))
plot(m1)
When you do this. you'll see that the residuals are normally distributed (which is essential), but that they follow a pattern (-, +, -), which is disturbing. I would want to understand that before declaring that this is a good model. [Admittedly, this is made up data, but the principles apply nevertheless.]
Comparing the aggregated to base models, look at the values of the coefficients.
coef.m1 <- summary(m1)$coefficients
coef.m2 <- summary(m2)$coefficients
cbind(coef.m1[,1],coef.m2[,1])
# [,1] [,2]
# (Intercept) 5.4096980 5.4096976
# PersonB -0.9249371 -0.9249366
# CarJag -0.6144606 -0.6144602
# CarKia -0.1786556 -0.1786555
# CarVaux -0.3597925 -0.3597923
The reason you think the aggregated model is "worse" is because of the p-values, but these depend on t = coeff/se . The ratio of se in m1 vs. m2 is the same for all coefficients:
coef.m2[,2]/coef.m1[,2]
# (Intercept) PersonB CarJag CarKia CarVaux
# 7.836171 7.836171 7.836171 7.836171 7.836171
Since
se ~ sd / √ df
the ratio of se for the two models should be approx
sem1/sem2 = √( (nm1-1) / (nm2-1) )
sqrt((nrow(dataset)-1)/(nrow(Agg1)-1))
# [1] 7.549834
Frankly I'm puzzled why the ratio is not exactly equal to 7.55.
Put another way, glm(...) has no way of knowing that you aggregated your data. It thinks you are trying to fit a model with 4 parameters and an intercept to 8 data points.

Resources