Obtaining Standardized coefficients from "rstanarm" package in R? - r

I was wondering if it might be possible (and perhaps recommended) to obtain standardized coefficients from stan_glm() in the rstanarm package? (did not find anything specific in the documentation)
Can I just standardize all variables as in normal regression? (see below)
Example:
library("rstanarm")
fit <- stan_glm(wt ~ vs*gear, data = mtcars)
Standardization:
design <- wt ~ vs*gear
vars <- all.vars(design)
stand.vars <- lapply(mtcars[, vars], scale)
fit <- stan_glm(stand.vars, data = mtcars)

I would not say that it is affirmatively recommended, but I would recommend that you not subtract the sample mean and divide by the sample standard deviation of the outcome because the estimation uncertainty in those two statistics will not be propagated to the posterior distribution.
Standardizing the predictors is more debatable. You can do it, but it makes doing posterior prediction with new data harder because you have to remember to subtract the old means from the new data and divide by the old standard deviations.
The most computationally efficient approach is to leave the variables as they are but specify the non-default argument QR = TRUE, especially if you are not going to modify the default (normal) priors on the coefficients anyway.
You can then standardize the posterior coefficients after-the-fact if standardized coefficients are of interest. To do so, you can do
X <- model.matrix(fit)
sd_X <- apply(X, MARGIN = 2, FUN = sd)[-1]
sd_Y <- apply(posterior_predict(fit), MARGIN = 1, FUN = sd)
beta <- as.matrix(fit)[ , 2:ncol(X), drop = FALSE]
b <- sweep(sweep(beta, MARGIN = 2, STATS = sd_X, FUN = `*`),
MARGIN = 1, STATS = sd_Y, FUN = `/`)
summary(b)
However, standardizing regression coefficients just gives the illusion of comparability across variables and says nothing about how germane a one standard deviation difference is, particularly for dummy variables. If your question is really whether manipulating this predictor or that predictor is going to make a bigger difference on the outcome variable, then simply simulate those manipulations like
PPD_0 <- posterior_predict(fit)
nd <- model.frame(fit)
nd[ , 2] <- nd[ , 2] + 1 # for example
PPD_1 <- posterior_predict(fit, newdata = nd)
summary(c(PPD_1 - PPD_0))
and repeat that process for other manipulations of interest.

Related

Simulating logistic regression from saved estimates in R

I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:
#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)
#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients
the_estimates <- matrix_coef[,1]
the_estimates
the_estimates[1]
the_estimates[2]
the_estimates[3]
I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.
I'm not 100% sure what your question is here.
If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:
dd <- data.frame(y, gender, age)
## best practice when modeling in R: take the variables from a data frame
model <- glm(y ~ gender + age, data = dd, family = "binomial")
simulate(model)
You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)
If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):
## simulate new model matrix (including intercept)
simdat <- cbind(1,
gender = rbinom(100, prob = 0.5, size = 1),
age = sample(18:80, size = 100, replace = TRUE))
## extract inverse-link function
invlink <- family(model)$linkinv
## sample new values
resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.
If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.
If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.
nsim <- 1000
res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
for (i in 1:nsim) {
## simulate returns a list (in this case, of length 1);
## extract the response vector
newresp <- simulate(model)[[1]]
newfit <- update(model, newresp ~ .)
res[i,] <- coef(newfit)
}
You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).
Let’s say your data matrix including age and gender, or whatever predictors, is X. Then you can use X on the right-hand side of your glm formula, get xb_hat <- X %*% the_estimates (or whatever other data matrix replacing X as long as it has same columns) and plug xb_hat into whatever link function you want.

Exchangeable correlations and constant variance in MCMCglmm?

I am attempting to fit a very simple model with MCMCglmm, but am getting quite stuck.
Imagine a class (30 students) get grades for two papers throughout the semester where the paper assignments are exactly the same (we don't want to model a difference in average scores between the papers, there are no "learning effects", and we can assume that the variance in grades are the same.)
Let $i = 1...30$ index the student, $y_{i1}$ and $y_{i2}$ index the scores for that student's first and second papers.
One way to model this data is using random intercepts for student scores to account for correlation between each students scores. Let $\mu_i$ be the student intercept, $sigma$ be the residual sd, and $\sigma_{\mu}$ be the sd of the intercepts. Then we write (in shorthand) our random intercept model at $f(y_{ij}|\mu_i) = Normal(\mu_i, \sigma)$ and $f(\mu_i) = Normal(\mu, \sigma_{\mu)$.
An alternative way to write this model would be to model the residual correlation structure more explicitly. That is, we would write that ${y_{i1}, y_{i2}}$ have a multivariate normal distribution with mean ${\mu, \mu}$ variance $\tau = \sigma^2 + \sigma_{\mu}^2$ and correlation $\rho = \sigma_{\mu}^2 / (\sigma^2 + \sigma_{\mu}^2)$.
To be clear, these models are mathematically equivalent, but statistical software will often have a specific implementation for each. For example we can fit the two approaches separately with nlme:
library(nlme)
library(tidyverse)
library(MCMCglmm)
df <-
tibble(id = factor(rep(1:100, each = 20))) %>%
mutate(paper = 1:n()) %>%
group_by(id) %>%
mutate(mu = rnorm(1),
y = mu + rnorm(n(), 0, 3))
gls(data = df,
model = y~1,
correlation = corCompSymm(form = ~ 1 | id))
lme(data = df, fixed = y ~ 1, random = ~1|id)
It seems MCMCglmm can fit the first parameterization (random intercepts) of the model just fine.
MCMCglmm(data = df,
fixed = y ~ 1,
random = ~id,
nitt = 1000, burnin = 0, thin = 1)
However, I am not seeing a way to implement the second approach. My best attempt involves "widening" the data frame and fitting a multiple response model.
df.wide <- df %>% select(- paper) %>%
pivot_wider(values_from = "y",
names_from = "obs", names_prefix = "paper") %>%
as.data.frame
MCMCglmm(fixed = cbind(paper1, paper2) ~ 1,
rcov = ~us(trait):units,
data = df.wide)
However, (1) I am not sure that I am fitting this model correctly, (2) I am not sure how to interpret the fitted values (especially since my posterior mean covariances seem much too small) and (3) there doesn't seem to be a way to get a constant variance across traits.
p.s. I would appreciate not being told to just fit the random intercept model. I am writing some course materials, and would like students to be able to more directly compare the exchangeable correlation model with other types of correlation structures that we might use when we have more than two observations (i.e. AR, Toeplitz, etc.), and I would like my students to be able to do the comparison of the two parameterizations themselves, as I would do when I used nlme.
FOLLOW-UP: I am currently trying to fit the model with BRMS, though would still be open to any "hacks" in MCMCglmm.
model1 <- brms::brm(data = df,
formula = y ~ 1 + cosy(gr = id, time = obs),
family = "gaussian",
chains = 4, thin = 1, iter = 5000, warmup = 100)
Is exchangeability + equal variances the same as what I would call compound symmetry? (I guess so, since you're using corCompSymm() in nlme) ...
As far as I can tell this isn't possible (I can't rule out that there's some way to hack it with the available variance structures, but it's far from obvious ...) From ?MCMCglmm:
Currently, the only ‘variance.functions’ available are ‘idv’,
‘idh’, ‘us’, ‘cor[]’ and ‘ante[]’. ‘idv’ fits a constant
variance across all components in ‘formula’. Both ‘idh’ and
‘us’ fit different variances across each component in
‘formula’, but ‘us’ will also fit the covariances. ‘corg’
fixes the variances along the diagonal to one and ‘corgh’
fixes the variances along the diagonal to those specified in
the prior. ‘cors’ allows correlation submatrices. ‘ante[]’
fits ante-dependence structures of different order (e.g
ante1, ante2), and the number can be prefixed by a ‘c’ to
hold all regression coefficients of the same order equal. The
number can also be suffixed by a ‘v’ to hold all innovation
variances equal (e.g ‘antec2v’ has 3 parameters).
By using the us() (unstructured, what nlme would call pdSymm for "positive-definite symmetric") structure, I believe you're not constraining the correlation parameters to be all the same (i.e., violating exchangeability).
For what it's worth, one reason (other than pedagogy) to want to specify a compound-symmetric correlation matrix explicitly rather than by composing the sum of group-level and individual-level random effects would be if you wanted to model negative compound symmetry (the sum-of-random-effects approach can only model rho>0).
My guess is that you're also restricted to answers using MCMCglmm, but if "some Bayesian MCMC approach" is good enough, then you could do this via brms or (somewhat more obscurely, sort of) glmmTMB + tmbstan (although this combination does not currently use informative priors!)

(mis) understanding priors in MCMCglmm

I'm new to this package; I usually use Stan for Bayesian models, but I'm using a lot of data and hoping I can get the models to run faster in MCMCglmm. I have read the course notes as well as the readme on github. They are really helpful and great descriptions, but I still do not understand how to set priors.
I have a simple mixed effects model, and here is some example code. (I know that others have asked this question, but I had trouble finding one with reproducible code and those answers helped users with a specific question, while I'm trying to understand what the values in a prior list mean).
library(MCMCglmm)
# inverse logit function for simulation
inv.logit <- function(x){
exp(x)/(exp(x)+1)
}
# simulate some data
set.seed(123)
n <- 1000
covariates <- replicate(3, rnorm(n, 0, .5))
X <- cbind(rep(1, length(covariates[,1])),covariates)
colnames(X) <- c('int', 'X1', 'X2', 'X3')
coefs <- c(1, -1, -.5, 0.3)
resp <- X %*% coefs
psi <- inv.logit(resp)
ind <- rep(1:10, floor(n/10)) # assigning individuals for random effect, but it's not important
n_inds <- length(unique(ind))
y <- rbinom(n, 1, psi)
df <- data.frame(ind, y, X)
# priors: I really don't know what I'm doing here
prior1 <- list(R = list(V = 2, n = 1, fix=1),
G = list(G1 = list(V = diag(3), n = 3)))
# run the model
glmm <- MCMCglmm(y ~ X1 + X2 + X3,
random= ~ us(1 + ind), # having random slope and random intercept for ind
family = "categorical",
data = df,
prior=prior1
)
If I were running this in Stan, I would set priors for each fixed effect covariate: Beta ~ normal(0, sigma_beta) and the hyperprior: sigma_beta ~ gamma(2,1). (Although, I'd also be happy just setting the prior as Beta ~ normal(0,100) or something like this.) I would use similar priors for the random effects. I understand that MCMCglmm is more limited in distributions, but I really don't understand the notation. To be clear, I'm not really interested in specifying priors in this example model, I'm trying to understand how one does it.
Is there somewhere I can find a definition of what is exactly meant by each of the values that goes into the prior (e.g., V, n, alpha, ...) and how these values correspond to what we would write in a full model description of the priors? Or is someone willing to explain it to more simple minded people like me? The descriptions in the course notes and github were not able to answer my question.
Thank you!

Extract data from glmnet output data

I am trying to do feature selection using the glmnet package. I have been about to run the glmnet. However, I have a tough time understanding the output. My goal is to get the list of genes and their respective coefficients so I can rank the list of gene based on how relevant they are at separating my two group of labels.
x = manual_normalized_melt[,colnames(manual_normalized_melt) %in%
sig_0_01_ROTS$Gene]
y = cellID_reference$conditions
glmnet_l0 <- glmnet(x = as.matrix(x), y = y, family = "binomial",alpha = 1)
Any hints/instructions on how I go from here? I know that the data I want is within the glmnet_l0 but I am a bit unsure on how to extract it.
Additionally, anyone know if there is a way to use L0-norm for feature selection in R?
Thank you so much!
Here are some approaches in glmnet:
first some data because you did not post any (iris data with two levels in species):
data(iris)
x <- iris[,1:4]
y <- iris[,5]
y[y == "setosa"] <- "virginica"
y <- factor(y)
First run a cross validation model to see what is the best lambda:
library(glmnet)
model_cv <- cv.glmnet(x = as.matrix(x),
y = y,
family = "binomial",
alpha = 1,
nfolds = 5,
intercept = FALSE)
Here I chose to have 5-fold cross validation to determine the best lambda.
Too see the coefficients at best lambda:
coef(model_cv, s = "lambda.min")
#output
#5 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) .
Sepal.Length -0.7966676
Sepal.Width 1.9291364
Petal.Length -0.9502821
Petal.Width 2.7113327
Here you can see no variables were dropped (or they would have . instead of a coefficient). If all the features are on the same scale (like gene expression data) you might consider adding standardize = FALSE as an argument to the glmnet call since it is by default set to TRUE. At least I would when modeling expression.
To see the best lambda:
model_cv$lambda[which.min(model_cv$cvm)]
Now you can make a model with all the data:
glmnet_l0 <- glmnet(x = as.matrix(x),
y = y,
family = "binomial",
alpha = 1,
intercept = FALSE)
You can plot it on the lambda scale and add a vertical line depicting best lambda:
plot(glmnet_l0, xvar = "lambda")
abline(v = log(model_cv$lambda[which.min(model_cv$cvm)]))
Here one can see coefficients were hardly shrunk at all at best lambda.
with higher dimensional data you will see many coefficient traces go towards 0 before best lambda kicks in and many . in the coef matrix.
When using predict.glmnet set s = model_cv$lambda[which.min(model_cv$cvm)] or it will generate predictions for all tested lambda.
Also check this post it contains some other relevant information.
A while back I wrapped glmnet in a package for feature selection, you can either look at the code (beginning from line 89) or you can download the package using devtools::install_github('mlampros/FeatureSelection'). I explained also how it works in a blog post.

R: obtain coefficients&CI from bootstrapping mixed-effect model results

The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.

Resources