(mis) understanding priors in MCMCglmm - r

I'm new to this package; I usually use Stan for Bayesian models, but I'm using a lot of data and hoping I can get the models to run faster in MCMCglmm. I have read the course notes as well as the readme on github. They are really helpful and great descriptions, but I still do not understand how to set priors.
I have a simple mixed effects model, and here is some example code. (I know that others have asked this question, but I had trouble finding one with reproducible code and those answers helped users with a specific question, while I'm trying to understand what the values in a prior list mean).
library(MCMCglmm)
# inverse logit function for simulation
inv.logit <- function(x){
exp(x)/(exp(x)+1)
}
# simulate some data
set.seed(123)
n <- 1000
covariates <- replicate(3, rnorm(n, 0, .5))
X <- cbind(rep(1, length(covariates[,1])),covariates)
colnames(X) <- c('int', 'X1', 'X2', 'X3')
coefs <- c(1, -1, -.5, 0.3)
resp <- X %*% coefs
psi <- inv.logit(resp)
ind <- rep(1:10, floor(n/10)) # assigning individuals for random effect, but it's not important
n_inds <- length(unique(ind))
y <- rbinom(n, 1, psi)
df <- data.frame(ind, y, X)
# priors: I really don't know what I'm doing here
prior1 <- list(R = list(V = 2, n = 1, fix=1),
G = list(G1 = list(V = diag(3), n = 3)))
# run the model
glmm <- MCMCglmm(y ~ X1 + X2 + X3,
random= ~ us(1 + ind), # having random slope and random intercept for ind
family = "categorical",
data = df,
prior=prior1
)
If I were running this in Stan, I would set priors for each fixed effect covariate: Beta ~ normal(0, sigma_beta) and the hyperprior: sigma_beta ~ gamma(2,1). (Although, I'd also be happy just setting the prior as Beta ~ normal(0,100) or something like this.) I would use similar priors for the random effects. I understand that MCMCglmm is more limited in distributions, but I really don't understand the notation. To be clear, I'm not really interested in specifying priors in this example model, I'm trying to understand how one does it.
Is there somewhere I can find a definition of what is exactly meant by each of the values that goes into the prior (e.g., V, n, alpha, ...) and how these values correspond to what we would write in a full model description of the priors? Or is someone willing to explain it to more simple minded people like me? The descriptions in the course notes and github were not able to answer my question.
Thank you!

Related

Is there a way in R to create a linear combination of coefficients from **different** lm models?

Unfortunately, I can't provide the data I'm working with, but I don't think it's necessary to solve the problem. My question is, I think, the same one this guy had 5 years ago (R: testing linear combination of coefficients from multiple regressions with plm), but back then it went unanswered, so I'll try my luck.
Essentially, as the title says, I want to test for a linear combination of coefficients. This sounds very straightforward and indeed there are several packages to do so (glht, lincom, etc.). The only problem is that all of them take as one of the arguments a single lm (or glm, etc.) model, whereas I am estimating several models and I want to compute point estimates, se, etc. from a linear combination of coefficients of different previously estimated models.
In Stata, this is done with a simple trick which is running first suest command which takes as argument model names and combines them in one single object, and then running lincom as one would do in R. I have been searching the internet for a while and couldn't find anything, do you know how to achieve this in R?
Here is a more concrete example of what I want to do, I hope it helps.
library(data.table)
library(estimatr)
library(foreign)
library(multcomp)
df <- data.table(y1 = runif(100, 0, 100),
y2 = runif(100, 0, 100),
y3 = runif(100, 0, 100),
x = runif(100, 0, 100),
z = runif(100, 0, 100),
id = round(runif(100, 0, 100)/10)*10)
lm1 <- lm_robust(formula = y1 ~ x + z,
data = df,
subset = x>10 & x<90,
clusters = id)
lm2 <- lm_robust(formula = y2 ~ x + z,
data = df,
subset = x>10 & x<90,
clusters = id)
lm3 <- lm_robust(formula = y3 ~ x + z,
data = df,
subset = x>10 & x<90,
clusters = id)
# linear combination I can for example do
summary(glht(lm1, linfct = "x + z = 0"))
# linear combination I would like to do
# 0.1*z_lm1 + 0.4*z_lm2 + 0.2*z_lm3 = 0, where the numbers before the coefficients are weights defined somewhere else
I now read in the glht documentation that by passing to the function a list of coefficients and a covariance matrix it should be able to ignore the "model" argument and thus let me compute a linear combination of coefficients from different models. However, I still have no idea how one would go about computing a covariance matrix using results from different models.

Obtaining Standardized coefficients from "rstanarm" package in R?

I was wondering if it might be possible (and perhaps recommended) to obtain standardized coefficients from stan_glm() in the rstanarm package? (did not find anything specific in the documentation)
Can I just standardize all variables as in normal regression? (see below)
Example:
library("rstanarm")
fit <- stan_glm(wt ~ vs*gear, data = mtcars)
Standardization:
design <- wt ~ vs*gear
vars <- all.vars(design)
stand.vars <- lapply(mtcars[, vars], scale)
fit <- stan_glm(stand.vars, data = mtcars)
I would not say that it is affirmatively recommended, but I would recommend that you not subtract the sample mean and divide by the sample standard deviation of the outcome because the estimation uncertainty in those two statistics will not be propagated to the posterior distribution.
Standardizing the predictors is more debatable. You can do it, but it makes doing posterior prediction with new data harder because you have to remember to subtract the old means from the new data and divide by the old standard deviations.
The most computationally efficient approach is to leave the variables as they are but specify the non-default argument QR = TRUE, especially if you are not going to modify the default (normal) priors on the coefficients anyway.
You can then standardize the posterior coefficients after-the-fact if standardized coefficients are of interest. To do so, you can do
X <- model.matrix(fit)
sd_X <- apply(X, MARGIN = 2, FUN = sd)[-1]
sd_Y <- apply(posterior_predict(fit), MARGIN = 1, FUN = sd)
beta <- as.matrix(fit)[ , 2:ncol(X), drop = FALSE]
b <- sweep(sweep(beta, MARGIN = 2, STATS = sd_X, FUN = `*`),
MARGIN = 1, STATS = sd_Y, FUN = `/`)
summary(b)
However, standardizing regression coefficients just gives the illusion of comparability across variables and says nothing about how germane a one standard deviation difference is, particularly for dummy variables. If your question is really whether manipulating this predictor or that predictor is going to make a bigger difference on the outcome variable, then simply simulate those manipulations like
PPD_0 <- posterior_predict(fit)
nd <- model.frame(fit)
nd[ , 2] <- nd[ , 2] + 1 # for example
PPD_1 <- posterior_predict(fit, newdata = nd)
summary(c(PPD_1 - PPD_0))
and repeat that process for other manipulations of interest.

I am using glmnet to address the multicollinearity issue, and for best lambda i want to calculate VIF between variables

I am using glmnet and for the best lambda I want to check the VIF between variables. Can anyone suggest how can I accomplish this?
Below is the code I am following and fielddfm is the data frame containing the independent variables:
x<- model.matrix(depvar ~ ., fielddfm) [,-1]
y <- depvar
lambda <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda)
predict(ridge.mod, s = 0, exact = T, type = 'coefficients')
cv.out <- cv.glmnet(x, y, alpha = 0, nfolds = 3)
bestlam <- cv.out$lambda.min
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x)
predict(ridge.mod, type = "coefficients", s = bestlam)'
Here, I get the coefficients for different promotion vehicles but I want to know, VIF values for the best lambda for different independent variables
Could yo please suggest how can I achieve this?
Since a) VIF is a function of your predictors rather than your model and b) a ridge regression keeps all variables irrespective of lambda, you could get the VIFs from an arbitrarily-fitted linear model. For example:
vifs = car::vif(lm(y ~ ., data = X))
where y is your response and X is your dataframe of predictors. Note that the results are independent of the values contained in y.
Given the above however, It's a little dubious whether this question makes sense in the first place...

Non linear regression with nls2 package

I work on agglomeration effect and i want to run a non linear regression with package nls2.
I am trying to run this model with R
dens=runif(100)
surf=rnorm(100, 10, 2)
zone=seq(1,100,1)
donnees<-data.frame(dens,surf,zone)
attach(donnees)
donnees$salaire<-rnorm(100, 1000,3)
mp<-rep(0,100)
MP<-rep(0,100)
MPfonc<-function(alpha){
for (i in 1:100){
for (j in 1:100){
if(j!=i){
mp[j]<- dens[j]/(surf[i]-surf[j])^alpha
}
}
MP[i]<-sum(mp)
}
return(MP)
}
fo <- salaire ~ const+ gamma1*dens+gamma2*surf+gamma3*MPfonc(alpha)
gstart <- data.frame(const = c(-100, 100), gamma1 = c(-10, 10),
gamma2 = c(-10, 10),gamma3 = c(-10, 10), alpha=c(-10, 10))
fm <- nls2(fo, start = gstart, alg = "plinear-random")
It does not run and I think it is a problem of alpha.
Can nls2 function accept a function (MP(alpha)) as an input?
Here is the specification of my model:
The problems are:
set.seed should be set to make the code reproducible
salaire is not defined -- it is defined in the data frame donnees but donnees is never used after that.
the elements summed in the sum call in MPfonc include NaN or NA elements so the sum becomes similarly undefined
For plinear algorithms the RHS of the formula must evaluate to a matrix of coefficients of the linear parameters.
For the plinear algorithms provide starting values only for the non-linear parameters (i.e. only for alpha).
the nls2 package is never loaded. A library statement is needed.
code posted to SO should be indented 4 spaces to cause it to format properly (this was addressed in an edit to the question)
the mathematical formulas in the question are not clear in how they relate to the problem and are missing significant elements, e.g. alpha. This needs to be cleaned up. We have assumed that MPfonc gives the desired result and just simplified it.
The following corrects all the points and adds some minor improvements.
library(nls2)
set.seed(123) # for reproducibility
dens <- runif(100)
surf <- rnorm(100, 10, 2)
zone <- seq(1, 100, 1)
salaire <- rnorm(100, 1000, 3)
MPfonc <- function(alpha) {
sapply(1:100, function(i) sum( (dens / (surf[i] - surf) ^ alpha)[-i], na.rm = TRUE ))
}
fo <- salaire ~ cbind(1, dens, surf, MPfonc(alpha))
gstart <- data.frame(alpha = c(-10, 10))
fm <- nls2(fo, start = gstart, alg = "plinear-random")
giving:
> fm
Nonlinear regression model
model: salaire ~ cbind(1, dens, surf, MPfonc(alpha))
data: parent.frame()
alpha .lin1 .lin.dens .lin.surf .lin4
0.90477 1001.20905 -0.50642 -0.12269 0.00681
residual sum-of-squares: 757.6
Number of iterations to convergence: 50
Achieved convergence tolerance: NA
Note: Now that we have the starting value we can use it with nls like this:
nls(fo, start = coef(fm)["alpha"], alg = "plinear")
Update: Some code improvements, corrections and clarifications.

MCMClogit confusion

Could anybody explain to me why
simulatedCase <- rbinom(100,1,0.5)
simDf <- data.frame(CASE = simulatedCase)
posterior_m0 <<- MCMClogit(CASE ~ 1, data = simDf, b0 = 0, B0 = 1)
always results in a MCMC acceptance ratio of 0? Any explanation would be greatly appreciated!
I think your problem is the model formula, since logistic regression models have no error term. Thus you model CASE ~ 1 should be replaced by something like CASE ~ x (the predictor variable x is mandatory). Here is your example, modified:
CASE <- rbinom(100,1,0.5)
x <- 1:100
posterior_m0 <- MCMClogit (CASE ~ x, b0 = 0, B0 = 1)
classic_m0 <- glm (CASE ~ x, family=binomial(link="logit"), na.action=na.pass)
So I think your problem is not related to the MCMCpack library (disclaimer: I have never used this package).
For anyone stumbling into this same problem :
It seems that the MCMClogit function cannot handle anything but B0=0 if your model only has an intercept.
If you add a covariate, then you can specify a precision just fine.
I would consider other packages (such as arm or rjags) if you really want to sample from this model. For a list of options available for Bayesian regression, see http://cran.r-project.org/web/views/Bayesian.html

Resources