Calculating and indexing mcmc chains in coda - r

There are two things I need to do. Firstly I would like to be able to create new variables in a coda mcmc object that have been calculated from existing variables so that I can run chain diagnostics on the new variable. Secondly I would like to be able to index single variables in some of the coda plot functions while still viewing all chains.
Toy data. Bayesian t-test on the sleep data using JAGS and rjags.
data(sleep)
# read in data
y <- sleep$extra
x <- as.numeric(as.factor(sleep$group))
nTotal <- length(y)
nGroup <- length(unique(x))
mY <- mean(y)
sdY <- sd(y)
# make dataList
dataList <- list(y = y, x = x, nTotal = nTotal, nGroup = nGroup, mY = mY, sdY = sdY)
# model string
modelString <- "
model{
for (oIdx in 1:nTotal) {
y[oIdx] ~ dnorm(mu[x[oIdx]], 1/sigma[x[oIdx]]^2)
}
for (gIdx in 1:nGroup) {
mu[gIdx] ~ dnorm(mY, 1/sdY)
sigma[gIdx] ~ dunif(sdY/10, sdY*10)
}
}
"
writeLines(modelString, con = "tempModel.txt")
# chains
# 1. adapt
jagsModel <- jags.model(file = "tempModel.txt",
data = dataList,
n.chains = 3,
n.adapt = 1000)
# 2. burn-in
update(jagsModel, n.iter = 1000)
# 3. generate
codaSamples <- coda.samples(model = jagsModel,
variable.names = c("mu", "sigma"),
thin = 15,
n.iter = 10000*15/3)
Problem one
If I convert the coda object to a dataframe I can calculate the difference between the estimates for the two groups and plot this new variable, like so...
df <- as.data.frame(as.matrix(codaSamples))
names(df) <- gsub("\\[|\\]", "", names(df), perl = T) # remove brackets
df$diff <- df$mu1 - df$mu2
ggplot(df, aes(x = diff)) +
geom_histogram(bins = 100, fill = "skyblue") +
geom_vline(xintercept = mean(df$diff), colour = "red", size = 1, linetype = "dashed")
...but how do I get a traceplot? I can get one for existing variables within the coda object like so...
traceplot(codaSamples[[1]][,1])
...but I would like to be able to get them for the the new diff variable.
Problem Two
Which brings me to the second problem. I would like to be able to get a traceplot (among other things) for individual variables. As I have shown above I can get them for a single variable if I only want to see one chain but I'd like to see all chains. I can see all chains for all variables in the model with the simple
plot(codaSamples)
...but what if I don't want or need to see all variables? What if I just want to see the trace and/or desnity plots for one, or even two, variables (but not all variables) but with all chains in the plot?

Related

Kfold CV in brms

I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.
After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891

sampling from posterior predictive distribution (stan vs inla)

I'm trying to implement functions from bayesplot package on a INLA object and a little unsure of how to draw from the posterior predictive distribution. I think I almost have it but rstan draws are more variable than the INLA ones.
In rstan, using the simplified example from bayesplot vignette I can:
library(bayesplot)
library(ggplot2)
library(rstanarm)
library(ggpubr)
library(tidyverse)
#rstan model set up
roaches$roach100 <- roaches$roach1 / 100 # pre-treatment number of roaches (in 100s)
fit_poisson <- stan_glm(y ~ roach100 + treatment + senior, offset = log(exposure2), family = poisson(link = "log"), data = roaches, seed = 1111, refresh = 0)
#In order to use the PPC functions from the bayesplot package we need a vector y of outcome values:
y <- roaches$y
#and a matrix yrep of draws from the posterior predictive distribution,
yrep_poisson <- posterior_predict(fit_poisson, draws = 500)
#then plot:
p1 <- bayesplot::ppc_dens_overlay(y, yrep_poisson[1:50, ])
p1
I want to replicate that plot on a INLA object. According to the bayesplot vignette you can do this as they have provided code to define a simple pp_check method that creates fitted model objects of class e.g. foo:
pp_check.foo <- function(object, type = c("multiple", "overlaid"), ...) {
type <- match.arg(type)
y <- object[["y"]]
yrep <- object[["yrep"]]
stopifnot(nrow(yrep) >= 50)
samp <- sample(nrow(yrep), size = ifelse(type == "overlaid", 50, 5))
yrep <- yrep[samp, ]
if (type == "overlaid") {
ppc_dens_overlay(y, yrep, ...)
} else {
ppc_hist(y, yrep, ...)
}
}
To use pp_check.foo we can just make a list with y and yrep components and give it class foo:
x <- list(y = rnorm(200), yrep = matrix(rnorm(1e5), nrow = 500, ncol = 200))
class(x) <- "foo"
#create plot above:
pp_check(x, type = "overlaid")
INLA
#create same model but in inla:
library(INLA)
fit_poisson_inla <- inla(y ~ roach100 + treatment + senior, offset = log(exposure2), data = roaches,
control.predictor = list(compute = T),
family = "poisson")
inla_object_name$marginals.fitted.values returns a posterior predictive distribution for each y:
fit_poisson_inla$marginals.fitted.values
#so to get distribution for first oberservation:
fitted.Predictor.1 <- fit_poisson_inla$marginals.fitted.values[[1]]
I think repeatedly sampling from this would give me what I need but there are only 75 values (dim(fitted.Predictor.1) per observation used to create this distribution when in reality I would want to be sampling from a full range of values. I think we can do this (section 4.3 here) by using inla.tmarginal using linear predictor:
fitted_dist <- fit_poisson_inla$marginals.linear.predictor
#should i have used "inla.rmarginal(n, marginal)"?
marginal_dist <- lapply(fitted_dist, function(y) inla.tmarginal(function(x) {exp(x)}, y)) %>% map(~ as.data.frame(.) %>% rename(., xx = x))
#resample 500 times
yrep_poisson_inla <- as.matrix(bind_rows(rerun(500, lapply(marginal_dist, function(x) sample(x$xx, 1)) %>% as.data.frame())))
#convert to class foo for pp_check
x <- list(y = y, yrep = yrep_poisson_inla[1:50, ])
class(x) <- "foo"
p2 <- pp_check(x, type = "overlaid")
#plot
ggarrange(p1, p2, ncol = 1, nrow = 2, labels = c("rstan", "inla sample"))
My question is how do I correctly get a matrix of draws from the posterior predictive distribution from this inla (fit_poisson_inla) object to pass into pp_check? yrep_poisson produces discrete values while yrep_poisson_inla produces continuous values. There is a lot more variation in the rstan draws than INLA (second plot). Is what I have done correct and this is just some sampling issue or is it an artifact of the different methods? In more complicated examples the differences could be substantial.
Thanks

How to cluster standard error in clubSandwich's vcovCR()?

I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.

ggplot equivalent of gelman.plot MCMC diagnostic in r

I am creating a series of MCMC diagnostic plots in r using ggplot. I realize there is already a package available in gg for MCMC plotting, but much of this is for my own education as well as practical use. One thing I can't seem to figure out is how to generate the gelman.plot in a ggplot framework.
The gelman.diag function only returns a simple data point and I would like to recreate the complete running chart as shown in gelman.plot.
Is anyone familiar with the algorithmic structure of the gelman potential scale reduction factor and/or a means to port its output to ggplot?
Thank you!
You haven't provided a reproducible example, so I've used the example here. We need the object called combinedchains from that example. In order to avoid cluttering the answer, I've put the code for that at the end of this post.
Now we can run gelman.plot on combined.chains. This is the plot we want to duplicate:
library(coda)
gelman.plot(combined.chains)
To create a ggplot version, we need to get the data for the plot. I haven't done MCMC before, so I'm going to let gelman.plot generate the data for me. For your actual use case, you can probably just generate the appropriate data directly.
Let's look at what gelman.plot is doing: We can see the code for that function by typing the bare function name in the console. A portion of the function code is below. The ... show where I've removed sections of the original code for brevity. Note the call to gelman.preplot, with the output of that function stored in y. Note also that y is returned invisibly at the end. y is a list containing the data we need to create a gelman.plot in ggplot.
gelman.plot = function (x, bin.width = 10, max.bins = 50, confidence = 0.95,
transform = FALSE, autoburnin = TRUE, auto.layout = TRUE,
ask, col = 1:2, lty = 1:2, xlab = "last iteration in chain",
ylab = "shrink factor", type = "l", ...)
{
...
y <- gelman.preplot(x, bin.width = bin.width, max.bins = max.bins,
confidence = confidence, transform = transform, autoburnin = autoburnin)
...
return(invisible(y))
}
So, let's get the data that gelman.plot returns invisibly and store it in an object:
gp.dat = gelman.plot(combinedchains)
Now for the ggplot version. First, gp.dat is a list and we need to convert the various parts of that list into a single data frame that ggplot can use.
library(ggplot2)
library(dplyr)
library(reshape2)
df = data.frame(bind_rows(as.data.frame(gp.dat[["shrink"]][,,1]),
as.data.frame(gp.dat[["shrink"]][,,2])),
q=rep(dimnames(gp.dat[["shrink"]])[[3]], each=nrow(gp.dat[["shrink"]][,,1])),
last.iter=rep(gp.dat[["last.iter"]], length(gp.dat)))
For the plot, we'll melt df into long format, so that we can have each chain in a separate facet.
ggplot(melt(df, c("q","last.iter"), value.name="shrink_factor"),
aes(last.iter, shrink_factor, colour=q, linetype=q)) +
geom_hline(yintercept=1, colour="grey30", lwd=0.2) +
geom_line() +
facet_wrap(~variable, labeller= labeller(.cols=function(x) gsub("V", "Chain ", x))) +
labs(x="Last Iteration in Chain", y="Shrink Factor",
colour="Quantile", linetype="Quantile") +
scale_linetype_manual(values=c(2,1))
MCMC example code to create the combinedchains object (code copied from here):
trueA = 5
trueB = 0
trueSd = 10
sampleSize = 31
x = (-(sampleSize-1)/2):((sampleSize-1)/2)
y = trueA * x + trueB + rnorm(n=sampleSize,mean=0,sd=trueSd)
likelihood = function(param){
a = param[1]
b = param[2]
sd = param[3]
pred = a*x + b
singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T)
sumll = sum(singlelikelihoods)
return(sumll)
}
prior = function(param){
a = param[1]
b = param[2]
sd = param[3]
aprior = dunif(a, min=0, max=10, log = T)
bprior = dnorm(b, sd = 5, log = T)
sdprior = dunif(sd, min=0, max=30, log = T)
return(aprior+bprior+sdprior)
}
proposalfunction = function(param){
return(rnorm(3,mean = param, sd= c(0.1,0.5,0.3)))
}
run_metropolis_MCMC = function(startvalue, iterations) {
chain = array(dim = c(iterations+1,3))
chain[1,] = startvalue
for (i in 1:iterations) {
proposal = proposalfunction(chain[i,])
probab = exp(likelihood(proposal) + prior(proposal) - likelihood(chain[i,]) - prior(chain[i,]))
if (runif(1) < probab){
chain[i+1,] = proposal
}else{
chain[i+1,] = chain[i,]
}
}
return(mcmc(chain))
}
startvalue = c(4,2,8)
chain = run_metropolis_MCMC(startvalue, 10000)
chain2 = run_metropolis_MCMC(startvalue, 10000)
combinedchains = mcmc.list(chain, chain2)
UPDATE: gelman.preplot is an internal coda function that's not directly visible to users. To get the function code, in the console type getAnywhere(gelman.preplot). Then you can see what the function is doing and, if you wish, construct your own function to return the appropriate diagnostic data in a form more suitable for ggplot.

R - Splitting Data, regression and applying equation to new split data set

I have a large data set that has older and newer data. I created two data frames, EarlyYears with the older data and LaterYears with the new data, so they have the same columns.
What I want to do is regress the data from Early years to determine an equation and apply it to the Later Years to test the equation's strength - A and B are constants, Input is what I am testing - I change it for different runs of the code - and Dummy is 1 is there is no data for the input. However, I want to split both the EarlyYears and LaterYears data by quintiles of one of the variables, and apply the equation found in quintile 1 of EarlyYears to data from LaterYears that is in quintile 1. I am fairly new at R, and so far have:
Model<-data.frame(Date = rep(c("3/31/09","3/31/11"),each = 20),
InputRating = rep(c(1:5), 8), Dummy = rep(c(rep(0,9),1),4),
Y = rep(1,3,5,7,11,13,17,19), A = 1:40,B = 1:40*3+7)
newer<-as.numeric(grep("/11",Model$Date))
later<-as.numeric(grep("/11",Model$Date,invert = TRUE))
LaterYears<-Model[newer,]
EarlyYears<-Model[later,]
newModel<-EarlyYears
DataSet.Input<-data.frame(Date = newModel$Date, InputRating = newModel$InputRating,
Dummy = newModel$Dummy, Y = newModel$Y, A = newModel$A,B = newModel$B)
quintiles<-quantile(DataSet.Input$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
VarQuint<-findInterval(DataSet.Input$A,quintiles,rightmost.closed=TRUE)+1L
regressionData<-do.call(rbind,lapply(split(DataSet.Input,VarQuint),
FUN = function(SplitData) {
SplitRegression<-lm(Y ~ A + B + InputRating + Dummy, data = SplitData, na.action = na.omit)
c(coef.Intercept = coef(summary(SplitRegression))[1],
coef.A = coef(summary(SplitRegression))[2],
coef.B = coef(summary(SplitRegression))[3],
coef.Input = coef(summary(SplitRegression))[4],
coef.Dummy= coef(summary(SplitRegression))[5])
}))
i = 0
quintiles.LY<-quantile(LaterYears$A,probs=c(0.2,0.4,0.6, 0.8, 1.0))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L
LaterYears$ExpectedValue <-apply(split(LaterYears,Quint.LY),1,
FUN = function(SplitData) {
i=i+1
regressionData[i,1]+regressionData[i,2]*SplitData$A +
regressionData[i,3]*SplitData$B + regressionData[i,4]*SplitData$Input +
regressionData[i,5]*SplitData$Dummy
})
The first part works great to get the data in regressionData. I want this results of applying the equation to be held in a column within the LaterYears dataset, but I get an error -
Error in apply(split(LaterYears, Quint.LY), 1, FUN = function(SplitData) { :
dim(X) must have a positive length
when running this with apply, and blank when running with lapply which is what I originally tried.
Any help with how to fix this would be greatly appreciated!
Thanks!
Perhaps something like this, using predict would be better. It doesn't work very well for your example data but it may work on the real data.
# by, splits a dataset by a factor
regressionData <- by(DataSet.Input,VarQuint,
function(d) {
lm1 <- lm(Y ~ A + B + InputRating + Dummy, d)
})
quintiles.LY<-quantile(LaterYears$A,probs=seq(0,1,0.2))
Quint.LY<-findInterval(LaterYears$A,quintiles,rightmost.closed=TRUE)+1L
LaterYearsPredict <- split(LaterYears,Quint.LY)
# lapply's arguments can be anything that is a sequence
LaterYears$ExpectedValue <- unlist(lapply(1:length(LaterYearsPredict),
function(x)
predict(regressionData[[x]],LaterYearsPredict[[x]])
))

Resources