Extracting Mean Parameter Estimates from Stan Output Table - stan

I understand how to extract chains from a Stan model but I was wondering if there was any quick way to extract the values displayed on the default Stan output table.
Here is some toy data
# simulate linear model
a <- 3 # intercept
b <- 2 # slope
# we can have both the predictor and the noise vary
x <- rnorm(28, 0, 1)
eps <- rnorm(28, 0, 2)
y <- a + b*x + eps
Which when we analyse
mod <- lm(y ~ x, df)
We can extract coefficients from
mod$coefficients
# (Intercept) x
# 3.355967 2.151597
I wondered if there is any way to do the equivalent with a Stan output table
# Step 1: Make List
data_reg <- list(N = 28, x = x, y = y)
# Step 2: Create Model String
write("
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
vector[N] mu;
sigma ~ cauchy(0, 2);
beta ~ normal(0,10);
alpha ~ normal(0,100);
for ( i in 1:N ) {
mu[i] = alpha + beta * x[i];
}
y ~ normal(mu, sigma);
}
", file = "temp.stan")
# Step 3: Generate MCMC Chains
fit1 <- stan(file = "temp.stan",
data = data_reg,
chains = 2,
warmup = 1000,
iter = 2000,
cores = 2,
refresh = 1000)
Now, when we call the model
fit1
# Output
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# alpha 3.33 0.01 0.40 2.57 3.06 3.33 3.59 4.13 1229 1
# beta 2.14 0.01 0.40 1.37 1.89 2.14 2.40 2.98 1470 1
# sigma 1.92 0.01 0.27 1.45 1.71 1.90 2.09 2.51 1211 1
# lp__ -31.92 0.05 1.30 -35.27 -32.50 -31.63 -30.96 -30.43 769 1
Is there any way to index and extract elements from the Output table displayed above?

If you only want means, then the get_posterior_mean function will work. Otherwise, you assign the result of print(fit1) or summary(print1) to an object, you can extract stuff from that object, but it is probably better to just do as.matrix(fit1) or as.data.frame(fit1) and calculate whatever you want yourself on the resulting columns.

Related

Fixing call notice printed after an lm() like function in R

I using an lm() like function called robu() from library robumeta within my own function foo.
However, I'm manipulating the formula argument such that when it is missing the default formula would be: formula(dint~1) or else any formula that user defines.
It works fine, however, in the output of foo the printed formula call always is: Model: missing(f) if formula(dint ~ 1) regardless of what formula is inputted in the foo.
Can I correct this part of output so that it only shows the exact formula used? (see below examples)
dat <- data.frame(dint = 1:9, SD = 1:9*.1,
time = c(1,1,2,3,4,3,2,4,1),
study.name = rep(c("bob", "jim", "jon"), 3))
library(robumeta)
# MY FUNCTION:
foo <- function(f, data){
robu(formula = if(missing(f)) formula(dint~1) else formula(f), data = data, studynum = study.name, var = SD^2)
}
# EXAMPLES OF USE:
foo(data = dat) ## HERE I expect: `Model: dint ~ 1`
foo(dint~as.factor(time), data = dat) ## HERE I expect: `Model: dint ~ time`
One option is to update the 'ml' object
foo <- function(f, data){
fmla <- if(missing(f)) {
formula(dint ~ 1)
} else {
formula(f)
}
model <- robu(formula = fmla, data = data, studynum = study.name, var = SD^2)
model$ml <- fmla
model
}
-checking
foo(data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ 1
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 96.83379
Tau.sq = 9.985899
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 4.99 0.577 8.65 2 0.0131 2.51 7.48 **
---
Signif. codes: < .01 *** < .05 ** < .10 *
---
Note: If df < 4, do not trust the results
foo(dint~ as.factor(time), data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ as.factor(time)
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 97.24601
Tau.sq = 11.60119
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 3.98 2.50 1.588 2.00 0.253 -6.80 14.8
2 as.factor.time.2 1.04 4.41 0.236 1.47 0.842 -26.27 28.3
3 as.factor.time.3 1.01 1.64 0.620 1.47 0.617 -9.10 11.1
4 as.factor.time.4 2.52 2.50 1.007 2.00 0.420 -8.26 13.3
---
Signif. codes: < .01 *** < .05 ** < .10 *

STAN in R: dimensions error in linear regression

Below you can find a simple example of linear regression with 2 predictors. It works well. However, when having only 1 predictor (see second script), the following error message appears:
Exception: mismatch in number dimensions declared and found in context; processing stage=data initialization; variable name=x; dims declared=(20,1); dims found=(20)
The problem is that a matrix with 1 row is automatically transformed in a vector and therefore, doesn't match with the declared dimensions. One solution would be to declare xas a vector, but the problem is that I'm running the same scripts with different number of predictors (could be 1 or more).
STAN script:
write("// Stan model for simple linear regression
data {
int<lower=0> N; // number of data items
int<lower=0> K;// number of predictors
matrix[N, K] x;// predictor matrix
vector[N] y;// outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}", "ex_dimension.stan")
R script with 2 predictors (working):
N=20
K=2
x1=1:N+rnorm(N,0,0.5)
x2=rnorm(N,2,1)
x=cbind(x1,x2)
a=2
b=3
y=a*x1+b*x2+rnorm(N,0,1)
stan_data=list(N=N,
K=K,
x=x,
y=y)
fit <- stan(file = "ex_dimension.stan",
data = stan_data,
warmup = 500,
iter = 2000,
chains = 4,
cores = 4,
thin = 1,
control=list(adapt_delta=0.8))
fit
Script with 1 predictor (not working):
stan_data=list(N=N,
K=1,
x=x[,1],
y=y)
fit <- stan(file = "ex_dimension.stan",
data = stan_data,
warmup = 500,
iter = 2000,
chains = 4,
cores = 4,
thin = 1,
control=list(adapt_delta=0.8))
fit
Subset the matrix with drop = FALSE to avoid collapsing it to a vector and thereby passing the wrong input to the Stan model (see also e.g. Advanced R - Subsetting Chapter).
library(rstan)
stan_data <- list(N = N, K = 1, x = x[, 1, drop = FALSE], y = y)
fit <- stan(
model_code = "// Stan model for simple linear regression
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}",
data = stan_data,
chains = 1
)
fit
#> Inference for Stan model: 4f8ba0f0c644593f519910e9d2741995.
#> 1 chains, each with iter=2000; warmup=1000; thin=1;
#> post-warmup draws per chain=1000, total post-warmup draws=1000.
#>
#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
#> alpha 6.26 0.06 1.20 3.93 5.49 6.25 7.04 8.68 470 1
#> beta[1] 2.00 0.00 0.10 1.81 1.94 2.00 2.06 2.19 453 1
#> sigma 2.70 0.02 0.50 1.87 2.35 2.62 2.97 3.88 458 1
#> lp__ -28.15 0.06 1.21 -31.12 -28.80 -27.84 -27.23 -26.74 366 1
#>
#> Samples were drawn using NUTS(diag_e) at Thu Aug 15 12:41:19 2019.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at
#> convergence, Rhat=1).

Fitting a poisson HMM JAGS model with RSTAN

Walter Zucchini in his book Hidden Markov Models for Time Series An Introduction Using R, in chapter 8 page 129, adjusts a Poisson HMM using R2OpenBUGS, then I show the code. I am interested in adjusting this same model but with rstan, but since I am new using this package, I am not clear about the syntax any suggestion.
data
dat <- read.table("http://www.hmms-for-time-series.de/second/data/earthquakes.txt")
RJAGS
library(R2jags)
library(rjags)
HMM <- function(){
for(i in 1:m){
delta[i] <- 1/m
v[i] <- 1}
s[1] ~ dcat(delta[])
for(i in 2:100){
s[i] ~ dcat(Gamma[s[i-1],])}
states[1] ~ dcat(Gamma[s[100],])
x[1]~dpois(lambda[states[1]])
for(i in 2:n){
states[i]~dcat(Gamma[states[i-1],])
x[i]~dpois(lambda[states[i]])}
for(i in 1:m){
tau[i]~dgamma(1,0.08)
Gamma[i,1:m]~ddirch(v[])}
lambda[1]<-tau[1]
for(i in 2:m){
lambda[i]<-lambda[i-1]+tau[i]}}
x = dat[,2]
n = dim(dat)[1]
m = 2
mod = jags(data = list("x", "n", "m" ), inits = NULL, parameters.to.save = c("lambda","Gamma"),
model.file = HMM, n.iter = 10000, n.chains = 1)
output
mod
Inference for Bugs model at "C:/Users/USER/AppData/Local/Temp/RtmpOkrM6m/model36c8429c5442.txt", fit using jags,
1 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
n.sims = 1000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
Gamma[1,1] 0.908 0.044 0.805 0.884 0.915 0.940 0.971
Gamma[2,1] 0.155 0.071 0.045 0.105 0.144 0.195 0.325
Gamma[1,2] 0.092 0.044 0.029 0.060 0.085 0.116 0.195
Gamma[2,2] 0.845 0.071 0.675 0.805 0.856 0.895 0.955
lambda[1] 15.367 0.763 13.766 14.877 15.400 15.894 16.752
lambda[2] 26.001 1.321 23.418 25.171 25.956 26.843 28.717
deviance 645.351 8.697 630.338 639.359 644.512 650.598 665.405
DIC info (using the rule, pD = var(deviance)/2)
pD = 37.8 and DIC = 683.2
DIC is an estimate of expected predictive error (lower deviance is better).
RSTAN
library("rstan")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
HMM <- '
data{
int<lower=0> n; // number of observations (length)
int<lower=0> x[n]; // observations
int<lower=1> m; // number of hidden states
}
parameters{
simplex[m] Gamma[n]; // t.p.m
vector[m] lambda; // mean of poisson ordered
}
model{
vector[m] delta[m];
vector[m] v[m];
vector[100] s[100];
vector[n] states[n];
vector[m] tau;
for(i in 1:m){
delta[i] = 1/m;
v[i] = 1;}
s[1] ~ categorical(delta[]);
for(i in 2:100){
s[i] ~ categorical(Gamma[s[i-1],]);}
states[1] ~ categorical(Gamma[s[100],]);
x[1] ~ poisson(lambda[states[1]]);
for(i in 2:n){
states[i] ~ categorical(Gamma[states[i-1],]);
x[i] ~ poisson(lambda[states[i]])};
for(i in 1:m){
tau[i] ~ gamma(1,0.08);
Gamma[i,1:m] ~ dirichlet(v[]);}
lambda[1] = tau[1];
for(i in 2:m){
lambda[i] = lambda[i-1] + tau[i]};}'
data <- list(n = dim(dat)[1], x = dat[,2], m = 2)
system.time(mod2 <- stan(model_code = HMM, data = data, chains = 1, iter = 1000, thin = 4))
mod2
however, an error occurs when running the stan model.
Using the forward algorithm, and as priors the gamma distribution, for the means vector of the dependent states, and imposing the restriction on the simplex[m] object, for the probability transition matrix, in which the sum by rows equals 1 The following estimates are obtained.
dat <- read.table("http://www.hmms-for-time-series.de/second/data/earthquakes.txt")
stan.data <- list(n=dim(dat)[1], m=2, x=dat$V2)
PHMM <- '
data {
int<lower=0> n; // length of the time series
int<lower=0> x[n]; // data
int<lower=1> m; // number of states
}
parameters{
simplex[m] Gamma[m]; // tpm
positive_ordered[m] lambda; // mean of poisson - ordered
}
model{
vector[m] log_Gamma_tr[m]; // log, transposed tpm
vector[m] lp; // for forward variables
vector[m] lp_p1; // for forward variables
lambda ~ gamma(0.1, 0.01); // assigning exchangeable priors
//(lambdas´s are ordered for sampling purposes)
// transposing tpm and taking the log of each entry
for(i in 1:m)
for(j in 1:m)
log_Gamma_tr[j, i] = log(Gamma[i, j]);
lp = rep_vector(-log(m), m); //
for(i in 1:n) {
for(j in 1:m)
lp_p1[j] = log_sum_exp(log_Gamma_tr[j] + lp) + poisson_lpmf(x[i] | lambda[j]);
lp = lp_p1;
}
target += log_sum_exp(lp);
}'
model <- stan(model_code = PHMM, data = stan.data, iter = 1000, chains = 1)
print(model,digits_summary = 3)
output
Inference for Stan model: 11fa5b74e5bea2ca840fe5068cb01b7b.
1 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
Gamma[1,1] 0.907 0.002 0.047 0.797 0.882 0.913 0.941 0.972 387 0.998
Gamma[1,2] 0.093 0.002 0.047 0.028 0.059 0.087 0.118 0.203 387 0.998
Gamma[2,1] 0.147 0.004 0.077 0.041 0.090 0.128 0.190 0.338 447 0.999
Gamma[2,2] 0.853 0.004 0.077 0.662 0.810 0.872 0.910 0.959 447 0.999
lambda[1] 15.159 0.044 0.894 13.208 14.570 15.248 15.791 16.768 407 1.005
lambda[2] 25.770 0.083 1.604 22.900 24.581 25.768 26.838 28.940 371 0.998
lp__ -350.267 0.097 1.463 -353.856 -351.091 -349.948 -349.155 -348.235 230 1.001
Samples were drawn using NUTS(diag_e) at Wed Jan 16 00:35:06 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

How to calculate differences between the coefficients (categorical variables) of glm procedure in Rstan

I have a question about the how to calculate differences between the coefficients (categorical variables) of glm in Rstan.
As example, I used iris dataset in R to judge whether I can calculate the posterior distribution of differences of coefficients.
At first, I conducted a basic glm procedure like below and calculate the significant differences of coefficients.
library(tidyverse)
library(magrittr)
library(multcomp)
iris_glm <-
glm(Sepal.Length ~ Species, data = iris)
multcomp::glht(iris_glm, linfct = mcp(Species = "Tukey")) %>%
summary(.) %>%
broom::tidy()
lhs rhs estimate std.error statistic p.value
1 versicolor - setosa 0 0.930 0.1029579 9.032819 0.000000e+00
2 virginica - setosa 0 1.582 0.1029579 15.365506 0.000000e+00
3 virginica - versicolor 0 0.652 0.1029579 6.332686 4.294805e-10
Next, I conducted bayesian glm procedure using stan like below code, and calculate the posterior distribution of the differences between coefficients in generated quantities section.
# Make the model matrix for Rstan
iris_mod <-
model.matrix(Sepal.Length ~ Species, data = iris) %>%
as.data.frame(.)
# Input data
stan_data <-
list(N = nrow(iris_mod),
SL = iris$Sepal.Length,
Intercept = iris_mod$`(Intercept)`,
versicolor = iris_mod$Speciesversicolor,
virginica = iris_mod$Speciesvirginica)
# Stan code
data{
int N;
real <lower = 0> SL[N];
int <lower = 1> Intercept[N];
int <lower = 0, upper = 1> versicolor[N];
int <lower = 0, upper = 1> virginica[N];
}
parameters{
real beta0;
real beta1;
real beta2;
real <lower = 0> sigma;
}
transformed parameters{
real mu[N];
for(n in 1:N) mu[n] = beta0*Intercept[n] + beta1*versicolor[n] +
beta2*virginica[n];
}
model{
for(n in 1:N) SL[n] ~ normal(mu[n], sigma);
}
generated quantities{
real diff_beta0_beta1;
real diff_beta1_beta2;
real diff_beta0_beta2;
diff_beta0_beta1 = (beta0 + beta1) - beta0;
diff_beta1_beta2 = (beta0 + beta1) - (beta0 + beta2);
diff_beta0_beta2 = (beta0 + beta2) - beta0;
}
library(rstan)
fit_stan <-
stan(file = "iris.stan", data = stan_data, chains = 4,
seed = 1234)
# confirmation of posterior distribution
print(fit_stan, pars = c("diff_beta0_beta1", "diff_beta1_beta2",
"diff_beta0_beta2"))
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
diff_beta0_beta1 0.92 0 0.1 0.73 0.86 0.92 0.99 1.13 2041 1
diff_beta1_beta2 0.65 0 0.1 0.45 0.58 0.65 0.72 0.86 4000 1
diff_beta0_beta2 1.58 0 0.1 1.38 1.51 1.58 1.64 1.78 1851 1
Finally, I could get same results between the frequentist method and bayesian method.
I think this is correct way, but I'm not sure this because there are no information nor examples.
Also I also confirm this way could be extended another error distributions (including, poisson, gamma, binomial, negative- binomial etc.).
If there are another good ways or advices, please teach me.
You can calculate any function (including the difference in coefficients) of draws (such as those produced by Stan) from any proper posterior distribution.

Arrange monte carlo p-value into a matrix for different sample size and variance estimators

The following code works out quite well (based on my previous question). But I have to change the variance estimator (ols, hc0, hc1, hc2, hc3) every time before I run the code. I would like to solve this problem with a loop.
Hereafter, I briefly describe the code. Within the code, 1000 regression models for each sample size (n = 25, 50, 100, 250, 500, 1000) are created. Then, each regression model out of the 1000 is estimated by OLS. After that, I calculate t-statistics based on the different beta values of x3 out of the 1000 samples. The null hypothesis reads: H0: beta03 = beta3, that is the calculated beta value of x3 equals the 'real' value which I defined as 1. In the last step, I check how often the null hypothesis is rejected (significance level = 0.05). My final goal is to create a code which spits out the procentual rejection rate of the null hypothesis for each sample size and variance estimator. Thus, the result should be a matrix whereas right now I get a vector as a result. I would be pleased if anyone of you could help me with that. Here you can see my code:
library(car)
sample_size = c("n=25"=25, "n=50"=50, "n=100"=100, "n=250"=250, "n=500"=500, "n=1000"=1000)
B <- 1000
beta0 <- 1
beta1 <- 1
beta2 <- 1
beta3 <- 1
alpha <- 0.05
simulation <- function(n, beta3h0){
t.test.values <- rep(NA, B)
#simulation of size
for(rep in 1:B){
#data generation
d1 <- runif(n, 0, 1)
d2 <- rnorm(n, 0, 1)
d3 <- rchisq(n, 1, ncp=0)
x1 <- (1 + d1)
x2 <- (3*d1 + 0.6*d2)
x3 <- (2*d1 + 0.6*d3)
# homoskedastic error term: exi <- rchisq(n, 4, ncp = 0)
exi <- sqrt(x3 + 1.6)*rchisq(n, 4, ncp = 0)
y <- beta0 + beta1*x1 + beta2*x2 + beta3*x3 + exi
mydata <- data.frame(y, x1, x2, x3)
#ols estimation
lmobj <- lm(y ~ x1 + x2 + x3, mydata)
#extraction
betaestim <- coef(lmobj)[4]
betavar <- vcov(lmobj)[4,4]
#robust variance estimators: hc0, hc1, hc2, hc3
betavar0 <- hccm(lmobj, type="hc0")[4,4]
betavar1 <- hccm(lmobj, type="hc1")[4,4]
betavar2 <- hccm(lmobj, type="hc2")[4,4]
betavar3 <- hccm(lmobj, type="hc3")[4,4]
#t statistic
t.test.values[rep] <- (betaestim - beta3h0)/sqrt(betavar)
}
mean(abs(t.test.values) > qt(p=c(1-alpha/2), df=n-4))
}
sapply(sample_size, simulation, beta3h0 = 1)
You don't need a double nested loop. Just make sure you get a matrix inside your loop. Update your current simulation with the following:
## set up a matrix
## replacing `t.test.values <- rep(NA, B)`
t.test.values <- matrix(nrow = 5, ncol = B) ## 5 estimators
## update / fill a column
## replacing `t.test.values[rep] <- (betaestim - beta3h0)/sqrt(betavar)`
t.test.values[, rep] <- abs(betaestim - beta3h0) / sqrt(c(betavar, betavar0, betavar1, betavar2, betavar3))
## row means
## replacing `mean(abs(t.test.values) > qt(p=c(1-alpha/2), df=n-4))`
rowMeans(t.test.values > qt(1-alpha/2, n-4))
Now, simulation would return a vector of length 5. For each sample size, the monte carlo estimate of t-statistic p-value is returned for all 5 variance estimators. Then, when you call sapply, you get a matrix result:
sapply(sample_size, simulation, beta3h0 = 1)
# n=25 n=50 n=100 n=250 n=500 n=1000
#[1,] 0.132 0.237 0.382 0.696 0.917 0.996
#[2,] 0.198 0.241 0.315 0.574 0.873 0.994
#[3,] 0.157 0.220 0.299 0.569 0.871 0.994
#[4,] 0.119 0.173 0.248 0.545 0.859 0.994
#[5,] 0.065 0.122 0.197 0.510 0.848 0.993

Resources