Stan: Copula on observed variable and latent variable - stan

Consider observed data y1 and y2. y1 is measured on a continuous scale and y2 is measured on a binary scale. A continuous latent variable z is assumed to generate y2 as: y2 = I(z > 0). (If z is normal then y2 is binary probit marginally). Furthermore, a copula is used to model the dependency between y1 and z. This model could be written hierarchically (with some abuse of notation) as:
y2 = I(z > 0)
(y1, z) ~ C(F_y1( |w), F_z( |w) | phi)
w, phi ~ priors
where w is the vector of marginal parameters for y1 and z, F_y1 and F_z are respective marginal cdfs for y1 and z, phi is the copula parameter.
How could this be modelled in Stan? I have written a custom probability function to sample y1 and z from the bivariate likelihood produced by the copula. What I don't know how to do is to account for (generate?) the latent variable(s) z, and how to specify the relationship between y2 and z.
I have already looked at Probit regression with data augmentation in stan, but this does not seem helpful due to the copula I have in my model.
Edit: I might be mistaken about the above link not being useful. I have written the following code, would appreciate comments on if it looks correct (theoretically).
functions {
real copulapdf_log(real[] y1, real[] z, vector mu1, vector mu2, real sigma1, real phi, int n){
real logl;
real s;
logl <- 0.0;
for (i in 1:n){
s <- log(dCphi_du1du2_s(normal_cdf(y1[i],mu1[i],sigma1), logistic_cdf(z[i],mu2[i],1), phi)) + normal_log(y1[i],mu1[i],sigma1) + logistic_log(z[i],mu2[i],1);
logl <- logl + s;
}
return logl;
}
}
data {
int<lower=0> n; // number of subjects
int<lower=0> k1; // number of predictors for y1
int<lower=0> k2; // number of predictors for y2
real y1[n]; // continuous data
real y2[n]; // 0/1 binary data
matrix[n, k1] x1; // predictor variables for y1
matrix[n, k2] x2; // predictor variables for y2
}
transformed data{
int<lower=-1, upper=1> sign[n];
for (i in 1:n) {
if (y2[i]==1)
sign[i] <- 1;
else
sign[i] <- -1;
}
}
parameters {
real phi; // frank copula param
vector[k1] b1; // beta coefficients for y1
vector[k2] b2; // beta coefficients for y2
real<lower=0> abs_z[n]; // abs value of latent variable
real<lower=0> sigma1; // sd for y1's normal distribution
}
transformed parameters {
real v[n];
vector[n] mu1; // location for y1
vector[n] mu2; // location for z
for (i in 1:n) {
v[i] <- sign[i] * abs_z[i];
}
mu1 <- x1 * b1;
mu2 <- x2 * b2;
}
model {
b1 ~ normal(0, 100);
b2 ~ normal(0, 100);
phi ~ normal(0, 10);
increment_log_prob(copulapdf_log(y1, v, mu1, mu2, sigma1, phi, n));
}

If you need the latent parameter formulation, that's just like the Albert and Chib characterization of probit regression. What you need to do is declare the truncation in the parameters. There's an example in the manual chapter on regression involving multivariate probit that shows how it's done. Basically the positive values get a lower=0 constraint and the negative ones an upper=0 constraint and then you put both sets of parameters together into a z vector (if you actually need to put it back together).

Related

Estimating parameters using stan when the distribution for response variable in a regression is non-normal

I am using R+stan for Bayesian estimates of model parameters when the distribution for response variable in a regression is not normal but rather some custom distribution as below.
Let say I have below data generating process
x <- rnorm(100, 0, .5)
noise <- rnorm(100,0,1)
y = exp(10 * x + noise) / (1 + exp(10 * x + noise))
data <- list( x= x, y = y, N = length(x))
In stan, I am creating below stan object
Model = "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
transformed parameters {
vector[N] mu;
for (f in 1:N) {
mu[f] = alpha + beta * x[f];
}
}
model {
sigma ~ chi_square(5);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
y ~ ???;
}
"
However as you can see, what will be the right stan continuous distribution in the model block for y?
Any pointer will be highly appreciated.
Thanks for your time.
The problem is not so much that the distribution of errors isn't normal (which is the assumption in a regular linear regression), but that that's clearly not a linear relationship between x and y. You DO have a linear relationship with normally distributed noise (z = x * 10 + noise, where I use z to avoid confusion with your y), but then you apply the softmax function: y = softmax(z). If you want to model this using a linear regression, you need to invert the softmax (i.e. get the z back from y), which you do using the inverse softmax (which is the logit function since the softmax is the inverse logit function and the inverse inverse logit is the logit). Then you can do a standard linear regression.
model = "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
transformed data{
// invert the softmax function, so there's a linear relationship between x and z
vector[N] z = logit(y);
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
transformed parameters {
vector[N] mu;
// no need to loop here, this can be vectorized
mu = alpha + beta * x;
}
model {
sigma ~ chi_square(5);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
z ~ normal(mu, sigma);
}
generated quantities {
// now if you want to check the prediction,
//you predict linearly and then apply the softmax again
vector[N] y_pred = inv_logit(alpha + beta * x);
}
"
If you won't use mu again outside the model, you can skip the transformed parameters block and compute it directly when needed:
z ~ normal(alpha + beta * x, sigma);
On a sidenote: You might want to reconsider your priors. The true values for alpha and beta in this case are 0 and 10 respectively. The likelihood is precise enough to overwrite the prior largely, but you'll probably see some shrinkage towards zero for beta (i.e. you might get 9 instead of 10). Try something like normal(0, 10) instead. And I've never seen someone use a chi-squared distribution as a prior on standard deviations.

Generated quantities block in stan model

i'm building a standard linear regression model and i want to include the generated quantities block and i want to use the dot_self() function. The problem is I canĀ“t get simulation samples. The error is: Stan model 'LinearRegression' does not contain samples. . I think the the function dot_self() is not being recognized as a function.
I show stan code and R code here.
Thanks in advance.
Note: I am sure that the data entered is correct because the model without the generated quantities block works perfectly.
Stan Code:
data {
int<lower=1> N;
int<lower=1> K;
matrix[N, K] X;
vector[N] y;
}
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model{
vector[N] mu;
mu = X * beta;
beta ~ normal(0, 10);
sigma ~ cauchy(0, 5);
y ~ normal(mu, sigma);
}
generated quantities {
real rss;
real totalss;
real<lower=0, upper=1> R2;
vector[N] mu;
mu=X * beta;
rss=dot_self(y-mu);
totalss=dot_self(y-mean(y));
R2=1 - rss/totalss;
}
R Code to run Stan model:
library(rstan)
library(coda)
library(ggplot2)
rstan_options(auto_write=T)
options(mc.cores=parallel::detectCores())
dat=list(N=N, K=ncol(X), y=y, X=X)
fit3 = stan(file = "C:.... LinearRegression.stan", data = dat, iter = 100,chains = 4)
print(fit3, digits=3, prob=c(.025,.5,.975))
The error is due to the bounds on R2. I believe there is no need to impose bounds on generated quantities.
Here I used simulated X and y:
X = matrix(runif(N*K), N, K)
y = rowSums(X)
The results after removing the bounds are:

How I can simulate response variable from a model already fitted?

I already fitted a regression model with JAGS
model{
for(i in 1:n) {
y[i] ~ dbeta(alpha[i], beta[i])
alpha[i] <- mu[i] * phi[i]
beta[i] <- (1 - mu[i]) * phi[i]
log(phi[i]) <- -inprod(X2[i, ], delta[])
cloglog(mu[i]) <- inprod(X1[i, ], B[])
}
for (j in 1:p){
B[j] ~ dnorm(0, .001)
}
for(k in 1:s){
delta[k] ~ dnorm(0, .001)
}
}
But I need to simulate 50 samples of response variable where each one have size, to do some plots. How can I do it?
I found this thread a litle help Estimating unknown response variable in JAGS - unsupervised learning
Should I run the chain again given the values of posterior estimates that I already have as inits?
I assume that your data are y, X1 and X2.
You can add the 50 lines of data in your X1 and X2 covariates, and add 50 NA values in y. And modify n to include the 50 values.
Your model will then produce predictions for the 50 NA values for y added.
Yes, you can do exactly as you described, as long as you first create a new dataset with 50 observations and the variables Y, X1, and X2 as described by StatnMap (viz., 50 values for both X1 and X2 and 50 NAs for Y), but you will not need to rerun your model, as implied by StatnMap. Just to be clear: you can, but you do not need.

simple Gamma GLM in STAN

I'm trying a simple Gamma GLM in STAN and R, but it crashes immediately
generate data:
set.seed(1)
library(rstan)
N<-500 #sample size
dat<-data.frame(x1=runif(N,-1,1),x2=runif(N,-1,1))
#the model
X<-model.matrix(~.,dat)
K<-dim(X)[2] #number of regression params
#the regression slopes
betas<-runif(K,-1,1)
shape <- 10
#simulate gamma data
mus<-exp(X%*%betas)
y<-rgamma(500,shape=shape,rate=shape/mus)
this is my STAN model:
model_string <- "
data {
int<lower=0> N; //the number of observations
int<lower=0> K; //the number of columns in the model matrix
matrix[N,K] X; //the model matrix
vector[N] y; //the response
}
parameters {
vector[K] betas; //the regression parameters
real<lower=0, upper=1000> shape; //the shape parameter
}
model {
y ~ gamma(shape, (shape/exp(X * betas)));
}"
when I run this model, R immediately crashes:
m <- stan(model_code = model_string, data = list(X=X, K=3, N=500, y=y), chains = 1, cores=1)
update : I think the problem is somewhere in the vectorization as I can get a running model where I pass every column of X as a vector.
update2: this also works
for(i in 1:N)
y[i] ~ gamma(shape, (shape / exp(X[i,] * betas)));
The problem with the original code is that there is no operator currently defined in Stan for a scalar divided by a vector. In this case,
shape / exp(X * betas)
You might be able to do
shape[1:N] ./ exp(X * betas)
or failing that,
(shape * ones_vector) ./ exp(X * betas)

Coda for visualizing Stan trajectories

I'm using Stan (specifically, rstan) with a Bayesian univariate linear regression $y = \beta_0 + \beta_1 x + \varepsilon$. And I'm trying to use the Coda package to visualize the resulting trajectories and distributions for the $\beta$s. However, this produces the error Error in plot.new() : figure margins too large. traceplot and densplot seem to work fine. The problem seems to be with plot.mcmc, which is supposed to produce a nice panel output. You can see an example of the expected output here, on slide "Traceplots and Density Plots".
Here's a minimum (non-)working example using the mtcars dataset:
library(rstan)
library(coda)
stanmodel <- "
data { // Data block: Exogenously given information
int<lower=1> N; // Sample size
vector<lower=1>[N] y; // Response or output.
// [N] means this is a vector of length N
vector<lower=0, upper=1>[N] x; // The single regressor; either 0 or 1
}
parameters { // Parameter block: Unobserved variables to be estimated
vector[2] beta; // Regression coefficients
real<lower=0> sigma; // Standard deviation of the error term
}
model { // Model block: Connects data to parameters
vector[N] yhat; // Regression estimate for y
yhat <- beta[1] + x*beta[2];
// Priors
beta ~ normal(0, 10);
// To plot in R: plot(function (x) {dnorm(x, 0, 10)}, -30, 30)
sigma ~ cauchy(0, 5); // With sigma bounded at 0, this is half-cauchy
// http://en.wikipedia.org/wiki/Cauchy_distribution
// To plot in R: plot(function (x) {dcauchy(x, 0, 5)}, 0, 10)
// Likelihood
y ~ normal(yhat, sigma); // yhat is the estimator, plus the N(0, sigma^2) error
// Note that Stan uses standard deviation
}
"
# Designate data
nobs <- nrow(mtcars)
y <- mtcars$mpg
x <- mtcars$am # Simple regression version doesn't include constant
data <- list(
N = nobs, # Sample size or number of observations
y = y, # The response or output
x = x # The single variable regressor, transmission type
)
# Set a seed for the random number generator
set.seed(123456)
# Run the model
bayes = stan(
model_code = stanmodel,
data = data, # Use the model and data we just defined
iter = 12000, # We're going to take 12,000 draws from the posterior,
warmup = 2000, # But throw away the first 2,000
thin = 10, # And only keep every tenth draw.
chains = 3 # But we'll do these 12,000 draws 4 times.
)
# Use the coda library to visualize parameter trajectories and distributions
param_samples <-
as.data.frame(bayes)[,c('beta[1]', 'beta[2]')]
plot(as.mcmc(param_samples))

Resources