Say I want to model a random effect at two levels, i.e. I have two levels of nesting: individuals within a parent group and parent groups within a grandparent group. I know how to write a basic model for a single random effect (below) from examples like these but I don't know how to write the equivalent to
lmer(resp ~ (1|a/b), data = DAT)
in lmer.
STAN code for single RE. Question is, how to nest a within a higher level b?
data{
int<lower=0> N;
int<lower=0> K;
matrix[N,K] X;
vector[N] price;
int J;
int<lower=1,upper=J> re[N];
}
parameters{
vector[J] a;
real mu_a;
real tau;
real<lower=0> sigma_a;
real<lower=0> sigma;
vector[K] beta;
}
transformed parameters{
vector[N] mu_hat;
for(i in 1:N)
mu_hat[i] <- a[re[i]];
}
model {
mu_a ~ normal(0,10);
tau ~ cauchy(0,5);
a ~ normal(mu_a,sigma_a);
for(i in 1:N)
price[i] ~ normal(X[i]*beta + mu_hat[i], sigma);
}
"
I'm not sure what the a/b notation is in lmer, but if you want nested levels multiple layers deep, then it's easy with a predictor. Say you have an IRT model with students (j in 1:J) nested in schools (school[j] in 1:S) and schools nested in cities (city[s] in 1:C).
[Update 14 April 2017]
You can now vectorize everything. So rather than this:
for (j in 1:J)
theta[j] ~ normal(alpha[school[j]], sigma_theta);
for (s in 1:S)
alpha[s] ~ normal(beta[city[s]], sigma_alpha);
beta ~ normal(0, 5);
you can have
theta ~ normal(alpha[school], sigma_theta);
alpha ~ normal(beta[city], sigma_alpha);
beta ~ normal(0, 5);
If your model is simple, the brms package is worth looking at. It compiles your formulae down to stan and runs the model. It also has expressive syntax borrowed from lmer. What I like is that you can compile the model as a stan file and then build on it if you already have the lmer formulae
And of course it has the added benefit(from stan) that the confusing difference around estimation of "fixed effect" vs "random effects" is gone and both are estimated essentially as parameters with posterior distributions.
Related
I am using R+stan for Bayesian estimates of model parameters when the distribution for response variable in a regression is not normal but rather some custom distribution as below.
Let say I have below data generating process
x <- rnorm(100, 0, .5)
noise <- rnorm(100,0,1)
y = exp(10 * x + noise) / (1 + exp(10 * x + noise))
data <- list( x= x, y = y, N = length(x))
In stan, I am creating below stan object
Model = "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
transformed parameters {
vector[N] mu;
for (f in 1:N) {
mu[f] = alpha + beta * x[f];
}
}
model {
sigma ~ chi_square(5);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
y ~ ???;
}
"
However as you can see, what will be the right stan continuous distribution in the model block for y?
Any pointer will be highly appreciated.
Thanks for your time.
The problem is not so much that the distribution of errors isn't normal (which is the assumption in a regular linear regression), but that that's clearly not a linear relationship between x and y. You DO have a linear relationship with normally distributed noise (z = x * 10 + noise, where I use z to avoid confusion with your y), but then you apply the softmax function: y = softmax(z). If you want to model this using a linear regression, you need to invert the softmax (i.e. get the z back from y), which you do using the inverse softmax (which is the logit function since the softmax is the inverse logit function and the inverse inverse logit is the logit). Then you can do a standard linear regression.
model = "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
transformed data{
// invert the softmax function, so there's a linear relationship between x and z
vector[N] z = logit(y);
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
transformed parameters {
vector[N] mu;
// no need to loop here, this can be vectorized
mu = alpha + beta * x;
}
model {
sigma ~ chi_square(5);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
z ~ normal(mu, sigma);
}
generated quantities {
// now if you want to check the prediction,
//you predict linearly and then apply the softmax again
vector[N] y_pred = inv_logit(alpha + beta * x);
}
"
If you won't use mu again outside the model, you can skip the transformed parameters block and compute it directly when needed:
z ~ normal(alpha + beta * x, sigma);
On a sidenote: You might want to reconsider your priors. The true values for alpha and beta in this case are 0 and 10 respectively. The likelihood is precise enough to overwrite the prior largely, but you'll probably see some shrinkage towards zero for beta (i.e. you might get 9 instead of 10). Try something like normal(0, 10) instead. And I've never seen someone use a chi-squared distribution as a prior on standard deviations.
i'm building a standard linear regression model and i want to include the generated quantities block and i want to use the dot_self() function. The problem is I canĀ“t get simulation samples. The error is: Stan model 'LinearRegression' does not contain samples. . I think the the function dot_self() is not being recognized as a function.
I show stan code and R code here.
Thanks in advance.
Note: I am sure that the data entered is correct because the model without the generated quantities block works perfectly.
Stan Code:
data {
int<lower=1> N;
int<lower=1> K;
matrix[N, K] X;
vector[N] y;
}
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model{
vector[N] mu;
mu = X * beta;
beta ~ normal(0, 10);
sigma ~ cauchy(0, 5);
y ~ normal(mu, sigma);
}
generated quantities {
real rss;
real totalss;
real<lower=0, upper=1> R2;
vector[N] mu;
mu=X * beta;
rss=dot_self(y-mu);
totalss=dot_self(y-mean(y));
R2=1 - rss/totalss;
}
R Code to run Stan model:
library(rstan)
library(coda)
library(ggplot2)
rstan_options(auto_write=T)
options(mc.cores=parallel::detectCores())
dat=list(N=N, K=ncol(X), y=y, X=X)
fit3 = stan(file = "C:.... LinearRegression.stan", data = dat, iter = 100,chains = 4)
print(fit3, digits=3, prob=c(.025,.5,.975))
The error is due to the bounds on R2. I believe there is no need to impose bounds on generated quantities.
Here I used simulated X and y:
X = matrix(runif(N*K), N, K)
y = rowSums(X)
The results after removing the bounds are:
...and what does the ~ sign mean compared to R in y[I] ~ dnorm(m[i],tau) vs y[I] <- dnorm(n,m[i],tau)?
Consider the two lines of code:
`for(I in 1:length(y)) {
y[i] ~ dnorm(m[i],tau) #---> Jags code (stochastic node)
m[i] = alpha + beta*(x[i] - x_bar)
.
.
}
y[i] <- dnorm(n,m[i],tau)?) ---> R`
In Jags, what will be the n values since it is not specified inside the dnorm function? (dnorm(m[i],tau))
For each i, does the dnorm function calculate the density values for each y value with respect to the mean m[I] which has a linear relationship determined by the deterministic node and tau(precision)?
In short, I wanna know what n values will be used by dnorm or any other density function for distributions(dgamma or dbeta).
In this specific instance y is your response variable, m is your linear predictor, and tau is precision (the inverse of variance). Using ~ makes the relationship stochastic. Looking to the JAGS user manual...
"Relations can be of two types. A stochastic relation (~) defines a stochastic node, representing a random variable in the model. A deterministic relation (<-) defines a deterministic
node, the value of which is determined exactly by the values of its parents. The equals sign
(=) can be used for a deterministic relation in place of the left arrow (<-)."
So, in other words, you are assuming that the values in y are drawn from a normal distribution that are related to m and tau.
While dnorm in R calculates the density, JAGS calculates the log density (as per the user manual). Effectively, this stochastic relationship allows you to use y and x to estimate alpha, beta, and tau, and you use dnorm in this case by making a distributional assumption about the data generating process.
Of course, as this is a Bayesian analysis, you'll need priors for your parameters. You can also deterministically calculate the standard deviation instead of precision. A full model would look something like...
model{
# likelihood
for(I in 1:length(y)) {
y[i] ~ dnorm(m[i],tau)
m[i] <- alpha + beta*(x[i] - x_bar)
}
# priors
tau ~ dgamma(0.001, 0.001)
sd <- 1/ sqrt(tau)
alpha ~ dnorm(0, 0.001)
beta ~ dnorm(0, 0.001)
}
I'm having trouble to implement deviance information criterion manually for a JAGS model
model = "
data{
for(i in 1:n){
zeros[i]<- 0
}
}
model{
C <- 10000
for (i in 1:n) {
zeros[i] ~ dpois(lambda[i])
lambda[i] <- -l[i] + C
l[i] <-
-0.5*log(sigma[i]*(y[i]*(1-y[i]))^3) +
-0.5*(1/sigma[i])*((y[i]-mu[i])^2)/(y[i]*(1-y[i])*mu[i]^2*(1-mu[i])^2)
logit(mu[i]) <- beta0 + beta1*income[i] + beta2*person[i]
log(sigma[i]) <- -delta0
}
Deviance <- -2*sum(l[])
beta0 ~ dnorm(0,.001)
beta1 ~ dnorm(0,.001)
beta2 ~ dnorm(0,.001)
delta0 ~ dnorm(0,.001)
}"
In rjags package there is a function called dic.samples() that return the DIC value, but the problem is that for this model with Poisson trick it doesn't work.
Here is what I want to implement DIC code,but I don't know well how do that
EDIT:
If I run coda.samples and request monitoring the deviance node, it will return the posterior mean and standard deviation, then I can calulate DIC using Gelman approximation to pD. Is it right?
A related problem: I couldn't extract DIC from models fit with the 'R2jags' package - the dic.samples() and related functions did not work.
Also, because my model was simultaneously calculating a lot of derived parameters (my outcome variable over a fine-scale gradient of the predictor variable), I couldn't use the documented print() function, because there was too much text and it got truncated before the DIC output.
The solution took a bit of poking around in the output data structure but is very easy. If you you fit your model by:
model.name <- jags(data=jag.data, inits=inits, parameters.to.save=parameters, model.file="modelfile.txt", n.thin=nt, n.chains=nc, n.burnin=nb, n.iter=ni, DIC=T, working.directory=getwd())
Then you can call the pD and DIC values via:
model.name$BUGSoutput$pD
model.name$BUGSoutput$DIC
I'm trying a simple Gamma GLM in STAN and R, but it crashes immediately
generate data:
set.seed(1)
library(rstan)
N<-500 #sample size
dat<-data.frame(x1=runif(N,-1,1),x2=runif(N,-1,1))
#the model
X<-model.matrix(~.,dat)
K<-dim(X)[2] #number of regression params
#the regression slopes
betas<-runif(K,-1,1)
shape <- 10
#simulate gamma data
mus<-exp(X%*%betas)
y<-rgamma(500,shape=shape,rate=shape/mus)
this is my STAN model:
model_string <- "
data {
int<lower=0> N; //the number of observations
int<lower=0> K; //the number of columns in the model matrix
matrix[N,K] X; //the model matrix
vector[N] y; //the response
}
parameters {
vector[K] betas; //the regression parameters
real<lower=0, upper=1000> shape; //the shape parameter
}
model {
y ~ gamma(shape, (shape/exp(X * betas)));
}"
when I run this model, R immediately crashes:
m <- stan(model_code = model_string, data = list(X=X, K=3, N=500, y=y), chains = 1, cores=1)
update : I think the problem is somewhere in the vectorization as I can get a running model where I pass every column of X as a vector.
update2: this also works
for(i in 1:N)
y[i] ~ gamma(shape, (shape / exp(X[i,] * betas)));
The problem with the original code is that there is no operator currently defined in Stan for a scalar divided by a vector. In this case,
shape / exp(X * betas)
You might be able to do
shape[1:N] ./ exp(X * betas)
or failing that,
(shape * ones_vector) ./ exp(X * betas)