Writing a custom Probit function in Stan - stan

I am trying to code a custom Probit function in Stan to improve my understanding of the Stan language and likelihoods. So far I've written the logarithm of the normal pdf but am receiving an error message that I've found to be unintelligible when I am trying to write the likelihood. What am I doing wrong?
Stan model
functions {
real normal_lpdf(real mu, real sigma) {
return -log(2 * pi()) / 2 - log(sigma)
- square(mu) / (2 * sigma^2);
}
real myprobit_lpdf(int y | real mu, real sigma) {
return normal_lpdf(mu, sigma)^y * (1 - normal_lpdf(mu, sigma))^(1-y);
}
}
data {
int N;
int y[N];
}
parameters {
real mu;
real<lower = 0> sigma;
}
model {
for (n in 1:N) {
target += myprobit_lpdf(y[n] | mu, sigma);
}
}
Error
PARSER EXPECTED:
Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname, :
failed to parse Stan model 'Probit_lpdf' due to the above error.
R code to simulate data
## DESCRIPTION
# testing a Probit model
## DATA
N <- 2000
sigma <- 1
mu <- 0.3
u <- rnorm(N, 0, 2)
y.star <- rnorm(N, mu, sigma)
y <- ifelse(y.star > 0,1, 0)
data = list(
N = N,
y = y
)
## MODEL
out.stan <- stan("Probit_lpdf.stan",data = data, chains = 2, iter = 1000 )

The full error message is
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Probabilty functions with suffixes _lpdf, _lpmf, _lcdf, and _lccdf,
require a vertical bar (|) between the first two arguments.
error in 'model2a7252aef8cf_probit' at line 7, column 27
-------------------------------------------------
5: }
6: real myprobit_lpdf(real y, real mu, real sigma) {
7: return normal_lpdf(mu, sigma)^y * (1 - normal_lpdf(mu, sigma))^(1-y);
^
8: }
-------------------------------------------------
which is telling you that the normal_lpdf function excepts three inputs and a vertical bar separating the first from the second.
It is also not a good idea to give your function the same name as a function that is already in the Stan language, such as normal_lpdf.
But the functions you have written do not implement the log-likelihood of a probit model anyway. First, the standard deviation of the errors is not identified by the data, so you do not need sigma. Then, the correct expressions would be something like
real Phi_mu = Phi(mu);
real log_Phi_mu = log(Phi_mu);
real log1m_Phi_mu = log1m(Phi_mu);
for (n in 1:N)
target += y[n] == 1 ? log_Phi_mu : log1m_Phi_mu;
although that is just a slow way of doing
target += bernoulli_lpmf(y | Phi(mu));

Related

Need to fix Stan code for the generalized pareto distribution to take in real arguments rather than vectors

I am using the functions defined here: Extreme value analysis and user defined probability functions in Stan for modeling the data with a generalized pareto distribution, but my problem is that my model is in a for-loop and expects three real valued arguments, whereas, the gpd functions assume a vector, real, real argument.
I’m not so sure that my model chunk is so amenable to being vectorized, and so I was thinking I would need to have the gpd functions take in real valued arguments (but maybe I’m wrong).
I’d appreciate any help with switching the code around to achieve this. Here is my stan code
functions {
real gpareto_lpdf(vector y, real k, real sigma) {
// generalised Pareto log pdf
int N = rows(y);
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma)
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
else
return -sum(y)/sigma -N*log(sigma); // limit k->0
}
real gpareto_lcdf(vector y, real k, real sigma) {
// generalised Pareto log cdf
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma)
if (fabs(k) > 1e-15)
return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
else
return sum(log1m_exp(-(y)/sigma)); // limit k->0
}
}
data {
// the input data
int<lower = 1> n; // number of observations
real<lower = 0> value[n]; // value measurements
int<lower = 0, upper = 1> censored[n]; // vector of 0s and 1s
// parameters for the prior
real<lower = 0> a;
real<lower = 0> b;
}
parameters {
real k;
real sigma;
}
model {
// prior
k ~ gamma(a, b);
sigma ~ gamma(a,b);
// likelihood
for (i in 1:n) {
if (censored[i]) {
target += gpareto_lcdf(value[i] | k, sigma);
} else {
target += gpareto_lpdf(value[i] | k, sigma);
}
}
}
Here is how the log PDF could be adapted. This way, index arrays for subsetting y into censored and non-censored observations can be passed.
real cens_gpareto_lpdf(vector y, int[] cens, int[] no_cens, real k, real sigma) {
// generalised Pareto log pdf
int N = size(cens);
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y[no_cens]) * (k/sigma))) -N*log(sigma) +
sum(log1m_exp((-inv_k)*(log1p((y[cens]) * (k/sigma)))));
else
return -sum(y[no_cens])/sigma -N*log(sigma) +
sum(log1m_exp(-(y[cens])/sigma));
}
Extend the data block: n_cens, n_not_cens, cens, and no_cens are values that need to supplied.
int<lower = 1> n; // total number of obs
int<lower = 1> n_cens; // number of censored obs
int<lower = 1> n_not_cens; // number of regular obs
int cens[n_cens]; // index set censored
int no_cens[n_not_cens]; // index set regular
vector<lower = 0>[n] value; // value measurements
Nonzero Parameters as suggested by gfgm:
parameters {
real<lower=0> k;
real<lower=0> sigma;
}
Rewrite the model block:
model {
// prior
k ~ gamma(a, b);
sigma ~ gamma(a,b);
// likelihood
value ~ cens_gpareto(cens, no_cens, k, sigma);
}
Disclaimer: I neither checked the formulas for sanity nor ran the model using test data. Just compiled via rstan::stan_model() which worked fine. gfgm's suggestion may be more convenient for post-processing / computing stuff in generated quantities etc. I'm not a Stan expert :-).
Edit:
Fixed divergence issue found by gfgm through simulation. The likelihood was ill-defined (N= rows(y) instead of N=size(cens). Runs fine now with gfgm's data (using set.seed(123) and rstan):
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
k 0.16 0.00 0.10 0.02 0.08 0.14 0.21 0.42 1687 1
sigma 0.90 0.00 0.12 0.67 0.82 0.90 0.99 1.16 1638 1
lp__ -106.15 0.03 1.08 -109.09 -106.56 -105.83 -105.38 -105.09 1343 1
You can get the model to accept a vector and avoid a for loop. In part you need to change the signature when you declare value but then also feed in as data the indices of the censored and uncensored observations.
I'm posting below code that runs on made up data, but the data I made up is just some random log-normal variates and not actual draws from a pareto so I have no idea if the model is recovering the parameters or what its coverage looks like. You'll probably want to do a bit of simulation based calibration to check the model.
The stan code:
functions {
real gpareto_lpdf(vector y, real k, real sigma) {
// generalised Pareto log pdf
int N = rows(y);
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma);
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma);
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
else
return -sum(y)/sigma -N*log(sigma); // limit k->0
}
real gpareto_lcdf(vector y, real k, real sigma) {
// generalised Pareto log cdf
real inv_k = inv(k);
if (k<0 && max(y)/sigma > -inv_k)
reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma);
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma);
if (fabs(k) > 1e-15)
return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
else
return sum(log1m_exp(-(y)/sigma)); // limit k->0
}
}
data {
// the input data
int<lower = 1> n; // number of observations
int<lower = 0> n_cens;
vector<lower = 0>[n] value; // value measurements
int<lower=1> cens_id[n_cens]; // the indices of censored observations
int<lower=1> nocens_id[n - n_cens]; // the indices of uncensored obs
// parameters for the prior
real<lower = 0> a;
real<lower = 0> b;
}
parameters {
real<lower=0> k;
real<lower=0> sigma;
}
model {
// prior
k ~ gamma(a, b);
sigma ~ gamma(a,b);
// likelihood
target += gpareto_lcdf(value[cens_id] | k, sigma);
target += gpareto_lpdf(value[nocens_id] | k, sigma);
}
This runs basically instantaneously on my mid-tier laptop using cmdstanr. You can run it with the R code
library(cmdstanr)
library(bayesplot)
gparet_mod <- cmdstan_model("gpareto_example.stan")
# make some fake data. This won't be correct and its
# no proper test of the capacity of the model to
# recover the parameters, just seeing if it runs
N <- 100
value <- rlnorm(N)
censored <- rbinom(N, 1, .5)
N_cens <- sum(censored)
cens_id <- which(censored == 1)
nocens_id <- which(censored == 0)
a <- b <- 2
dat <- list(n = N, n_cens = N_cens, value = value,
cens_id = cens_id, nocens_id = nocens_id,
a = a, b = b)
ests <- gparet_mod$sample(data = dat, parallel_chains = 2, chains = 2)
This produces what appear to be sane samples
ests$summary()
# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -103. -103. 1.10 0.767 -105. -102. 1.00 750. 807.
2 k 0.316 0.296 0.135 0.131 0.128 0.557 1.00 665. 570.
3 sigma 0.691 0.685 0.113 0.110 0.511 0.887 1.00 772. 887.
And posterior distributions
mcmc_hist(ests$draws(variables = c("k", "sigma")))
I think the approach of Martin Arnold is a bit more elegant, but when I tried to get it to run on the same data the sampler broke down with divergences, so maybe I ported the model wrong or something needs tweaking.
Edit
Given the likelihood functions it makes more sense to constrain the parameters k and sigma to be positive. The model runs better and you'll get fewer error messages. I modified the stan code above to reflect this.

RStan - Problem in stan_file model code - Variational Bayes

I am trying to do Variational inference, so that I can get the best approximating distribution with respect to the target distribution: a Normal-Inverse-Wishart.
Normal-Inverse-Wishart distribution
However when I compile the stan file with the model code it gives the error:
Error in stanc(file = file, model_code = model_code, model_name = ?>model_name, : 0
Syntax error in 'string', line 16, column 14 to column 15, parsing >error:
Expected "generated quantities {" or end of file after end of model >block.
I have tried to investigate what this is referring to but I require some help. My R code is:
stan_file <- "RStan VI model.stan"
stan_model <- rstan::stan_model(file = stan_file) // Error occurs at this line
The RStan file model code is:
data {
int<lower=1> N; // number of assets
real<lower=0> nu0; // prior confidence for sigma
matrix[N, N] sigma_0; // prior sigma
real<lower=0> T0; // prior confidence for mu
vector[N] mu0; // prior mu
}
parameters {
matrix[N, N] sigma;
vector[N] mu;
}
transformed parameters {
matrix[N, N] a;
matrix[N, N] b;
a = sigma0*nu0;
b = sigma/T0;
}
model {
target += inv_wishart_lpdf(sigma | nu0, a);
target += normal_lpdf(mu | mu0, b);
}
I even tried changing the last section of the model code to:
model {
sigma ~ inv_wishart(nu0, a);
mu ~ normal(mu0, b);
}
But still same error. Would anyone know what the error is and how I can fix it?
Many thanks.
Best,
Nihaar

Problems adjusting Linear Regression at Stan

I'm having trouble adjusting a linear regression model on the stan. When observing the error message, the identification in the block part of the transformed parameters is noted.
See below the structure of the code in stan.
Packages:
library(rstan)
library(bayesplot)
Data:
head(Orange)
cols <- c(colnames(Orange[-1]))
Orange <- Orange[,cols]
str(Orange)
Code in stan:
See that the block structure within the stan follows the recommended pattern, however I am not able to identify which part of the code may seem wrong to me.
y = Orange$circumference
x = Orange$age
n = length(y)
regresstan = '
data{
int n;
real y[n];
real x[n];
}
parameters{
real alpha;
real beta;
real sigma;
}
transformed parameters{
real mu[n];
mu = alpha + beta*x;
}
model{
//Priors
alpha ~ normal(0, 100);
beta ~ normal(0, 100);
sigma ~ uniform(0, 100);
//Likelihood
y ~ normal(mu, sigma);
}
'
Error:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for:
real * real[ ]
Available argument signatures for operator*:
real * real
vector * real
row_vector * real
matrix * real
row_vector * vector
vector * row_vector
matrix * vector
row_vector * matrix
matrix * matrix
real * vector
real * row_vector
real * matrix
No matches for:
real + ill-formed
Available argument signatures for operator+:
int + int
real + real
vector + vector
row_vector + row_vector
matrix + matrix
vector + real
row_vector + real
matrix + real
real + vector
real + row_vector
real + matrix
+int
+real
+vector
+row_vector
+matrix
Expression is ill formed.
error in 'modele28054257a16_a9d23411185fa271b60f20be43062e80' at line 16, column 23
-------------------------------------------------
14: transformed parameters{
15: real mu[n];
16: mu = alpha + beta*x;
^
17: }
-------------------------------------------------
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
failed to parse Stan model 'a9d23411185fa271b60f20be43062e80' due to the above error.
The error comes from the transformed parameters block at the line
mu = alpha + beta*x;
The error is saying you can't multiply a real scalar by a real vector (the error of real * real[ ]). You can solve this by looping over the values of mu
transformed parameters {
real mu[n];
for(i in 1:n) {
mu[i] = alpha + beta * x[i];
}
}
which resolves the issue as now you have a real scalar times a real scalar.

State-space models with latent variables in R

I tried to estimate State-space models with latent variables in r, but I failed.
I tried to do it in rstan package.
Does someone know how to do it better?
Let's say we have two observable variables: gdp and inflation.
We also know there are three unobservable variables: potential grow, potential gdp, output gap.
We want to estimate unobservable variables and two coefficient (a1 and a2).
Our model:
State Equations:
grow[t] = grow[t-1] + e
potential[t] = potential[t-1] + grow[t] + e
gap[t] = a1*gap[t-1] + e
Measurement Equations:
gdp[t] = potential[t] + gap[t]
inflation[t] = inflation[t-1] + a2*gap[t] + e
Here I generated data:
a1 <- 0.7
a2 <- 0.3
grow <- c()
potential <- c()
gap <- c()
gdp <- c()
inflation <- c()
grow[1] <- 2
potential[1] <- 10
gap[1] <- 0
gdp[1] <- potential[1] + gap[1]/100
inflation[1] <- 2
for (i in 2:100) {
grow[i] <- grow[i-1] + rnorm(1, 0, 0.1)
potential[i] <- potential[i-1] + grow[i]/100 + rnorm(1, 0, 0.1)
gap[i] <- a1*gap[i-1] + rnorm(1,0,0.1)
gdp[i] <- potential[i] + gap[i]/100
inflation[i] <- inflation[i-1] + a2*gap[i] + rnorm(1,0,0.1)
}
And here is my rstan code:
data {
int T; // number of obs
int P; //number of variables
matrix[T,P] Y; //dataset of generated series
}
parameters {
#Coefficients
vector[1] alfa1; //ar gap
vector[1] alfa2; //phillips curve
#State Variables (unobserved economic variables)
vector[T] gap; // output gap
vector[T] potential; // potential output
vector[T] grow; // growth of potential output
#Innovations
real<lower = 0> sigma_pot; // The scale of innovations to potential output
real<lower = 0> sigma_grow; // The scale of innovations to growth in potential output
real<lower = 0> sigma_gap; // The scale of innovations to output gap
real<lower = 0> sigma_inf; // The scale of innovations to phillips curve
}
model {
// priors
//Innovations
sigma_pot ~ cauchy(0.2,3);
sigma_grow ~ cauchy(0.3,3);
sigma_gap ~ cauchy(0.9,5);
sigma_inf ~ cauchy(2,5);
//coefficients
alfa1 ~ normal(0,1);
alfa2 ~ normal(0,1);
//Initialize State Equations
potential[1] ~ normal(0,1);
grow[1] ~ normal(0,1);
gap[1] ~ normal(0,1);
// State Equations
for(t in 2:T) {
grow[t] ~ normal(grow[t-1], sigma_grow);
potential[t] ~ normal(potential[t-1] + grow[t], sigma_pot);
gap[t] ~ normal( alfa1*gap[t-1], sigma_gap);
}
// Measurement Equations
for(t in 1:T) {
Y[t,1] = potential[t] + gap[t];
Y[t,2] ~ normal(Y[t-1,2] + alfa1*gap[t],sigma_inf);
}
}
Here I tried model
mvf_model <- stan(file = "newstan.stan" ,
data = list(T = nrow(data),
P = ncol(data),
Y = data),
chains = 4)
And here is the error I've got
Cannot assign to variable outside of declaration block; left-hand-side variable origin=data
Illegal statement beginning with non-void expression parsed as
Y[[t, 1]]
Not a legal assignment, sampling, or function statement. Note that
* Assignment statements only allow variables (with optional indexes) on the left;
* Sampling statements allow arbitrary value-denoting expressions on the left.
* Functions used as statements must be declared to have void returns
error in 'model16a4a72378d_newstan' at line 27, column 3
-------------------------------------------------
25: transformed parameters {
26: for(t in 1:T) {
27: Y[t,1] = potential[t] + gap[t];
^
28: }
-------------------------------------------------
PARSER EXPECTED: "}"
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
failed to parse Stan model 'newstan' due to the above error.

Partially observed parameter in Stan

I am trying to migrate some code from JAGS to Stan. Say I have the following dataset:
N <- 10
nchoices <- 3
ncontrols <- 3
toydata <- list("y" = rbinom(N, nchoices - 1, .5),
"controls" = matrix(runif(N*ncontrols), N, ncontrols),
"N" = N,
"nchoices" = nchoices,
"ncontrols" = ncontrols)
and that I want to run a multinomial logit with the following code (taken from section 9.5 of the documentation):
data {
int N;
int nchoices;
int y[N];
int ncontrols;
vector[ncontrols] controls[N];
}
parameters {
matrix[nchoices, ncontrols] beta;
}
model {
for (k in 1:nchoices)
for (d in 1:ncontrols)
beta[k,d] ~ normal(0,100);
for (n in 1:N)
y[n] ~ categorical(softmax(beta * controls[n]));
}
I now want to fix the first row of beta to zero. In JAGS I would simply declare in the model block that
for (i in 1:ncontrols) {
beta[1,i] <- 0
}
but I am not sure about how to do this in Stan. I have tried many combinations along the lines of section 6.2 of the documentation (Partially Known Parameters) like, for instance,
parameters {
matrix[nchoices, ncontrols] betaNonObs;
}
transformed parameters {
matrix[nchoices, ncontrols] beta;
for (i in 1:ncontrols) beta[1][i] <- 0
for (k in 2:nchoices) beta[k] <- betaNonObs[k - 1]
}
but none of them work. Any suggestions?
It would be helpful to mention the error message. In this case, if beta is declared to be a matrix, then the syntax you want is the R-like syntax
beta[1,i] <- 0.0; // you also omitted the semicolon
To answer your broader question, I believe you were on the right track with your last approach. I would create a matrix of parameters in the parameters block called free_beta and copy those elements to another matrix declared in the model block called beta that has one extra row at the top for the fixed zeros. Like
data {
int N;
int nchoices;
int y[N];
int ncontrols;
vector[ncontrols] controls[N];
}
parameters {
matrix[nchoices-1, ncontrols] free_beta;
}
model {
// copy free beta into beta
matrix[nchoices,ncontrols] beta;
for (d in 1:ncontrols)
beta[1,d] <- 0.0;
for (k in 2:nchoices)
for (d in 1:ncontrols)
beta[k,d] <- free_beta[k-1,d];
// priors on free_beta, which execute faster this way
for (k in 1:(nchoices-1))
row(free_beta,k) ~ normal(0.0, 100.0);
// likelihood
for (n in 1:N)
y[n] ~ categorical(softmax(beta * controls[n]));
}

Resources