Efficiently sample a collection of multi-normal variables with varying sigma (covariance) matrix - stan

I'm new to Stan, so hoping you can point me in the right direction. I'll build up to my situation to make sure we're on the same page...
If I had a collection of univariate normals, the docs tell me that:
y ~ normal(mu_vec, sigma);
provides the same model as the unvectorized version:
for (n in 1:N)
y[n] ~ normal(mu_vec[n], sigma);
but that the vectorized version is (much?) faster. Ok, fine, makes good sense.
So the first question is: is it possible to take advantage of this vectorization speedup in the univariate normal case where both the mu and sigma of the samples vary by position in the vector. I.e. if both mu_vec and sigma_vec are vectors (in the previous case sigma was a scalar), then is this:
y ~ normal(mu_vec, sigma_vec);
equivalent to this:
for (n in 1:N)
y[n] ~ normal(mu_vec[n], sigma_vec[n]);
and if so is there a comparable speedup?
Ok. That's the warmup. The real question is how to best approach the multi-variate equivalent of the above.
In my particular case, I have N of observations of bivariate data for some variable y, which I store in an N x 2 matrix. (For order of magnitude, N is about 1000 in my use case.)
My belief is that the mean of each component of each observation is 0 and that the stdev of each component is each observation is 1 (and I'm happy to hard code them as such). However, my belief is that the correlation (rho) varies from observation to observation as a (simple) function of another observed variable, x (stored in an N element vector). For example, we might say that rho[n] = 2*inverse_logit(beta * x[n]) - 1 for n in 1:N and our goal is to learn about beta from our data. I.e. the covariance matrix for the nth observation would be:
[1, rho[n]]
[rho[n], 1 ]
My question is what's the best way to put this together in a STAN model so that it isn't slow as heck? Is there a vectorized version of the multi_normal distribution so that I could specify this as:
y ~ multi_normal(vector_of_mu_2_tuples, vector_of_sigma_matrices)
or perhaps as some other similar formulation? Or will I need to write:
for (n in 1:N)
y[n] ~ multi_normal(vector_of_mu_2_tuples[n], vector_of_sigma_matrices[n])
after having set up vector_of_sigma_matrices and vector_of_mu_2_tuples in an earlier block?
Thanks in advance for any guidance!
Edit to add code
Using python, I can generate data in the spirit of my problem as follows:
import numpy as np
import pandas as pd
import pystan as pys
import scipy as sp
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns
def gen_normal_data(N, true_beta, true_mu, true_stdevs):
N = N
true_beta = true_beta
true_mu = true_mu
true_stdevs = true_stdevs
drivers = np.random.randn(N)
correls = 2.0 * sp.special.expit(drivers*true_beta)-1.0
observations = []
for i in range(N):
covar = np.array([[true_stdevs[0]**2, true_stdevs[0] * true_stdevs[1] * correls[i]],
[true_stdevs[0] * true_stdevs[1] * correls[i], true_stdevs[1]**2]])
observations.append(sp.stats.multivariate_normal.rvs(true_mu, covar, size=1).tolist())
observations = np.array(observations)
return {
'N': N,
'true_mu': true_mu,
'true_stdev': true_stdevs,
'y': observations,
'd': drivers,
'correls': correls
}
and then actually generate the data using:
normal_data = gen_normal_data(100, 1.5, np.array([1., 5.]), np.array([2., 5.]))
Here's what the data set looks like (scatterplot of y colored by correls in the left pane and by drivers in the right pane...so the idea is that the higher the driver the closer to 1 the correl and the lower the driver, the closer to -1 the correl. So would expect red dots on the left pane to be "down-left to up-right" and blue dots to be "up-left to down-right", and indeed they are:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
x = normal_data['y'][:, 0]
y = normal_data['y'][:, 1]
correls = normal_data['correls']
drivers = normal_data['d']
for ax, colordata, cmap in zip(axes, [correls, drivers], ['coolwarm', 'viridis']):
color_extreme = max(abs(colordata.max()), abs(colordata.min()))
sc = ax.scatter(x, y, c=colordata, lw=0, cmap=cmap, vmin=-color_extreme, vmax=color_extreme)
divider = make_axes_locatable(ax)
cax = divider.append_axes('right', size='5%', pad=0.05)
fig.colorbar(sc, cax=cax, orientation='vertical')
fig.tight_layout()
Using the brute force approach, I can set up a STAN model that looks like this:
model_naked = pys.StanModel(
model_name='naked',
model_code="""
data {
int<lower=0> N;
vector[2] true_mu;
vector[2] true_stdev;
real d[N];
vector[2] y[N];
}
parameters {
real beta;
}
transformed parameters {
}
model {
real rho[N];
matrix[2, 2] cov[N];
for (n in 1:N) {
rho[n] = 2.0*inv_logit(beta * d[n]) - 1.0;
cov[n, 1, 1] = true_stdev[1]^2;
cov[n, 1, 2] = true_stdev[1] * true_stdev[2] * rho[n];
cov[n, 2, 1] = true_stdev[1] * true_stdev[2] * rho[n];
cov[n, 2, 2] = true_stdev[2]^2;
}
beta ~ normal(0, 10000);
for (n in 1:N) {
y[n] ~ multi_normal(true_mu, cov[n]);
}
}
"""
)
This fits nicely:
fit_naked = model_naked.sampling(data=normal_data, iter=1000, chains=2)f = fit_naked.plot();
f.tight_layout()
But I'm hoping someone can point me in the right direction for the "marginalized" approach where we break down our bivariate normal into a pair of independent normals that can be blended using the correlation. The reason I need this is that in my actual use case, both dimensions of are fat-tailed. I am happy to model this as a student-t distribution, but the issue is that STAN only allows a single nu to be specified (not one for each dimension), so I think I'll need to find a way to decompose a multi_student_t into a pair of independent student_t's so that I can set the degrees of freedom separately for each dimension.

The univariate normal distribution does accept vectors for any or all of its arguments and it will be faster than looping over the N observations to call it N times with scalar arguments.
However, the speedup is only going to be linear because the calculations are all the same, but it only has to allocate memory once if you only call it once. The overall wall time is more affected by the number of function evaluations you have to do, which is up to 2^10 - 1 per MCMC iteration (by default), but whether you hit the maximum treedepth depends on the geometry of the posterior distribution you are trying to sample from, which, in turn, depends on everything including the data you condition on.
The bivariate normal distribution can be written as a product of a marginal univariate normal distribution for the first variable and a conditional univariate normal distribution for the second variable given the first variable. In Stan code, we can utilize element-wise multiplication and division to write its log-density like
target += normal_lpdf(first_variable | first_means, first_sigmas);
target += normal_lpdf(second_variable | second_means +
rhos .* first_sigmas ./ second_sigmas .* (first_variable - first_means),
second_sigmas .* sqrt(1 - square(rhos)));
Unfortunately, the more general multivariate normal distribution in Stan does not have an implementation that inputs arrays of covariance matrices.

This isn't quite answering your question, but you can make your program more efficient by removing a bunch of redundant calculations and converting scale a bit to use tanh rather than scaled inverse logit. I'd get rid of the scaling and just use smaller betas, but I left it so that it should get the same results.
data {
int<lower=0> N;
vector[2] mu;
vector[2] sigma;
vector[N] d;
vector[2] y[N];
}
parameters {
real beta;
}
transformed data {
real var1 = square(sigma[1]);
real var2 = square(sigma[2]);
real covar12 = sigma[1] * sigma[2];
vector[N] d_div_2 = d * 0.5;
}
model {
// note: tanh(u) = 2 * inv_logit(u / 2) - 1
vector[N] rho = tanh(beta * d_div_2);
matrix[2, 2] Sigma;
Sigma[1, 1] = var1;
Sigma[2, 2] = var2;
// only reassign what's necessary with minimal recomputation
for (n in 1:N) {
Sigma[1, 2] = rho[n] * covar12;
Sigma[2, 1] = Sigma[1, 2];
y[n] ~ multi_normal(true_mu, Sigma);
}
// weakly informative priors fit more easily
beta ~ normal(0, 8);
}
You could also factor by figuring out Cholesky factorization as function of rho and other fixed values and use that---it saves a solver step in the multivariate normal.
The other option you have is to write out the multi-student-t directly rather than using our built-in implementation. The built-in probably won't be a whole lot faster as the whole operation's pretty heavily dominated by the matrix solve.

Related

RJAGS - How to pass more complex functions in BUGS file

My goal is to basically migrate this code to R.
All the preprocessing wrt datasets has been already done, now however I am stuck in writing the "model" file. As a first attempt, and for the sake of clarity, I wrote the code which is shown below in R language.
What I want to do is to run an MCMC to have an estimate of the parameter R_t, given the daily reported data for Italian Country.
The main steps that have been pursued are:
Sample an array parameter, namely the log(R_t), from a Gaussian RW distribution
Gauss_RandomWalk <- function(N, x0, mu, variance) {
z <- cumsum(rnorm(n=N, mean=mu, sd=sqrt(variance)))
t <- 1:N
x <- (x0 + t*mu + z)
return(x)
}
log_R_t <- Gauss_RandomWalk(tot_dates, 0., 0., 0.035**2)
R_t_candidate <- exp(log_R_t)
Compute some quantities, that are function of this sampled parameters, namely the number of infections. This dependence is quite simple, since it is linear algebra:
infections <- rep(0. , tot_dates)
infections[1] <- exp(seed)
for (t in 2:tot_dates){
infections[t] <- sum(R_t_candidate * infections * gt_to_convolution[t-1,])
}
Convolve the array I have just computed with a delay distribution (onset+reporting delay), finally rescaling it by the exposure variable:
test_adjusted_positive <- convolve(infections, delay_distribution_df$density, type = "open")
test_adjusted_positive <- test_adjusted_positive[1:tot_dates]
positive <- round(test_adjusted_positive*exposure)
Compute the Likelihood, which is proportional to the probability that a certain set of data was observed (i.e. daily confirmed cases), by sampling the aforementioned log(R_t) parameter from which the variable positive is computed.
likelihood <- dnbinom(round(Italian_data$daily_confirmed), mu = positive, size = 1/6)
Finally, here we come to my BUGS model file:
model {
#priors as a Gaussian RW
log_rt[1] ~ dnorm(0, 0.035)
log_rt[2] ~ dnorm(0, 0.035)
for (t in 3:tot_dates) {
log_rt[t] ~ dnorm(log_rt[t-1] + log_rt[t-2], 0.035)
R_t_candidate[t] <- exp(log_rt[t])
}
# data likelihood
for (t in 2:tot_dates) {
infections[t] <- sum(R_t_candidate * infections * gt_to_convolution[t-1,])
}
test_adjusted_positive <- convolve(infections, delay_distribution)
test_adjusted_positive <- test_adjusted_positive[1:tot_dates]
positive <- test_adjusted_positive*exposure
for (t in 2:tot_dates) {
confirmed[t] ~ dnbinom( obs[t], positive[t], 1/6)
}
}
where gt_to_convolution is a constant matrix, tot_dates is a constant value and exposure is a constant array.
When trying to compile it through:
data <- NULL
data$obs <- round(Italian_data$daily_confirmed)
data$tot_dates <- n_days
data$delay_distribution <- delay_distribution_df$density
data$exposure <- exposure
data$gt_to_convolution <- gt_to_convolution
inits <- NULL
inits$log_rt <- rep(0, tot_dates)
library (rjags)
library (coda)
set.seed(1995)
model <- "MyModel.bug"
jm <- jags.model(model , data, inits)
It raises the following raising error:
Compiling model graph
Resolving undeclared variables
Allocating nodes
Deleting model
Error in jags.model(model, data, inits) : RUNTIME ERROR:
Compilation error on line 19.
Possible directed cycle involving test_adjusted_positive
Hence I am not even able to debug it a little, even though I'm pretty sure there is something wrong more in general but I cannot figure out what and why.
At this point, I think the best choice would be to implement a Metropolis Algorithm myself according to the likelihood above, but obviously, I would way much more prefer to use an already tested framework that is BUGS/JAGS, this is the reason why I am asking for help.

Generate random natural numbers that sum to a given number and comply to a set of general constraints

I had an application that required something similar to the problem described here.
I too need to generate a set of positive integer random variables {Xi} that add up to a given sum S, where each variable might have constraints such as mi<=Xi<=Mi.
This I know how to do, the problem is that in my case I also might have constraints between the random variables themselves, say Xi<=Fi(Xj) for some given Fi (also lets say Fi's inverse is known), Now, how should one generate the random variables "correctly"? I put correctly in quotes here because I'm not really sure what it would mean here except that I want the generated numbers to cover all possible cases with as uniform a probability as possible for each possible case.
Say we even look at a very simple case:
4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, what would be the "correct" way to generate them?
P.S. I know that this seems like it would be a better fit for math overflow but I found no solutions there either.
For 4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, one could use multinomial distribution
As soon as probability of the first number is low enough, your
condition would be almost always satisfied, if not - reject and repeat.
And multinomial distribution by design has the sum equal to 100.
Code, Windows 10 x64, Python 3.8
import numpy as np
def x1x2x3x4(rng):
while True:
v = rng.multinomial(100, [0.1, 1/2-0.1, 1/4, 1/4])
if v[0] <= 2*v[1]:
return v
return None
rng = np.random.default_rng()
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
UPDATE
Lots of freedom in selecting probabilities. E.g., you could make other (##2, 3, 4) symmetric. Code
def x1x2x3x4(rng, pfirst = 0.1):
pother = (1.0 - pfirst)/3.0
while True:
v = rng.multinomial(100, [pfirst, pother, pother, pother])
if v[0] <= 2*v[1]:
return v
return None
UPDATE II
If you start rejecting combinations, then you artificially bump probabilities of one subset of events and lower probabilities of another set of events - and total sum is always 1. There is NO WAY to have uniform probabilities with conditions you want to meet. Code below runs with multinomial with equal probabilities and computes histograms and mean values. Mean supposed to be exactly 25 (=100/4), but as soon as you reject some samples, you lower mean of first value and increase mean of the second value. Difference is small, but UNAVOIDABLE. If it is ok with you, so be it. Code
import numpy as np
import matplotlib.pyplot as plt
def x1x2x3x4(rng, summa, pfirst = 0.1):
pother = (1.0 - pfirst)/3.0
while True:
v = rng.multinomial(summa, [pfirst, pother, pother, pother])
if v[0] <= 2*v[1]:
return v
return None
rng = np.random.default_rng()
s = 100
N = 5000000
# histograms
first = np.zeros(s+1)
secnd = np.zeros(s+1)
third = np.zeros(s+1)
forth = np.zeros(s+1)
mfirst = np.float64(0.0)
msecnd = np.float64(0.0)
mthird = np.float64(0.0)
mforth = np.float64(0.0)
for _ in range(0, N): # sampling with equal probabilities
v = x1x2x3x4(rng, s, 0.25)
q = v[0]
mfirst += np.float64(q)
first[q] += 1.0
q = v[1]
msecnd += np.float64(q)
secnd[q] += 1.0
q = v[2]
mthird += np.float64(q)
third[q] += 1.0
q = v[3]
mforth += np.float64(q)
forth[q] += 1.0
x = np.arange(0, s+1, dtype=np.int32)
fig, axs = plt.subplots(4)
axs[0].stem(x, first, markerfmt=' ')
axs[1].stem(x, secnd, markerfmt=' ')
axs[2].stem(x, third, markerfmt=' ')
axs[3].stem(x, forth, markerfmt=' ')
plt.show()
print((mfirst/N, msecnd/N, mthird/N, mforth/N))
prints
(24.9267492, 25.0858356, 24.9928602, 24.994555)
NB! As I said, first mean is lower and second is higher. Histograms are a little bit different as well
UPDATE III
Ok, Dirichlet, so be it. Lets compute mean values of your generator before and after the filter. Code
import numpy as np
def generate(n=10000):
uv = np.hstack([np.zeros([n, 1]),
np.sort(np.random.rand(n, 2), axis=1),
np.ones([n,1])])
return np.diff(uv, axis=1)
a = generate(1000000)
print("Original Dirichlet sample means")
print(a.shape)
print(np.mean((a[:, 0] * 100).astype(int)))
print(np.mean((a[:, 1] * 100).astype(int)))
print(np.mean((a[:, 2] * 100).astype(int)))
print("\nFiltered Dirichlet sample means")
q = (a[(a[:,0]<=2*a[:,1]) & (a[:,2]>0.35),:] * 100).astype(int)
print(q.shape)
print(np.mean(q[:, 0]))
print(np.mean(q[:, 1]))
print(np.mean(q[:, 2]))
I've got
Original Dirichlet sample means
(1000000, 3)
32.833758
32.791228
32.88054
Filtered Dirichlet sample means
(281428, 3)
13.912784086871243
28.36360987535
56.23109285501087
Do you see the difference? As soon as you apply any kind of filter, you alter the distribution. Nothing is uniform anymore
Ok, so I have this solution for my actual question where I generate 9000 triplets of 3 random variables by joining zeros to sorted random tuple arrays and finally ones and then taking their differences as suggested in the answer on SO I mentioned in my original question.
Then I simply filter out the ones that don't match my constraints and plot them.
S = 100
def generate(n=9000):
uv = np.hstack([np.zeros([n, 1]),
np.sort(np.random.rand(n, 2), axis=1),
np.ones([n,1])])
return np.diff(uv, axis=1)
a = generate()
def plotter(a):
fig = plt.figure(figsize=(10, 10), dpi=100)
ax = fig.add_subplot(projection='3d')
surf = ax.scatter(*zip(*a), marker='o', color=a / 100)
ax.view_init(elev=25., azim=75)
ax.set_xlabel('$A_1$', fontsize='large', fontweight='bold')
ax.set_ylabel('$A_2$', fontsize='large', fontweight='bold')
ax.set_zlabel('$A_3$', fontsize='large', fontweight='bold')
lim = (0, S);
ax.set_xlim3d(*lim);
ax.set_ylim3d(*lim);
ax.set_zlim3d(*lim)
plt.show()
b = a[(a[:, 0] <= 3.5 * a[:, 1] + 2 * a[:, 2]) &\
(a[:, 1] >= (a[:, 2])),:] * S
plotter(b.astype(int))
As you can see, the distribution is uniformly distributed over these arbitrary limits on the simplex but I'm still not sure if I could forego throwing away samples that don't adhere to the constraints (work the constraints somehow into the generation process? I'm almost certain now that it can't be done for general {Fi}). This could be useful in the general case if your constraints limit your sampled area to a very small subarea of the entire simplex (since resampling like this means that to sample from the constrained area a you need to sample from the simplex an order of 1/a times).
If someone has an answer to this last question I will be much obliged (will change the selected answer to his).
I have an answer to my question, under a general set of constraints what I do is:
Sample the constraints in order to evaluate s, the constrained area.
If s is big enough then generate random samples and throw out those that do not comply to the constraints as described in my previous answer.
Otherwise:
Enumerate the entire simplex.
Apply the constraints to filter out all tuples outside the constrained area.
List the resulting filtered tuples.
When asked to generate, I generate by choosing uniformly from this result list.
(note: this is worth my effort only because I'm asked to generate very often)
A combination of these two strategies should cover most cases.
Note: I also had to handle cases where S was a randomly generated parameter (m < S < M) in which case I simply treat it as another random variable constrained between m and M and I generate it together with the rest of the variables and handle it as I described earlier.

How can I use RStan without for loops?

Is there a way to more efficiently perform the following calculations in RStan?
I have only provided the minimal amount of coded that is needed:
parameters {
real beta_0;
real beta_1;
}
model {
vector [n] p_i = exp(beta_0 + beta_1*x)/[1 + exp(beta_0 + beta_1*x)];
y ~ bernoulli(p_i);
/* Likelihood:
for(i in 1:n){
p_i[i] = exp(beta_0 + beta_1*x[i])/(1 + exp(beta_0 + beta_1*x[i]));
y[i] ~ bernoulli(p_i[i]);
*/}
// Prior:
beta_0 ~ normal(m_beta_0, s_beta_0);
beta_1 ~ normal(m_beta_1, s_beta_1);
}
I obtain the following error message: "Matrix expression elements must be type row_vector and row vector expression elements must be int or real, but found element of type vector". If I use the for loop (which is commented out), the code works fine, but I would like to limit the use of for loops in my code. In the above code, x, is a vector of length n.
Another example:
parameters {
real gamma1;
real gamma2;
real gamma3;
real gamma4;
}
model {
// Likelihood:
real lambda;
real beta;
real phi;
for(i in 1:n){
lambda = exp(gamma1)*x[n_length[i]]^gamma2;
beta = exp(gamma3)*x[n_length[i]]^gamma4;
phi = lambda^(-1/beta);
y[i] ~ weibull(beta, phi);
}
//y ~ weibull(exp(gamma1)*x^gamma2, exp(gamma3)*x^gamma4); //cannot raise a vector to a power
// Prior:
gamma1 ~ normal(m_gamma1, s_gamma1);
gamma2 ~ normal(m_gamma2, s_gamma2);
gamma3 ~ normal(m_gamma3, s_gamma3);
gamma4 ~ normal(m_gamma4, s_gamma4);
}
The above code works, but the commented out likelihood calculation does not work since I "cannot raise a vector to a power" (but you can in R). I would, once again, like to not be forced to use for loops. In the above code, n_length, is a vector of length n.
A final example. If I want to draw 10000 samples from a normal distribution in R, I can simply specify
rnorm(10000, mu, sigma)
But in RStan, I would have to use a for loop, for example
parameters {
real mu;
real sigma;
}
generated quantities {
vector[n] x;
for(i in 1:n) {
x[i] = normal_rng(mu, sigma);
}
}
Is there anything that I can do to speed up my RStan examples?
This line of code:
vector [n] p_i = exp(beta_0 + beta_1*x)/[1 + exp(beta_0 + beta_1*x)];
is not valid syntax in the Stan language because square brackets are only used for indexing. It could instead be
vector [n] p_i = exp(beta_0 + beta_1*x) ./ (1 + exp(beta_0 + beta_1*x));
which utilizes the elementwise division operator, or better yet
vector [n] p_i = inv_logit(beta_0 + beta_1*x);
in which case y ~ bernoulli(p_i); would work as a likelihood. Better still, just do
y ~ bernoulli_logit(beta_0 + beta_1 * x);
and it will do the transformation for you in a numerically stable fashion. You could also use bernoulli_logit_glm, which is slightly faster particularly with large datasets.
In Stan 2.19.x, I think you can draw N values from a probability distribution in the generated quantities block. But you are too worried about for loops. The Stan program is transpiled to C++ where loops are fast and almost all of the functions in the Stan language that accept vector inputs and produce vector outputs actually involve the same loop in C++ as if you had done the loop yourself.

R angular distance weighting interpolation function

I would like use R for doing an Angular-Distance-Weighting Interpolation as described e.g. here (p. 3).
It is an interpolation which (1) weights the data points by distance relative to a correlation decay distance and (2) gives more weight to isolated data points.
My question is whether there is some package implementation of this in R or whether one would need to do it from scratch. My Google search did not get me anything obvious on this but I do not know R's interpolation packages so well. So if anybody has ever done this, please let me know.
There is no R package available..I wrote a simple function for Angular distance weighting (ADW) as implemented in Shepard 1968
[X,Y] are coordinates where you wish to estimate. [Xtrain,Ytrain] are the coordinates of data points. weights radial are the weights by distance.
compute_weight_by_theta = function(X,Y,xtrain,ytrain,weights.radial){
N = length(xtrain)
weight.directional = integer(N)
for (i in 1:N){
numerator = 0;
denominator = 0;
for (j in 1:N){
xi = xtrain[i]; yi = ytrain[i]
xj = xtrain[j]; yj = ytrain[j]
if ((xi != xj) | (yi != yj)){
Sj = weights.radial[j]
Di = sqrt((X - xi)**2 + (Y - yi)**2)
Dj = sqrt((X - xj)**2 + (Y - yj)**2)
cos_theta = ((X-xi)*(X-xj) + (Y-yi)*(Y-yj)) / (Di*Dj)
numerator = numerator + (1-cos_theta)*Sj
denominator = denominator + Sj
}
}
weight.directional[i] = numerator/denominator
}
return(weight.directional)
}

Using JAGS or STAN when an observed node is the max of latent nodes

I have the following latent variable model: Person j has two latent variables, Xj1 and Xj2. The only thing we get to observe is their maximum, Yj = max(Xj1, Xj2). The latent variables are bivariate normal; they each have mean mu, variance sigma2, and their correlation is rho. I want to estimate the three parameters (mu, sigma2, rho) using only Yj, with data from n patients, j = 1,...,n.
I've tried to fit this model in JAGS (so I'm putting priors on the parameters), but I can't get the code to compile. Here's the R code I'm using to call JAGS. First I generate the data (both latent and observed variables), given some true values of the parameters:
# true parameter values
mu <- 3
sigma2 <- 2
rho <- 0.7
# generate data
n <- 100
Sigma <- sigma2 * matrix(c(1, rho, rho, 1), ncol=2)
X <- MASS::mvrnorm(n, c(mu,mu), Sigma) # n-by-2 matrix
Y <- apply(X, 1, max)
Then I define the JAGS model, and write a little function to run the JAGS sampler and return the samples:
# JAGS model code
model.text <- '
model {
for (i in 1:n) {
Y[i] <- max(X[i,1], X[i,2]) # Ack!
X[i,1:2] ~ dmnorm(X_mean, X_prec)
}
# mean vector and precision matrix for X[i,1:2]
X_mean <- c(mu, mu)
X_prec[1,1] <- 1 / (sigma2*(1-rho^2))
X_prec[2,1] <- -rho / (sigma2*(1-rho^2))
X_prec[1,2] <- X_prec[2,1]
X_prec[2,2] <- X_prec[1,1]
mu ~ dnorm(0, 1)
sigma2 <- 1 / tau
tau ~ dgamma(2, 1)
rho ~ dbeta(2, 2)
}
'
# run JAGS code. If latent=FALSE, remove the line defining Y[i] from the JAGS model
fit.jags <- function(latent=TRUE, data, n.adapt=1000, n.burnin, n.samp) {
require(rjags)
if (!latent)
model.text <- sub('\n *Y.*?\n', '\n', model.text)
textCon <- textConnection(model.text)
fit <- jags.model(textCon, data, n.adapt=n.adapt)
close(textCon)
update(fit, n.iter=n.burnin)
coda.samples(fit, variable.names=c("mu","sigma2","rho"), n.iter=n.samp)[[1]]
}
Finally, I call JAGS, feeding it only the observed data:
samp1 <- fit.jags(latent=TRUE, data=list(n=n, Y=Y), n.burnin=1000, n.samp=2000)
Sadly this results in an error message: "Y[1] is a logical node and cannot be observed". JAGS does not like me using "<-" to assign a value to Y[i] (I denote the offending line with an "Ack!"). I understand the complaint, but I'm not sure how to rewrite the model code to fix this.
Also, to demonstrate that everything else (besides the "Ack!" line) is fine, I run the model again, but this time I feed it the X data, pretending that it's actually observed. This runs perfectly and I get good estimates of the parameters:
samp2 <- fit.jags(latent=FALSE, data=list(n=n, X=X), n.burnin=1000, n.samp=2000)
colMeans(samp2)
If you can find a way to program this model in STAN instead of JAGS, that would be fine with me.
Theoretically you can implement a model like this in JAGS using the dsum distribution (which in this case uses a bit of a hack as you are modelling the maximum and not the sum of the two variables). But the following code does compile and run (although it does not 'work' in any real sense - see later):
set.seed(2017-02-08)
# true parameter values
mu <- 3
sigma2 <- 2
rho <- 0.7
# generate data
n <- 100
Sigma <- sigma2 * matrix(c(1, rho, rho, 1), ncol=2)
X <- MASS::mvrnorm(n, c(mu,mu), Sigma) # n-by-2 matrix
Y <- apply(X, 1, max)
model.text <- '
model {
for (i in 1:n) {
Y[i] ~ dsum(max_X[i])
max_X[i] <- max(X[i,1], X[i,2])
X[i,1:2] ~ dmnorm(X_mean, X_prec)
ranks[i,1:2] <- rank(X[i,1:2])
chosen[i] <- ranks[i,2]
}
# mean vector and precision matrix for X[i,1:2]
X_mean <- c(mu, mu)
X_prec[1,1] <- 1 / (sigma2*(1-rho^2))
X_prec[2,1] <- -rho / (sigma2*(1-rho^2))
X_prec[1,2] <- X_prec[2,1]
X_prec[2,2] <- X_prec[1,1]
mu ~ dnorm(0, 1)
sigma2 <- 1 / tau
tau ~ dgamma(2, 1)
rho ~ dbeta(2, 2)
#data# n, Y
#monitor# mu, sigma2, rho, tau, chosen[1:10]
#inits# X
}
'
library('runjags')
results <- run.jags(model.text)
results
plot(results)
Two things to note:
JAGS isn't smart enough to initialise the matrix of X while satisfying the dsum(max(X[i,])) constraint on its own - so we have to initialise X for JAGS using sensible values. In this case I'm using the simulated values which is cheating - the answer you get is highly dependent on the choice of initial values for X, and in the real world you won't have the simulated values to fall back on.
The max() constraint causes problems to which I can't think of a solution within a general framework: unlike the usual dsum constraint that allows one parameter to decrease while the other increases and therefore both parameters are used at all times, the min() value of X[i,] is ignored and the sampler is therefore free to do as it pleases. This will very very rarely (i.e. never) lead to values of min(X[i,]) that happen to be identical to Y[i], which is the condition required for the sampler to 'switch' between the two X[i,]. So switching never happens, and the X[] that were chosen at initialisation to be the maxima stay as the maxima - I have added a trace parameter 'chosen' which illustrates this.
As far as I can see the other potential solutions to the 'how do I code this' question will fall into essentially the same non-mixing trap which I think is a fundamental problem here (although I might be wrong and would very much welcome working BUGS/JAGS/Stan code that illustrates otherwise).
Solutions to the failure to mix are harder, although something akin to the Carlin & Chibb method for model selection may work (force a min(pseudo_X) parameter to be equal to Y to encourage switching). This is likely to be tricky to get working, but if you can get help from someone with a reasonable amount of experience with BUGS/JAGS you could try it - see:
Carlin, B.P., Chib, S., 1995. Bayesian model choice via Markov chain Monte Carlo methods. J. R. Stat. Soc. Ser. B 57, 473–484.
Alternatively, you could try thinking about the problem slightly differently and model X directly as a matrix with the first column all missing and the second column all equal to Y. You could then use dinterval() to set a constraint on the missing values that they must be lower than the corresponding maximum. I'm not sure how well this would work in terms of estimating mu/sigma2/rho but it might be worth a try.
By the way, I realise that this doesn't necessarily answer your question but I think it is a useful example of the difference between 'is it codeable' and 'is it workable'.
Matt
ps. A much smarter solution would be to consider the distribution of the maximum of two normal variates directly - I am not sure if such a distribution exists, but it it does and you can get a PDF for it then the distribution could be coded directly using the zeros/ones trick without having to consider the value of the minimum at all.
I believe you can model this in the Stan language treating the likelihood as a two component mixture with equal weights. The Stan code could look like
data {
int<lower=1> N;
vector[N] Y;
}
parameters {
vector<upper=0>[2] diff[N];
real mu;
real<lower=0> sigma;
real<lower=-1,upper=1> rho;
}
model {
vector[2] case_1[N];
vector[2] case_2[N];
vector[2] mu_vec;
matrix[2,2] Sigma;
for (n in 1:N) {
case_1[n][1] = Y[n]; case_1[n][2] = Y[n] + diff[n][1];
case_2[n][2] = Y[n]; case_2[n][1] = Y[n] + diff[n][2];
}
mu_vec[1] = mu; mu_vec[2] = mu;
Sigma[1,1] = square(sigma);
Sigma[2,2] = Sigma[1,1];
Sigma[1,2] = Sigma[1,1] * rho;
Sigma[2,1] = Sigma[1,2];
// log-likelihood
target += log_mix(0.5, multi_normal_lpdf(case_1 | mu_vec, Sigma),
multi_normal_lpdf(case_2 | mu_vec, Sigma));
// insert priors on mu, sigma, and rho
}

Resources