Sequential Monte Carlo - r

I was given this model, and to get the probability I am supposed to simulate the data.
x_1 ∼N(0, 102)
x_t =0.5 ∗ (x_t−1) + 25 · (x_t−1)/(1 + (x_t-1)^2) + 8 · cos(1.2 ∗ (t − 1)) + εt
, t = 2, 3, ..
y_t =(x_t)^2/25 + ηt, t = 1, 2, 3, ...
Where εT and ηt follows normal distribution.
I tried to inverse the function but I cannot do it because of the fact that I have no idea if my Xs will be positive or negative. I understood that I should use a sequential monte carlo, but I can't figure out how to find the functions of the algorithm. What is f and g, and how can we decide x(t-1) if it is equally likely to be positive or negative because of the x squared?
Algorithm:
1 Sample X1 ∼ g1(·). Let w1 = u1 = f1(x1)/g1(x1). Set t = 2
2 Sample Xt|xt−1 ∼ gt(xt|xt−1).
3 Append xt to x1:t−1, obtaining xt
4 Let ut = ft(xt|xt−1)/gt(xt|xt−1)
5 Let wt = wt−1ut , the importance weight for x1:t
6 Increment t and return to step 2

With a time-series model like yours, essentially the only way to compute the probability distribution of x or y is to run multiple simulations of the model, with randomly drawn values of x_0, eps_t, eta_t, and then construct histograms by aggregating the samples across all the runs. In very special cases (e.g. damped Brownian motion) it may be possible to calculate the resulting probability distributions algebraically, but I don't think there's any chance of that for your model.
In Python (I'm afraid I'm not fluent enough in R), you can simulate the time-series by something like this:
import math, random
def simSequence(steps, eps=0.1, eta=0.1):
x = random.normalvariate(0, 102)
ySamples = []
for t in range(steps):
y = (x ** 2) / 25 + random.normalvariate(0, eta)
ySamples.append(y)
x = (0.5 * x + 25 * x / (1 + x ** 2)
+ 8 * math.cos(1.2 * t) + random.normalvariate(0, eps))
return ySamples
(This replaces your t=1..n with t=0..(n-1).)
You could then generate a plot of a few examples of the y time-series:
import matplotlib.pyplot as plt
nSteps = 100
for run in range(5):
history = simSequence(nSteps)
plt.plot(range(nSteps), history)
plt.show()
to get something like:
If you then want to compute the probability distribution of y at different times, you could generate a matrix whose columns represent realizations of y_t at a common value of time and compute histograms at selected values of t:
import numpy
runs = numpy.array([ simSequence(nSteps) for run in range(10000) ])
plt.hist(runs[:,5], bins=25, label='t=5', alpha=0.5, normed=True)
plt.hist(runs[:,10], bins=25, label='t=10', alpha=0.5, normed=True)
plt.legend(loc='best')
plt.show()
which gives:

Related

How can I find LGCP random field Lambda values in overall area?

There is a rLGCP model example in the RandomField package.
if(require(RandomFields)) {
# homogeneous LGCP with exponential covariance function
X <- rLGCP("exp", 3, var=0.2, scale=.1)
# inhomogeneous LGCP with Gaussian covariance function
m <- as.im(function(x, y){5 - 1.5 * (x - 0.5)^2 + 2 * (y - 0.5)^2}, W=owin())
X <- rLGCP("gauss", m, var=0.15, scale =0.5)
plot(attr(X, "Lambda"))
points(X)
}
I think that the Lambda attribute of X does not show the overall values in the overall two dimensional area.
How can I find the overall Lambda values in overall area?
I'm not entirely sure if this is what you are looking for, but the matrix of values of Lambda for each point in the plot are stored in the Lambda attribute of the model created by spatstat::rLGCP.
You can access them like this:
m <- as.im(function(x, y){5 - 1.5 * (x - 0.5)^2 + 2 * (y - 0.5)^2}, W=owin())
X <- rLGCP("gauss", m, var=0.15, scale = 0.5)
lambda_matrix <- attr(X, "Lambda")$v
Now lambda_matrix is a 128 x 128 matrix containing the value of Lambda at each point on the grid.

Clustering Time Series in R - is K Mean accurate?

My data set is composed by measurement of the same index for 14 years (columns) for 105 countries (rows). I want to cluster countries based on their index trend over time.
I am trying Hierarchical clustering (hclust) and K Medoids (pam) exploiting DTW distance matrix (dtw package).
I also tried K Mean, using the DTW distance matrix as first argument of function kmeans. The algorithm works, but I'm not sure about the accuracy of that, since K Mean exploit Eucledian Distance and computes centroids as means.
I am also thinking about using data directly, but I can't understand how the result would be accurate since the algorithm would consider different measurement of the same variable over time as different variables in order to compute the centroids at each iteration and Eucledian distance to assign observations to clusters. It doesn't seem to me that this process could cluster time series as well as Hierarchical and K Medoids clustering.
Is K Mean algorithm a good choice when clustering Time Series or it is better to use algorithms that exploit distance concept as DTW (but are slower)? Does it exist an R function that allows to use K Mean algorithm with distance matrix or a specific package to cluster Time Series data?
KMeans will do exactly what you tell it to do. Unfortunately, trying to feed a time series dataset into a KMeans algo will result in meaningless results. The KMeans algo, and most general clustering methods, are built around the Euclidean distance, which does not seem to be a good measure for time series data. Quite simply, K-means often doesn’t work when clusters are not round shaped because of it uses some kind of distance function and distance is measured from cluster center. Check out the GMM algo as an alternative. It sounds like you are going with R for this experiment. If so, check out the sample code below.
Here is a KMeans cluster.
Here is a GMM cluster.
Which one looks more like a time series plot to you??!!
I Googled around for a good sample of R code to demonstrate how GMM clustering works. Unfortunately, I couldn't find anything decent. Personally, I use Python much more than I use R. If you are open to a Python solution, check out the sample code below.
import numpy as np
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
print(__doc__)
# Number of samples per component
n_samples = 500
# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
.7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(X)
bic.append(gmm.bic(X))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
'darkorange'])
clf = best_gmm
bars = []
# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
xpos = np.array(n_components_range) + .2 * (i - 2)
bars.append(plt.bar(xpos, bic[i * len(n_components_range):
(i + 1) * len(n_components_range)],
width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
.2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
# Plot the winner
splot = plt.subplot(2, 1, 2)
Y_ = clf.predict(X)
for i, (mean, cov, color) in enumerate(zip(clf.means_, clf.covariances_,
color_iter)):
v, w = linalg.eigh(cov)
if not np.any(Y_ == i):
continue
plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan2(w[0][1], w[0][0])
angle = 180. * angle / np.pi # convert to degrees
v = 2. * np.sqrt(2.) * np.sqrt(v)
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(.5)
splot.add_artist(ell)
plt.xticks(())
plt.yticks(())
plt.title('Selected GMM: full model, 2 components')
plt.subplots_adjust(hspace=.35, bottom=.02)
plt.show()
Finall, from the image below, you can clearly see how
Here's an example of how to visualize clusters using plotGMM. The code to reproduce follows:
require(quantmod)
SCHB <- fortify(getSymbols('SCHB', auto.assign=FALSE))
set.seed(730) # for reproducibility
mixmdl <- mixtools::normalmixEM(Cl(SCHB), k = 5); plot_GMM(mixmdl, k = 5) # 5 clusters
plot_GMM(mixmdl, k = 5)
I hope that helps. Oh, and for plotting time series with ggplot2, you should avail yourself of ggplot2's fortify function. Hope that helps.

Efficiently sample a collection of multi-normal variables with varying sigma (covariance) matrix

I'm new to Stan, so hoping you can point me in the right direction. I'll build up to my situation to make sure we're on the same page...
If I had a collection of univariate normals, the docs tell me that:
y ~ normal(mu_vec, sigma);
provides the same model as the unvectorized version:
for (n in 1:N)
y[n] ~ normal(mu_vec[n], sigma);
but that the vectorized version is (much?) faster. Ok, fine, makes good sense.
So the first question is: is it possible to take advantage of this vectorization speedup in the univariate normal case where both the mu and sigma of the samples vary by position in the vector. I.e. if both mu_vec and sigma_vec are vectors (in the previous case sigma was a scalar), then is this:
y ~ normal(mu_vec, sigma_vec);
equivalent to this:
for (n in 1:N)
y[n] ~ normal(mu_vec[n], sigma_vec[n]);
and if so is there a comparable speedup?
Ok. That's the warmup. The real question is how to best approach the multi-variate equivalent of the above.
In my particular case, I have N of observations of bivariate data for some variable y, which I store in an N x 2 matrix. (For order of magnitude, N is about 1000 in my use case.)
My belief is that the mean of each component of each observation is 0 and that the stdev of each component is each observation is 1 (and I'm happy to hard code them as such). However, my belief is that the correlation (rho) varies from observation to observation as a (simple) function of another observed variable, x (stored in an N element vector). For example, we might say that rho[n] = 2*inverse_logit(beta * x[n]) - 1 for n in 1:N and our goal is to learn about beta from our data. I.e. the covariance matrix for the nth observation would be:
[1, rho[n]]
[rho[n], 1 ]
My question is what's the best way to put this together in a STAN model so that it isn't slow as heck? Is there a vectorized version of the multi_normal distribution so that I could specify this as:
y ~ multi_normal(vector_of_mu_2_tuples, vector_of_sigma_matrices)
or perhaps as some other similar formulation? Or will I need to write:
for (n in 1:N)
y[n] ~ multi_normal(vector_of_mu_2_tuples[n], vector_of_sigma_matrices[n])
after having set up vector_of_sigma_matrices and vector_of_mu_2_tuples in an earlier block?
Thanks in advance for any guidance!
Edit to add code
Using python, I can generate data in the spirit of my problem as follows:
import numpy as np
import pandas as pd
import pystan as pys
import scipy as sp
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns
def gen_normal_data(N, true_beta, true_mu, true_stdevs):
N = N
true_beta = true_beta
true_mu = true_mu
true_stdevs = true_stdevs
drivers = np.random.randn(N)
correls = 2.0 * sp.special.expit(drivers*true_beta)-1.0
observations = []
for i in range(N):
covar = np.array([[true_stdevs[0]**2, true_stdevs[0] * true_stdevs[1] * correls[i]],
[true_stdevs[0] * true_stdevs[1] * correls[i], true_stdevs[1]**2]])
observations.append(sp.stats.multivariate_normal.rvs(true_mu, covar, size=1).tolist())
observations = np.array(observations)
return {
'N': N,
'true_mu': true_mu,
'true_stdev': true_stdevs,
'y': observations,
'd': drivers,
'correls': correls
}
and then actually generate the data using:
normal_data = gen_normal_data(100, 1.5, np.array([1., 5.]), np.array([2., 5.]))
Here's what the data set looks like (scatterplot of y colored by correls in the left pane and by drivers in the right pane...so the idea is that the higher the driver the closer to 1 the correl and the lower the driver, the closer to -1 the correl. So would expect red dots on the left pane to be "down-left to up-right" and blue dots to be "up-left to down-right", and indeed they are:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
x = normal_data['y'][:, 0]
y = normal_data['y'][:, 1]
correls = normal_data['correls']
drivers = normal_data['d']
for ax, colordata, cmap in zip(axes, [correls, drivers], ['coolwarm', 'viridis']):
color_extreme = max(abs(colordata.max()), abs(colordata.min()))
sc = ax.scatter(x, y, c=colordata, lw=0, cmap=cmap, vmin=-color_extreme, vmax=color_extreme)
divider = make_axes_locatable(ax)
cax = divider.append_axes('right', size='5%', pad=0.05)
fig.colorbar(sc, cax=cax, orientation='vertical')
fig.tight_layout()
Using the brute force approach, I can set up a STAN model that looks like this:
model_naked = pys.StanModel(
model_name='naked',
model_code="""
data {
int<lower=0> N;
vector[2] true_mu;
vector[2] true_stdev;
real d[N];
vector[2] y[N];
}
parameters {
real beta;
}
transformed parameters {
}
model {
real rho[N];
matrix[2, 2] cov[N];
for (n in 1:N) {
rho[n] = 2.0*inv_logit(beta * d[n]) - 1.0;
cov[n, 1, 1] = true_stdev[1]^2;
cov[n, 1, 2] = true_stdev[1] * true_stdev[2] * rho[n];
cov[n, 2, 1] = true_stdev[1] * true_stdev[2] * rho[n];
cov[n, 2, 2] = true_stdev[2]^2;
}
beta ~ normal(0, 10000);
for (n in 1:N) {
y[n] ~ multi_normal(true_mu, cov[n]);
}
}
"""
)
This fits nicely:
fit_naked = model_naked.sampling(data=normal_data, iter=1000, chains=2)f = fit_naked.plot();
f.tight_layout()
But I'm hoping someone can point me in the right direction for the "marginalized" approach where we break down our bivariate normal into a pair of independent normals that can be blended using the correlation. The reason I need this is that in my actual use case, both dimensions of are fat-tailed. I am happy to model this as a student-t distribution, but the issue is that STAN only allows a single nu to be specified (not one for each dimension), so I think I'll need to find a way to decompose a multi_student_t into a pair of independent student_t's so that I can set the degrees of freedom separately for each dimension.
The univariate normal distribution does accept vectors for any or all of its arguments and it will be faster than looping over the N observations to call it N times with scalar arguments.
However, the speedup is only going to be linear because the calculations are all the same, but it only has to allocate memory once if you only call it once. The overall wall time is more affected by the number of function evaluations you have to do, which is up to 2^10 - 1 per MCMC iteration (by default), but whether you hit the maximum treedepth depends on the geometry of the posterior distribution you are trying to sample from, which, in turn, depends on everything including the data you condition on.
The bivariate normal distribution can be written as a product of a marginal univariate normal distribution for the first variable and a conditional univariate normal distribution for the second variable given the first variable. In Stan code, we can utilize element-wise multiplication and division to write its log-density like
target += normal_lpdf(first_variable | first_means, first_sigmas);
target += normal_lpdf(second_variable | second_means +
rhos .* first_sigmas ./ second_sigmas .* (first_variable - first_means),
second_sigmas .* sqrt(1 - square(rhos)));
Unfortunately, the more general multivariate normal distribution in Stan does not have an implementation that inputs arrays of covariance matrices.
This isn't quite answering your question, but you can make your program more efficient by removing a bunch of redundant calculations and converting scale a bit to use tanh rather than scaled inverse logit. I'd get rid of the scaling and just use smaller betas, but I left it so that it should get the same results.
data {
int<lower=0> N;
vector[2] mu;
vector[2] sigma;
vector[N] d;
vector[2] y[N];
}
parameters {
real beta;
}
transformed data {
real var1 = square(sigma[1]);
real var2 = square(sigma[2]);
real covar12 = sigma[1] * sigma[2];
vector[N] d_div_2 = d * 0.5;
}
model {
// note: tanh(u) = 2 * inv_logit(u / 2) - 1
vector[N] rho = tanh(beta * d_div_2);
matrix[2, 2] Sigma;
Sigma[1, 1] = var1;
Sigma[2, 2] = var2;
// only reassign what's necessary with minimal recomputation
for (n in 1:N) {
Sigma[1, 2] = rho[n] * covar12;
Sigma[2, 1] = Sigma[1, 2];
y[n] ~ multi_normal(true_mu, Sigma);
}
// weakly informative priors fit more easily
beta ~ normal(0, 8);
}
You could also factor by figuring out Cholesky factorization as function of rho and other fixed values and use that---it saves a solver step in the multivariate normal.
The other option you have is to write out the multi-student-t directly rather than using our built-in implementation. The built-in probably won't be a whole lot faster as the whole operation's pretty heavily dominated by the matrix solve.

R Optimize linear equations coefficients with constraints

Say I have n linear equations of the form:
ax1 + bx2 + cx3 = y1
-ax1 + bx2 + cx3 = y2
-ax1 -bx2 + cx3 = y3
Here is n=3 and a,b,c are known and fixed.
I'm looking for the optimal values for x1,x2,x3 such that their ranges are within [-r,r] for some positive r and the sum sum(y1,y2,y3) is maximized.
Is there a package for R which can handle such optimization problems?
You can use the optim in R function for this purpose.
If you are trying to maximize sum(y1,y2,y3), this actually simplifies the problem to maximize (ax1 + bx2 + 3*cx3) such that x1,x2,x3 ∈ [-r,r]
You can use below code to find the optimal values. Note that the optim function minimizes by default, so I am returning the negative value of the sum in the function.
max_sum <- function(x){
a <- 2; b<- -3; c<-2;
y <- a*x[1]+b*x[2]+3*c*x[3]
return( -1*y ) }
r <- 5
optim(par=c(0,0,0), max_sum,lower= (-1*r),upper = r)
$par
[1] 5 -5 5

How to define Intervals for a uniform probability distribution?

I do not know its the right forum to ask question, however i will appreciate if someone can help me.
I have two processes and each have a distinct random Variable , let say X1 & X2, each Random Variable is from a uniform distribution with [0,1], then how random.nextdouble() can help me to identify the variation between the probabilities of these two random variables. I need this variation because i want to find the probability of minimum of the two random variables.
Can I say that its too simple and I should run the program for 100000 or more times twice and then count the minimum value from two iterations? If so, then how can I map this result with probabilities of two random variables i.e. X1 & X2?? Like what is the criteria to say that the first time I ran the program was for X1 and 2nd time for X2.
The probability of a single variable under uniform distribution to be under d is P(X<=dx) = d (assuming in range [0,1]).
Thus, the probability of it to be more then d is P(X>=d) = (1-d).
The probability of 2 random variables to be above d is P(X>=d AND Y>=d) = P(X>=d)*P(Y>=d) = (1-d)^2
Thus, the probability that one of X or Y to be under d is p = 1-(1-d)^2, and this means that the probability of the minimum to be under d is the same: p = 1 - (1-d)^2.
If you are looking for the probability density function, you can just find the derivitive of the probability:
f(x) = d/dx P(x) = d/dx 1 - (1-x)^2 =
= d/dx (1 - 1 + 2x - x^2) =
= d/dx (2x - x^2) = 2 - 2x

Resources