Why are the gradients added instead of averaged in backpropagation through time? - recurrent-neural-network

In the following implementation of a backward pass of an RNN, the gradients for Wh, Wx, and b are calculated by adding the calculated gradients at each time step. Intuitively, what does this do and why wouldn't they be averaged?
def rnn_backward(dh, cache):
"""
Compute the backward pass for a vanilla RNN over an entire sequence of data.
Inputs:
- dh: Upstream gradients of all hidden states, of shape (N, T, H)
Returns a tuple of:
- dx: Gradient of inputs, of shape (N, T, D)
- dh0: Gradient of initial hidden state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
- db: Gradient of biases, of shape (H,)
"""
Wx, Wh, b, x, prev_h, next_h = cache[0]
N, T, H = dh.shape
D, H = Wx.shape
# Initialise gradients.
dx = np.zeros([N, T, D])
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
dprev_h = np.zeros_like(prev_h)
# Backprop in time - start at last calculated time step and work back.
for t_step in reversed(range(T)):
# Add the current timestep upstream gradient to previous calculated dh
cur_dh = dprev_h + dh[:,t_step,:]
# Calculate gradients at this time step.
dx[:, t_step, :], dprev_h, dWx_temp, dWh_temp, db_temp = rnn_step_backward(cur_dh, cache[t_step])
# Add gradient contributions from each time step.
dWx += dWx_temp
dWh += dWh_temp
db += db_temp
# dh0 is the last hidden state gradient calculated.
dh0 = dprev_h
return dx, dh0, dWx, dWh, db

Each timestep you have a set of gradients in the same way a MLP neural network has a set of gradients for each training example. It doesn't make sense to average the gradients for a MLP neural network over each training example. In the same way, the gradients aren't averaged over each timestep. Think of each timestep as a separate training example for the RNN.

Related

Initial state starts at y(1), how to go backwards to find y(0)? [duplicate]

I would like to solve a differential equation in R (with deSolve?) for which I do not have the initial condition, but only the final condition of the state variable. How can this be done?
The typical code is: ode(times, y, parameters, function ...) where y is the initial condition and function defines the differential equation.
Are your equations time reversible, that is, can you change your differential equations so they run backward in time? Most typically this will just mean reversing the sign of the gradient. For example, for a simple exponential growth model with rate r (gradient of x = r*x) then flipping the sign makes the gradient -r*x and generates exponential decay rather than exponential growth.
If so, all you have to do is use your final condition(s) as your initial condition(s), change the signs of the gradients, and you're done.
As suggested by #LutzLehmann, there's an even easier answer: ode can handle negative time steps, so just enter your time vector as (t_end, 0). Here's an example, using f'(x) = r*x (i.e. exponential growth). If f(1) = 3, r=1, and we want the value at t=0, analytically we would say:
x(T) = x(0) * exp(r*T)
x(0) = x(T) * exp(-r*T)
= 3 * exp(-1*1)
= 1.103638
Now let's try it in R:
library(deSolve)
g <- function(t, y, parms) { list(parms*y) }
res <- ode(3, times = c(1, 0), func = g, parms = 1)
print(res)
## time 1
## 1 1 3.000000
## 2 0 1.103639
I initially misread your question as stating that you knew both the initial and final conditions. This type of problem is called a boundary value problem and requires a separate class of numerical algorithms from standard (more elementary) initial-value problems.
library(sos)
findFn("{boundary value problem}")
tells us that there are several R packages on CRAN (bvpSolve looks the most promising) for solving these kinds of problems.
Given a differential equation
y'(t) = F(t,y(t))
over the interval [t0,tf] where y(tf)=yf is given as initial condition, one can transform this into the standard form by considering
x(s) = y(tf - s)
==> x'(s) = - y'(tf-s) = - F( tf-s, y(tf-s) )
x'(s) = - F( tf-s, x(s) )
now with
x(0) = x0 = yf.
This should be easy to code using wrapper functions and in the end some list reversal to get from x to y.
Some ODE solvers also allow negative step sizes, so that one can simply give the times for the construction of y in the descending order tf to t0 without using some intermediary x.

Clustering Time Series in R - is K Mean accurate?

My data set is composed by measurement of the same index for 14 years (columns) for 105 countries (rows). I want to cluster countries based on their index trend over time.
I am trying Hierarchical clustering (hclust) and K Medoids (pam) exploiting DTW distance matrix (dtw package).
I also tried K Mean, using the DTW distance matrix as first argument of function kmeans. The algorithm works, but I'm not sure about the accuracy of that, since K Mean exploit Eucledian Distance and computes centroids as means.
I am also thinking about using data directly, but I can't understand how the result would be accurate since the algorithm would consider different measurement of the same variable over time as different variables in order to compute the centroids at each iteration and Eucledian distance to assign observations to clusters. It doesn't seem to me that this process could cluster time series as well as Hierarchical and K Medoids clustering.
Is K Mean algorithm a good choice when clustering Time Series or it is better to use algorithms that exploit distance concept as DTW (but are slower)? Does it exist an R function that allows to use K Mean algorithm with distance matrix or a specific package to cluster Time Series data?
KMeans will do exactly what you tell it to do. Unfortunately, trying to feed a time series dataset into a KMeans algo will result in meaningless results. The KMeans algo, and most general clustering methods, are built around the Euclidean distance, which does not seem to be a good measure for time series data. Quite simply, K-means often doesn’t work when clusters are not round shaped because of it uses some kind of distance function and distance is measured from cluster center. Check out the GMM algo as an alternative. It sounds like you are going with R for this experiment. If so, check out the sample code below.
Here is a KMeans cluster.
Here is a GMM cluster.
Which one looks more like a time series plot to you??!!
I Googled around for a good sample of R code to demonstrate how GMM clustering works. Unfortunately, I couldn't find anything decent. Personally, I use Python much more than I use R. If you are open to a Python solution, check out the sample code below.
import numpy as np
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
print(__doc__)
# Number of samples per component
n_samples = 500
# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
.7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(X)
bic.append(gmm.bic(X))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
'darkorange'])
clf = best_gmm
bars = []
# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
xpos = np.array(n_components_range) + .2 * (i - 2)
bars.append(plt.bar(xpos, bic[i * len(n_components_range):
(i + 1) * len(n_components_range)],
width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
.2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
# Plot the winner
splot = plt.subplot(2, 1, 2)
Y_ = clf.predict(X)
for i, (mean, cov, color) in enumerate(zip(clf.means_, clf.covariances_,
color_iter)):
v, w = linalg.eigh(cov)
if not np.any(Y_ == i):
continue
plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan2(w[0][1], w[0][0])
angle = 180. * angle / np.pi # convert to degrees
v = 2. * np.sqrt(2.) * np.sqrt(v)
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(.5)
splot.add_artist(ell)
plt.xticks(())
plt.yticks(())
plt.title('Selected GMM: full model, 2 components')
plt.subplots_adjust(hspace=.35, bottom=.02)
plt.show()
Finall, from the image below, you can clearly see how
Here's an example of how to visualize clusters using plotGMM. The code to reproduce follows:
require(quantmod)
SCHB <- fortify(getSymbols('SCHB', auto.assign=FALSE))
set.seed(730) # for reproducibility
mixmdl <- mixtools::normalmixEM(Cl(SCHB), k = 5); plot_GMM(mixmdl, k = 5) # 5 clusters
plot_GMM(mixmdl, k = 5)
I hope that helps. Oh, and for plotting time series with ggplot2, you should avail yourself of ggplot2's fortify function. Hope that helps.

Efficiently sample a collection of multi-normal variables with varying sigma (covariance) matrix

I'm new to Stan, so hoping you can point me in the right direction. I'll build up to my situation to make sure we're on the same page...
If I had a collection of univariate normals, the docs tell me that:
y ~ normal(mu_vec, sigma);
provides the same model as the unvectorized version:
for (n in 1:N)
y[n] ~ normal(mu_vec[n], sigma);
but that the vectorized version is (much?) faster. Ok, fine, makes good sense.
So the first question is: is it possible to take advantage of this vectorization speedup in the univariate normal case where both the mu and sigma of the samples vary by position in the vector. I.e. if both mu_vec and sigma_vec are vectors (in the previous case sigma was a scalar), then is this:
y ~ normal(mu_vec, sigma_vec);
equivalent to this:
for (n in 1:N)
y[n] ~ normal(mu_vec[n], sigma_vec[n]);
and if so is there a comparable speedup?
Ok. That's the warmup. The real question is how to best approach the multi-variate equivalent of the above.
In my particular case, I have N of observations of bivariate data for some variable y, which I store in an N x 2 matrix. (For order of magnitude, N is about 1000 in my use case.)
My belief is that the mean of each component of each observation is 0 and that the stdev of each component is each observation is 1 (and I'm happy to hard code them as such). However, my belief is that the correlation (rho) varies from observation to observation as a (simple) function of another observed variable, x (stored in an N element vector). For example, we might say that rho[n] = 2*inverse_logit(beta * x[n]) - 1 for n in 1:N and our goal is to learn about beta from our data. I.e. the covariance matrix for the nth observation would be:
[1, rho[n]]
[rho[n], 1 ]
My question is what's the best way to put this together in a STAN model so that it isn't slow as heck? Is there a vectorized version of the multi_normal distribution so that I could specify this as:
y ~ multi_normal(vector_of_mu_2_tuples, vector_of_sigma_matrices)
or perhaps as some other similar formulation? Or will I need to write:
for (n in 1:N)
y[n] ~ multi_normal(vector_of_mu_2_tuples[n], vector_of_sigma_matrices[n])
after having set up vector_of_sigma_matrices and vector_of_mu_2_tuples in an earlier block?
Thanks in advance for any guidance!
Edit to add code
Using python, I can generate data in the spirit of my problem as follows:
import numpy as np
import pandas as pd
import pystan as pys
import scipy as sp
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns
def gen_normal_data(N, true_beta, true_mu, true_stdevs):
N = N
true_beta = true_beta
true_mu = true_mu
true_stdevs = true_stdevs
drivers = np.random.randn(N)
correls = 2.0 * sp.special.expit(drivers*true_beta)-1.0
observations = []
for i in range(N):
covar = np.array([[true_stdevs[0]**2, true_stdevs[0] * true_stdevs[1] * correls[i]],
[true_stdevs[0] * true_stdevs[1] * correls[i], true_stdevs[1]**2]])
observations.append(sp.stats.multivariate_normal.rvs(true_mu, covar, size=1).tolist())
observations = np.array(observations)
return {
'N': N,
'true_mu': true_mu,
'true_stdev': true_stdevs,
'y': observations,
'd': drivers,
'correls': correls
}
and then actually generate the data using:
normal_data = gen_normal_data(100, 1.5, np.array([1., 5.]), np.array([2., 5.]))
Here's what the data set looks like (scatterplot of y colored by correls in the left pane and by drivers in the right pane...so the idea is that the higher the driver the closer to 1 the correl and the lower the driver, the closer to -1 the correl. So would expect red dots on the left pane to be "down-left to up-right" and blue dots to be "up-left to down-right", and indeed they are:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
x = normal_data['y'][:, 0]
y = normal_data['y'][:, 1]
correls = normal_data['correls']
drivers = normal_data['d']
for ax, colordata, cmap in zip(axes, [correls, drivers], ['coolwarm', 'viridis']):
color_extreme = max(abs(colordata.max()), abs(colordata.min()))
sc = ax.scatter(x, y, c=colordata, lw=0, cmap=cmap, vmin=-color_extreme, vmax=color_extreme)
divider = make_axes_locatable(ax)
cax = divider.append_axes('right', size='5%', pad=0.05)
fig.colorbar(sc, cax=cax, orientation='vertical')
fig.tight_layout()
Using the brute force approach, I can set up a STAN model that looks like this:
model_naked = pys.StanModel(
model_name='naked',
model_code="""
data {
int<lower=0> N;
vector[2] true_mu;
vector[2] true_stdev;
real d[N];
vector[2] y[N];
}
parameters {
real beta;
}
transformed parameters {
}
model {
real rho[N];
matrix[2, 2] cov[N];
for (n in 1:N) {
rho[n] = 2.0*inv_logit(beta * d[n]) - 1.0;
cov[n, 1, 1] = true_stdev[1]^2;
cov[n, 1, 2] = true_stdev[1] * true_stdev[2] * rho[n];
cov[n, 2, 1] = true_stdev[1] * true_stdev[2] * rho[n];
cov[n, 2, 2] = true_stdev[2]^2;
}
beta ~ normal(0, 10000);
for (n in 1:N) {
y[n] ~ multi_normal(true_mu, cov[n]);
}
}
"""
)
This fits nicely:
fit_naked = model_naked.sampling(data=normal_data, iter=1000, chains=2)f = fit_naked.plot();
f.tight_layout()
But I'm hoping someone can point me in the right direction for the "marginalized" approach where we break down our bivariate normal into a pair of independent normals that can be blended using the correlation. The reason I need this is that in my actual use case, both dimensions of are fat-tailed. I am happy to model this as a student-t distribution, but the issue is that STAN only allows a single nu to be specified (not one for each dimension), so I think I'll need to find a way to decompose a multi_student_t into a pair of independent student_t's so that I can set the degrees of freedom separately for each dimension.
The univariate normal distribution does accept vectors for any or all of its arguments and it will be faster than looping over the N observations to call it N times with scalar arguments.
However, the speedup is only going to be linear because the calculations are all the same, but it only has to allocate memory once if you only call it once. The overall wall time is more affected by the number of function evaluations you have to do, which is up to 2^10 - 1 per MCMC iteration (by default), but whether you hit the maximum treedepth depends on the geometry of the posterior distribution you are trying to sample from, which, in turn, depends on everything including the data you condition on.
The bivariate normal distribution can be written as a product of a marginal univariate normal distribution for the first variable and a conditional univariate normal distribution for the second variable given the first variable. In Stan code, we can utilize element-wise multiplication and division to write its log-density like
target += normal_lpdf(first_variable | first_means, first_sigmas);
target += normal_lpdf(second_variable | second_means +
rhos .* first_sigmas ./ second_sigmas .* (first_variable - first_means),
second_sigmas .* sqrt(1 - square(rhos)));
Unfortunately, the more general multivariate normal distribution in Stan does not have an implementation that inputs arrays of covariance matrices.
This isn't quite answering your question, but you can make your program more efficient by removing a bunch of redundant calculations and converting scale a bit to use tanh rather than scaled inverse logit. I'd get rid of the scaling and just use smaller betas, but I left it so that it should get the same results.
data {
int<lower=0> N;
vector[2] mu;
vector[2] sigma;
vector[N] d;
vector[2] y[N];
}
parameters {
real beta;
}
transformed data {
real var1 = square(sigma[1]);
real var2 = square(sigma[2]);
real covar12 = sigma[1] * sigma[2];
vector[N] d_div_2 = d * 0.5;
}
model {
// note: tanh(u) = 2 * inv_logit(u / 2) - 1
vector[N] rho = tanh(beta * d_div_2);
matrix[2, 2] Sigma;
Sigma[1, 1] = var1;
Sigma[2, 2] = var2;
// only reassign what's necessary with minimal recomputation
for (n in 1:N) {
Sigma[1, 2] = rho[n] * covar12;
Sigma[2, 1] = Sigma[1, 2];
y[n] ~ multi_normal(true_mu, Sigma);
}
// weakly informative priors fit more easily
beta ~ normal(0, 8);
}
You could also factor by figuring out Cholesky factorization as function of rho and other fixed values and use that---it saves a solver step in the multivariate normal.
The other option you have is to write out the multi-student-t directly rather than using our built-in implementation. The built-in probably won't be a whole lot faster as the whole operation's pretty heavily dominated by the matrix solve.

How to leverage Convex Optimization for Portfolio Optimization in Julia

I'm trying to use Julia (0.5) and Convex.jl (with ECOS solver) to figure out, given a portfolio of 2 stocks, how can I distribute my allocations (in percent) across both stocks such that I maximize my portfolio return and minimize my risk (std dev of returns). I want to maximize what is known as the Sharpe ratio that is a calculation driven from what percentages I have in each of my 2 stocks. So I want to MAXIMIZE the Sharpe ratio and have the solver figure out what is the optimal allocation for the two stocks (I want it to tell me I need x% of stock 1 and 1-x% of stock 2). The only real constraint is that the sum of the percent allocations adds to 100%. I have code below that runs, but does not give me the optimal weights/allocations I'm expecting (which is 36.3% for Supertech & 63.7% for Slowpoke). The solver instead comes back with 50/50.
My intuition is that I either have the objective function modeled incorrectly for the solver, or I need to do more with constraints. I don't have a good grasp on convex optimization so I'm winging it. Also, my objective function uses the variable.value attribute to get the correct output and I suspect I need to be working with the Variable expression object instead.
Question is, is what I'm trying to achieve something the Convex solver is designed for and I just have to model the objective function and constraints better, or do I have to just iterate the weights and brute force it?
Code with comments:
using Convex, ECOS
Supertech = [-.2; .1; .3; .5];
Slowpoke = [.05; .2; -.12; .09];
A = reshape([Supertech; Slowpoke],4,2)
mlen = size(A)[1]
R = vec(mean(A,1))
n=rank(A)
w = Variable(n)
c1 = sum(w) == 1;
λ = .01
w.value = [λ; 1-λ]
sharpe_ratio = sqrt(mlen) * (w.value' * R) / sqrt(sum(vec(w.value' .* w.value) .* vec(cov(A,1,false))))
# sharpe_ratio will be maximized at 1.80519 when w.value = [λ, 1-λ] where λ = .363
p = maximize(sharpe_ratio,c1);
solve!(p, ECOSSolver(verbose = false)); # when verbose=true, says is 'degenerate' because I don't have enough constrains...
println(w.value) # expecting to get [.363; .637] here but I get [0.5; 0.5]

how to specify final value (rather than initial value) for solving differential equations

I would like to solve a differential equation in R (with deSolve?) for which I do not have the initial condition, but only the final condition of the state variable. How can this be done?
The typical code is: ode(times, y, parameters, function ...) where y is the initial condition and function defines the differential equation.
Are your equations time reversible, that is, can you change your differential equations so they run backward in time? Most typically this will just mean reversing the sign of the gradient. For example, for a simple exponential growth model with rate r (gradient of x = r*x) then flipping the sign makes the gradient -r*x and generates exponential decay rather than exponential growth.
If so, all you have to do is use your final condition(s) as your initial condition(s), change the signs of the gradients, and you're done.
As suggested by #LutzLehmann, there's an even easier answer: ode can handle negative time steps, so just enter your time vector as (t_end, 0). Here's an example, using f'(x) = r*x (i.e. exponential growth). If f(1) = 3, r=1, and we want the value at t=0, analytically we would say:
x(T) = x(0) * exp(r*T)
x(0) = x(T) * exp(-r*T)
= 3 * exp(-1*1)
= 1.103638
Now let's try it in R:
library(deSolve)
g <- function(t, y, parms) { list(parms*y) }
res <- ode(3, times = c(1, 0), func = g, parms = 1)
print(res)
## time 1
## 1 1 3.000000
## 2 0 1.103639
I initially misread your question as stating that you knew both the initial and final conditions. This type of problem is called a boundary value problem and requires a separate class of numerical algorithms from standard (more elementary) initial-value problems.
library(sos)
findFn("{boundary value problem}")
tells us that there are several R packages on CRAN (bvpSolve looks the most promising) for solving these kinds of problems.
Given a differential equation
y'(t) = F(t,y(t))
over the interval [t0,tf] where y(tf)=yf is given as initial condition, one can transform this into the standard form by considering
x(s) = y(tf - s)
==> x'(s) = - y'(tf-s) = - F( tf-s, y(tf-s) )
x'(s) = - F( tf-s, x(s) )
now with
x(0) = x0 = yf.
This should be easy to code using wrapper functions and in the end some list reversal to get from x to y.
Some ODE solvers also allow negative step sizes, so that one can simply give the times for the construction of y in the descending order tf to t0 without using some intermediary x.

Resources