Basic RNN cost function not converging to zero - recurrent-neural-network

I'm writing a basic RNN.
My function for generating the hidden state is:
tanh(xU + sW)
Where x is the input vector and s is the previous hidden state vector. U and W are both the parameters that are being adjusted during backprop.
For modifying U and W I use:
U += 1/(cosh^2(xU+sW)) * x * expectedValue * stepSize
W += 1/(cosh^2(xU+sW)) * s * expectedValue * stepSize
Where stepSize is about 0.01, though I've tested lots of smaller values. The expectedValue is the same for both and it is just the value of the function I am trying to learn for testing.
For the cost function to determine how close my estimations are, I am using the mean squared error function:
1/n * (expectedValue^2 - predictedValue^2)
My cost function is not converging to zero over 10,000,000 iterations. Am I screwing up some of the math somewhere?

It could be the gradient vanishing problem. Try RELU activation function instead of tanh().

Related

Is there any way to bound the region searched by NLsolve in Julia?

I'm trying to find one of the roots of a nonlinear (roughly quartic) equation.
The equation always has four roots, a pair of them close to zero, a large positive, and a large negative root. I'd like to identify either of the near zero roots, but nlsolve, even with an initial guess very close to these roots, seems to always converge on the large positive or negative root.
A plot of the function essentially looks like a constant negative value, with a (very narrow) even-ordered pole near zero, and gradually rising to cross zero at the large positive and negative roots.
Is there any way I can limit the region searched by nlsolve, or do something to make it more sensitive to the presence of this pole in my function?
EDIT:
Here's some example code reproducing the problem:
using NLsolve
function f!(F,x)
x = x[1]
F[1] = -15000 + x^4 / (x+1e-5)^2
end
# nlsolve will find the root at -122
nlsolve(f!,[0.0])
As output, I get:
Results of Nonlinear Solver Algorithm
* Algorithm: Trust-region with dogleg and autoscaling
* Starting Point: [0.0]
* Zero: [-122.47447713915808]
* Inf-norm of residuals: 0.000000
* Iterations: 15
* Convergence: true
* |x - x'| < 0.0e+00: false
* |f(x)| < 1.0e-08: true
* Function Calls (f): 16
* Jacobian Calls (df/dx): 6
We can find the exact roots in this case by transforming the objective function into a polynomial:
using PolynomialRoots
roots([-1.5e-6,-0.3,-15000,0,1])
produces
4-element Array{Complex{Float64},1}:
122.47449713915809 - 0.0im
-122.47447713915808 + 0.0im
-1.0000000813048448e-5 + 0.0im
-9.999999186951818e-6 + 0.0im
I would love a way to identify the pair of roots around the pole at x = -1e-5 without knowing the exact form of the objective function.
EDIT2:
Trying out Roots.jl :
using Roots
f(x) = -15000 + x^4 / (x+1e-5)^2
find_zero(f,0.0) # finds +122... root
find_zero(f,(-1e-4,0.0)) # error, not a bracketing interval
find_zeros(f,-1e-4,0.0) # finds 0-element Array{Float64,1}
find_zeros(f,-1e-4,0.0,no_pts=6) # finds root slightly less than -1e-5
find_zeros(f,-1e-4,0.0,no_pts=10) # finds 0-element Array{Float64,1}, sensitive to value of no_pts
I can get find_zeros to work, but it's very sensitive to the no_pts argument and the exact values of the endpoints I pick. Doing a loop over no_pts and taking the first non-empty result might work, but something more deterministic to converge would be preferable.
EDIT3 :
Here's applying the tanh transformation suggested by Bogumił
using NLsolve
function f_tanh!(F,x)
x = x[1]
x = -1e-4 * (tanh(x)+1) / 2
F[1] = -15000 + x^4 / (x+1e-5)^2
end
nlsolve(f_tanh!,[100.0]) # doesn't converge
nlsolve(f_tanh!,[1e5]) # doesn't converge
using Roots
function f_tanh(x)
x = -1e-4 * (tanh(x)+1) / 2
return -15000 + x^4 / (x+1e-5)^2
end
find_zeros(f_tanh,-1e10,1e10) # 0-element Array
find_zeros(f_tanh,-1e3,1e3,no_pts=100) # 0-element Array
find_zero(f_tanh,0.0) # convergence failed
find_zero(f_tanh,0.0,max_evals=1_000_000,maxfnevals=1_000_000) # convergence failed
EDIT4 : This combination of techniques identifies at least one root somewhere around 95% of the time, which is good enough for me.
using Peaks
using Primes
using Roots
# randomize pole location
a = 1e-4*rand()
f(x) = -15000 + x^4 / (x+a)^2
# do an initial sample to find the pole location
l = 1000
minval = -1e-4
maxval = 0
m = []
sample_r = []
while l < 1e6
sample_r = range(minval,maxval,length=l)
rough_sample = f.(sample_r)
m = maxima(rough_sample)
if length(m) > 0
break
else
l *= 10
end
end
guess = sample_r[m[1]]
# functions to compress the range around the estimated pole
cube(x) = (x-guess)^3 + guess
uncube(x) = cbrt(x-guess) + guess
f_cube(x) = f(cube(x))
shift = l ÷ 1000
low = sample_r[m[1]-shift]
high = sample_r[m[1]+shift]
# search only over prime no_pts, so no samplings divide into each other
# possibly not necessary?
for i in primes(500)
z = find_zeros(f_cube,uncube(low),uncube(high),no_pts=i)
if length(z)>0
println(i)
println(cube.(z))
break
end
end
More comment could be given if you provided more information on your problem.
However in general:
It seems that your problem is univariate, in which case you can use Roots.jl where find_zero and find_zeros give the interface you ask for (i.e. allowing to specify the search region)
If a problem is multivariate you have several options how to do it in the problem specification for nlsolve (as it by default does not allow to specify a bounding box AFAICT). The simplest is to use variable transformation. E.g. you can apply a ai * tanh(xi) + bi transformation selecting ai and bi for each variable so that it is bounded to the desired interval
The first problem you have in your definition is that the way you define f it never crosses 0 near the two roots you are looking for because Float64 does not have enough precision when you write 1e-5. You need to use greater precision of computations:
julia> using Roots
julia> f(x) = -15000 + x^4 / (x+1/big(10.0^5))^2
f (generic function with 1 method)
julia> find_zeros(f,big(-2*10^-5), big(-8*10^-6), no_pts=100)
2-element Array{BigFloat,1}:
-1.000000081649671426108658262468117284940444265467160592853348997523986352593615e-05
-9.999999183503552405580084054429938261707450678661727461293670518591720605751116e-06
and set no_pts to be sufficiently large to find intervals bracketing the roots.

How to leverage Convex Optimization for Portfolio Optimization in Julia

I'm trying to use Julia (0.5) and Convex.jl (with ECOS solver) to figure out, given a portfolio of 2 stocks, how can I distribute my allocations (in percent) across both stocks such that I maximize my portfolio return and minimize my risk (std dev of returns). I want to maximize what is known as the Sharpe ratio that is a calculation driven from what percentages I have in each of my 2 stocks. So I want to MAXIMIZE the Sharpe ratio and have the solver figure out what is the optimal allocation for the two stocks (I want it to tell me I need x% of stock 1 and 1-x% of stock 2). The only real constraint is that the sum of the percent allocations adds to 100%. I have code below that runs, but does not give me the optimal weights/allocations I'm expecting (which is 36.3% for Supertech & 63.7% for Slowpoke). The solver instead comes back with 50/50.
My intuition is that I either have the objective function modeled incorrectly for the solver, or I need to do more with constraints. I don't have a good grasp on convex optimization so I'm winging it. Also, my objective function uses the variable.value attribute to get the correct output and I suspect I need to be working with the Variable expression object instead.
Question is, is what I'm trying to achieve something the Convex solver is designed for and I just have to model the objective function and constraints better, or do I have to just iterate the weights and brute force it?
Code with comments:
using Convex, ECOS
Supertech = [-.2; .1; .3; .5];
Slowpoke = [.05; .2; -.12; .09];
A = reshape([Supertech; Slowpoke],4,2)
mlen = size(A)[1]
R = vec(mean(A,1))
n=rank(A)
w = Variable(n)
c1 = sum(w) == 1;
λ = .01
w.value = [λ; 1-λ]
sharpe_ratio = sqrt(mlen) * (w.value' * R) / sqrt(sum(vec(w.value' .* w.value) .* vec(cov(A,1,false))))
# sharpe_ratio will be maximized at 1.80519 when w.value = [λ, 1-λ] where λ = .363
p = maximize(sharpe_ratio,c1);
solve!(p, ECOSSolver(verbose = false)); # when verbose=true, says is 'degenerate' because I don't have enough constrains...
println(w.value) # expecting to get [.363; .637] here but I get [0.5; 0.5]

How to solve this "non-finite function value" in R?

I am using R to calculate a nested functions like this:
C1_B <- function(T){integrate(function(tau)f(tau),lower=0.01*T,upper=0.99*T)$value}
f <- function(tau) {integrate(function(tau1)sqrt(1/(tau-tau1)),lower=0.01*tau,upper=0.99*tau)$value}
C1_B(0.5)
However, I receive a message like
"Error in integrate(function(tau1) sqrt(1/(tau - tau1)), lower = 0.01
* : non-finite function value
In addition: Warning message:**
In sqrt(1/(tau - tau1)) : NaNs produced"
I guess the problem is about the "(tau-tau1)" in my code; but from the
integral domain I defined ("lower=0.01*tau,upper=0.99*tau"), (tau-tau1) could not be equal to zero.
Could any body please tell me how can I solve this problem?
I gave it a try - the problem is that integrate expects the handed over function to be able to deal with input vectors and output a vector of same size.
Luckily the solution is easy - just wrap your function in sapply.
The following code works:
f <- function(tau) {integrate(function(tau1)sqrt(1/(tau-tau1)),lower=0.01*tau,upper=0.99*tau)$value}
intfun <- function(x) sapply(x,f)
C1_B <- function(T){integrate(function(tau) intfun(tau),lower=0.01*T,upper=0.99*T)$value}
C1_B(0.5)
There exists an exact solution to your integral f. However the value I get does not agree with this numeric approximation. I would say the integral of
d(tau1)/sqrt(tau - tau1)
is
-2 * sqrt(tau - tau1)
With you upper bound of 0.99*tau and you lower bound of 0.01*tau you get
-2 * (sqrt(tau - 0.99 * tau) - sqrt(tau - 0.01 * tau)) =
-2 * sqrt(tau) * (sqrt(0.01) - sqrt(0.99))
The integration of that for tau can again be solved exactly. It yields
-(4/3)(sqrt(0.01) - sqrt(0.99)) * tau^(3/2)
Edit: With your given boundaries 0.01*T and 0.99*T the final resulting solution is
-(4/3)(sqrt(0.01)-sqrt(0.99)) * ((0.99 * T)^3/2 - (0.01 * T)^3/2)
You can use integrate on the first exact integration result (for f). No error are produced. The errors you report are probably due to the method of approximation. Maybe you could try another integration function that uses another approximation. The exact solution of the function f matches the integral calculated in your program.
When you use integrate to integrate the exact result for f the results are equal to the exact final solution I gave.

Translating code that carries out SOCP/SDP optimisation from MATLAB to R

I have the following MATLAB code which was used in the linked paper (http://www.optimization-online.org/DB_FILE/2014/05/4366.pdf), and would like to be able to use the Rsocp package to be able to carry out the same function but in R. The Rsocp package is available by using the command:
install.packages("Rsocp", repos="http://R-Forge.R-project.org")
and through the socp() function it carries out a similar function to solvesdp(constraints, -wcvar, ops) in the MATLAB code below.
I do not have MATLAB which makes this problem more difficult for me to solve.
The issue I have is the R's socp() function takes matrices as inputs that reflect the data(/covariance matrix and average return values) and constraints all together, where as the MATLAB code seems to be optimising a function...in this specific case it looks like its optimising -wcvar to get the optimal weights, so I am unsure of how to set up my problem in R to get similar results.
The MATLAB code I would therefore like help in translating to R is as follows:
function [w] = rgop(T, mu, sigma, epsilon)
% This function determines the robust growth-optimal portfolio
% Input parameters:
% T - the number of time periods
% mu - the mean vector of asset returns
% sigma - the covariance matrix of asset returns
% epsilon - violation probability
% Output parameters:
% w - robust growth-optimal portfolios
% the number of assets
n = length(mu);
% portfolio weights
w = sdpvar(n,1);
% mean and standard deviation of portfolio
rp = w'*mu;
sigmap = sqrt(w'*sigma*w);
% preclude short selling
constraints = [w >= 0]; %#ok<NBRAK>
% budget constraint
constraints = [constraints, sum(w) == 1];
% worst-case value-at-risk (theorem 4.1)
wcvar = 1/2*(1 - (1 - rp + sqrt((1-epsilon)/epsilon/T)*sigmap)^2 - ((T-1)/epsilon/T)*sigmap^2);
% maximise WCVAR
ops = sdpsettings('solver','sdpt3','verbose',0);
solvesdp(constraints, -wcvar, ops);
w = double(w);
end
For the square root function of the covariance matrix one can use:
Rsocp:::.SqrtMatrix()
Note this question is partially related to my previous question however is more focused on getting the worst case VaR weights:
SOCP Solver Error for fPortoflio using solveRsocp
Perhaps a good start would be to use this code where the Rsocp package has already been used...
https://r-forge.r-project.org/scm/viewvc.php/pkg/fPortfolio/R/solveRsocp.R?view=markup&root=rmetrics&pathrev=3507
EDIT
I think the MATLAB code for the solvesdp function is available from this link:
https://code.google.com/p/vroster/source/browse/trunk/matlab/yalmip/solvesdp.m?r=11
Also a quick question about SOCP optimisations in general...would the result obtained via SOCP optimisation be the same as that achieved using other methods of optimisation? will the only difference be speed and efficiency?
EDIT2
Since it was requested...
rgop <- function(tp, mu, sigma, epsilon){
# INPUTS
# tp - the number of time periods
# mu - the mean vector of asset returns
# sigma - the covariance matrix of asset returns
# epsilon - violation probability
# OUTPUT
# w - robust growth-optimal portfolios
#n is number of assets
n <- length(mu)
# portfolio weights (BUT THIS IS THE OUTPUT)
# for now will assume equal weight
w <- rep(1/n,n)
# mean and standard deviation of portfolio
rp <- sum(w*mu)
sigmap <- as.numeric(sqrt(t(w) %*% sigma %*% w))
# worst-case value-at-risk (theorem 4.1)
wcvar = 1/2*(1 - (1 - rp + sqrt((1-epsilon)/epsilon/tp)*sigmap)^2 - ((tp-1)/epsilon/tp)*sigmap^2);
# optimise...not sure how to carry out this optimisation...
# which is the main thrust of this question...
# could use DEoptim...but would like to understand the SOCP method
}
SOCP is just a fast way of finding the minimum in cases where you know enough about the problem to constrain it in certain technical ways. As you're discovering these constraints can be tricky to formulate, so it is worth asking if you need the speed. Often the answer is yes, but for debugging/exploration purposes brute numerical optimisation using R's optim function can be fruitful.

Help understanding a definitive integral

I am trying to translate a function in a book into code, using MATLAB and C#.
I am first trying to get the function to work properly in MATLAB.
Here are the instructions:
The variables are:
xt and m can be ignored.
zMax = Maximum Sensor Range (100)
zkt = Sensor Measurement (49)
zkt* = What sensor measurement should have been (50)
oHit = Std Deviation of my measurement (5)
I have written the first formula, N(zkt;zkt*,oHit) in MATLAB as this:
hitProbabilty = (1/sqrt( 2*pi * (oHit^2) ))...
* exp(-0.5 * (((zkt- zktStar) ^ 2) / (oHit^2)) );
This gives me the Gaussian curve I expect.
I have an issue with the definite integral below, I do not understand how to turn this into a real number, because I get horrible values out my code, which is this:
func = #(x) hitProbabilty * zkt * x;
normaliser = quad(func, 0, max) ^ -1;
hitProbabilty = normaliser * hitProbabilty;
Can someone help me with this integral? It is supposed to normalize my curve, but it just goes crazy.... (I am doing this for zkt 0:1:100, with everything else the same, and graphing the probability it should output.)
You should use the error function ERF (available in basic MATLAB)
EDIT1:
As #Jim Brissom mentioned, the cumulative distribution function (CDF) is related to the error function by:
normcdf(X) = (1 + erf(X/sqrt(2)) / 2 , where X~N(0,1)
Note that NORMCDF requires the Statistics Toolbox
EDIT2:
I think there's been a small confusion seeing the comments.. The above only compute the normalizing factor, so if you want to compute the final probability over a certain range of values, you should do this:
zMax = 100; %# Maximum Sensor Range
zktStar = 50; %# What sensor measurement should have been
oHit = 5; %# Std Deviation of my measurement
%# p(0<z<zMax) = p(z<zMax) - p(z<0)
ncdf = diff( normcdf([0 zMax], zktStar, oHit) );
normaliser = 1 ./ ncdf;
zkt = linspace(0,zMax,500); %# Sensor Measurement, 500 values in [0,zMax]
hitProbabilty = normpdf(zkt, zktStar, oHit) * normaliser;
plot(zkt, hitProbabilty)
xlabel('z^k_t'), ylabel('P_{hit}(z^k_t)'), title('Measurement Probability')
The N in your code is just the well-known gaussian or normal distribution. I am mentioning this because since you re-implemented it in Matlab, it seems you missed that, seeing as how it is obviously already implemented in Matlab.
Integrating the normal distribution will yield a cumulative distribution function, available in Matlab for the normal distribution via normcdf. The ncdf can be written in terms of erf, which is probably what Amro was talking about.
Using normcdf avoids integrating manually.
In case you still need the result for the integral.
From Mathematica. The Calc is
hitProbabilty[zkt_] := (1/Sqrt[2*Pi*oHit^2])*Exp[-0.5*(((zkt - zktStar)^2)/(oHit^2))];
Integrate[hitProbabilty[zkt], {zkt, 0, zMax}];
The result is (just for copy/paste)
((1.2533141373155001*oHit*zktStar*Erf[(0.7071067811865476*Sqrt[zktStar^2])/oHit])/
Sqrt[zktStar^2] +
(1.2533141373155001*oHit*(zMax-zktStar)*Erf[(0.7071067811865476*Sqrt[(zMax-zktStar)^2])/oHit])/
Sqrt[(zMax-zktStar)^2])/(2*oHit*Sqrt[2*Pi])
Where Erf[] is the error function
HTH!

Resources