R: Performing Gradient Descent - r

I am working with the R programming language.
I am trying to learn more about optimization algorithms, and as a learning exercise - I would like to try an optimize a mathematical function using the (famous) gradient descent algorithm using the R programming language.
For instance, I would like to try and "optimize" (i.e. find out the values of "x1 and x2" that produce the smallest possible value of "y") the following function (this function is called the Rastrign Function, and is a popular function to test optimization algorithms on due to its irregular and complicated shape):
I first defined this function in R:
Rastrigin <- function(x)
{
return(20 + x[1]^2 + x[2]^2 - 10*(cos(2*pi*x[1]) + cos(2*pi*x[2])))
}
Then, I tried to do some research and see if there are any standard and common implementations of gradient descent in R. For example, I found out about the "optim()" function in (base) R, which provides many choices of popular optimization algorithms such as "BFGS", "Simulated Annealing" and "Nelder-Meade". For instance, below I used a variant of the "BFGS" algorithm to optimize the Rastrign Function:
#run BFGS optimization algorithm:
optim(par = c(2,2), Rastrigin, lower = c(-5,-5), upper = c(5,5), method = "L-BFGS-B")
$par
[1] 5.453531e-15 5.453531e-15
$value
[1] 0
$counts
function gradient
7 7
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Based on the results of the above code, it seems like the BFGS algorithm was able to successfully find the minimum of this function by returning values of x1 and x2 that are very close to the true minimum (using trigonometry, we can see that if "x1 = x2 = 0", f(x1,x2) = 20 + 0 + 0 - 10*(cos(0) + cos(0)) = 20 - 10*2 = 20 - 20 = 0 ).
My Question: I tried looking for a standard function in R that would allow you to perform gradient descent optimization, but I could not find anything.
Does anyone know if there are any standard functions in R for gradient descent optimization? Can someone please show me how to do this?
Thanks!
References:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/optim.html

As indicated in the comments, (I just learned that) "gradient descent" is the same as "steepest descent":
library(pracma)
> steep_descent(c(1, 1), Rastrigin)
$xmin
[1] 0.9949586 0.9949586
$fmin
[1] 1.989918
$niter
[1] 3

Related

Maximum-Likelihood Estimation of three parameter reverse Weibull model implementation in R

I'm implementing a Maximum-Likelihood estimation in R for a three parameter reverse Weibull model and have some troubles to get plausible results, which include:
Bad optimization results, unwanted optimx behaviour. Beside these I wonder, how I could make use of parscale in this model.
Here is my implementation attempt:
To generate data I use the probabilty integral transform:
#Generate N sigma*RWei(alph)-mu distributed points
gen.wei <- function(N, theta) {
alph <- theta[1]
mu <- theta[2]
sigma <- theta[3]
return(
mu - sigma * (- log (runif(N)))**(1/alph)
)
}
Now I define the Log-Likelihood and negative Log-Likelihood to use optimx optimization:
#LL----
ll.wei <- function(theta,x) {
N <- length(x)
alph <- theta[1]
mu <- theta[2]
sigma <- theta[3]
val <- sum(ifelse(
x <= mu,
log(alph/sigma) + (alph-1) * log( (mu-x)/sigma) - ( (mu-x)/sigma)**(alph-1),
-Inf
))
return(val)
}
#Negative LL----
nll.wei <- function(theta,x) {
return(-ll.wei(theta=theta, x=x))
}
Afterwards I define the analytical gradient of the negative LL. Remark: There are points at which the negative LL isn't differentiable (the upper end-point mu)
gradnll.wei <- function(theta,x) {
N <- length(x)
alph <- theta[1]
mu <- theta[2]
sigma <- theta[3]
argn <- (mu-x)/sigma
del.alph <- sum(ifelse(x <= mu,
1/alph + log(argn) - log(argn) * argn**(alph-1),
0
))
del.mu <- sum(ifelse(x <= mu,
(alph-1)/(mu-x) - (alph-1)/sigma * argn**(alph-2),
0))
del.sigma <- sum(ifelse(x <= mu,
((alph-1)*argn**(alph-1)-alph)/sigma,
0))
return (-c(del.alph, del.mu, del.sigma))
}
Finally I try to optimize using the optimx package and the methods Nelder-Mead (derivative free) and BFGS (my LL is kinda smooth, there's just one point, which is problematic).
#MLE for Weibull
mle.wei <- function(start,sample) {
optimx(
par=start,
fn = nll.wei,
gr = gradnll.wei,
method = c("BFGS"),
x = sample
)
}
theta.s <- c(4,1,1/2) #test for parameters
sample <- gen.wei(100, theta.s) #generate 100 data points distributed like theta.s
mle.wei(start=c(8,4, 2), sample) #MLE Estimation
To my surprise I get the following error:
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, :
Cannot evaluate function at initial parameters
I checked manually: Both nll and gradnll are finite at the initial parameters...
If i switch to optim instead of optimx I get a result, but a pretty bad one:
$par
[1] 8.178674e-01 9.115766e-01 1.745724e-06
$value
[1] -1072.786
$counts
function gradient
574 100
$convergence
[1] 1
$message
NULL
So it doesn't converge. If I don't supply the gradient to BFGS, there isn't a result. If I use Nelder-Mead instead:
$par
[1] 1.026393e+00 9.649121e-01 9.865624e-18
$value
[1] -3745.039
$counts
function gradient
502 NA
$convergence
[1] 1
$message
NULL
So it is also very bad...
My questions are:
Should I instead of defining the ll outside of the support as -Inf give it a very high negative value like -1e20 to circumvent -Inf errors or does it not matter?
Like the first one but for the gradient: technically the ll isn't defined outside of the support but since the likelihood is 0 albeit constant outside of the support, is it smart to define the gradnll as 0 outside?
3.I checked the implementation of the MLE estimator fgev from the evd package and saw that they use the BFGS method but don't supply the gradient even though the gradient does exist. Therefore my question is, whether there are situations where it is contraproductive to supply the gradient since it isn't defined everywhere (like my and the evd case)?
I got an error of "argument x matches multiple formal arguments" type in optimx but not in optim, which surprised me. What am I doing wrong in supplying my functions and data to the optimx function?
Thank you very much in advance!
Re 3: That's kind of a bug in optimx, but one that's hard to avoid. It uses x as a variable name when calculating a numerical gradient; you also use it as an "additional parameter" to your functions. You can work around that by renaming your argument, e.g. call it xdata in your functions.
Re 1 & 2: There are several techniques to handle boundary problems in optimization. Setting to a big constant value tends not to work: if the optimizer goes out of bounds, it finds the objective function really flat. If the exact boundary is legal, then pushing the parameter to the boundary and adding a penalty sometimes works. If the exact boundary is illegal, you might be able to reflect: e.g. if mu > 0 is a requirement, sometimes replacing mu by abs(mu) in the objective function gets things to work. Sometimes the best solution is to get rid of the boundary by transforming the parameters.
Edited to add some more details:
For this problem, it looks to me as though transformations of the parameters might work. I think alpha and sigma must both be positive. Setting alpha <- exp(theta[1]) and sigma <- exp(theta[3]) will guarantee that. Limits on mu are harder, but I think mu > max(xdata) is needed, so mu <- max(xdata) + exp(theta[2]) should keep it in bounds. Of course, making these changes messes up your gradient formula and starting values.
As to resources: I'm afraid I don't know any. This advice is based on years of painful experience.
https://web.ncf.ca/nashjc/optimx202112/ has a version of the package that deals with at least some variable clashes in the dot args.
There are some separate cleanups to be done before this goes on CRAN, but
the package should be more or less robust at the moment.
JN

An Example of Nonlinear Optimization with JuMP

As a test in understanding nonlinear optimization using Julia's JuMP modeling language, I am trying to minimize the Rosenbrock function in 10 dimensions with constraints 0 <= x[i] <= 0.5. First Rosenbrock with variable arguments:
function rosen(x...)
local n = length(x); local s = 0.0
for i = 1:length(x)-1
s += 100*(x[i+1] - x[i]^2)^2 + (x[i] - 1)^2
end
return s
end
## rosen (generic function with 1 method)
Define the optimization model with Ipopt as solver,
using JuMP; using Ipopt
m = Model(solver = IpoptSolver())
## Feasibility problem with:
## * 0 linear constraints
## * 0 variables
## Solver is Ipopt
and the variables with bound constraints and starting values x[i] = 0.1:
#variable(m, 0.0 <= x[1:10] <= 0.5)
for i in 1:10 setvalue(x[i], 0.1); end
Now I understand that I have to register the objective function.
JuMP.register(m, :rosen, 10, rosen, autodiff=true)
I am uncertain here whether I can do it like this, or if I need to define and register a mysquare function, as is done in the "User-defined Functions" section of the JuMP manual.
#NLobjective(m, Min, rosen(x[1],x[2],x[3],x[4],x[5],x[6],x[7],x[8],x[9],x[10]))
How can I write this more compactly? An expression like
#NLobjective(m, Min, rosen(x[1:10]))
##ERROR: Incorrect number of arguments for "rosen" in nonlinear expression.
gives an error. What if I would like to solve this problem with 100 variables?
Now we solve this model problem and, alas, it returnes a solution, and indeed the correct solution as I know from solving it with the NEOS IPOPT solver.
sol = solve(m);
## ...
## EXIT: Optimal Solution Found.
As I am only interested in the exact value of x[10], extracting it thus:
getvalue(x[10]
## 0.00010008222367154784
Can this be simplified somehow? Think of it how easy it is to solve this problem with fminsearch in MATLAB or optim in R.
R> optim(rep(0.1,10), fnRosenbrock, method="L-BFGS-B",
lower=rep(0.0,10), upper=rep(0.5,10),
control=list(factr=1e-12, maxit=5000))
## $par
## [1] 0.50000000 0.26306537 0.08003061 0.01657414 0.01038065
## [6] 0.01021197 0.01020838 0.01020414 0.01000208 0.00000000
Except, of course, it says $par[10] is 0.0 which is not true.

Optimization in R - not accurate enough

Suppose f(k) = exp(k/200) - 1 and we want to minimize ( f(a) + f(b) + f(c) + f(d) - pi )^2. The solution should be a = 6, b = 75, c = 89, d = 226. The sum of squares for this solution is ~ 8e-17.
sumsq <- function(theta, n=200) {
f <- function(k) exp(k/n) - 1
(f(theta[1]) + f(theta[2]) + f(theta[3]) + f(theta[4]) - pi)^2
}
theta <- optim(par=c(10, 90, 70, 300), fn=sumsq)
# theta$par = 62.97 106.89, 78.64, 189.82
# theta$value = 6.32e-10
# sumsq(c(6,75,89,226)) = 8.20e-17
So clearly, the solution of a = 6, b = 75, c = 89, d = 226 is better than the one the optim function gave by comparing the sum of squares. I would to know how to make R more accurate with its optimization technique. I have also tried the nlm() function, without success.
The value of pi used is 3.1415926535897931 - I think that the accuracy of pi is not the reason why the optim function isn't producing an optimal solution
As the commenters say, this isn't a problem with the accuracy of optim, but rather that the algorithm used by optim may not be suitable for your particular problem. There are very many optimization packages and interfaces available in R; I have had good results using the rgenoud package to improve maximum likelihood-based parameter estimates with the fitdist packages (which I believe uses optim by default).
The other question, of course, is whether the problem your are posing actually has a global minimum that is distinguishable from other local minimums within the numerical tolerance you can specify/R can detect. 6.32e-10 and 8.20e-17 are both pretty small and far beyond the numerical tolerances I consider acceptable in my work... but I don't know about your field.
This is not a well-posed minimization problem. There is an infinite amount of possible solutions. One of them is
a=b=c=d=200*log(1+pi/4)
which numerically is approximately
115.92829021682383
The residual sumsq in this case is zero (within the numerical accuracy) if you insert the numbers.
The problem would probably be far more complex to solve if one would impose, e.g., the restriction that only natural or only integer numbers are allowed. In that case, your combination (and permutations thereof) might be the best, but at the moment I wouldn't know how to verify this. A minimization in the presence of such a constraint would be a qualitatively different problem, which might be interesting for mathematicians. In any case, the usual numerical optimization algorithms won't allow to introduce such a constraint.
I used the "BFGS" method:
sumsq <- function(theta, n=200) {
f <- function(k) exp(k/n) - 1
(f(theta[1]) + f(theta[2]) + f(theta[3]) + f(theta[4]) - pi)^2
}
theta <- optim(par=c(10, 90, 70, 300), fn=sumsq, method="BFGS")
Look at the result:
> theta
$par
[1] -2.629695 71.159586 52.952260 246.174513
$value
[1] 4.009243e-22
$counts
function gradient
19 8
$convergence
[1] 0

Optimizing a simple linear curve (constant and coefficient estimated from a regression)

I am trying to calculate the turning point of a a few functions where I have estimated the coefficient and constant from a regression. I'm using the optimize function for this as my curves are all linear.
My function looks like:
F<- function(x){
beta* x + alpha
}
mind: beta and alpha are both vectors here. When running the optimisation with optimize, I'm getting the following error:
Error in optimize(F, interval = c(10, 20), lower = (10), :
invalid function value in 'optimize'
Is this because optimize is running the optimisation mathematically, so the beta and alphas need to be single parameters? If anyone knows a better way of doing this please do contribute!
Thank you in advance :)
If the functions are linear, then they will be at a minimum at the lower end of the range where beta>=0, and at the upper end of the range if beta<=0 - no need to use optimize().
It's not entirely clear what you're expecting the code to do - if you want it to return an x for each set of parameters, look at optim() instead and have F return the sum, or run optimize on each set of parameters in turn using an apply() function or loop.
One other thing is that your syntax is a bit wonky - I imagine that you mean:
> F<- function(x){
+ beta* x + alpha
+ }
> alpha <- 1
> beta <- 2
> optimize(F,c(10,20))
$minimum
[1] 10.00006
$objective
[1] 21.00011

How extreme values of a functional can be found using R?

I have a functional like this :
(LaTex formula: $v[y]=\int_0^2 (y'^2+23yy'+12y^2+3ye^{2t})dt$)
with given start and end conditions y(0)=-1, y(2)=18.
How can I find extreme values of this functional in R? I realize how it can be done for example in Excel but didn't find appropriate solution in R.
Before trying to solve such a task in a numerical setting, it might be better to lean back and think about it for a moment.
This is a problem typically treated in the mathematical discipline of "variational calculus". A necessary condition for a function y(t) to be an extremum of the functional (ie. the integral) is the so-called Euler-Lagrange equation, see
Calculus of Variations at Wolfram Mathworld.
Applying it to f(t, y, y') as the integrand in your request, I get (please check, I can easily have made a mistake)
y'' - 12*y + 3/2*exp(2*t) = 0
You can go now and find a symbolic solution for this differential equation (with the help of a textbook, or some CAS), or solve it numerically with the help of an R package such as 'deSolve'.
PS: Solving this as an optimization problem based on discretization is possible, but may lead you on a long and stony road. I remember solving the "brachistochrone problem" to a satisfactory accuracy only by applying several hundred variables (not in R).
Here is a numerical solution in R. First the functional:
f<-function(y,t=head(seq(0,2,len=length(y)),-1)){
len<-length(y)-1
dy<-diff(y)*len/2
y0<-(head(y,-1)+y[-1])/2
2*sum(dy^2+23*y0*dy+12*y0^2+3*y0*exp(2*t))/len
}
Now the function that does the actual optimization. The best results I got were using the BFGS optimization method, and parametrizing using dy rather than y:
findMinY<-function(points=100, ## number of points of evaluation
boundary=c(-1,18), ## boundary values
y0=NULL, ## optional initial value
method="Nelder-Mead", ## optimization method
dff=T) ## if TRUE, optimizes based on dy rather than y
{
t<-head(seq(0,2,len=points),-1)
if(is.null(y0) || length(y0)!=points)
y0<-seq(boundary[1],boundary[2],len=points)
if(dff)
y0<-diff(y0)
else
y0<-y0[-1]
y0<-head(y0,-1)
ff<-function(z){
if(dff)
y<-c(cumsum(c(boundary[1],z)),boundary[2])
else
y<-c(boundary[1],z,boundary[2])
f(y,t)
}
res<-optim(y0,ff,control=list(maxit=1e9),method=method)
cat("Iterations:",res$counts,"\n")
ymin<-res$par
if(dff)
c(cumsum(c(boundary[1],ymin)),boundary[2])
else
c(boundary[1],ymin,boundary[2])
}
With 500 points of evaluation, it only takes a few seconds with BFGS:
> system.time(yy<-findMinY(500,method="BFGS"))
Iterations: 90 18
user system elapsed
2.696 0.000 2.703
The resulting function looks like this:
plot(seq(0,2,len=length(yy)),yy,type='l')
And now a solution that numerically integrates the Euler equation.
As #HansWerner pointed out, this problem boils down to applying the Euler-Lagrange equation to the integrand in OP's question, and then solving that differential equation, either analytically or numerically. In this case the relevant ODE is
y'' - 12*y = 3/2*exp(2*t)
subject to:
y(0) = -1
y(2) = 18
So this is a boundary value problem, best approached using bvpcol(...) in package bvpSolve.
library(bvpSolve)
F <- function(t, y.in, pars){
dy <- y.in[2]
d2y <- 12*y.in[1] + 1.5*exp(2*t)
return(list(c(dy,d2y)))
}
init <- c(-1,NA)
end <- c(18,NA)
t <- seq(0, 2, by = 0.01)
sol <- bvpcol(yini = init, yend = end, x = t, func = F)
y = function(t){ # analytic solution...
b <- sqrt(12)
a <- 1.5/(4-b*b)
u <- exp(2*b)
C1 <- ((18*u + 1) - a*(exp(4)*u-1))/(u*u - 1)
C2 <- -1 - a - C1
return(a*exp(2*t) + C1*exp(b*t) + C2*exp(-b*t))
}
par(mfrow=c(1,2))
plot(t,y(t), type="l", xlim=c(0,2),ylim=c(-1,18), col="red", main="Analytical Solution")
plot(sol[,1],sol[,2], type="l", xlim=c(0,2),ylim=c(-1,18), xlab="t", ylab="y(t)", main="Numerical Solution")
It turns out that in this very simple example, there is an analytical solution:
y(t) = a * exp(2*t) + C1 * exp(sqrt(12)*t) + C2 * exp(-sqrt(12)*t)
where a = -3/16 and C1 and C2 are determined to satisfy the boundary conditions. As the plots show, the numerical and analytic solution agree completely, and also agree with the solution provided by #mrip

Resources