Difference between solnp and gosolnp functions in R for Non-Linear optimization problems - r

What is the main difference between the two functions, the r Help manual says that gosolnp helps to set the initial parameters correctly. Is there any difference otherwise? Also, if so is the case, how do we determine the correct distribution type for the parameter space?
In my problem, the initial set of parameters is difficult to determine, which is why the optimization problem is used. However, I have idea about the parameter upper and lower bounds.

gsolnp is an extension of solnp, a wrapper, allowing for multiple restarts. Simply put, it uses solnp several times (controllable by n.restarts) to avoid getting stuck in local minima. If your function is known to have no local minima (e.g., it is convex, which can be derived analytically), use solnp to save time. Otherwise, use gsolnp. If you know any additional information (for instance, an area where a global minimum is supposed to be), you may use it for finer control of the starting parameter distribution: see parameters distr and distr.opt.


Min Max component, objective function

I would like to perform some optimizations by minimizing the maximum of a specific path variable within Dymos. or the maximum of the absolute of such a variable.
In linear programming methods, this can be done by introducing slack variables.
Do you know if this has been attempted before with Dymos, or if there was a reason not to include it?
I understand gradient based methods are not entirely suitable for these problems, though I think some "functions" can be introduced to mitigate this.
For example,
The space shuttle reentry problem from [Betts][1] used as a [test example][2] in dymos, the original source contains an example where the maximum heat flux is minimized. Such functionality could be implemented with the "loc" argument as:
phase.add_objective('q_c', loc='max')
[1]: J. Betts. Practical Methods for Optimal Control and Estimation Using Nonlinear Programming. Society for Industrial and Applied Mathematics, second edition, 2010. URL: https://epubs.siam.org/doi/abs/10.1137/1.9780898718577, arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898718577, doi:10.1137/1.9780898718577.
[2]: https://openmdao.github.io/dymos/examples/reentry/reentry.html
This has been done with pseudospectral methods before. Dymos currently doesn't have any direct way of implementing this, for a few reasons:
As you said, doing this naively can introduce discontinuous gradients that confuse the optimizer. When the node at which the maximum occurs switches, this tends to cause a sharp edge discontinuity in the gradient.
Since the pseudospectral methods are discrete, you cannot guarantee that the maximum will occur at a node. It's often fine to assume it does, but sometimes your requirements might demand more precision.
There are two possible ways to get around this.
The KSComp in OpenMDAO can be used as a "differentiable maximum". Add one after the trajectory, feed it the timeseries data for the output of interest, and set it up such that it returns a smooth approximation to the maximum. The KS function is a bit conservative, so it won't pick out the precise maximum, but depending on the value of the rho option it can be tuned to get pretty close.
When a more precise value of a maximum is needed, it's pretty common to set up a trajectory such that a phase ends when the maximum or minimum is reached.
If the variable whose maximum is being sought is a state, this can be done by adding a boundary constraint on the rate source for that state.
This ensures that the maximum occurs at the first or last node in the phase (depending on if its an initial or final boundary constraint). That lets you more accurately capture its value.
If the variable being sought is not a state, its possible to use the polynomials that are used for fitting states and controls in a phase to interpolate the variable of interest. By then taking the time derivative of that polynomial we can get a reasonably good approximation for its rate. The master branch of dymos has a method add_timeseries_rate_output that does this. And soon, within a few weeks hopefully, we'll add add_boundary_rate_constraint so that these interpolated rates can be easily used as boundary constraints.
In the meantime, you should be able to achieve this by adding the timeseries rate output and then manually applying the OpenMDAO method 'add_constraint' to the resulting timeseries output, using either indices=[0] or indices=[-1] to treat it as an initial or final constraint.
This is a common enough request that we'll add some documentation on how to achieve this behavior using both the KSComp approach and the boundary constraint approach.
Personally I'm not as much of a fan of KSComp because I've had trouble getting problems getting those types of objectives to converge in the past. I've used the slack variable and that has worked well. In the following example, we take a guess at the Rotor power in static analysis, and then we run a trajectory and get the actual rotor power during the mission. The objective was to minimize aircraft weight, so if you have a large amount of power in statics, that costs more weight. The constraint shown below prevents us from decreasing our updated guess of rotor power in statics below the maximum power required during the trajectory.
om.ExecComp('Power_check = Power_ODE - Power_statics',
Power_check = {'value':np.ones(nn_timeseries_main_tx), 'units':'kW'},
Power_ODE = {'value':np.ones(nn_timeseries_main_tx), 'units':'kW'},
Power_statics = {'value':0.0, 'units':'kW'}),
('Power_ODE','hop0.main_phase.timeseries.Power_R'), ('Power_statics','Power_{rotor,slack}')],
p.model.add_constraint('Power_check', upper=0, ref=1)
The constraint on the slack variable effectively helped us ensure that our slack rotor power matched the maximum rotor power during the mission. This allowed us to get the right sizes for the rotor parts (i.e. motors).

Convergence Criteria in glmmTMB - what are my options?

When using glmmTMB() of the R-package {glmmTMB} (see CRAN with links to manual & vignettes), I am aware that I have certain options when dealing with the convergence of models. More specifically, there is the control = argument to which I can pass glmmTMBControl() parameters, whose section in the manual is this:
Furthermore, one of the vignettes - i.e. Troubleshooting with glmmTMB - talks explicitly about dealing with convergence problems. My key point is now, however, that to my knowledge any time glmmTMBControl() is mentioned, it is always in one of these two ways:
glmmTMBControl(optCtrl=list(iter.max=1e3,eval.max=1e3)) i.e. increase the number of iterations
glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")) i.e. try a different optimizer
Regarding the second one, I am left with the impression that I have multiple options besides the one shown there since "The optimizer argument can be any optimization (minimizing) function [...]" and the following phrasing:
Yet, I was not able to find out about any other options I could actually put as my optimizer=, since it really seems to be the exact example shown above that is presented, and I would be thankful if someone could provide a list.
P.S.: I am trying to play around with glmmTMBs convergence criteria, because it seems to often estimate slightly smaller variance components compared to the same model fit via PROC MIXED in SAS.
From ?glmmTMB:
The ‘optimizer’ argument can be any optimization (minimizing) function, provided that:
• the first three arguments, in order, are the starting values, objective function, and gradient function;
• the function also takes a ‘control’ argument;
• the function returns a list with elements (at least) ‘par’,
‘objective’, ‘convergence’ (0 if convergence is successful)
and ‘message’ (‘glmmTMB’ automatically handles output from
‘optim()’, by renaming the ‘value’ component to ‘objective’)
The options built into base R that satisfy these requirements (gradient-based minimizers) are nlminb and optim with method="BFGS", "L-BFGS-B", or "CG". You could also look into optimizers provided by optimx or nloptr, but you'd probably have to write a wrapper function to make sure they satisfied the criteria above ...

How do I warm start a dymos optimization problem?

My problem:
I have a system with 4 states and 4 parameters (static) that I would like to optimize. The parameters are initialized to some known values that would result in trajectories that respect constraints. The states are initialized to a constant value. To verify the model, I run the problem where the parameters setting opt=False. Once verified, I rebuild the OpenMDAO problem with opt=True and run the optimizer.
I'm running a study to evaluate how each parameter affects the system, cost function, etc. and how the initial guess impacts the optimization (ideally, it doesn't). The problem I encounter is that some initial guesses for a parameter result in a failed optimization (iteration limit or positive line search) while others don't and it's not immediately clear why. Note: I always provide an initial guess for the problem that results in feasible trajectories. I check this by setting opt=False for the parameters when I build the problem.
My assumption is that although my initial guess for the parameters are okay, my initial guess for the states is not and the problem gets stuck trying to get feasible trajectories.
My solution/idea:
Is it possible to warm start an optimization problem in Dymos? To warm start, I would like to provide a feasible solution to the states and state rates of the optimizer. As a general flow I would like to first (1) run the optimization with the opt setting in controls and parameters set to False to get a state trajectory, then (2) set the opt setting for controls and parameters to True, and finally (3) re-run the optimization. It seems like there should be an easy way to do this, but I can't determine how without creating 2 problems (with different opt settings) and setting all the initial state guesses of the opt=True problem.
Note: I did read this post: Dymos how to use previous trajectory solution as initial guess? and I can rerun a problem. I just don't know how to change the opt setting between runs.
If there is an alternate or better solution to my problem, I'd be interested in that as well.
If you are using IPOPT, using a previous solution as your initial guess doesn't really help. This is due to the nature of interior point optimizers. On start, the barrier parameter mu is large. This will push the "optimum" solution, for that value of the barrier parameter mu, from doing Newton's method, AWAY from the initial guess. Then mu is decreased, Newton's method gets you closer to the true optimum. This process gets repeated as mu as decreased, until finally mu is small and you get back to the point, which was the optimum that you guessed initially.
Also, because we are using a Quasi-Newton method with a limited-memory Hessian approximation (L-BFGS) when going through Dymos/pyoptsparse, all the information about the Hessian is not there when you start again even if your initial guess is the optimum. So that information has to be filled in again as the algorithm iterates.
I am not an IPOPT expert but this seems to explain why I had no luck trying to use an "improved" initial guess. One thing that did help a lot with convergence was increasing the "limited_memory_max_history" parameter to 100 or so.
IPOPT does have the warm-start option but getting it the initial information it needs regarding the Hessian and initial multipliers might be something you have to go into pyoptsparse to figure out how to do.

Can I use automatic differentiation for non-differentiable functions?

I am testing performance of different solvers on minimizing an objective function derived from simulated method of moments. Given that my objective function is not differentiable, I wonder if automatic differentiation would work in this case? I tried my best to read some introduction on this method, but I couldn't figure it out.
I am actually trying to use Ipopt+JuMP in Julia for this test. Previously, I have tested it using BlackBoxoptim in Julia. I will also appreciate if you could provide some insights on optimization of non-differentiable functions in Julia.
It seems that I am not clear on "non-differentiable". Let me give you an example. Consider the following objective function. X is dataset, B is unobserved random errors which will be integrated out, \theta is parameters. However, A is discrete and therefore not differentiable.
I'm not exactly an expert on optimization, but: it depends on what you mean by "nondifferentiable".
For many mathematical functions that are used, "nondifferentiable" will just mean "not everywhere differentiable" -- but that's still "differentiable almost everywhere, except on countably many points" (e.g., abs, relu). These functions are not a problem at all -- you can just chose any subgradient and apply any normal gradient method. That's what basically all AD systems for machine learning do. The case for non-singular subgradients will happen with low probability anyway. An alternative for certain forms of convex objectives are proximal gradient methods, which "smooth" the objective in an efficient way that preserves optima (cf. ProximalOperators.jl).
Then there's those functions that seem like they can't be differentiated at all, since they seem "combinatoric" or discrete, but are in fact piecewise differentiable (if seen from the correct point of view). This includes sorting and ranking. But you have to find them, and describing and implementing the derivative is rather complicated. Whether such functions are supported by an AD system depends on how sophisticated its "standard library" is. Some variants of this, like "permute", can just fall out AD over control structures, while move complex ones require the primitive adjoints to be manually defined.
For certain kinds of problems, though, we just work in an intrinsically discrete space -- like, integer parameters of some probability distributions. In these case, differentiation makes no sense, and hence AD libraries define their primitives not to work on these parameters. Possible alternatives are to use (mixed) integer programming, approximations, search, and model selection. This case also occurs for problems where the optimized space itself depends on the parameter in question, like the second argument of fill. We also have things like the ℓ0 "norm" or the rank of a matrix, for which there exist well-known continuous relaxations, but that's outside of the scope of AD).
(In the specific case of MCMC for discrete or dimensional parameters, there's other ways to deal with that, like combining HMC with other MC methods in a Gibbs sampler, or using a nonparametric model instead. Other tricks are possible for VI.)
That being said, you will rarely encounter complicated nowhere differentiable continuous functions in optimization. They are already complicated to describe, are just unlikely to arise in the kind of math we use for modelling.

constrained optimization without gradient

This is a more general question, somewhat independent of data, so I do not have a MWE.
I often have functions fn(.) that implement algorithms that are not differentiable but that I want to optimize. I usually use optim(.) with its standard method, which works fine for me in terms of speed and results.
However, I now have a problem that requires me to set bounds on one of the several parameters of fn. From what I understand, optim(method="L-BFGS-B",...) allows me to set limits to parameters but also requires a gradient. Because fn(.) is not a mathematical function but an algorithm, I suspect it does not have a gradient that I could derive through differentiation. This leads me to ask whether there is a way of performing constrained optimization in R in a way that does not require me to give a gradient.
I have looked at some sources, e.g. John C. Nash's texts on this topic but as far as I understand them, they concern mostly differentiable functions where gradients can be supplied.
Summarizing the comments so far (which are all things I would have said myself):
you can use method="L-BFGS-B" without providing explicit gradients (the gr argument is optional); in that case, R will compute approximations to the derivative by finite differencing (#G.Grothendieck). It is the simplest solution, because it works "out of the box": you can try it and see if it works for you. However:
L-BFGS-B is probably the finickiest of the methods provided by optim() (e.g. it can't handle the case where a trial set of parameters evaluates to NA)
finite-difference approximations are relatively slow and numerically unstable (but, fine for simple problems)
for simple cases you can fit the parameter on a transformed scale, e.g. if b is a parameter that must be positive, you can use log_b as a parameter (and transform it via b <- exp(log_b) in your objective function). (#SamMason) But:
there isn't always a simple transformation that will achieve the constraint you want
if the optimal solution is on the boundary, transforming will cause problems
there are a variety of derivative-free optimizers with constraints (typically "box constraints", i.e. independent lower and/or upper bounds one or more parameters) (#ErwinKalvelagen): dfoptim has a few, I have used the nloptr package (and its BOBYQA optimizer) extensively, minqa has some as well. This is the solution I would recommend.
