This is a more general question, somewhat independent of data, so I do not have a MWE.
I often have functions fn(.) that implement algorithms that are not differentiable but that I want to optimize. I usually use optim(.) with its standard method, which works fine for me in terms of speed and results.
However, I now have a problem that requires me to set bounds on one of the several parameters of fn. From what I understand, optim(method="L-BFGS-B",...) allows me to set limits to parameters but also requires a gradient. Because fn(.) is not a mathematical function but an algorithm, I suspect it does not have a gradient that I could derive through differentiation. This leads me to ask whether there is a way of performing constrained optimization in R in a way that does not require me to give a gradient.
I have looked at some sources, e.g. John C. Nash's texts on this topic but as far as I understand them, they concern mostly differentiable functions where gradients can be supplied.
Summarizing the comments so far (which are all things I would have said myself):
you can use method="L-BFGS-B" without providing explicit gradients (the gr argument is optional); in that case, R will compute approximations to the derivative by finite differencing (#G.Grothendieck). It is the simplest solution, because it works "out of the box": you can try it and see if it works for you. However:
L-BFGS-B is probably the finickiest of the methods provided by optim() (e.g. it can't handle the case where a trial set of parameters evaluates to NA)
finite-difference approximations are relatively slow and numerically unstable (but, fine for simple problems)
for simple cases you can fit the parameter on a transformed scale, e.g. if b is a parameter that must be positive, you can use log_b as a parameter (and transform it via b <- exp(log_b) in your objective function). (#SamMason) But:
there isn't always a simple transformation that will achieve the constraint you want
if the optimal solution is on the boundary, transforming will cause problems
there are a variety of derivative-free optimizers with constraints (typically "box constraints", i.e. independent lower and/or upper bounds one or more parameters) (#ErwinKalvelagen): dfoptim has a few, I have used the nloptr package (and its BOBYQA optimizer) extensively, minqa has some as well. This is the solution I would recommend.
Related
I had to add some circular dependencies to my model and thus adding NonlinearBlockGS and LinearBlockGS to the Group with the circular dependency. I get messages like this
LN: LNBGSSolver 'LN: LNBGS' on system 'XXX' failed to converge in 10
iterations.
in the phase where it's finding the Coloring of the problem. There is a Dymos trajectory as part of the problem, but the circular dependency is not in the Trajectory group, it's upstream. It however converges very easily when actually solving the problem. The number of FWD solves is the same as it was before-- everything seem to work fine. Should I be worried about anything?
the way our total derivative coloring works is that we replace partial derivatives with random numbers and then solve the linear system. So the linear solver should be converging. Now, whether or not it should converge with LNBGS in 10 iterations... probably not.
Its hard to speak diffinitively when putting random numbers into a matrix to invert it... but generally speaking it should remain invertible (though we can't promise). That does not mean that it will remain easily invertible. How close does the linear residual get during the coloring? it is decreasing, but slowly. Would more iteration let it get there?
If your problem is working well, I don't think you need to freak out about this. If you would like it to converge better, it won't hurt anything and might give you better coloring. You can increase the iprint of that solver to get more information on the convergence history.
Another option, if your system is small enough, is to try using the DirectSolver instead of LNBGS. For most models with less than 10,000 variables in them a DirectSolver will be overall faster than the LNBGS. There is a nice symetry to using LNBGS with NLGBS ... but while the nonlinear solver tends to be a good choice (i.e. fast and stable) for cyclic dependencies the same can't be said for its linear counter part.
So my go-to combination if NLBGS and DirectSolver. You can't always use the DirectSolver. If you have distributed components in your model, or components that use the matrix-free derivative APIs (apply_linear, compute_jacvec_product), then LNBGS is a good option. But if everything is explicit components with compute_partials or implicit components that provide partials in the linearize method then I suggest using the DirectSolver as your first option.
I think you may have discovered a coloring performance issue in OpenMDAO. When we compute coloring, internally we replace the component partials with random arrays matching the declared sparsity. Since we're not trying to find an actual solution when we compute coloring, we probably don't need to iterate more than once in any given system. And we shouldn't be generating convergence warnings when computing the coloring. I don't think you need to be worried in this case. I'll put a story in our bug tracker to look into this.
When using glmmTMB() of the R-package {glmmTMB} (see CRAN with links to manual & vignettes), I am aware that I have certain options when dealing with the convergence of models. More specifically, there is the control = argument to which I can pass glmmTMBControl() parameters, whose section in the manual is this:
Furthermore, one of the vignettes - i.e. Troubleshooting with glmmTMB - talks explicitly about dealing with convergence problems. My key point is now, however, that to my knowledge any time glmmTMBControl() is mentioned, it is always in one of these two ways:
glmmTMBControl(optCtrl=list(iter.max=1e3,eval.max=1e3)) i.e. increase the number of iterations
glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")) i.e. try a different optimizer
Regarding the second one, I am left with the impression that I have multiple options besides the one shown there since "The optimizer argument can be any optimization (minimizing) function [...]" and the following phrasing:
Yet, I was not able to find out about any other options I could actually put as my optimizer=, since it really seems to be the exact example shown above that is presented, and I would be thankful if someone could provide a list.
P.S.: I am trying to play around with glmmTMBs convergence criteria, because it seems to often estimate slightly smaller variance components compared to the same model fit via PROC MIXED in SAS.
From ?glmmTMB:
The ‘optimizer’ argument can be any optimization (minimizing) function, provided that:
• the first three arguments, in order, are the starting values, objective function, and gradient function;
• the function also takes a ‘control’ argument;
• the function returns a list with elements (at least) ‘par’,
‘objective’, ‘convergence’ (0 if convergence is successful)
and ‘message’ (‘glmmTMB’ automatically handles output from
‘optim()’, by renaming the ‘value’ component to ‘objective’)
The options built into base R that satisfy these requirements (gradient-based minimizers) are nlminb and optim with method="BFGS", "L-BFGS-B", or "CG". You could also look into optimizers provided by optimx or nloptr, but you'd probably have to write a wrapper function to make sure they satisfied the criteria above ...
I am testing performance of different solvers on minimizing an objective function derived from simulated method of moments. Given that my objective function is not differentiable, I wonder if automatic differentiation would work in this case? I tried my best to read some introduction on this method, but I couldn't figure it out.
I am actually trying to use Ipopt+JuMP in Julia for this test. Previously, I have tested it using BlackBoxoptim in Julia. I will also appreciate if you could provide some insights on optimization of non-differentiable functions in Julia.
It seems that I am not clear on "non-differentiable". Let me give you an example. Consider the following objective function. X is dataset, B is unobserved random errors which will be integrated out, \theta is parameters. However, A is discrete and therefore not differentiable.
I'm not exactly an expert on optimization, but: it depends on what you mean by "nondifferentiable".
For many mathematical functions that are used, "nondifferentiable" will just mean "not everywhere differentiable" -- but that's still "differentiable almost everywhere, except on countably many points" (e.g., abs, relu). These functions are not a problem at all -- you can just chose any subgradient and apply any normal gradient method. That's what basically all AD systems for machine learning do. The case for non-singular subgradients will happen with low probability anyway. An alternative for certain forms of convex objectives are proximal gradient methods, which "smooth" the objective in an efficient way that preserves optima (cf. ProximalOperators.jl).
Then there's those functions that seem like they can't be differentiated at all, since they seem "combinatoric" or discrete, but are in fact piecewise differentiable (if seen from the correct point of view). This includes sorting and ranking. But you have to find them, and describing and implementing the derivative is rather complicated. Whether such functions are supported by an AD system depends on how sophisticated its "standard library" is. Some variants of this, like "permute", can just fall out AD over control structures, while move complex ones require the primitive adjoints to be manually defined.
For certain kinds of problems, though, we just work in an intrinsically discrete space -- like, integer parameters of some probability distributions. In these case, differentiation makes no sense, and hence AD libraries define their primitives not to work on these parameters. Possible alternatives are to use (mixed) integer programming, approximations, search, and model selection. This case also occurs for problems where the optimized space itself depends on the parameter in question, like the second argument of fill. We also have things like the ℓ0 "norm" or the rank of a matrix, for which there exist well-known continuous relaxations, but that's outside of the scope of AD).
(In the specific case of MCMC for discrete or dimensional parameters, there's other ways to deal with that, like combining HMC with other MC methods in a Gibbs sampler, or using a nonparametric model instead. Other tricks are possible for VI.)
That being said, you will rarely encounter complicated nowhere differentiable continuous functions in optimization. They are already complicated to describe, are just unlikely to arise in the kind of math we use for modelling.
If multiple solvers are attached to the model as external code components (complete black boxes) is there a way to implement adjoint method?
Many of the usages of openMDAO that I have seen so far seems to use some black box flow solvers or structural solvers etc.
But I dont see how one can implement an adjoint method without touching the source codes.
As it is also unlikely to implement a complex step method. The only way is finite difference if one wants to use a gradient based optimizer?
You can implement a semi-analytic adjoint method, even if you don't have any analytic derivatives in your model. Assume you have a single black-bock code, that has an internal solver to converge some implicit equations before computing some set of output values.
You wrap the code in OpenMDAO as an ImplicitComponent, using custom file wrapper. (note: You can't use the ExternalCode component because it is a sub-class of ExplicitComponent)
In your new wrapper, you will implement two methods in your custom ImplicitComponent:
solve_nonlienar
apply_nonlinear
The first method is simply the normal run of your code. The second method however takes the input values and given output values and computes a residual.
Then in order to get partial derivatives, OpenMDAO will finite-difference the apply_linear method to compute dResidaul_dinputs and dResidual_dOutputs, which it can then use to compute total derivatives using reverse (or adjoint) mode.
This approach is typically both more efficient and more accurate then wrapping the code as an ExplicitComponent. First of all, the apply_nonlinear method (residual evaluation) should be very significantly less expensive the solve_nonlinear method, so the finite-difference should be much cheaper. Perhaps more importantly though, finite-difference across a residual evaluation should be much more accurate than when you finite-difference around a full solver convergence loop (see this technical paper for an example of why that is).
There are a few things to note in this approach. First, there may be a large number of residual equations and implicit variables. So you may be doing many more finite-difference calls than if you wrapped your code as an ExplicitComponent. However, since the residual evaluation is much cheaper it should be acceptable. Second, not all codes have the ability to return residual evaluations, so that might require some modification of the black-box. Without the residual evaluation, you can't use an adjoint method.
One other general approach is to mix codes with analytic derivatives and finite-difference partial derivatives. Lets say you have an expensive CFD analysis that does have an existing adjoint, and you want to couple in a inexpensive black-box code. You can finite-difference the black-box code to compute its partials, and then OpenMDAO will use those with the exisiing adjoint to compute semi-analytic total derivatives.
I'm working on a simulation project with a 3-dimensional piece-wise constant function, and I'm trying to find the inputs that maximize the output. Using optim() in R with the Nelder-Mead or SANN algorithms seems best (they don't require the function to be differentiable), but I'm finding that optim() ends up returning my starting value exactly. This starting value was obtained using a grid search, so it's likely reasonably good, but I'd be surprised if it was the exact peak.
I suspect that optim() is not testing points far enough out from the initial guess, leading to a situation where all tested points give the same output.
Is this a reasonable concern?
How can I tweak the breadth of values that optim() is testing as it searches?