Can I use automatic differentiation for non-differentiable functions? - julia

I am testing performance of different solvers on minimizing an objective function derived from simulated method of moments. Given that my objective function is not differentiable, I wonder if automatic differentiation would work in this case? I tried my best to read some introduction on this method, but I couldn't figure it out.
I am actually trying to use Ipopt+JuMP in Julia for this test. Previously, I have tested it using BlackBoxoptim in Julia. I will also appreciate if you could provide some insights on optimization of non-differentiable functions in Julia.
It seems that I am not clear on "non-differentiable". Let me give you an example. Consider the following objective function. X is dataset, B is unobserved random errors which will be integrated out, \theta is parameters. However, A is discrete and therefore not differentiable.

I'm not exactly an expert on optimization, but: it depends on what you mean by "nondifferentiable".
For many mathematical functions that are used, "nondifferentiable" will just mean "not everywhere differentiable" -- but that's still "differentiable almost everywhere, except on countably many points" (e.g., abs, relu). These functions are not a problem at all -- you can just chose any subgradient and apply any normal gradient method. That's what basically all AD systems for machine learning do. The case for non-singular subgradients will happen with low probability anyway. An alternative for certain forms of convex objectives are proximal gradient methods, which "smooth" the objective in an efficient way that preserves optima (cf. ProximalOperators.jl).
Then there's those functions that seem like they can't be differentiated at all, since they seem "combinatoric" or discrete, but are in fact piecewise differentiable (if seen from the correct point of view). This includes sorting and ranking. But you have to find them, and describing and implementing the derivative is rather complicated. Whether such functions are supported by an AD system depends on how sophisticated its "standard library" is. Some variants of this, like "permute", can just fall out AD over control structures, while move complex ones require the primitive adjoints to be manually defined.
For certain kinds of problems, though, we just work in an intrinsically discrete space -- like, integer parameters of some probability distributions. In these case, differentiation makes no sense, and hence AD libraries define their primitives not to work on these parameters. Possible alternatives are to use (mixed) integer programming, approximations, search, and model selection. This case also occurs for problems where the optimized space itself depends on the parameter in question, like the second argument of fill. We also have things like the ℓ0 "norm" or the rank of a matrix, for which there exist well-known continuous relaxations, but that's outside of the scope of AD).
(In the specific case of MCMC for discrete or dimensional parameters, there's other ways to deal with that, like combining HMC with other MC methods in a Gibbs sampler, or using a nonparametric model instead. Other tricks are possible for VI.)
That being said, you will rarely encounter complicated nowhere differentiable continuous functions in optimization. They are already complicated to describe, are just unlikely to arise in the kind of math we use for modelling.

Related

constrained optimization without gradient

This is a more general question, somewhat independent of data, so I do not have a MWE.
I often have functions fn(.) that implement algorithms that are not differentiable but that I want to optimize. I usually use optim(.) with its standard method, which works fine for me in terms of speed and results.
However, I now have a problem that requires me to set bounds on one of the several parameters of fn. From what I understand, optim(method="L-BFGS-B",...) allows me to set limits to parameters but also requires a gradient. Because fn(.) is not a mathematical function but an algorithm, I suspect it does not have a gradient that I could derive through differentiation. This leads me to ask whether there is a way of performing constrained optimization in R in a way that does not require me to give a gradient.
I have looked at some sources, e.g. John C. Nash's texts on this topic but as far as I understand them, they concern mostly differentiable functions where gradients can be supplied.
Summarizing the comments so far (which are all things I would have said myself):
you can use method="L-BFGS-B" without providing explicit gradients (the gr argument is optional); in that case, R will compute approximations to the derivative by finite differencing (#G.Grothendieck). It is the simplest solution, because it works "out of the box": you can try it and see if it works for you. However:
L-BFGS-B is probably the finickiest of the methods provided by optim() (e.g. it can't handle the case where a trial set of parameters evaluates to NA)
finite-difference approximations are relatively slow and numerically unstable (but, fine for simple problems)
for simple cases you can fit the parameter on a transformed scale, e.g. if b is a parameter that must be positive, you can use log_b as a parameter (and transform it via b <- exp(log_b) in your objective function). (#SamMason) But:
there isn't always a simple transformation that will achieve the constraint you want
if the optimal solution is on the boundary, transforming will cause problems
there are a variety of derivative-free optimizers with constraints (typically "box constraints", i.e. independent lower and/or upper bounds one or more parameters) (#ErwinKalvelagen): dfoptim has a few, I have used the nloptr package (and its BOBYQA optimizer) extensively, minqa has some as well. This is the solution I would recommend.

Are series representations of functions every practically used to graph in computer science?

As you probably know functions can be represented as a infinite series. For example f(x) = cosx can be represented as this. My question is if this is every used practically in programming for any type of application. I know it can be used I was just wondering if it actually is for serious projects.
Aside from infinite series, there are other representations for functions which can be useful for computing approximations. Asymptotic series, identities involving other "elementary" functions, and interpolation in a table of values are all used in different contexts. Take a look at Abramowitz & Stegun "Handbook of Mathematical Functions" to get an idea of the variety of possibilities. Also look for the source code for popular libraries or systems such as R, Numpy, Scipy, or Octave to see what approaches have been used by the authors of that software.
Specifically about series approximations for trigonometric functions, I think that might be a reasonable thing to do, but only if the range of the argument is reduced (via identities) so that it is as small as possible.
Approximation of functions is a great topic; good luck and have fun.

Understanding the complex-step in a physical sense

I think I understand what complex step is doing numerically/algorithmically.
But the questions still linger. First two questions might have the same answer.
1- I replaced the partial derivative calculations of 'Betz_limit' example with complex step and removed the analytical gradients. Looking at the recorded design_var evolution none of the values are complex? Aren't they supposed to be shown as somehow a+bi?
Or it always steps in the real space. ?
2- Tying to picture 'cs', used in a physical concept. For example a design variable of beam length (m), objective of mass (kg) and a constraint of loads (Nm). I could be using an explicit component to calculate these (pure python) or an external code component (pure fortran). Numerically they all can handle complex numbers but obviously the mass is a real value. So when we say capable of handling the complex numbers is it just an issue of handling a+bi (where actual mass is always 'a' and b is always equal to 0?)
3- How about the step size. I understand there wont be any subtractive cancellation errors but what if i have a design variable normalized/scaled to 1 and a range of 0.8 to 1.2. Decreasing the step to 1e-10 does not make sense. I am a bit confused there.
The ability to use complex arithmetic to compute derivative approximations is based on the mathematics of complex arithmetic.
You should read about the theory to get a better understanding of why it works and how the step size issue is resolved with complex-step vs finite-difference.
There is no physical interpretation that you can make for the complex-step method. You are simply taking advantage of the mathematical properties of complex arithmetic to approximate a derivative in a more accurate manner than FD can. So the key is that your code is set up to do complex-arithmetic correctly.
Sometimes, engineering analyses do actually leverage complex numbers. One aerospace example of this is the Jukowski Transformation. In electrical engineering, complex numbers come up all the time for load-flow analysis of ac circuits. If you have such an analysis, then you can not easily use complex-step to approximate derivatives since the analysis itself is already complex. In these cases, it is technically possible to use a more general class of numbers called hyper dual numbers, but this is not supported in OpenMDAO. So if you had an analysis like this you could not use complex-step.
Also, occationally there are implementations of methods that are not complex-step safe which will prevent you from using it unless you define a new complex-step safe version. The simplest example of this is the np.absolute() method in the numpy library for python. The implementation of this, when passed a complex number, will return the asolute magnitude of the number:
abs(a+bj) = sqrt(1^2 + 1^2) = 1.4142
While not mathematically incorrect, this implementation would mess up the complex-step derivative approximation.
Instead you need an alternate version that gives:
abs(a+bj) = abs(a) + abs(b)*j
So in summary, you need to watch out for these kinds of functions that are not implemented correctly for use with complex-step. If you have those functions, you need to use alternate complex-step safe versions of them. Also, if your analysis itself uses complex numbers then you can not use complex-step derivative approximations either.
With regard to your step size question, again I refer you to the this paper for greater detail. The basic idea is that without subtractive cancellation you are free to use a very small step size with complex-step without the fear of lost accuracy due to numerical issues. So typically you will use 1e-20 smaller as the step. Since complex-step accuracy scalea with the order of step^2, using such a small step gives effectively exact results. You need not worry about scaling issues in most cases, if you just take a small enough step.

How can I do blind fitting on a list of x, y value pairs if I don't know the form of f(x) = y?

If I have a function f(x) = y that I don't know the form of, and if I have a long list of x and y value pairs (potentially thousands of them), is there a program/package/library that will generate potential forms of f(x)?
Obviously there's a lot of ambiguity to the possible forms of any f(x), so something that produces many non-trivial unique answers (in reduced terms) would be ideal, but something that could produce at least one answer would also be good.
If x and y are derived from observational data (i.e. experimental results), are there programs that can create approximate forms of f(x)? On the other hand, if you know beforehand that there is a completely deterministic relationship between x and y (as in the input and output of a pseudo random number generator) are there programs than can create exact forms of f(x)?
Soooo, I found the answer to my own question. Cornell has released a piece of software for doing exactly this kind of blind fitting called Eureqa. It has to be one of the most polished pieces of software that I've ever seen come out of an academic lab. It's seriously pretty nifty. Check it out:
It's even got turnkey integration with Amazon's ec2 clusters, so you can offload some of the heavy computational lifting from your local computer onto the cloud at the push of a button for a very reasonable fee.
I think that I'm going to have to learn more about GUI programming so that I can steal its interface.
(This is more of a numerical methods question.) If there is some kind of observable pattern (you can kinda see the function), then yes, there are several ways you can approximate the original function, but they'll be just that, approximations.
What you want to do is called interpolation. Two very simple (and not very good) methods are Newton's method and Laplace's method of interpolation. They both work on the same principle but they are implemented differently (Laplace's is iterative, Newton's is recursive, for one).
If there's not much going on between any two of your data points (ie, the actual function doesn't have any "bumps" whose "peaks" are not represented by one of your data points), then the spline method of interpolation is one of the best choices you can make. It's a bit harder to implement, but it produces nice results.
Edit: Sometimes, depending on your specific problem, these methods above might be overkill. Sometimes, you'll find that linear interpolation (where you just connect points with straight lines) is a perfectly good solution to your problem.
It depends.
If you're using data acquired from the real-world, then statistical regression techniques can provide you with some tools to evaluate the best fit; if you have several hypothesis for the form of the function, you can use statistical regression to discover the "best" fit, though you may need to be careful about over-fitting a curve -- sometimes the best fit (highest correlation) for a specific dataset completely fails to work for future observations.
If, on the other hand, the data was generated something synthetically (say, you know they were generated by a polynomial), then you can use polynomial curve fitting methods that will give you the exact answer you need.
Yes, there are such things.
If you plot the values and see that there's some functional relationship that makes sense, you can use least squares fitting to calculate the parameter values that minimize the error.
If you don't know what the function should look like, you can use simple spline or interpolation schemes.
You can also use software to guess what the function should be. Maybe something like Maxima can help.
Wolfram Alpha can help you guess:
http://blog.wolframalpha.com/2011/05/17/plotting-functions-and-graphs-in-wolframalpha/
Polynomial Interpolation is the way to go if you have a totally random set
http://en.wikipedia.org/wiki/Polynomial_interpolation
If your set is nearly linear, then regression will give you a good approximation.
Creating exact form from the X's and Y's is mostly impossible.
Notice that what you are trying to achieve is at the heart of many Machine Learning algorithm and therefor you might find what you are looking for on some specialized libraries.
A list of x/y values N items long can always be generated by an degree-N polynomial (assuming no x values are the same). See this article for more details:
http://en.wikipedia.org/wiki/Polynomial_interpolation
Some lists may also match other function types, such as exponential, sinusoidal, and many others. It is impossible to find the 'simplest' matching function, but the best you can do is go through a list of common ones like exponential, sinusoidal, etc. and if none of them match, interpolate the polynomial.
I'm not aware of any software that can do this for you, though.

Non Linear Integer Programming

I would like to know if there is a package in R handling non linear integer optimization.
"Basically", I would like to solve the following problem:
max f(x) s.t x in (0,10) and x is integer.
I know that some branching algorithms are able to handle the linear version of this problem, but here my function f() might be more complicated. (I can't even make sure it would be quadratic of the form f(x)=xQx).
I guess there is always the brute force solution to test all the possibilities as long as they are bounded, but I was wondering if there wasn't anything smarter.
I have a few options for you, but none of them is the silver bullet, although it looks like your silver bullet is in the works under the rino project: http://r-forge.r-project.org/projects/rino/.
Since your function is complicated, you may want to use a genetic algorithm (i.e., gradient-based optimizers may not be reliable). genoud in the rgenoud library may do the trick (link text). If you set data.type.int=TRUE it should do the trick. I have not used this library, but have some experience with GAs in matlab and the time to convergence is sensitive to the settings, so you'll be well served to read the man page a few times through.
Or, if your function in strictly concave (unlikely, since you say it may be complicated) you can solve with a gradient solver (e.g., optim) then check the neighborhood around the optimum (can't be more than 2^n points to check).
Sorry, I can't be of more help.
If it is hardly nonlinear there is no better method than brute force (you will never know if the minimum is local or if some flat-looking fragment doesn't have any narrow and deep valleys), except of course symbolic computation (which probably won't work because the function is too complicated) or soft computing, I mean things like genetic algorithms, Monte-Carlo, swarms, etc. (here you don't have a guarantee that it will find the very global minimum and because you have integer x it can be slower than brute force).
http://cran.r-project.org/web/views/Optimization.html lists the packages Rdonlp2 and Rsolnp which may be suitable.
Discrete filled function method is one of the recent methods that can find global solution of nonlinear integer programming with about 100 constraints and variables.

Resources