Find N best solutions using nloptr - r

I am using nloptr in R, however, I want to give my model more freedom since the best solution and avoid overfitting. I have described my problem earlier in this question:
nloptr(x0, eval_f, eval_grad_f = NULL, lb = NULL, ub = NULL,
eval_g_ineq = NULL, eval_jac_g_ineq = NULL, eval_g_eq = NULL,
eval_jac_g_eq = NULL, opts = list(), ... )
Basically I have a non-linear problem to solve. I have a function to minimize and some non-linear constrains. But I don't want to use the best found solution because it overfits the in sample data and gives me extreme values. Hence I want to find N best solutions and then choose ones I want.
So now I am wondering is there a way of finding N best solutions which nloptr finds during the iteration. Are there any other ways of doing this except nloptr?

This is not really an answer, but rather a long comment, which I hope will be helpful.
I agree with #tonytonov that you should better define "second best" and your general needs. regardless, to get N different solutions, that are not just very near each other, I would run nloptr iteratively, each time with a slightly different target function, each time adding a penalty for being near the previous solution. here is an example:
sols = list()
evalf= list(eval_f)
for (i in 1:N) {
sols[i] = nloptr(x0,evalf[[i]],...)
# now creating a new evaluation function which adds a log(distance) penalty to the
# last solution
evalf[[i+1]] = function(x) {evalf[[i]](x)-log(sum((x-sols[i]$solution)^2))}
}
you can think of a different penalty of course, the idea is to add a big big penalty for being very close to an existing solution, but once you get relatively far from it (you should know, what does it mean to be far enough - this is context specific), the penalty is relatively flat, and hence does not affect the original minimum points.
you should of course check that the last solution exists, and probably change the starting point (x0) from one iteration to another, but you get the point, I think.
More generally, as you try to avoid overfitting, I would think of adding a penalty to your eval function in the first place. For example, a sign of overfitting in regression analysis is the magnitude of the coefficients, so it is typical to try minimzing not the square root of error (the typical OLS method) in determining the regression estimation, but rather the square root of error + the sum of coefficients (normalized in some way), which create a preference for small coefficients, and hence decrease the likelihood of overfitting.
I know very little on your specific problem, but maybe you can come up with some "penalty" function that will decrease overfitting when minimized.
Another approach, if your eval_f depends on data, would be to use the same evaluation function but on bootstrap subsamples of your data. each time you get a differen minimum (because of the different sample). you get N such solutions and you can average them up or do anything you want to generate a non-overfitting solution (now the solution does not overfit the data, because each solution is based on different part of the data).
I hope it helps.

Related

How to remove oscillations in a solution to Differential Equation having indeterminate form using DifferentialEquations.jl

I am trying to solve differential equations with indeterminate forms and I am playing with Julia and DifferentialEquations.jl package to do so.
I started with a simple differential equation with indeterminate form that I know of (as a test case) to see if this might work ( My eventual goal is to solve a complex system of differential equations and couple it with Machine Learning. This requires substantial amount of effort on my part to analyze and code, which I would only like to begin once I know that some basic test cases work).
Here is the test case code I started with:
bondi(u,p,t) = -(2*k/t^2)*(1 - (2*a^2*t/k))/(1 - a^2/u)
u0 = 0.01
p = (1e4, 10)
k,a = p
tc = k/(2*a^2)
tspan = (tc/10, tc*5)
prob = ODEProblem(bondi,u0,tspan,p)
The analytical solution to this problem (known as the Bondi flow problem in Astrophysics) is well known (even for the indeterminate cases). I am interested in the indeterminate solutions obtained from the solver.
When I solve using sol = solve(prob) I am getting an oscillating solution quite different from the smooth analytical solution I am aware exists (see the figure in the link below).
Plot u vs t
I was expecting to encounter some 'issues' as t approaches 50 (while simultaneously the y axis variable (representing speed) denoted by u would approach 100) as only then the numerator (and denominator) would vanish together. Any ideas why the solutions starts to oscillate?
I also tried with sol = solve(prob, alg_hints = [:stiff]) and got the following warning:
Warning: Interrupted. Larger maxiters is needed.
# DiffEqBase C:\Users\User.julia\packages\DiffEqBase\1yTcS\src\integrator_interface.jl:329
The solution still oscillates (similar to the solutions obtained without imposing stiffness).
Am I doing something incorrect here?
Is there another way to solve such indeterminate equations with the DifferentialEquations.jl package?
That ODE is highly numerically unstable at the singularity. Time stepping based methods are prone to explode when encountering that kind of problem. You need to use a collocation based method, like something in ApproxFun.jl, to avoid directly evaluating at the singularity in order to stabilize the solve.

How can optimization be used as a solver?

In a question on Cross Validated (How to simulate censored data), I saw that the optim function was used as a kind of solver instead of as an optimizer. Here is an example:
optim(1, fn=function(scl){(pweibull(.88, shape=.5, scale=scl, lower.tail=F)-.15)^2})
# $par
# [1] 0.2445312
# ...
pweibull(.88, shape=.5, scale=0.2445312, lower.tail=F)
# [1] 0.1500135
I have found a tutorial on optim here, but I am still not able to figure out how to use optim to work as a solver. I have several questions:
What is first parameter (i.e., the value 1 being passed in)?
What is the function that is passed in?
I can understand that it is taking the Weibull probability distribution and subtracting 0.15, but why are we squaring the result?
I believe you are referring to my answer. Let's walk through a few points:
The OP (of that question) wanted to generate (pseudo-)random data from a Weibull distribution with specified shape and scale parameters, and where the censoring would be applied for all data past a certain censoring time, and end up with a prespecified censoring rate. The problem is that once you have specified any three of those, the fourth is necessarily fixed. You cannot specify all four simultaneously unless you are very lucky and the values you specify happen to fit together perfectly. As it happened, the OP was not so lucky with the four preferred values—it was impossible to have all four as they were inconsistent. At that point, you can decide to specify any three and solve for the last. The code I presented were examples of how to do that.
As noted in the documentation for ?optim, the first argument is par "[i]nitial values for the parameters to be optimized over".
Very loosely, the way the optimization routine works is that it calculates an output value given a function and an input value. Then it 'looks around' to see if moving to a different input value would lead to a better output value. If that appears to be the case, it moves in that direction and starts the process again. (It stops when it does not appear that moving in either direction will yield a better output value.)
The point is that is has to start somewhere, and the user is obliged to specify that value. In each case, I started with the OP's preferred value (although really I could have started most anywhere).
The function that I passed in is ?pweibull. It is the cumulative distribution function (CDF) of the Weibull distribution. It takes a quantile (X value) as its input and returns the proportion of the distribution that has been passed through up to that point. Because the OP wanted to censor the most extreme 15% of that distribution, I specified that pweibull return the proportion that had not yet been passed through instead (that is the lower.tail=F part). I then subtracted.15 from the result.
Thus, the ideal output (from my point of view) would be 0. However, it is possible to get values below zero by finding a scale parameter that makes the output of pweibull < .15. Since optim (or really most any optimizer) finds the input value that minimizes the output value, that is what it would have done. To keep that from happening, I squared the difference. That means that when the optimizer went 'too far' and found a scale parameter that yielded an output of .05 from pweibull, and the difference was -.10 (i.e., < 0), the squaring makes the ultimate output +.01 (i.e., > 0, or worse). This would push the optimizer back towards the scale parameter that makes pweibull output (.15-.15)^2 = 0.
In general, the distinction you are making between an "optimizer" and a "solver" is opaque to me. They seem like two different views of the same elephant.
Another possible confusion here involves optimization vs. regression. Optimization is simply about finding an input value[s] that minimizes (maximizes) the output of a function. In regression, we conceptualize data as draws from a data generating process that is a stochastic function. Given a set of realized values and a functional form, we use optimization techniques to estimate the parameters of the function, thus extracting the data generating process from noisy instances. Part of regression analyses partakes of optimization then, but other aspects of regression are less concerned with optimization and optimization itself is much larger than regression. For example, the functions optimized in my answer to the other question are deterministic, and there were no "data" being analyzed.

Hmm training with multiple observations and mhsmm package in R

i wanted to train a new hmm model, by means of Poisson observations that are the only thing i know.
I'm using the mhsmm package for R.
The first thing that bugs me is the initialization of the model, in the examples is:
J<-3
initial <- rep(1/J,J)
P <- matrix(1/J, nrow = J, ncol = J)
b <- list(lambda=c(1,3,6))
model = hmmspec(init=initial, trans=P, parms.emission=b,dens.emission=dpois.hsmm)
in my case i don't have initial values for the emission distribution parameters, that's what i want to estimate. How?
Secondly: if i only have observations, how do i pass them to
h1 = hmmfit(list_of_observations, model ,mstep=mstep.pois)
in order to obtain the trained model?
list_of_observations, in the examples, contains a vector of states, one of observations and one of observation sequence length and is usually obtained by a simulation of the model:
list_of_observations = simulate(model, N, rand.emis = rpois.hsmm)
EDIT: Found this old question with an answer that partially solved my problem:
MHSMM package in R-Input Format?
These two lines did the trick:
train <- list(x = data.df$sequences, N = N)
class(train) <- "hsmm.data"
where data.df$sequences is the array containing all observations sequences and N is the array containing the count of observations for each sequence.
Still, the initial model is totally random, but i guess this is the way it is meant to be since it will be re-estimated, am i right?
The problem of initialization is critical not only for HMMs and HSMMs, but for all learning methods based on a form of the Expectation-Maximization algorithm. EM converges to a local optimum in terms of likelihood between model and data, but that does not always guarantee to reach the global optimum.
Goal: find estimates of the emission distribution but it also works for initial probability and transition matrix
Algorithm: needs initial estimate to start the optimisation from
You: have to provide an initial "guess" of the parameters
This may seem confusing at first, but the EM algorithm needs a point to start the optimisation. Then it makes some computations and it gives you a better estimate of your own initial guess (re-estimation, as you said). It is not able to just find the best parameters on its own, without being initialised.
From my experience, there is no general way to initialise the parameters that guarantee to converge to a global optimum, but it will depend more on the case at hand. That's why initialisation plays a critical role (mostly for emission distribution).
What I used to do in such a case is to separate the training data in different groups (e.g. percentiles of a certain parameter in the set), estimate the parameters on these groups, and then use them as initial parameter estimates for the EM algorithm. Basically, you have to try different methods and see which one works best.
I'd recommend to search the literature if similar problems have been solved with HMM, and try their initialisation method.

High order PDEs

I'm trying to solve a 6th-order nonlinear PDE (1D) with fixed boundary values (extended Fisher-Kolmogorov - EFK). After failing with FTCS, next attempt is MoL (either central in space or FEM) using e.g. LSODES.
How can this be implemented? Using Python/C + OpenMP so far, but need some pointers
to do this efficiently.
EFK with additional 6th order term:
u_t = d u_6x - g u_4x + u_xx + u-u^3
where d, g are real coefficients.
u(x,0) = exp(-x^2/16),
ux = 0 on boundary
domain is [0,300] and dx << 1 since i'm looking for pattern formations (subject to the values
of d, g)
I hope this is sufficient information.
All PDE solutions like this will ultimately end up being expressed using linear algebra in your program, so the trick is to figure out how to get the PDE into that form before you start coding.
Finite element methods usually begin with a weighted residual method. Non-linear equations will require a linear approximation and iterative methods like Newton-Raphson. I would recommend that you start there.
Yours is a transient solution, so you'll have to do time stepping. You can either use an explicit method and live with the small time steps that stability limits will demand or an implicit method, which will force you to do a matrix inversion at each step.
I'd do a Fourier analysis first of the linear piece to get an idea of the stability requirements.
The only term in that equation that makes it non-linear is the last one: -u^3. Have you tried starting by leaving that term off and solving the linear equation that remains?
UPDATE: Some additional thoughts prompted by comments:
I understand how important the u^3 term is. Diffusion is a 2nd order derivative w.r.t. space, so I wouldn't be so certain that a 6th order equation will follow suit. My experience with PDEs comes from branches of physics that don't have 6th order equations, so I honestly don't know what the solution might look like. I'd solve the linear problem first to get a feel for it.
As for stability and explicit methods, it's dogma that the stability limits placed on time step size makes them likely to fail, but the probability isn't 1.0. I think map reduce and cloud computing might make an explicit solution more viable than it was even 10-20 years ago. Explicit dynamics have become a mainstream way to solve difficult statics problems, because they don't require a matrix inversion.

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Resources