Solver for non-linear least squares with boundary constraints - julia

I'm looking for an analog to Matlab's lsqnonlin function in Julia.
LsqFit.jl looks great, but doesn't accept the same arguments Matlab's implementation does; specifically:
Lower bounds
Upper bounds
Initial conditions
where initial conditions, lower, and upper bounds are vectors of length 6.
Any advice would be awesome. Thanks!

Actually, it does, it's just not explained in the readme (for good measure, here is a stable link
It is unclear what you mean by initial conditions. If you mean initial parameters, this is very much possible.
using LsqFit
# a two-parameter exponential model
# x: array of independent variables
# p: array of model parameters
model(x, p) = p[1]*exp.(-x.*p[2])
# some example data
# xdata: independent variables
# ydata: dependent variable
xdata = linspace(0,10,20)
ydata = model(xdata, [1.0 2.0]) + 0.01*randn(length(xdata))
p0 = [0.5, 0.5]
fit = curve_fit(model, xdata, ydata, p0)
(taken from the manual). Here p0 is the initial parameter vector.
This will give you something very close to [1.0, 2.0]. But what if we want to constrain the parameter to be in [0,1]x[0,1]? Then we simply set the keyword arguments lower and upper to be vectors of lower and upper bounds
fit = curve_fit(model, xdata, ydata, p0; lower = zeros(2), upper = ones(2))
That should give something like [1.0, 1.0] depending on your exact data.

Maybe it's not a proper answer, but I have had some success in the past
adding a penalization term to the cost function outside the bounds, something like a strong exponential with a step-like behaivour. The downside is defining your cost function manually, of course.


How to use the ncp parameter in R's qt function?

I'm using R to make some calculations. This question is about R but also about statistics.
Say I have a dataset of paired samples consisting of a subject's blood platelet concentration after injection of placebo and then again after injection of medication for a number of subjects. I want to estimate the mean difference for the paired samples. I'm just learning about the t distribution. If I wanted to a 95% confidence interval for the mean difference using a Z-test, I could simply use:
mydata$diff <- mydata$medication - mydata$placebo
mu0 <- mean(mydata$diff)
sdmu <- sd(mydata$diff) / sqrt(length(mydata$diff))
qnorm(c(0.025, 0.975), mu, sdmu)
After much confusion and cross-checking with the t.test function, I've figured out that I can get the 95% confidence interval for a t-test with:
qt(c(0.025, 0.975), df=19) * sdmu + mu0
My understanding of this is as follows:
Tstatistic = (mu - mu0)/sdmu
Tcdf^-1(0.025) <= (mu - mu0) / sdmu <= Tcdf^-1(0.975)
sdmu * Tcdf^-1(0.025) + mu0 <= mu <= sdmu * Tcdf^-1(0.975) + mu0
The reason this is confusing is that if I were using a Z-test, I would write it like this:
qnorm(c(0.025, 0.975), mu0, sdmu)
and it's not until I tried to figure out how to use the t distribution that I realised I could move the normal distribution parameters out of the function too:
qnorm(c(0.025, 0.975), 0, 1) * sdmu + mu0
Trying to wrap my head around what this means mathematically, it means that the Z-statistic (mu - mu0)/sdmu is always normally distributed with mean 0 and standard deviation of 1?
What has me stumped is that I'd like to move the t distribution parameters into the arguments to the function to cut down on the enormous mental overhead of thinking about this transformation.
However, according to my version of the R function qt's documentation, in order to do this, I would need to calculate the non-centrality parameter ncp. According to (my version of) the documentation, the ncp is explained as follows:
Let T= (mX - m0) / (S/sqrt(n)) where mX is the mean and S the sample standard deviation (sd) of X_1, X_2, …, X_n which are i.i.d. N(μ, σ^2) Then T is distributed as non-central t with df= n - 1 degrees of freedom and non-centrality parameter ncp = (μ - m0) * sqrt(n)/σ.
I can't wrap my head around this at all. At first it seems to fit into my framework because Tstatistic = (mu - m0) / sdmu. But isn't μ what I want the qt function (which is Tcdf-1) to return? How can it appear in the ncp, which I need to give as an input? And what about σ? What do μ and σ mean in this context?
Basically, how can I get the same result as qt(c(0.025, 0.975), df=19) * sdmu + mu0, without any terms outside of the function call, and could I have an explanation of how it works?
Let me try to explain without using any formulae.
First of all, the student t distribution and the normal distribution are two distinct probability distributions and (in most situations) are not supposed to give you the same results.
The t distribution is the appropriate probability distribution to test for a difference between two normally distributed samples. Since we do not know the population sd we have to stick with the one we get from the sample. And that distribution is not normal distributed anymore, it is t-distributed.
The z-distribution can be used to approximate the test. In this case, we use the z-distribution as approximation of the t-distribution. However, it is recommended not to do this with low degrees of freedom. Reason: the higher degrees of freedom a t distribution has it becomes increasingly similar to a normal distribution. Textbooks usually say that t and normal distribution with df>30 are similar enough to approximate t with normal distribution. In order to do that, you would have to normalise your data, first, so that mean = 0 and sd = 1. Then you can do the approximation using the z-distribution.
I usually recommend not to use this approximation. It was a reasonable crutch when calculations had to be done on paper using your head, a pen, and a bunch of tables. There exist many workarounds in basic statistics that were supposed to give you a reasonble result with less computation effort. With modern computers that is usually obsolete (in most cases at least).
The z distribution, by the way, is defined (by convention) as a normal distribution N(0, 1) i.e. a normal distribution with mean = 0 and sd = 1.
Finally, about the different ways these distributions are specified. The normal distribution is actually the only probability distribution that I know that you can specify by setting mean and sd directly (there are dozens of distributions, in case you're interested). The non-centrality parameter has a similar effect than the mean of the normal distribution. In a plot it moves the t-distribution along the x-axis. But it also changes its shape and skews it so that mean and ncp move away from each other.
This code will show how the ncp changes the shape and location of the t-distribution:
x <- seq(-5, 15, 0.1)
plot(x, dt(x, df = 10, ncp = 0), from = -4, to = +4, type = "l")
for(ncp in 1:6)
lines(x, dt(x, df = 10, ncp = ncp))

Weibull Distribution parameter estimation error

I used the following function to estimate the three-parameter Weibull distribution.
xl=rweibull3(50, shape = 1,scale=1, thres = 0)
dweib3l <- function(shape, scale, thres) {
-sum(dweibull3(xl , shape, scale, thres, log=TRUE))
ml <- mle2(dweib3l, start= list(shape = 1, scale = 1, thres=0), data=list(xl))
However, when I run the above function I am getting the following error.
Error in optim(par = c(shape = 1, scale = 1, thres = 0), fn = function (p) :
non-finite finite-difference value [3]
In addition: There were 16 warnings (use warnings() to see them)
Is there any way to overcome this issue?
Thank you!
The problem is that the threshold parameter is special: it defines a sharp boundary to the distribution, so any value of thres above the minimum value of the data will give zero likelihoods (-Inf negative log-likelihoods): if a given value of xl is less than the specified threshold, then it's impossible according to the statistical model you have defined. Furthermore, we know already that the maximum likelihood value of the threshold is equal to the minimum value in the data set (analogous results hold for MLE estimation of the bounds of a uniform distribution ...)
I don't know why the other questions on SO that address this question don't encounter this particular problem - it may be because they use a starting value of the threshold that's far enough below the minimum value in the data set ...
Below, I use a fixed value of min(xl)-1e-5 for the threshold (shifting the value downward avoids numerical problems when the value is exactly on the boundary). I also use the formula interface so we can call the dweibull3() function directly, and put lower bounds on the shape and scale parameters (as a result I need to use method="L-BFGS-B", which allows for constraints).
ml <- mle2(xl ~ dweibull3(shape=shape, scale = scale,
start=list(shape=1, scale=1),
(The formula interface is convenient for simple examples: if you want to do something very much more complicated you may want to go back to defining your own log-likelihood function explicitly.)
If you insist on fitting the threshold parameter, you can do it by setting an upper bound that is (nearly) equal to the minimum value that occurs in the data [any larger value will give NA values and thus break the optimization]. However, you will find that the estimate of the threshold parameter always converges to this upper bound ... so this approach is really getting to the previous answer the hard way (you'll also get warnings about parameters being on the boundary, and about not being able to invert the Hessian).
eps <- 1e-8
ml3 <- mle2(xl ~ dweibull3(shape=shape, scale = scale, thres = thres),
start=list(shape=1, scale=1, thres=-5),
For what it's worth it does seem to be possible to fit the model without fixing the threshold parameter, if you start with a small value and use Nelder-Mead optimization: however, it seems to give unreliable results.

Brent's method in the R optim implementation always returns the same local minimum

I'm trying to minimise the function shown above. I'm searching between (-1,1). I use the following code
optim(runif(1,min=-1,max=+1), ..., method = "Brent", lower = -1.0, upper = 1.0)
and I've noticed that it always returns a value of x = -0.73 instead of the correct x = 0.88 answer. The reason is given in the optimise help page:
The first evaluation of f is always at x_1 = a + (1-φ)(b-a) where (a,b) = (lower, upper) and phi = (sqrt(5) - 1)/2 = 0.61803.. is the golden section ratio. Almost always, the second evaluation is at x_2 = a + phi(b-a). Note that a local minimum inside [x_1,x_2] will be found as solution, even when f is constant in there, see the last example.
I'm curious if there is anyway to use Brent's method without hitting the same local minimum each time.
Changing method to "L-BFGS-B" works better (a random local minimum is returned each time):
optim(runif(1,min=-1,max=+1), ..., method = "L-BFGS-B", lower = -1.0, upper = 1.0)
Your function is NOT convex, therefore you will have multiple local/global minima or maxima. For your function I would run a non traditional/ derivative free global optimizer like simulated annealing or genetic algorithm and use the output as a starting point for BFGS or any other local optimizers to get a precise solution. Repeat the above step multiple times you will find all the global and local optimum points.

Quadrature to approximate a transformed beta distribution in R

I am using R to run a simulation in which I use a likelihood ratio test to compare two nested item response models. One version of the LRT uses the joint likelihood function L(θ,ρ) and the other uses the marginal likelihood function L(ρ). I want to integrate L(θ,ρ) over f(θ) to obtain the marginal likelihood L(ρ). I have two conditions: in one, f(θ) is standard normal (μ=0,σ=1), and my understanding is that I can just pick a number of abscissa points, say 20 or 30, and use Gauss-Hermite quadrature to approximate this density. But in the other condition, f(θ) is a linearly transformed beta distribution (a=1.25,b=10), where the linear transformation B'=11.14*(B-0.11) is such that B' also has (approximately) μ=0,σ=1.
I am confused enough about how to implement quadrature for a beta distribution but then the linear transformation confuses me even more. My question is threefold: (1) can I use some variation of quadrature to approximate f(θ) when θ is distributed as this linearly transformed beta distribution, (2) how would I implement this in R, and (3) is this a ridiculous waste of time such that there is an obviously much faster and better method to accomplish this task? (I tried writing my own numerical approximation function but found that my implementation of it, being limited to the R language, was just too slow to suffice.)
First, I assume you can express your L(θ,ρ) and f(θ) in terms of actual code; otherwise you're kinda screwed. Given that assumption, you can use integrate to perform the necessary computations. Something like this should get you started; just plug in your expressions for L and f.
marglik <- function(rho) {
integrand <- function(theta, rho) L(theta, rho) * f(theta)
# set your lower/upper integration limits as appropriate
integrate(integrand, lower=-5, upper=5, rho=rho)
For this to work, your integrand has to be vectorized; ie, given a vector input for theta, it must return a vector of outputs. If your code doesn't fit the bill, you can use Vectorize on the integrand function before passing it to integrate:
integrand <- Vectorize(integrand, "theta")
Edit: not sure if you're also asking how to define f(θ) for the transformed beta distribution; that seems rather elementary for someone working with joint and marginal likelihoods. But if you are, then the density of B' = a*B + b, given f(B), is
f'(B') = f(B)/a = f((B' - b)/a) / a
So in your case, f(theta) is dbeta(theta/11.14 + 0.11, 1.25, 10) / 11.14

Curve fitting: Find the smoothest function that satisfies a list of constraints

Consider the set of non-decreasing surjective (onto) functions from (-inf,inf) to [0,1].
(Typical CDFs satisfy this property.)
In other words, for any real number x, 0 <= f(x) <= 1.
The logistic function is perhaps the most well-known example.
We are now given some constraints in the form of a list of x-values and for each x-value, a pair of y-values that the function must lie between.
We can represent that as a list of {x,ymin,ymax} triples such as
constraints = {{0, 0, 0}, {1, 0.00311936, 0.00416369}, {2, 0.0847077, 0.109064},
{3, 0.272142, 0.354692}, {4, 0.53198, 0.646113}, {5, 0.623413, 0.743102},
{6, 0.744714, 0.905966}}
Graphically that looks like this:
We now seek a curve that respects those constraints.
For example:
Let's first try a simple interpolation through the midpoints of the constraints:
mids = ({#1, Mean[{#2,#3}]}&) ### constraints
f = Interpolation[mids, InterpolationOrder->0]
Plotted, f looks like this:
That function is not surjective. Also, we'd like it to be smoother.
We can increase the interpolation order but now it violates the constraint that its range is [0,1]:
The goal, then, is to find the smoothest function that satisfies the constraints:
Tends to 0 as x approaches negative infinity and tends to 1 as x approaches infinity.
Passes through a given list of y-error-bars.
The first example I plotted above seems to be a good candidate but I did that with Mathematica's FindFit function assuming a lognormal CDF.
That works well in this specific example but in general there need not be a lognormal CDF that satisfies the constraints.
I don't think you've specified enough criteria to make the desired CDF unique.
If the only criteria that must hold is:
CDF must be "fairly smooth" (see below)
CDF must be non-decreasing
CDF must pass through the "error bar" y-intervals
CDF must tend toward 0 as x --> -Infinity
CDF must tend toward 1 as x --> Infinity.
then perhaps you could use Monotone Cubic Interpolation.
This will give you a C^2 (twice continously differentiable) function which,
unlike cubic splines, is guaranteed to be monotone when given monotone data.
This leaves open the question, exactly what data should you use to generate the
monotone cubic interpolation. If you take the center point (mean) of each error
bar, are you guaranteed that the resulting data points are monotonically
increasing? If not, you might as well make some arbitrary choice to guarantee
that the points you select are monotonically increasing (because the criteria does not force our solution to be unique).
Now what to do about the last data point? Is there an X which is guaranteed to
be larger than any x in the constraints data set? Perhaps you can again make an
arbitrary choice of convenience and pick some very large X and put (X,1) as the
final data point.
Comment 1: Your problem can be broken into 2 sub-problems:
Given exact points (x_i,y_i) through which the CDF must pass, how do you generate CDF? I suspect there are infinitely many possible solutions, even with the infinite-smoothness constraint.
Given y-errorbars, how should you pick (x_i,y_i)? Again, there infinitely many possible solutions. Some additional criteria may need to be added to force a unique choice. Additional criteria would also probably make the problem even harder than it currently is.
Comment 2: Here is a way to use monotonic cubic interpolation, and satisfy criteria 4 and 5:
The monotonic cubic interpolation (let's call it f) maps R --> R.
Let CDF(x) = exp(-exp(f(x))). Then CDF: R --> (0,1). If we could find the appropriate f, then by defining CDF this way, we could satisfy criteria 4 and 5.
To find f, transform the CDF constraints (x_0,y_0),...,(x_n,y_n) using the transformation xhat_i = x_i, yhat_i = log(-log(y_i)). This is the inverse of the CDF transformation. If the y_i's were increasing, then the yhat_i's are decreasing.
Now apply monotone cubic interpolation to the (x_hat,y_hat) data points to generate f. Then finally, define CDF(x) = exp(-exp(f(x))). This will be a monotonically increasing function from R --> (0,1), which passes through the points (x_i,y_i).
This, I think, satisfies all the criteria 2--5. Criteria 1 is somewhat satisfied, though there certainly could exist smoother solutions.
I have found a solution that gives reasonable results for a variety of inputs.
I start by fitting a model -- once to the low ends of the constraints, and again to the high ends.
I'll refer to the mean of these two fitted functions as the "ideal function".
I use this ideal function to extrapolate to the left and to the right of where the constraints end, as well as to interpolate between any gaps in the constraints.
I compute values for the ideal function at regular intervals, including all the constraints, from where the function is nearly zero on the left to where it's nearly one on the right.
At the constraints, I clip these values as necessary to satisfy the constraints.
Finally, I construct an interpolating function that goes through these values.
My Mathematica implementation follows.
First, a couple helper functions:
(* Distance from x to the nearest member of list l. *)
listdist[x_, l_List] := Min[Abs[x - #] & /# l]
(* Return a value x for the variable var such that expr/.var->x is at least (or
at most, if dir is -1) t. *)
invertish[expr_, var_, t_, dir_:1] := Module[{x = dir},
While[dir*(expr /. var -> x) < dir*t, x *= 2];
And here's the main function:
(* Return a non-decreasing interpolating function that maps from the
reals to [0,1] and that is as close as possible to expr[var] without
violating the given constraints (a list of {x,ymin,ymax} triples).
The model, expr, will have free parameters, params, so first do a
model fit to choose the parameters to satisfy the constraints as well
as possible. *)
cfit[constraints_, expr_, params_, var_] :=
xlist = First /# constraints;
bots = Most /# constraints; (* bottom points of the constraints *)
tops = constraints /. {x_, _, ymax_} -> {x, ymax};
(* fit a model to the lower bounds of the constraints, and
to the upper bounds *)
loparams = FindFit[bots, expr, params, var];
hiparams = FindFit[tops, expr, params, var];
lofit[z_] = (expr /. loparams /. var -> z);
hifit[z_] = (expr /. hiparams /. var -> z);
(* find x-values where the fitted function is very close to 0 and to 1 *)
{xmin, xmax} = {
Min#Append[xlist, invertish[expr /. hiparams, var, 10^-6, -1]],
Max#Append[xlist, invertish[expr /. loparams, var, 1-10^-6]]};
(* the smallest gap between x-values in constraints *)
gap = Min[(#2 - #1 &) ### Partition[Sort[xlist], 2, 1]];
(* augment the constraints to fill in any gaps and extrapolate so there are
constraints everywhere from where the function is almost 0 to where it's
almost 1 *)
aug = SortBy[Join[constraints, Select[Table[{x, lofit[x], hifit[x]},
{x, xmin,xmax, gap}],
listdist[#[[1]],xlist]>gap&]], First];
(* pick a y-value from each constraint that is as close as possible to
the mean of lofit and hifit *)
bests = ({#1, Clip[(lofit[#1] + hifit[#1])/2, {#2, #3}]} &) ### aug;
Interpolation[bests, InterpolationOrder -> 3]]
For example, we can fit to a lognormal, normal, or logistic function:
g1 = cfit[constraints, CDF[LogNormalDistribution[mu,sigma], z], {mu,sigma}, z]
g2 = cfit[constraints, CDF[NormalDistribution[mu,sigma], z], {mu,sigma}, z]
g3 = cfit[constraints, 1/(1 + c*Exp[-k*z]), {c,k}, z]
Here's what those look like for my original list of example constraints:
The normal and logistic are nearly on top of each other and the lognormal is the blue curve.
These are not quite perfect.
In particular, they aren't quite monotone.
Here's a plot of the derivatives:
Plot[{g1'[x], g2'[x], g3'[x]}, {x, 0, 10}]
That reveals some lack of smoothness as well as the slight non-monotonicity near zero.
I welcome improvements on this solution!
You can try to fit a Bezier curve through the midpoints. Specifically I think you want a C2 continuous curve.
