Weibull Distribution parameter estimation error - r

I used the following function to estimate the three-parameter Weibull distribution.
library(bbmle)
library(FAdist)
set.seed(16)
xl=rweibull3(50, shape = 1,scale=1, thres = 0)
dweib3l <- function(shape, scale, thres) {
-sum(dweibull3(xl , shape, scale, thres, log=TRUE))
}
ml <- mle2(dweib3l, start= list(shape = 1, scale = 1, thres=0), data=list(xl))
However, when I run the above function I am getting the following error.
Error in optim(par = c(shape = 1, scale = 1, thres = 0), fn = function (p) :
non-finite finite-difference value [3]
In addition: There were 16 warnings (use warnings() to see them)
Is there any way to overcome this issue?
Thank you!

The problem is that the threshold parameter is special: it defines a sharp boundary to the distribution, so any value of thres above the minimum value of the data will give zero likelihoods (-Inf negative log-likelihoods): if a given value of xl is less than the specified threshold, then it's impossible according to the statistical model you have defined. Furthermore, we know already that the maximum likelihood value of the threshold is equal to the minimum value in the data set (analogous results hold for MLE estimation of the bounds of a uniform distribution ...)
I don't know why the other questions on SO that address this question don't encounter this particular problem - it may be because they use a starting value of the threshold that's far enough below the minimum value in the data set ...
Below, I use a fixed value of min(xl)-1e-5 for the threshold (shifting the value downward avoids numerical problems when the value is exactly on the boundary). I also use the formula interface so we can call the dweibull3() function directly, and put lower bounds on the shape and scale parameters (as a result I need to use method="L-BFGS-B", which allows for constraints).
ml <- mle2(xl ~ dweibull3(shape=shape, scale = scale,
thres=min(xl)-1e-5),
start=list(shape=1, scale=1),
lower=c(0,0),
method="L-BFGS-B",
data=data.frame(xl))
(The formula interface is convenient for simple examples: if you want to do something very much more complicated you may want to go back to defining your own log-likelihood function explicitly.)
If you insist on fitting the threshold parameter, you can do it by setting an upper bound that is (nearly) equal to the minimum value that occurs in the data [any larger value will give NA values and thus break the optimization]. However, you will find that the estimate of the threshold parameter always converges to this upper bound ... so this approach is really getting to the previous answer the hard way (you'll also get warnings about parameters being on the boundary, and about not being able to invert the Hessian).
eps <- 1e-8
ml3 <- mle2(xl ~ dweibull3(shape=shape, scale = scale, thres = thres),
start=list(shape=1, scale=1, thres=-5),
lower=c(shape=0,scale=0,thres=-Inf),
upper=c(shape=Inf,scale=Inf,thres=min(xl)-eps),
method="L-BFGS-B",
data=data.frame(xl))
For what it's worth it does seem to be possible to fit the model without fixing the threshold parameter, if you start with a small value and use Nelder-Mead optimization: however, it seems to give unreliable results.

Related

Difference in likelihood functions for continuous vs discrete lognormal distributions in R's poweRlaw package

I'm trying to fit a lognormal distribution to some count data using Colin Gillespie's poweRlaw package in R. I'm aware that the lognormal distribution is continuous and count data is discrete, however, the package contains classes and methods for both continuous and discrete versions of the lognormal distribution.
When I fit xmin (threshold below which count values are disregarded), log mean and log sd parameters and bootstrap the results to get a p value, I get a vector memory exhaustion error. I found that this happens when the package-internal function sample_p_helper tries to generate random numbers from the fitted distribution. The fitted log mean and log sd parameters are so low that the rejection sampling algorithm tries to generate literally billions of numbers to get anything above xmin, hence the memory issue.
Input:
library(poweRlaw)
counts <- c(54, 64, 126, 161, 162, 278, 281, 293, 296, 302, 322, 348, 418, 511, 696, 793, 1894)
dist <- dislnorm$new(counts) # Create discrete lnorm object
dist$setXmin(estimate_xmin(dist)) # Get xmin and parameters
bs <- bootstrap_p(dist) # Run bootstrapping
Error message:
Expected total run time for 100 sims, using 1 threads is 24.6 seconds.
Error in checkForRemoteErrors(val) :
one node produced an error: vector memory exhausted (limit reached?)
The question then becomes why such low and poor-fitting log mean and log sd parameter values are being fitted in the first place.
I noticed that if I fit the continuous version of the lognormal distribution, the error does not occur and the parameter values seem more reasonable (in fact, the p value suggests the data are compatible with the lognormal distribution):
dist_cont <- conlnorm$new(counts)
dist_cont$setXmin(estimate_xmin(dist_cont))
bs <- bootstrap_p(dist_cont)
bs
Looking at the source code for the package, I noticed the likelihood functions for the discrete vs continuous lognormal distributions are different. Specifically, the part where joint probability is calculated.
The continuous version looks how I'd expect:
########################################################
#Log-likelihood
########################################################
conlnorm_tail_ll = function(x, pars, xmin) {
if(is.vector(pars)) pars = t(as.matrix(pars))
n = length(x)
joint_prob = colSums(apply(pars, 1,
function(i) dlnorm(x, i[1], i[2], log=TRUE)))
prob_over = apply(pars, 1, function(i)
plnorm(xmin, i[1], i[2], log.p=TRUE, lower.tail=FALSE))
joint_prob - n*prob_over
}
However, in the discrete version, joint probability is calculated differently:
########################################################
#Log-likelihood
########################################################
dis_lnorm_tail_ll = function(xv, xf, pars, xmin) {
if(is.vector(pars)) pars = t(as.matrix(pars))
n = sum(xf)
p = function(par) {
m_log = par[1]; sd_log = par[2]
plnorm(xv-0.5, m_log, sd_log, lower.tail=FALSE) -
plnorm(xv+0.5, m_log, sd_log, lower.tail=FALSE)
}
if(length(xv) == 1L) {
joint_prob = sum(xf * log(apply(pars, 1, p)))
} else {
joint_prob = colSums(xf * log(apply(pars, 1, p)))
}
prob_over = apply(pars, 1, function(i)
plnorm(xmin-0.5, i[1], i[2],
lower.tail = FALSE, log.p = TRUE))
return(joint_prob - n*prob_over)
}
There's a similar difference between discrete and continuous implementations of the exponential distribution, but not the discrete and continuous power law distributions. In the continuous version, joint_prob is calculated with a relatively simple call to dlnorm, but the discrete versions call plnorm instead. Further, they call plnorm twice, first on the observed data values -0.5 then on the observed values +0.5 and subtract the former from the latter.
So, at last, my questions:
Why does poweRlaw calculate joint probability in this way in the discrete implementation of the lognormal distribution? I'm sure it's been written in this way for a reason and it's just my mathematical ignorance, but I don't really understand it.
Is it safe to use poweRlaw's continuous lognormal distribution instead, even though my data is discrete, since it seems to work well enough anyway?
Any other clues as to what might be going wrong with my data when trying to fit the discrete lognormal distribution? I'm thinking there might be a scaling issue somewhere but having a hard time getting my head around it.
Does my comically small dataset play into things at all? I'm trying to fit a distribution to just 8 values above xmin, which is way too few for maximum likelihood to be reliable, I know.
Thanks for bearing with me through this lengthy post. I'm aware this is as much a statistics question as a coding question. Any helpful nudges in the right direction are very much appreciated! Cheers.
dlnorm() gives the probability density value. Remember densities integrate to one but don't sum to one. So to work out the discrete distribution we take the values either side of an integer. They'll be a normalising constant as well. For the CTN case, the log-likelihood is just a product of dlnorm(), which is easier and faster.
"Safe" is a hard word to define. For this data, the CTN and discrete give visually the same fit. But neither fit well.
The estimated parameters values for the discrete distribution gives a truncated lognormal in the very extreme tails. Simulating data in that region is challenging
Yep, your data is the problem. But that's also the issue when the model doesn't work ;)

R: Using fitdistrplus to fit curve over histogram of discrete data

So I have this discrete set of data my_dat that I am trying to fit a curve over to be able to generate random variables based on my_dat. I had great success using fitdistrplus on continuous data but have many errors when attempting to use it for discrete data.
Table settings:
library(fitdistrplus)
my_dat <- c(2,5,3,3,3,1,1,2,4,6,
3,2,2,8,3,4,3,3,4,4,
2,1,5,3,1,2,2,4,3,4,
2,4,1,6,2,3,2,1,2,4,
5,1,2,3,2)
I take a look at the histogram of the data first:
hist(my_dat)
Since the data's discrete, I decide to try a binomial distribution or the negative binomial distribution to fit and this is where I run into trouble: Here I try to define each:
fitNB3 <- fitdist(my_dat, discrete = T, distr = "nbinom" ) #NaNs Produced
fitB3 <- fitdist(my_dat, discrete = T, distr = "binom")
I receive two errors:
fitNB3 seems to run but notes that "NaNs Produced" - can anyone let me
know why this is the case?
fitB3 doesn't run at all and provides me with the error: "Error in start.arg.default(data10, distr = distname) : Unknown starting values for distribution binom." - can anyone point out why this won't work here? I am unclear about providing a starting number given that the data is discrete (I attempted to use start = 1 in the fitdist function but I received another error: "Error in fitdist(my_dat, discrete = T, distr = "binom", start = 1) : the function mle failed to estimate the parameters, with the error code 100"
I've been spinning my wheels for a while on this but I would be take any feedback regarding these errors.
Don't use hist on discrete data, because it doesn't do what you think it's doing.
Compare plot(table(my_dat)) with hist(my_dat)... and then ponder how many wrong impressions you've gotten doing this before. If you must use hist, make sure you specify the breaks, don't rely on defaults designed for continuous variables.
hist(my_dat)
lines(table(my_dat),col=4,lwd=6,lend=1)
Neither of your models can be suitable as both these distributions start from 0, not 1, and with the size of values you have, p(0) will not be ignorably small.
I don't get any errors fitting the negative binomial when I run your code.
The issue you had with fitting the binomial is you need to supply starting values for the parameters, which are called size (n) and prob (p), so
you'd need to say something like:
fitdist(my_dat, distr = "binom", start=list(size=15, prob=0.2))
However, you will then get a new problem! The optimizer assumes that the parameters are continuous and will fail on size.
On the other hand this is probably a good thing because with unknown n MLE is not well behaved, particularly when p is small.
Typically, with the binomial it would be expected that you know n. In that case, estimation of p could be done as follows:
fitdist(my_dat, distr = "binom", fix.arg=list(size=20), start=list(prob=0.15))
However, with fixed n, maximum likelihood estimation is straightforward in any case -- you don't need an optimizer for that.
If you really don't know n, there are a number of better-behaved estimators than the MLE to be found, but that's outside the scope of this question.

Solver for non-linear least squares with boundary constraints

I'm looking for an analog to Matlab's lsqnonlin function in Julia.
LsqFit.jl looks great, but doesn't accept the same arguments Matlab's implementation does; specifically:
Lower bounds
Upper bounds
Initial conditions
where initial conditions, lower, and upper bounds are vectors of length 6.
Any advice would be awesome. Thanks!
Actually, it does, it's just not explained in the readme (for good measure, here is a stable link README.md).
It is unclear what you mean by initial conditions. If you mean initial parameters, this is very much possible.
using LsqFit
# a two-parameter exponential model
# x: array of independent variables
# p: array of model parameters
model(x, p) = p[1]*exp.(-x.*p[2])
# some example data
# xdata: independent variables
# ydata: dependent variable
xdata = linspace(0,10,20)
ydata = model(xdata, [1.0 2.0]) + 0.01*randn(length(xdata))
p0 = [0.5, 0.5]
fit = curve_fit(model, xdata, ydata, p0)
(taken from the manual). Here p0 is the initial parameter vector.
This will give you something very close to [1.0, 2.0]. But what if we want to constrain the parameter to be in [0,1]x[0,1]? Then we simply set the keyword arguments lower and upper to be vectors of lower and upper bounds
fit = curve_fit(model, xdata, ydata, p0; lower = zeros(2), upper = ones(2))
That should give something like [1.0, 1.0] depending on your exact data.
Maybe it's not a proper answer, but I have had some success in the past
adding a penalization term to the cost function outside the bounds, something like a strong exponential with a step-like behaivour. The downside is defining your cost function manually, of course.

Bootstrap failed using mixed model in lme4 package

I want to use the bootMer() feature of the lme4 package using linear mixed model and also using boot.ci to get 95% CIs by parametric bootstrapping, and have been getting the warnings of the type "In bootMer(object, bootFun, nsim = nsim, ...) : some bootstrap runs failed (30/100)”.
My code is:
> lmer(LLA ~ 1 +(1|PopID/FamID), data=fp1) -> LLA
> LLA.boot <- bootMer(LLA, qst, nsim=999, use.u=F, type="parametric")
Warning message:
In bootMer(LLA, qst, nsim = 999, use.u = F, type = "parametric") :
some bootstrap runs failed (3/999)
> boot.ci(LLA.boot, type=c("norm", "basic", "perc"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 996 bootstrap replicates
CALL :
boot.ci(boot.out = LLA.boot, type = c("norm", "basic", "perc"))
Intervals :
Level Normal Basic Percentile
95% (-0.2424, 1.0637 ) (-0.1861, 0.8139 ) ( 0.0000, 1.0000 )
Calculations and Intervals on Original Scale
my problem is why Bootstrap fails for a few values ? and Confidence interval estimated using boot.ci at 95% show negative value, though there are no negative values in the array of values generated by bootstrap.'
The result of plot(LLA.boot):
It's not surprising, for a slightly difficult or unstable model, that a few parametric bootstrap runs might fail to converge for numerical reasons. You should be able to retrieve the specific error messages via attr(LLA.boot,"boot.fail.msgs") (this really should be documented, but isn't ...) In general I wouldn't worry about it too much if the failure fraction is very small (which it is in this case); if were large (say >5-10%) I would revisit my data and model and try to see if there was something else wrong that was manifesting itself in this way.
As for the confidence intervals: the "basic" and "norm" methods use Normal and bias-corrected Normal approximations, respectively, so it's not surprising that intervals should go beyond the range of the computed values. Since your function is
Qst <- function(x){
uu <- unlist(VarCorr(x))
uu[2]/(uu[3]+uu[2])}
}
its possible range is from 0 to 1, and your percentile bootstrap CI shows this range is attained. If your model were perfectly uninformative, the distribution of Qst would be uniform (mean=0.5, sd=sqrt(1/12)=0.288) and the Normal approximation to the CI would be
> 0.5+c(-1,1)*1.96*sqrt(1/12)
[1] -0.06580326 1.06580326
The upper end is about in the same place as your Normal CI, but your lower limit is even smaller, suggesting that there may even be some bimodality in the sampling distribution of your estimate (this is confirmed by the distribution plot you posted). In any case, I suspect that the bottom line is that your confidence intervals (however computed) are so wide that they're telling you that your data provide almost no practical information about the value of Qst ... In particular, it looks like the majority of your bootstrap replicates are finding singular fits, in which one or the other of the variances are estimated as zero. I'm guessing your data set is just not large enough to estimate these variances very precisely.
For more information on how the Normal and bias-corrected Normal approximations are computed, see boot:::basic.ci and boot:::norm.ci or chapter 5 of Davison and Hinkley as cited in ?boot.ci.

Predict.lm() in R - how to get nonconstant prediction bands around fitted values

So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.

Resources