Problems fitting a log pearson III with negative scale - r

I'd like to perform a log Pearson III fit to some data points I have in R. I followed this guidelines
link. But i encounter a problem when my skewness (g) is negative (and of course the parameter "scale" is negative too, because the "sign(g)" in the computation of scale). The distribution from "FAdist" does not work with negative scale parameter, i need this for starting values for the "fitdist" (from fitdistrplus). Some pages say that the parameters shape and scale is only positive in a pearson III (or generalize gamma distribution) and other dont, i run out of ideas. The code is:
library(fitdistrplus)
library(FAdist)
library(e1071)
#data
df <-(92.8, 53.2, 112.0, 164.0, 132.0, 69.9, 140.0, 48.3, 123.0 ,24.6, 179.0, 55.1, 31.3, 17.0, 111.0, 35.4, 133.0, 505.0, 303.0, 121.5, 203.0, 198.0, 250.0, 232.0, 185.0, 222.0, 191.0, 238.0, 53.0, 121.0, 106.4, 347.3, 186.4, 89.1, 131.4 ,53.2, 252.6)
# log of df
df<-log(df)
#Pearson 3 Sample moments
m <- mean(df)
v <- var(df)
s <- sd(df)
g <- e1071::skewness(df, type=1)
n <- length(df)
#Pearson 3 Parameter estimation
my.shape <- (2/g)^2
my.scale <- sqrt(v)/sqrt(my.shape)*sign(g) # modified as recommended by Carl Schwarz, this is negative =(
my.thres <- m-(my.shape*my.scale)
# All parameter together
my.param <- list(shape=my.shape, scale=my.scale, thres=my.thres )
# fit dist from the "fitdistrplus" and "lgamma3" from "FAdist"
q.p3 <- fitdist(data = caudales,distr = "lgamma3", start = my.param)
Give me the following
Error in fitdist(data = df, distr = "lgamma3", start = my.param, :
the function mle failed to estimate the parameters,
with the error code 100

I already solve the problem. I used the package "PearsonDS" instead of "FAdist" to compute log pearson III, why? because the latter accept negative values of skewness (g) and in consecuence negative values of "scale" parameter (if u look at the documentation of PearsonDS in Pearson III, they add an absolute value in the argument). I had to do a little changes to the equation of Pearson III to work right with package "fitdistrplus" (to fit distribution to sample data):
The final code is right this (IMPORTANT : DATA MUST BE THE ORIGINAL VALUES NOT THE LOG(DATA)), BUT MEAN,VAR AND G THEY NEED TO BE COMPUTED WITH LOG(DATA):
library(FAdist); library(fitdistrplus)
library("e1071")
library(PearsonDS)
load("data")
## PP for precipitation
pp <- as.numeric(data$pp)
## IMPORTANT : DATA MUST BE THE ORIGINAL VALUES NOT THE LOG(DATA)
data <- (pp)
m <- mean(log(data))
v <- var(log(data))
s <- sd(log(data))
g <- e1071::skewness(log(data), type=1)
n <- length(data)
#g <- g*(sqrt(n*(n-1))/(n-2))*(1+8.5/n)
my.shape <- (2/g)^2
my.scale <- (sqrt(v/my.shape))*sign(g) # modified as recommended by Carl Schwarz
my.thres <- m-my.shape*my.scale
#load functions
dlPIII<-function(x, shape, location, scale) {PearsonDS::dpearsonIII(log(x), shape, location, scale, log=FALSE)/x}
plPIII<-function(q, shape, location, scale) {PearsonDS::ppearsonIII(log(q), shape, location, scale, lower.tail = TRUE, log.p = FALSE)}
qlPIII<-function(p, shape, location, scale) {exp(PearsonDS::qpearsonIII(p, shape, location, scale, lower.tail = TRUE, log.p = FALSE))}
# Fit the distribution
f.pl3 <- fitdist(data, distr="lPIII", method="mle", start=list(shape=my.shape,location=my.thres ,scale=my.scale )

Related

R code for maximum likelihood estimate from a specific likelihood function

I have been trying to generate R code for maximum likelihood estimation from a log likelihood function in a paper (equation 9 in page 609). Authors in the paper estimated it using MATLAB, which I am not familiar with. So I tried to generate codes in R.
Here is the snapshot of the log likelihood function in the paper:
, where
r: Binary decision (0 or 1) indicating infested plant(s) detection (1) or not (0).
e: Inspection efficiency. This is known.
n: Sample size
The overall objective is to estimate plant infestation rate (gamma: γ) and epsilon (e) based on binary decision of presence and absence of infested plants instead of using infested plant(s) detected. So, the function has only binary information (r) of infested plant detection and sample size. Since epsilon (e) is known or fixed, the actual goal is to estimate gamma (γ) in a population.
Another objective is to compare estimated infestation rates from above with ones in hypergeometric sampling formula in another paper (in page 6). The formula is:
This formula generates required sample size to detect infested plants with selected probability (e.g., 95) given an infested rate. For example:
# Sample size calculation function
fosgate.sample1 <- function(box, p, ci){ # Note: box represent total plant number
ninf <- p*box
sample.size <- round(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
#sample.size <- ceiling(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
sample.size
}
fosgate.sample1(box=100, p = .05, ci = .95) # where box: population or total plants, p: infestation rate, and ci: probability of detection
## 44
The idea is if sample size (e.g., 44) and binary decision data are provided the log-likelihood function can be used to estimate infestation rate and the rate may be close to anticipated rate (e.g., .05). Ultimately, I would like to compare plant infestation rates (gamma: γ) estimated from the log likelihood function above and D/N in the sample size calculation formula (second) or p in the sample size code below.
I generated R code for the log-likelihood described above.
### MLE with stat4
library(stats4)
# Log-likelihood function
plant.inf.lik <- function(inf.rate){
logl <- suppressWarnings(
sum((1-insp.result)*n*log(1-inf.rate) +
insp.result*log(1-(1-inf.rate)^n))
)
return(-logl)
}
Using the sample size function (i.e., fosgate.sample1) I generated sample sizes for various cases of total plant (or box) and anticipated detection rate (p) in the function. Since I am also interested in error/confidence ranges of estimated plant infestation rates, I used bootstrapping to calculate range of estimates (I am not sure if this is appropriate/acceptable). Here is the final code I generated:
### MLE and CI with bootstrapping with multiple scenarios
plant <- c(100, 500, 1000, 5000, 10000, 100000) # Total plant number
ir <- seq(.01, .2, by = .01) # Plant infestation rate
df.result <- data.frame(expand.grid(plant=plant, inf.rate = ir))
df.result$sample.size <- fosgate.sample1(box=df.result$plant, p=df.result$inf.rate, ci=.95) # Sample size
df.result$insp.result <- 1000 # Shipment number (can be replaced with random integers)
df.result <- df.result[order(df.result$plant, df.result$inf.rate, df.result$sample.size), ]
rownames(df.result) <- 1:nrow(df.result)
df.result$est.mean <- 0
#df.result$est.median <- 0
df.result$est.lower.ci <- 0
df.result$est.upper.ci <- 0
df.result$nsim <- 0
str(df.result)
head(df.result)
# Looping
est <- rep(NA, 1000)
for(j in 1:nrow(df.result)){
for(i in 1:1000){
insp.result <- sample(c(rep(1, df.result$insp.result[j]-df.result$insp.result[j]*df.result$inf.rate[j]),
rep(0, df.result$insp.result[j]*df.result$inf.rate[j])))
ir <- df.result$inf.rate[j]
n <- df.result$sample.size[j]
insp.result <- sample(insp.result, replace = TRUE)
est[i] <- mle(plant.inf.lik, start = list(inf.rate = ir*.9), method = "BFGS", nobs = length(insp.result))#coef
df.result$est.mean[j] <- mean(est, na.rm = TRUE)
# df.result$est.median[j] <- median(est, na.rm = TRUE)
df.result$est.lower.ci[j] <- quantile(est, prob = .025, na.rm = TRUE)
df.result$est.upper.ci[j] <- quantile(est, prob = .975, na.rm = TRUE)
df.result$nsim[j] <- length(est)
}
}
# Significance test result
sig <- ifelse(df.result$inf.rate >= df.result$est.lower.ci & df.result$inf.rate <= df.result$est.upper.ci, "no sig", "sig")
table(sig)
# Plot
library(ggplot2)
library(reshape2)
df.result$num <- ave(df.result$inf.rate, df.result$plant, FUN=seq_along)
df.result.m <- melt(df.result, id.vars=c("plant", "sample.size", "insp.result", "est.lower.ci", "est.upper.ci", "nsim", "num"))
df.result.m$est.lower.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.lower.ci)
df.result.m$est.upper.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.upper.ci)
str(df.result.m)
ggplot(data = df.result.m, aes(x = num, y = value, group=variable, color=variable, shape=variable))+
geom_point()+
geom_errorbar(aes(ymin = est.lower.ci, ymax = est.upper.ci), width=.5)+
scale_y_continuous(breaks = seq(0, .2, .02))+
xlab("Index")+
ylab("Plant infestation rate")+
facet_wrap(~plant, ncol = 3)
When I ran the code, I was able to obtain results and to compare estimated (est.mean) and anticipated (inf.rate) infestation rates as shown in the plot below.
If results are correct, plot indicates that estimation looks fine but off for greater infestation rates.
Also, I always got warning messages without "suppressWarnings" function and occasionally error messages below. I have no clue how to fix them.
## Warning messages
## 29: In log(1 - (1 - inf.rate)^n) : NaNs produced
## 30: In log(1 - inf.rate) : NaNs produced
## Error message (occasionally)
## Error in solve.default(oout$hessian) :
## Lapack routine dgesv: system is exactly singular: U[1,1] = 0
My questions are:
Is R function (plant.inf.lik) for maximum likelihood estimation of the log-likelihood function appropriate?
Should I take care of warning and error messages? If yes, how? Again, I have no clue how to fix...
Is bootstrapping (resampling?) method appropriate to estimate CI ranges and/or standard error?
I found this link useful for alternative approach. Although I am still working both approaches together, results seem different (maybe following question).
Any suggestion would be greatly appreciated.
Concerning your last question about estimating CI ranges, there are three common methods for ML estimators:
Variance estimation from the inverted Hessian matrix.
Jackknife estimator for the variance (simpler and more stable, if the Hessian is estimated numerically, but computationally more expensive)
Bootstrap CIs (the computatianally most expensive approach).
For bootstrap CIs, you do not need to implement them yourself (bias correction, e.g. can be tricky), but can rely on the R library boot.
Incidentally, I have written a summary with R code for all three approaches two years ago: Construction of Confidence Intervals (see section 5). For the method utilizing the Hessian Matrix, e.g., the outline is as follows:
lnL <- function(theta1, theta2, ...) {
# definition of the negative (!)
# log-likelihood function...
}
# starting values for the optimization
theta0 <- c(start1, start2, ...)
# optimization
p <- optim(theta0, lnL, hessian=TRUE)
if (p$convergence == 0) {
theta <- p$par
covmat <- solve(p$hessian)
sigma <- sqrt(diag(covmat))
}
The function mle from stats4 already wraps the covrainace matrix estimation and retruns it in vcov. In the practical use cases in which I have tried this (paired comparison models), though, this estimation was rather unstable, and I have resorted to the jackknife method instead.

Data approximation with maxLik library. NaNs produced

I'm trying to make a data approximation for quakes dataset, stations column. I'm following an example in the maxLik library documentation https://cran.r-project.org/web/packages/maxLik/maxLik.pdf
library(maxLik)
set.seed(1)
attach(quakes)
## log likelihood function.
## Note: 􏰀param􏰀 is a vector
llf <- function( param ) {
mu <- param[ 1 ]
sigma <- param[ 2 ]
llValue <- dnorm(stations, mean=mu, sd=sigma, log=TRUE)
return(sum(llValue))
}
## Estimate it. Take standard normal as start values
ml <- maxLik( llf, start = c(mu=0, sigma=1) )
I'm getting In dnorm(stations, mean = mu, sd = sigma, log = TRUE) : NaNs produced warning. This happens because sum(llValue) returns negative values. As a result, the approximation is very bad. If I change the code to return(abs(sum(llValue))) the warning disappear and the ​approximation is a bit closer, but still very bad.
What I'm doing wrong? How to make an approximation for quakes$stations data?
The problem was in the wrong distribution for llValue. I tried to approximate lognormal distribution using the normal distribution.
Stations from quakes dataset could be approximated with following function
library(maxLik)
attach(quakes)
llf_dlnorm <- function(param) {
mu <- param[ 1 ]
sigma <- param[ 2 ]
llValue <- dlnorm(stations, mean=mu, sd=sigma, log=TRUE)
return(sum(llValue))
}
ml_dlnorm <- maxLik(llf_dlnorm, start = c(mu=mean(stations), sigma=sd(stations)))
x<-seq(0, 150, by=0.01)
hist(stations, prob=T)
lines(x,dlnorm(x, ml_dlnorm$estimate[1], ml_dlnorm$estimate[2]), col="blue",lwd=2)

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

How to draw an $\alpha$ confidence areas on a 2D-plot?

There are a lot of answers regarding to plotting confidence intervals.
I'm reading the paper by Lourme A. et al (2016) and I'd like to draw the 90% confidence boundary and the 10% exceptional points like in the Fig. 2 from the paper: .
I can't use LaTeX and insert the picture with the definition of confidence areas:
library("MASS")
library(copula)
set.seed(612)
n <- 1000 # length of sample
d <- 2 # dimension
# random vector with uniform margins on (0,1)
u1 <- runif(n, min = 0, max = 1)
u2 <- runif(n, min = 0, max = 1)
u = matrix(c(u1, u2), ncol=d)
Rg <- cor(u) # d-by-d correlation matrix
Rg1 <- ginv(Rg) # inv. matrix
# round(Rg %*% Rg1, 8) # check
# the multivariate c.d.f of u is a Gaussian copula
# with parameter Rg[1,2]=0.02876654
normal.cop = normalCopula(Rg[1,2], dim=d)
fit.cop = fitCopula(normal.cop, u, method="itau") #fitting
# Rg.hat = fit.cop#estimate[1]
# [1] 0.03097071
sim = rCopula(n, normal.cop) # in (0,1)
# Taking the quantile function of N1(0, 1)
y1 <- qnorm(sim[,1], mean = 0, sd = 1)
y2 <- qnorm(sim[,2], mean = 0, sd = 1)
par(mfrow=c(2,2))
plot(y1, y2, col="red"); abline(v=mean(y1), h=mean(y2))
plot(sim[,1], sim[,2], col="blue")
hist(y1); hist(y2)
Reference.
Lourme, A., F. Maurer (2016) Testing the Gaussian and Student's t copulas in a risk management framework. Economic Modelling.
Question. Could anyone help me and give the explanation of the variable v=(v_1,...,v_d) and G(v_1),..., G(v_d) in the equation?
I think v is the non-random matrix, the dimensions should be $k^2$ (grid points) by d=2 (dimensions). For example,
axis_x <- seq(0, 1, 0.1) # 11 grid points
axis_y <- seq(0, 1, 0.1) # 11 grid points
v <- expand.grid(axis_x, axis_y)
plot(v, type = "p")
So, your question is about the vector nu and correponding G(nu).
nu is a simple random vector drawn from any distribution that has a domain (0,1). (Here I use uniform distribution). Since you want your samples in 2D one single nu can be nu = runif(2). Given the explanations above, G is a gaussain pdf with mean 0 and a covariance matrix Rg. (Rg has dimensions of 2x2 in 2D).
Now what the paragraph says: if you have a random sample nu and you want it to be drawn from Gamma given the number of dimensions d and confidence level alpha then you need to compute the following statistic (G(nu) %*% Rg^-1) %*% G(nu) and check that is below the pdf of Chi^2 distribution for d and alpha.
For example:
# This is the copula parameter
Rg <- matrix(c(1,runif(2),1), ncol = 2)
# But we need to compute the inverse for sampling
Rginv <- MASS::ginv(Rg)
sampleResult <- replicate(10000, {
# we draw our nu from uniform, but others that map to (0,1), e.g. beta, are possible, too
nu <- runif(2)
# we compute G(nu) which is a gaussian cdf on the sample
Gnu <- qnorm(nu, mean = 0, sd = 1)
# for this we compute the statistic as given in formula
stat <- (Gnu %*% Rginv) %*% Gnu
# and return the result
list(nu = nu, Gnu = Gnu, stat = stat)
})
theSamples <- sapply(sampleResult["nu",], identity)
# this is the critical value of the Chi^2 with alpha = 0.95 and df = number of dimensions
# old and buggy threshold <- pchisq(0.95, df = 2)
# new and awesome - we are looking for the statistic at alpha = .95 quantile
threshold <- qchisq(0.95, df = 2)
# we can accept samples given the threshold (like in equation)
inArea <- sapply(sampleResult["stat",], identity) < threshold
plot(t(theSamples), col = as.integer(inArea)+1)
The red points are the points you would keep (I plot all points here).
As for drawing the decision boundries, I think it is a little bit more complicated, since you need to compute the exact pair of nu so that (Gnu %*% Rginv) %*% Gnu == pchisq(alpha, df = 2). It is a linear system that you solve for Gnu and then apply inverse to get your nu at the decision boundries.
edit: Reading the paragraph again, I noticed, the parameter for Gnu does not change, it is simply Gnu <- qnorm(nu, mean = 0, sd = 1).
edit: There was a bug: for threshold you need to use the quantile function qchisq instead of the distribution function pchisq - now corrected in the code above (and updated the figures).
This has two parts: first, compute the copula value as a function of X and Y; then, plot the curve giving the boundary where the copula exceeds the threshold.
Computing the value is basically linear algebra which #drey has answered. This is a rewritten version so that the copula is given by a function.
cop1 <- function(x)
{
Gnu <- qnorm(x)
Gnu %*% Rginv %*% Gnu
}
copula <- function(x)
{
apply(x, 1, cop1)
}
Plotting the boundary curve can be done using the same method as here (which in turn is the method used by the textbooks Modern Applied Stats with S, and Elements of Stat Learning). Create a grid of values, and use interpolation to find the contour line at the given height.
Rg <- matrix(c(1,runif(2),1), ncol = 2)
Rginv <- MASS::ginv(Rg)
# draw the contour line where value == threshold
# define a grid of values first: avoid x and y = 0 and 1, where infinities exist
xlim <- 1e-3
delta <- 1e-3
xseq <- seq(xlim, 1-xlim, by=delta)
grid <- expand.grid(x=xseq, y=xseq)
prob.grid <- copula(grid)
threshold <- qchisq(0.95, df=2)
contour(x=xseq, y=xseq, z=matrix(prob.grid, nrow=length(xseq)), levels=threshold,
col="grey", drawlabels=FALSE, lwd=2)
# add some points
data <- data.frame(x=runif(1000), y=runif(1000))
points(data, col=ifelse(copula(data) < threshold, "red", "black"))

Maximum Likelihood Estimation for three-parameter Weibull distribution in r

I want to estimate the scale, shape and threshold parameters of a 3p Weibull distribution.
What I've done so far is the following:
Refering to this post, Fitting a 3 parameter Weibull distribution in R
I've used the functions
EPS = sqrt(.Machine$double.eps) # "epsilon" for very small numbers
llik.weibull <- function(shape, scale, thres, x)
{
sum(dweibull(x - thres, shape, scale, log=T))
}
thetahat.weibull <- function(x)
{
if(any(x <= 0)) stop("x values must be positive")
toptim <- function(theta) -llik.weibull(theta[1], theta[2], theta[3], x)
mu = mean(log(x))
sigma2 = var(log(x))
shape.guess = 1.2 / sqrt(sigma2)
scale.guess = exp(mu + (0.572 / shape.guess))
thres.guess = 1
res = nlminb(c(shape.guess, scale.guess, thres.guess), toptim, lower=EPS)
c(shape=res$par[1], scale=res$par[2], thres=res$par[3])
}
to "pre-estimate" my Weibull parameters, such that I can use them as initial values for the argument "start" in the "fitdistr" function of the MASS-Package.
You might ask why I want to estimate the parameters twice... reason is that I need the variance-covariance-matrix of the estimates which is also estimated by the fitdistr function.
EXAMPLE:
set.seed(1)
thres <- 450
dat <- rweibull(1000, 2.78, 750) + thres
pre_mle <- thetahat.weibull(dat)
my_wb <- function(x, shape, scale, thres) {
dweibull(x - thres, shape, scale)
}
ml <- fitdistr(dat, densfun = my_wb, start = list(shape = round(pre_mle[1], digits = 0), scale = round(pre_mle[2], digits = 0),
thres = round(pre_mle[3], digits = 0)))
ml
> ml
shape scale thres
2.942548 779.997177 419.996196 ( 0.152129) ( 32.194294) ( 28.729323)
> ml$vcov
shape scale thres
shape 0.02314322 4.335239 -3.836873
scale 4.33523868 1036.472551 -889.497580
thres -3.83687258 -889.497580 825.374029
This works quite well for cases where the shape parameter is above 1. Unfortunately my approach should deal with the cases where the shape parameter could be smaller than 1.
The reason why this is not possible for shape parameters that are smaller than 1 is described here: http://www.weibull.com/hotwire/issue148/hottopics148.htm
in Case 1, All three parameters are unknown the following is said:
"Define the smallest failure time of ti to be tmin. Then when γ → tmin, ln(tmin - γ) → -∞. If β is less than 1, then (β - 1)ln(tmin - γ) goes to +∞ . For a given solution of β, η and γ, we can always find another set of solutions (for example, by making γ closer to tmin) that will give a larger likelihood value. Therefore, there is no MLE solution for β, η and γ."
This makes a lot of sense. For this very reason I want to do it the way they described it on this page.
"In Weibull++, a gradient-based algorithm is used to find the MLE solution for β, η and γ. The upper bound of the range for γ is arbitrarily set to be 0.99 of tmin. Depending on the data set, either a local optimal or 0.99tmin is returned as the MLE solution for γ."
I want to set a feasible interval for gamma (in my code called 'thres') such that the solution is between (0, .99 * tmin).
Does anyone have an idea how to solve this problem?
In the function fitdistr there seems to be no opportunity doing a constrained MLE, constraining one parameter.
Another way to go could be the estimation of the asymptotic variance via the outer product of the score vectors. The score vector could be taken from the above used function thetahat.weibul(x). But calculating the outer product manually (without function) seems to be very time consuming and does not solve the problem of the constrained ML estimation.
Best regards,
Tim
It's not too hard to set up a constrained MLE. I'm going to do this in bbmle::mle2; you could also do it in stats4::mle, but bbmle has some additional features.
The larger issue is that it's theoretically difficult to define the sampling variance of an estimate when it's on the boundary of the allowed space; the theory behind Wald variance estimates breaks down. You can still calculate confidence intervals by likelihood profiling ... or you could bootstrap. I ran into a variety of optimization issues when doing this ... I haven't really thought about wether there are specific reasons
Reformat three-parameter Weibull function for mle2 use (takes x as first argument, takes log as an argument):
dweib3 <- function(x, shape, scale, thres, log=TRUE) {
dweibull(x - thres, shape, scale, log=log)
}
Starting function (slightly reformatted):
weib3_start <- function(x) {
mu <- mean(log(x))
sigma2 <- var(log(x))
logshape <- log(1.2 / sqrt(sigma2))
logscale <- mu + (0.572 / logshape)
logthres <- log(0.5*min(x))
list(logshape = logshape, logsc = logscale, logthres = logthres)
}
Generate data:
set.seed(1)
dat <- data.frame(x=rweibull(1000, 2.78, 750) + 450)
Fit model: I'm fitting the parameters on the log scale for convenience and stability, but you could use boundaries at zero as well.
tmin <- log(0.99*min(dat$x))
library(bbmle)
m1 <- mle2(x~dweib3(exp(logshape),exp(logsc),exp(logthres)),
data=dat,
upper=c(logshape=Inf,logsc=Inf,
logthres=tmin),
start=weib3_start(dat$x),
method="L-BFGS-B")
vcov(m1), which should normally provide a variance-covariance estimate (unless the estimate is on the boundary, which is not the case here) gives NaN values ... not sure why without more digging.
library(emdbook)
tmpf <- function(x,y) m1#minuslogl(logshape=x,
logsc=coef(m1)["logsc"],
logthres=y)
tmpf(1.1,6)
s1 <- curve3d(tmpf,
xlim=c(1,1.2),ylim=c(5.9,tmin),sys3d="image")
with(s1,contour(x,y,z,add=TRUE))
h <- lme4:::hessian(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1))
vv <- solve(h)
diag(vv) ## [1] 0.002672240 0.001703674 0.004674833
(se <- sqrt(diag(vv))) ## standard errors
## [1] 0.05169371 0.04127558 0.06837275
cov2cor(vv)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.8852090 -0.8778424
## [2,] 0.8852090 1.0000000 -0.9616941
## [3,] -0.8778424 -0.9616941 1.0000000
This is the variance-covariance matrix of the log-scaled variables. If you want to convert to the variance-covariance matrix on the original scale, you need to scale by (x_i)*(x_j) (i.e. by the derivatives of the transformation exp(x)).
outer(exp(coef(m1)),exp(coef(m1))) * vv
## logshape logsc logthres
## logshape 0.02312803 4.332993 -3.834145
## logsc 4.33299307 1035.966372 -888.980794
## logthres -3.83414498 -888.980794 824.831463
I don't know why this doesn't work with numDeriv - would be very careful with variance estimates above. (Maybe too close to boundary for Richardson extrapolation to work?)
library(numDeriv)
hessian()
grad(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1)) ## looks OK
vcov(m1)
The profiles look OK ... (we have to supply std.err because the Hessian isn't invertible)
pp <- profile(m1,std.err=c(0.01,0.01,0.01))
par(las=1,bty="l",mfcol=c(1,3))
plot(pp,show.points=TRUE)
confint(pp)
## 2.5 % 97.5 %
## logshape 0.9899645 1.193571
## logsc 6.5933070 6.755399
## logthres 5.8508827 6.134346
Alternately, we can do this on the original scale ... one possibility would be to use the log-scaling to fit, then refit starting from those parameters on the original scale.
wstart <- as.list(exp(unlist(weib3_start(dat$x))))
names(wstart) <- gsub("log","",names(wstart))
m2 <- mle2(x~dweib3(shape,sc,thres),
data=dat,
lower=c(shape=0.001,sc=0.001,thres=0.001),
upper=c(shape=Inf,sc=Inf,
thres=exp(tmin)),
start=wstart,
method="L-BFGS-B")
vcov(m2)
## shape sc thres
## shape 0.02312399 4.332057 -3.833264
## sc 4.33205658 1035.743511 -888.770787
## thres -3.83326390 -888.770787 824.633714
all.equal(unname(coef(m2)),unname(exp(coef(m1))),tol=1e-4)
About the same as the values above.
We can fit with a small shape, if we are a little more careful to bound the paraameters, but now we end up on the boundary for the threshold, which will cause lots of problems for the variance calculations.
set.seed(1)
dat <- data.frame(x = rweibull(1000, .53, 365) + 100)
tmin <- log(0.99 * min(dat$x))
m1 <- mle2(x ~ dweib3(exp(logshape), exp(logsc), exp(logthres)),
lower=c(logshape=-10,logscale=0,logthres=0),
upper = c(logshape = 20, logsc = 20, logthres = tmin),
data = dat,
start = weib3_start(dat$x), method = "L-BFGS-B")
For censored data, you need to replace dweibull with pweibull; see Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf for some hints.
Another possible solution is to do Bayesian inference. Using scale priors on the shape and scale parameters and a uniform prior on the location parameter, you can easily run Metropolis-Hastings as follows. It might be adviceable to reparameterize in terms of log(shape), log(scale) and log(y_min - location) because the posterior for some of the parameters becomes strongly skewed, in particular for the location parameter. Note that the output below shows the posterior for the backtransformed parameters.
library(MCMCpack)
logposterior <- function(par,y) {
gamma <- min(y) - exp(par[3])
sum(dweibull(y-gamma,exp(par[1]),exp(par[2]),log=TRUE)) + par[3]
}
y <- rweibull(100,shape=.8,scale=10) + 1
chain0 <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=.01*diag(3))
chain <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=var(chain0))
plot(exp(chain))
summary(exp(chain))
This produces the following output
#########################################################
The Metropolis acceptance rate was 0.43717
#########################################################
Iterations = 501:20500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
[1,] 0.81530 0.06767 0.0004785 0.001668
[2,] 10.59015 1.39636 0.0098738 0.034495
[3,] 0.04236 0.05642 0.0003990 0.001174
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
var1 0.6886083 0.768054 0.81236 0.8608 0.9498
var2 8.0756210 9.637392 10.50210 11.4631 13.5353
var3 0.0003397 0.007525 0.02221 0.0548 0.1939

Resources