How does glmnet compute the maximal lambda value? - r

The glmnet package uses a range of LASSO tuning parameters lambda scaled from the maximal lambda_max under which no predictors are selected. I want to find out how glmnet computes this lambda_max value. For example, in a trivial dataset:
set.seed(1)
library("glmnet")
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
fitGLM <- glmnet(x,y)
max(fitGLM$lambda)
# 0.1975946
The package vignette (http://www.jstatsoft.org/v33/i01/paper) describes in section 2.5 that it computes this value as follows:
sx <- as.matrix(scale(x))
sy <- as.vector(scale(y))
max(abs(colSums(sx*sy)))/100
# 0.1865232
Which clearly is close but not the same value. So, what causes this difference? And in a related question, how could I compute lambda_max for a logistic regression?

To get the same result you need to standardize the variables using a standard deviation with n instead of n-1 denominator.
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x,scale=apply(x, 2, mysd))
sx <- as.matrix(sx, ncol=20, nrow=100)
sy <- as.vector(scale(y, scale=mysd(y)))
max(abs(colSums(sx*sy)))/100
## [1] 0.1758808
fitGLM <- glmnet(sx,sy)
max(fitGLM$lambda)
## [1] 0.1758808
For the unscaled (original) x and y, the maximum lambda should be
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x,scale=apply(x, 2, mysd))
norm(t(sx) %*% y, 'i') / nrow(x)
## [1] 0.1975946
# norm of infinity is also equal to
max(abs(colSums(sx*y)))/100
## [1] 0.1975946
max(fitGLM$lambda) - norm(t(sx) %*% y, 'i') / nrow(x)
## [1] 2.775558e-17

It seems lambda_max for a logistic regression is calculated similarly as for linear regression, but with weights based on class proportions:
set.seed(1)
library("glmnet")
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x, scale=apply(x, 2, mysd))
sx <- as.matrix(sx, ncol=20, nrow=100)
y_bin <- factor(ifelse(y<0, -1, 1))
prop.table(table(y_bin))
# y_bin
# -1 1
# 0.62 0.38
fitGLM_log <- glmnet(sx, y_bin, family = "binomial")
max(fitGLM_log$lambda)
# [1] 0.1214006
max(abs(colSums(sx*ifelse(y<0, -.38, .62))))/100
# [1] 0.1214006

For your second question, look to Friedman et al's paper, "Regularization paths for generalized linear models via coordinate descent". In particular, see equation (10), which is equality at equilibrium. Just check under what conditions the numerator $S(\cdot,\cdot)$ is zero for all parameters.

Sorry, been a while, but maybe still of help:
You can calculate the maximum lambda value for any problem with L1-regularization by finding the highest absolute value of the gradient of the objective function (i.e. the score function for likelihoods) at the optimized parameter values for the completely regularized model (eg. all penalized parameters set to zero).
I sadly can't help with the difference in values, though. Although I can say that I try to use a max lambda value that is a bit higher - say 5% - than the calculated maximum lambda, so that the model with all selected parameterers constrained will surely be a part of the number of estimated models. Maybe this is what is being done in glmnet.
Edit: sorry, I confused the non-regularized with the fully penalized model. Edited it above now.

According to help("glmnet") the maximal lambda value is "the smallest value for which all coefficients are zero":
sum(fitGLM$beta[, which.max(fitGLM$lambda)])
#[1] 0
sum(glmnet(x,y, lambda=max(fitGLM$lambda)*0.999)$beta)
#[1] -0.0001809804
At a quick glance the value seems to be calculated by the Fortran code called by elnet.

Related

Generalized Gini with Weights in R

I want to calculate weighted generalized gini coefficients.
CRAN distributes the "acid" package with a suitable function.
What am i missing here? When weights are constant, the estimates of weighted.gini and sgini are equal given the parameter nu = 2 (as to get the regular gini). When weights are nonconstant, they differ. Is there something fishy going on or am i missing something? They ought to be the same, right?
Checked back with STATA sgini function by van Kerm which is cited in the documentation of acid and its function returns the expected same estimates.
set.seed(123)
install.packages("acid")
library(acid)
x <- rnorm(100,10,1)
w <- rep(1, length(x))
acid::weighted.gini(x,w)$Gini
acid::sgini(x,w,nu=2)$Gini
w <- rnorm(100,10,1)
acid::weighted.gini(x,w)$Gini
acid::sgini(x,w,nu=2)$Gini
There is a mistake in "sgini".
In the formula that the command "sgini" has, at same point, mean(x) is calculated without taking into account the weights.
If we tried to calculate manually the weighted Gini given the formulas: https://core.ac.uk/download/pdf/41339501.pdf
set.seed(123)
x <- rnorm(100,10,1)
w <- rep(1, length(x))
acid::weighted.gini(x,w)$Gini
acid::sgini(x,w,nu=2)$Gini
w <- rnorm(100,10,1)
acid::weighted.gini(x,w)
acid::sgini(x,w,nu=2)
#calc manually
ox<-order(x)
x<-x[ox]
w<-w[ox]
#cov(x,cumsum(x)/cumsum(x)[length(x)])*2/mean(x) #gini without weights
w<-w/sum(w)
f<-w/2+cumsum(c(0,w[-length(w)]))
2/sum(x*w)*sum(w*(x-sum(x*w))*(f-sum(f*w))) #==weighted.gini(x,w)$Gini

find total variation distance between multinomial distributions in r

I am comparing Bayes estimators to MLE in multinomial distributions. I am drawing random samples using rmultinom from a particular multinomial distribution using
rmultinom(400, size = 30, prob = c(5,7,10,8,14,10,15,12,10,9))
For each of the 400 samples, I compute the MLE and Bayes estimators for the ten probability parameters. I now want to find in each case the total variation distance between the true distribution and the one defined by the estimators.
Since for size 30 and 10 bins there are over 200 million possible arrangements, I don't think that using the theoretical definition is a good idea.
The package distrEx has a function "TotalVarDist()", but it can only be used with distributions defined in the distr package, and multinomial is not one of them. There are directions for defining them (see here and here) but the options are either to define a discrete distribution by explicitly listing the support (again, I don't think this is a good option since the support has a size of over 200 million) or starting from scratch using the same methods as how the distr package was created, which is beyond my current ability.
Any thoughts on how to do this, either using the packages mentioned or in a completely different way?
My answer is about how to calculate this using base R.
We have two multinomial parameter vectors, θ and η. The total variation distance is equivalent to P_θ(E) - P_η(E), where E={ω | P_θ({ω})>P_η({ω})}, and ω is a vector of sample counts.
I know of two ways to evaluate P(E) in base R. One is a very simple simulation-based method. The other reframes the problem in terms of a linear combination of the counts, which is approximately normally distributed, and uses the pnorm function.
Simulation-based method
You simulate samples from each distribution, check whether they're in E using the probability mass functions, and count how often they are. I'll go through an example here. We'll assume the true distribution from your question:
unnormalized.true <- c(5,7,10,8,14,10,15,12,10,9)
true <- unnormalized.true / sum(unnormalized.true)
We'll draw a sample and estimate a new distribution using a Bayes estimator:
set.seed(921)
result <- as.vector(rmultinom(1, size = 30, prob = true))
result
## [1] 3 6 2 0 5 3 3 4 1 3
dirichlet <- (result+1)/(30+length(true))
Calculating the probability of E under the true distribution:
set.seed(939)
true.dist <- rmultinom(10^6, 30, true)
p.true.e <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
Calculating the probability of E under the estimated distribution from the Bayes estimator:
dirichlet.dist <- rmultinom(10^6, 30, dirichlet)
p.dirichlet.e <- mean(apply(dirichlet.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, dirichlet) > 0))
And we can subtract to get the total variation distance.
p.true.e - p.dirichlet.e
## [1] 0.83737
Repeating this with the maximum likelihood estimate, we get a comparison of the estimators.
mle <- result/30
mle.dist <- rmultinom(10^6, 30, mle)
p.true.e2 <- mean(apply(true.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.mle.e2 <- mean(apply(mle.dist, 2, function(x)
dmultinom(x, 30, true) - dmultinom(x, 30, mle) > 0))
p.true.e2 - p.mle.e2
## [1] 0.968301
(edited to fix a serious mistake. Previously I had re-used p.true.e in the comparison with the MLE. I forgot that the event E is defined in terms of the estimated distribution.)
Normal approximation
I think this method is actually more accurate than the simulation based method, despite the normal approximation. As you'll see, we're not taking a normal approximation to the multinomial counts, which would be unlikely to be accurate for n=30. We're taking a normal approximation to a linear combination of these counts, which is close to normal. The weakness of this method will turn out to be that it can't handle zero probabilities in the estimated distribution. That's a real problem, since handling zeros gracefully is, to me, part of the point of using total variation distance rather than Kullback-Leibler divergence. But here it is.
The following derivation yields a restatement of E:
Define
where N_i is one cell of the multinomial sample, and
Then, E is the event that L>0.
The reason we have a problem with a zero probability is that it causes one of the λ_i's to be infinite.
I want to verify that L is close to normally distributed, in the example from before. I'll do that by getting samples from the distribution of L, using the previous multinomial simulation:
lambda <- log(true/dirichlet)
L.true.dist <- apply(true.dist, 2, function(x) sum(lambda*x))
L.dirichlet.dist <- apply(dirichlet.dist, 2, function(x) sum(lambda*x))
Note that I'm doing the comparison between the true distribution and the Bayes estimated distribution. I can't do the one with the MLE, because my sample had a zero count.
Plotting the distribution of L and comparing to a normal fit:
par(mfrow=c(1,2))
L.true.dist.hist <- hist(L.true.dist)
L.true.dist.fit <- function(x)
length(L.true.dist) * diff(L.true.dist.hist$breaks)[1] *
dnorm(x, mean(L.true.dist), sd=sd(L.true.dist))
curve(L.true.dist.fit, add=TRUE, n=1000, col='red')
L.dirichlet.dist.hist <- hist(L.dirichlet.dist)
L.dirichlet.dist.fit <- function(x)
length(L.dirichlet.dist) * diff(L.dirichlet.dist.hist$breaks)[1] *
dnorm(x, mean(L.dirichlet.dist), sd=sd(L.dirichlet.dist))
curve(L.dirichlet.dist.fit, add=TRUE, n=1000, col='red')
par(mfrow=c(1,1))
The distribution of L appears normal. So, instead of using simulations, we can just use pnorm. However, we need to calculate the mean and standard deviation of L. This can be done as follows.
The mean of L is
where p_i is the cell probability of cell i in the distribution p. The variance is
where
is the covariance matrix of the multinomial distribution. I'll calculate these moments for this example, and check them against the empirical moments in the simulation. First, for the distribution of L under the true distribution:
n <- 30
k <- length(true)
mean.L.true <- sum(lambda * n * true)
# Did we get the mean right?
c(mean.L.true, mean(L.true.dist))
## [1] 3.873509 3.875547
# Covariance matrix assuming the true distribution
sigma.true <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*true[i]*(1-true[i]), -n*true[i]*true[j]))
var.L.true <- t(lambda) %*% sigma.true %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.true), sd(L.true.dist))
## [1] 2.777787 2.776945
Then, the mean and variance of L under the Bayes estimate of the distribution:
mean.L.dirichlet <- sum(lambda * n * dirichlet)
# Did we get the mean right?
c(mean.L.dirichlet, mean(L.dirichlet.dist))
## [1] -3.893836 -3.895983
# Covariance matrix assuming the estimated distribution
sigma.dirichlet <- outer(1:k, 1:k, function(i,j)
ifelse(i==j, n*dirichlet[i]*(1-dirichlet[i]), -n*dirichlet[i]*dirichlet[j]))
var.L.dirichlet <- t(lambda) %*% sigma.dirichlet %*% lambda
# Did we get the standard deviation right?
c(sqrt(var.L.dirichlet), sd(L.dirichlet.dist))
## [1] 2.796348 2.793421
With these in hand, we can calculate the total variation distance with pnorm:
pnorm(0, mean.L.true, sd=sqrt(var.L.true), lower.tail=FALSE) -
pnorm(0, mean.L.dirichlet, sd=sqrt(var.L.true), lower.tail=FALSE)
## [1] 0.8379193
# Previous result was 0.83737
We get three digits of agreement with the simulation.
I don't know of any easy way to extend the normal approximation method to handle zero probabilities, though. I had an idea, but I got stuck trying to calculate the covariance matrix of the counts conditional on a specific cell having 0 count. I could share my progress if you think you could make something of it.

Fast nonnegative quantile and Huber regression in R

I am looking for a fast way to do nonnegative quantile and Huber regression in R (i.e. with the constraint that all coefficients are >0). I tried using the CVXR package for quantile & Huber regression and the quantreg package for quantile regression, but CVXR is very slow and quantreg seems buggy when I use nonnegativity constraints. Does anybody know of a good and fast solution in R, e.g. using the Rcplex package or R gurobi API, thereby using the faster CPLEX or gurobi optimizers?
Note that I need to run a problem size like below 80 000 times, whereby I only need to update the y vector in each iteration, but still use the same predictor matrix X. In that sense, I feel it's inefficient that in CVXR I now have to do obj <- sum(quant_loss(y - X %*% beta, tau=0.01)); prob <- Problem(Minimize(obj), constraints = list(beta >= 0)) within each iteration, when the problem is in fact staying the same and all I want to update is y. Any thoughts to do all this better/faster?
Minimal example:
## Generate problem data
n <- 7 # n predictor vars
m <- 518 # n cases
set.seed(1289)
beta_true <- 5 * matrix(stats::rnorm(n), nrow = n)+20
X <- matrix(stats::rnorm(m * n), nrow = m, ncol = n)
y_true <- X %*% beta_true
eps <- matrix(stats::rnorm(m), nrow = m)
y <- y_true + eps
Nonnegative quantile regression using CVXR :
## Solve nonnegative quantile regression problem using CVX
require(CVXR)
beta <- Variable(n)
quant_loss <- function(u, tau) { 0.5*abs(u) + (tau - 0.5)*u }
obj <- sum(quant_loss(y - X %*% beta, tau=0.01))
prob <- Problem(Minimize(obj), constraints = list(beta >= 0))
system.time(beta_cvx <- pmax(solve(prob, solver="SCS")$getValue(beta), 0)) # estimated coefficients, note that they ocasionally can go - though and I had to clip at 0
# 0.47s
cor(beta_true,beta_cvx) # correlation=0.99985, OK but very slow
Syntax for nonnegative Huber regression is the same but would use
M <- 1 ## Huber threshold
obj <- sum(CVXR::huber(y - X %*% beta, M))
Nonnegative quantile regression using quantreg package :
### Solve nonnegative quantile regression problem using quantreg package with method="fnc"
require(quantreg)
R <- rbind(diag(n),-diag(n))
r <- c(rep(0,n),-rep(1E10,n)) # specify bounds of coefficients, I want them to be nonnegative, and 1E10 should ideally be Inf
system.time(beta_rq <- coef(rq(y~0+X, R=R, r=r, tau=0.5, method="fnc"))) # estimated coefficients
# 0.12s
cor(beta_true,beta_rq) # correlation=-0.477, no good, and even worse with tau=0.01...
To speed up CVXR, you can get the problem data once in the beginning, then modify it within a loop and pass it directly to the solver's R interface. The code for this is
prob_data <- get_problem_data(prob, solver = "SCS")
Then, parse out the arguments and pass them to scs from the scs library. (See Solver.solve in solver.R). You'll have to dig into the details of the canonicalization, but I expect if you're just changing y at each iteration, it should be a straightforward modification.

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

Maximum Likelihood Estimation for three-parameter Weibull distribution in r

I want to estimate the scale, shape and threshold parameters of a 3p Weibull distribution.
What I've done so far is the following:
Refering to this post, Fitting a 3 parameter Weibull distribution in R
I've used the functions
EPS = sqrt(.Machine$double.eps) # "epsilon" for very small numbers
llik.weibull <- function(shape, scale, thres, x)
{
sum(dweibull(x - thres, shape, scale, log=T))
}
thetahat.weibull <- function(x)
{
if(any(x <= 0)) stop("x values must be positive")
toptim <- function(theta) -llik.weibull(theta[1], theta[2], theta[3], x)
mu = mean(log(x))
sigma2 = var(log(x))
shape.guess = 1.2 / sqrt(sigma2)
scale.guess = exp(mu + (0.572 / shape.guess))
thres.guess = 1
res = nlminb(c(shape.guess, scale.guess, thres.guess), toptim, lower=EPS)
c(shape=res$par[1], scale=res$par[2], thres=res$par[3])
}
to "pre-estimate" my Weibull parameters, such that I can use them as initial values for the argument "start" in the "fitdistr" function of the MASS-Package.
You might ask why I want to estimate the parameters twice... reason is that I need the variance-covariance-matrix of the estimates which is also estimated by the fitdistr function.
EXAMPLE:
set.seed(1)
thres <- 450
dat <- rweibull(1000, 2.78, 750) + thres
pre_mle <- thetahat.weibull(dat)
my_wb <- function(x, shape, scale, thres) {
dweibull(x - thres, shape, scale)
}
ml <- fitdistr(dat, densfun = my_wb, start = list(shape = round(pre_mle[1], digits = 0), scale = round(pre_mle[2], digits = 0),
thres = round(pre_mle[3], digits = 0)))
ml
> ml
shape scale thres
2.942548 779.997177 419.996196 ( 0.152129) ( 32.194294) ( 28.729323)
> ml$vcov
shape scale thres
shape 0.02314322 4.335239 -3.836873
scale 4.33523868 1036.472551 -889.497580
thres -3.83687258 -889.497580 825.374029
This works quite well for cases where the shape parameter is above 1. Unfortunately my approach should deal with the cases where the shape parameter could be smaller than 1.
The reason why this is not possible for shape parameters that are smaller than 1 is described here: http://www.weibull.com/hotwire/issue148/hottopics148.htm
in Case 1, All three parameters are unknown the following is said:
"Define the smallest failure time of ti to be tmin. Then when γ → tmin, ln(tmin - γ) → -∞. If β is less than 1, then (β - 1)ln(tmin - γ) goes to +∞ . For a given solution of β, η and γ, we can always find another set of solutions (for example, by making γ closer to tmin) that will give a larger likelihood value. Therefore, there is no MLE solution for β, η and γ."
This makes a lot of sense. For this very reason I want to do it the way they described it on this page.
"In Weibull++, a gradient-based algorithm is used to find the MLE solution for β, η and γ. The upper bound of the range for γ is arbitrarily set to be 0.99 of tmin. Depending on the data set, either a local optimal or 0.99tmin is returned as the MLE solution for γ."
I want to set a feasible interval for gamma (in my code called 'thres') such that the solution is between (0, .99 * tmin).
Does anyone have an idea how to solve this problem?
In the function fitdistr there seems to be no opportunity doing a constrained MLE, constraining one parameter.
Another way to go could be the estimation of the asymptotic variance via the outer product of the score vectors. The score vector could be taken from the above used function thetahat.weibul(x). But calculating the outer product manually (without function) seems to be very time consuming and does not solve the problem of the constrained ML estimation.
Best regards,
Tim
It's not too hard to set up a constrained MLE. I'm going to do this in bbmle::mle2; you could also do it in stats4::mle, but bbmle has some additional features.
The larger issue is that it's theoretically difficult to define the sampling variance of an estimate when it's on the boundary of the allowed space; the theory behind Wald variance estimates breaks down. You can still calculate confidence intervals by likelihood profiling ... or you could bootstrap. I ran into a variety of optimization issues when doing this ... I haven't really thought about wether there are specific reasons
Reformat three-parameter Weibull function for mle2 use (takes x as first argument, takes log as an argument):
dweib3 <- function(x, shape, scale, thres, log=TRUE) {
dweibull(x - thres, shape, scale, log=log)
}
Starting function (slightly reformatted):
weib3_start <- function(x) {
mu <- mean(log(x))
sigma2 <- var(log(x))
logshape <- log(1.2 / sqrt(sigma2))
logscale <- mu + (0.572 / logshape)
logthres <- log(0.5*min(x))
list(logshape = logshape, logsc = logscale, logthres = logthres)
}
Generate data:
set.seed(1)
dat <- data.frame(x=rweibull(1000, 2.78, 750) + 450)
Fit model: I'm fitting the parameters on the log scale for convenience and stability, but you could use boundaries at zero as well.
tmin <- log(0.99*min(dat$x))
library(bbmle)
m1 <- mle2(x~dweib3(exp(logshape),exp(logsc),exp(logthres)),
data=dat,
upper=c(logshape=Inf,logsc=Inf,
logthres=tmin),
start=weib3_start(dat$x),
method="L-BFGS-B")
vcov(m1), which should normally provide a variance-covariance estimate (unless the estimate is on the boundary, which is not the case here) gives NaN values ... not sure why without more digging.
library(emdbook)
tmpf <- function(x,y) m1#minuslogl(logshape=x,
logsc=coef(m1)["logsc"],
logthres=y)
tmpf(1.1,6)
s1 <- curve3d(tmpf,
xlim=c(1,1.2),ylim=c(5.9,tmin),sys3d="image")
with(s1,contour(x,y,z,add=TRUE))
h <- lme4:::hessian(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1))
vv <- solve(h)
diag(vv) ## [1] 0.002672240 0.001703674 0.004674833
(se <- sqrt(diag(vv))) ## standard errors
## [1] 0.05169371 0.04127558 0.06837275
cov2cor(vv)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.8852090 -0.8778424
## [2,] 0.8852090 1.0000000 -0.9616941
## [3,] -0.8778424 -0.9616941 1.0000000
This is the variance-covariance matrix of the log-scaled variables. If you want to convert to the variance-covariance matrix on the original scale, you need to scale by (x_i)*(x_j) (i.e. by the derivatives of the transformation exp(x)).
outer(exp(coef(m1)),exp(coef(m1))) * vv
## logshape logsc logthres
## logshape 0.02312803 4.332993 -3.834145
## logsc 4.33299307 1035.966372 -888.980794
## logthres -3.83414498 -888.980794 824.831463
I don't know why this doesn't work with numDeriv - would be very careful with variance estimates above. (Maybe too close to boundary for Richardson extrapolation to work?)
library(numDeriv)
hessian()
grad(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1)) ## looks OK
vcov(m1)
The profiles look OK ... (we have to supply std.err because the Hessian isn't invertible)
pp <- profile(m1,std.err=c(0.01,0.01,0.01))
par(las=1,bty="l",mfcol=c(1,3))
plot(pp,show.points=TRUE)
confint(pp)
## 2.5 % 97.5 %
## logshape 0.9899645 1.193571
## logsc 6.5933070 6.755399
## logthres 5.8508827 6.134346
Alternately, we can do this on the original scale ... one possibility would be to use the log-scaling to fit, then refit starting from those parameters on the original scale.
wstart <- as.list(exp(unlist(weib3_start(dat$x))))
names(wstart) <- gsub("log","",names(wstart))
m2 <- mle2(x~dweib3(shape,sc,thres),
data=dat,
lower=c(shape=0.001,sc=0.001,thres=0.001),
upper=c(shape=Inf,sc=Inf,
thres=exp(tmin)),
start=wstart,
method="L-BFGS-B")
vcov(m2)
## shape sc thres
## shape 0.02312399 4.332057 -3.833264
## sc 4.33205658 1035.743511 -888.770787
## thres -3.83326390 -888.770787 824.633714
all.equal(unname(coef(m2)),unname(exp(coef(m1))),tol=1e-4)
About the same as the values above.
We can fit with a small shape, if we are a little more careful to bound the paraameters, but now we end up on the boundary for the threshold, which will cause lots of problems for the variance calculations.
set.seed(1)
dat <- data.frame(x = rweibull(1000, .53, 365) + 100)
tmin <- log(0.99 * min(dat$x))
m1 <- mle2(x ~ dweib3(exp(logshape), exp(logsc), exp(logthres)),
lower=c(logshape=-10,logscale=0,logthres=0),
upper = c(logshape = 20, logsc = 20, logthres = tmin),
data = dat,
start = weib3_start(dat$x), method = "L-BFGS-B")
For censored data, you need to replace dweibull with pweibull; see Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf for some hints.
Another possible solution is to do Bayesian inference. Using scale priors on the shape and scale parameters and a uniform prior on the location parameter, you can easily run Metropolis-Hastings as follows. It might be adviceable to reparameterize in terms of log(shape), log(scale) and log(y_min - location) because the posterior for some of the parameters becomes strongly skewed, in particular for the location parameter. Note that the output below shows the posterior for the backtransformed parameters.
library(MCMCpack)
logposterior <- function(par,y) {
gamma <- min(y) - exp(par[3])
sum(dweibull(y-gamma,exp(par[1]),exp(par[2]),log=TRUE)) + par[3]
}
y <- rweibull(100,shape=.8,scale=10) + 1
chain0 <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=.01*diag(3))
chain <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=var(chain0))
plot(exp(chain))
summary(exp(chain))
This produces the following output
#########################################################
The Metropolis acceptance rate was 0.43717
#########################################################
Iterations = 501:20500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
[1,] 0.81530 0.06767 0.0004785 0.001668
[2,] 10.59015 1.39636 0.0098738 0.034495
[3,] 0.04236 0.05642 0.0003990 0.001174
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
var1 0.6886083 0.768054 0.81236 0.8608 0.9498
var2 8.0756210 9.637392 10.50210 11.4631 13.5353
var3 0.0003397 0.007525 0.02221 0.0548 0.1939

Resources