Fitting truncnorm using fitdistrplus - r

I am trying to fit a truncated normal distribution to some data. However, I have been running into the following error:
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth, lower = lower, upper = upper, ...): non-finite finite-difference value [1]>
Error in fitdist(testData, "truncnorm", start = list(a = 0, mean = 0.8, :
the function mle failed to estimate the parameters,
with the error code 100
I'm not sure what's going wrong - I've read that in some cases there can be problems fitting if the initial guesses are wrong or higher than the actual values, but I've tried a number of different start values and none seem to work.
Here is a small sample of my data, and the code I used to get the error:
library(fitdistrplus)
library(truncnorm)
testData <- c(3.2725167726, 0.1501345235, 1.5784128343, 1.218953218, 1.1895520932,
2.659871271, 2.8200152609, 0.0497193249, 0.0430677458, 1.6035277181,
0.2003910167, 0.4982836845, 0.9867184303, 3.4082793339, 1.6083770189,
2.9140912221, 0.6486576911, 0.335227878, 0.5088426851, 2.0395797721,
1.5216239237, 2.6116576364, 0.1081283479, 0.4791143698, 0.6388625172,
0.261194346, 0.2300098384, 0.6421213993, 0.2671907741, 0.1388568942,
0.479645736, 0.0726750815, 0.2058983462, 1.0936704833, 0.2874115077,
0.1151566887, 0.0129750118, 0.152288794, 0.1508512023, 0.176000366,
0.2499423442, 0.8463027325, 0.0456045486, 0.7689214668, 0.9332181529,
0.0290242892, 0.0441181842, 0.0759601229, 0.0767983979, 0.1348839304
)
fitdist(testData, "truncnorm", start = list(a = 0, mean = 0.8, sd = 0.9))

The problem is that the mle estimator provides increasingly negative estimates for the parameter mean as the lower bound a tends to zero (note that the latter must not be specified within the start parameter, but within fix.arg):
fitdist(testData, "truncnorm", fix.arg=list(a=-.5),
start = list(mean = mean(testData), sd = sd(testData)))
fitdist(testData, "truncnorm", fix.arg=list(a=-.2),
start = list(mean = mean(testData), sd = sd(testData)))
fitdist(testData, "truncnorm", fix.arg=list(a=-.15),
start = list(mean = mean(testData), sd = sd(testData)))
One possibility to prevent large negative values for mean is to use a lower bound for the optimisation:
fitdist(testData, "truncnorm", fix.arg=list(a=0),
start = list(mean = mean(testData), sd = sd(testData)),
optim.method="L-BFGS-B", lower=c(0, 0))
However, this alters the estimation procedure; in fact you are imposing additional constraints on the parameters and might obtain different answers with different lower bounds.

Related

Why the mle function can not run with lower and upper bound?

I tried to use mle to estimate the parameters for the negative binomial distribution. Here is my code.
library(stats4)
library(bbmle)
library(MASS)
b=rnbinom(n=1000, size=3, prob=0.1)
LL2 <- function(size, prob) {
R = dnbinom(b, size, prob, log = TRUE)
-sum(R)
}
When I set the mle function with lower and upper bound, I got
stats4::mle(LL2, start = list(size = 3, prob = 0.1),lower = c(-Inf,-Inf),upper = c(Inf,Inf))
Error in optim(start, f, method = method, hessian = TRUE, lower = lower, :
L-BFGS-B needs finite values of 'fn'
When I removed the bounds
stats4::mle(LL2, start = list(size = 3, prob = 0.1))
Call:
stats4::mle(minuslogl = LL2, start = list(size = 3, prob = 0.1))
Coefficients:
size prob
3.0467857 0.1037522
However, if I change the bounds to a finite value, the error is still there.
I was wondering why this happened? Is that because the L-BFGS-B method can not handle with bounds settings?
Any comments will be appreciated.
I ran your setup code with set.seed(101).
Create an instrumented version of the score function so we can see where the optimizer is going:
LL2 <- function(size, prob) {
R = dnbinom(b, size, prob, log = TRUE)
res <- -sum(R)
cat(size,prob,res,"\n")
res
}
stats4::mle(LL2, start = list(size = 3, prob = 0.1),lower = c(-Inf,-Inf),upper = c(Inf,Inf))
## 3 0.1 4085.146
## 3.001 0.1 4085.166
## 2.999 0.1 4085.127
## 3 0.101 4084.767
## 3 0.099 4085.858
## 2.964666 1.099376 NaN
Error in optim(start, f, method = method, hessian = TRUE, lower = lower, :
L-BFGS-B needs finite values of 'fn'
In addition: Warning message:
In dnbinom(b, size, prob, log = TRUE) : NaNs produced
The first 5 steps are the evaluation of initial value and of the finite difference approximation of the derivatives. The very next optimization step takes us to prob = 1.099, which gives us an NaN result (we need 0 < prob < 1). L-BFGS-B is much more finicky than the other optimizers about non-finite values; most of the others treat non-finite results as "bad" and try something sensible.
You could set the lower bound to 0 for size and bounds (0,1) for prob ... (I tried it and it seems to work). You do have to be a little bit careful with L-BFGS-B - it doesn't always respect the bounds when it is calculating the finite-difference approximation, so e.g. if values <= 0 will give non-finite results you may need to set the lower bound slightly above 0 (e.g. 0.002, since the default finite-difference epsilon is 0.001).

Integrate function returning roundoff error after working previously

When using integrate to integrate a lognormal density function from 2000 -> Inf, I am returned with an error. I had used a very similar equation previously with no problems.
I have tried disabling stop on error, and setting rel.tol lower. I am fairly new and unfamilar with r so I apologize if neither of those are expected to have done anything.
> integrand = function(x) {(x-2000)*(1/x)*(1/(.99066*((2*pi)^.5)))*exp(-((log(x)-7.641)^2)/((2*(.99066)^2)))}
> integrate(integrand,lower=2000,upper=Inf)
1854.002 with absolute error < 0.018
#returns value fine
> integrand = function(x) {(x-2000)*(1/x)*(1/(1.6247*((2*pi)^.5)))*exp(-((log(x)-9.0167)^2)/((2*(1.6247)^2)))}
> integrate(integrand,lower=2000,upper=Inf)
Error in integrate(integrand, lower = 2000, upper = Inf) :
roundoff error is detected in the extrapolation table
#small change in the mu and sigma in the lognormal density function results in roundoff error
> integrate(integrand,lower=1293,upper=Inf)
29005.08 with absolute error < 2
#integrating on lower bound works fine, but having lower=1294 returns a roundoff error again
> integrate(integrand,lower=1294,upper=Inf)
Error in integrate(integrand, lower = 1294, upper = Inf) :
roundoff error is detected in the extrapolation table
I should be getting returned a value, no? I struggle to see how very slightly altering the values would cause the function to no longer integrate.
First of all, I believe you are complicating when you define the integrand by writing down the entire expression, it seems better to use the built-in dlnorm function.
g <- function(x, deduce, meanlog, sdlog){
(x - deduce) * dlnorm(x, meanlog = meanlog, sdlog = sdlog)
}
curve(g(x, deduce = 2000, meanlog = 9.0167, sdlog = 1.6247),
from = 1294, to = 1e4)
As for the integration problem, package cubature generally does a better job when integrate fails. All of the following produce the results, with no errors.
library(cubature)
cubintegrate(g, lower = 1293, upper = Inf, method = "pcubature",
deduce = 2000, meanlog = 9.0167, sdlog = 1.6247)
cubintegrate(g, lower = 1294, upper = Inf, method = "pcubature",
deduce = 2000, meanlog = 9.0167, sdlog = 1.6247)
cubintegrate(g, lower = 2000, upper = Inf, method = "pcubature",
deduce = 2000, meanlog = 9.0167, sdlog = 1.6247)

Understanding R-syntax in the code for Bayesian Optimization

This is with reference to this answer on implementation of Bayesian Optimization. I am unable to understand the following R-code that defines a function xgb.cv.bayes(). The code is as follows:
xgb.cv.bayes <- function(max.depth, min_child_weight, subsample, colsample_bytree, gamma){
cv <- xgv.cv(params = list(booster = 'gbtree', eta = 0.05,
max_depth = max.depth,
min_child_weight = min_child_weight,
subsample = subsample,
colsample_bytree = colsample_bytree,
gamma = gamma,
lambda = 1, alpha = 0,
objective = 'binary:logistic',
eval_metric = 'auc'),
data = data.matrix(df.train[,-target.var]),
label = as.matrix(df.train[, target.var]),
nround = 500, folds = cv_folds, prediction = TRUE,
showsd = TRUE, early.stop.round = 5, maximize = TRUE,
verbose = 0
)
list(Score = cv$dt[, max(test.auc.mean)],
Pred = cv$pred)
}
I am unable to understand the following part of code that comes after closing parenthesis of xgb.cv():
list(Score = cv$dt[, max(test.auc.mean)],
Pred = cv$pred)
Or very briefly, I do not understand the following syntax:
xgb.cv.bayes <- function(max.depth, min_child_weight, subsample, colsample_bytree, gamma){
cv <- xgv.cv(...)list(...)
}
I will be grateful in understanding this R-syntax and where can I find more examples of this.
In R the value of the last expression in a function is automatically the return value of this function. So the function you presented has exactly two steps:
compute the result of xgv.cv(...) and store the result in a
variable cv
create a list with two entries (Score and Pred)
whose values are extracted from cv.
Since the expression that creates the list is the last expression in the function, the list is automatically the return value. So, if you would execute test <- xgb.cv.bayes(...) you could then access test$Score and test$Pred.
Does this answer your question?

Error when using dtruncnorm: Argument 's_x' is not a real vector

I am attempting to fit a truncated normal distribution to a dataset of 5000 claim sizes using maximum likelihood:
l1 = function(theta)
{
-sum(dtruncnorm(x=size, a=0, b=Inf, mean = theta[1], sd=theta[2]))
}
mle1=optim(par=c(4,4), fn=l1)
When I run the optim(par=c(4, 2), fn=l1) line however, I get the error:
Error in dtruncnorm(x = size, a = 0, b = Inf, mean = theta[1], sd = theta[2]) :
Argument 's_x' is not a real vector.
I know it has something to do with the size variable but as far as I can tell it is a vector of integers since when I run typeof(size) I get "integer" as the output.
Any help is appreciated!
For some reason the function does not accept sequences. This worked for me:
-sum(sapply(size, function(v){
dtruncnorm(x=as.numeric(v), a=0, b=Inf, mean = theta[1], sd=theta[2])
}))

R: error in mle initial value in 'vmmin' is not finite

nll <- function(lambda, kappa){
logit=function(x) {log(x/(1-x))}
a=c(1-exp(-(15/lambda)^kappa), 1-exp(-(25/lambda)^kappa), 1-exp(-(35/lambda)^kappa))
a=logit(a)
mu = c(0.1, 0.2, 0.3)
mu = logit(mu)
cov = matrix(c(0.18830690, 0.00235681, 0.00071954, 0.00235681, 0.00736811, 0.00110457, 0.00071954, 0.00110457, 0.00423955), nrow =3)
L1 = dmvnorm(a, mu, cov)
a=c(1-exp(-(25/lambda)^kappa), 1-exp(-(35/lambda)^kappa), 1-exp(-(45/lambda)^kappa))
a=logit(a)
mu = c(0.4, 0.1, 0.9)
mu = logit(mu)
cov = matrix(c(2.7595442, 0.0045178, 0.0010505, 0.0045178, 0.00972309, 0.0015120, 0.0010505, 0.0015120, 0.0088425), nrow =3)
L2 = dmvnorm(a, mu, cov)
-sum(log(L1*L2))
}
> mle(nll, start = list(lambda = 1, kappa = 1))
Error in optim(start, f, method = method, hessian = TRUE, ...) :
initial value in 'vmmin' is not finite
I'm trying to find the lambda and kappa values that maximize the function above.
My original likelihood function returns L1*L2, but because the mle function requires the negative log-likelihood to be passed in, I modified the function to return -sum(log(L1*L2)) instead.
However, I ran into the above error. And I've also tried specifying dmvnorm(... ,log = TRUE) but that didn't solve the problem.
L1 and L2 are both scalars. Assuming we're going to pass log=TRUE to dmvnorm so they are each log-likelihoods, do you mean just -(L1+L2) in the final line?
by specifying debug(nll) and nll(lambda=1,kappa=1) , then stepping through the code, waiting til we find an infinite value, and then backtracking, we see that 1-exp(-(45/lambda)^kappa) is exactly 1 (exp(-45) is less than 1e-16, the smallest value for which 1+x is > 1, so that the final element of logit(a) is infinite ...
So if I make dmvnorm(...,log=TRUE) in both places, change the last line to return(-(L1+L2)), and change the initial value of lambda to 10, I get a finite value for nll(10,1) (4474), and stats4::mle(nll,start=list(lambda=10,kappa=1)) gives:
Coefficients:
lambda kappa
40.622673 4.883857

Resources