Fréchet distribution parameters estimation in R? - r

I would need to calculate the parameters of a Fréchet distribution.
I am using the packages fitdistrplus and evd of R. But I don't know
what values to initialize the parameters.
library(fitdistrplus)
library(evd)
#Datos
x<-c(19.1,20.2,14.3,19.0,18.8,18.5,20.0,18.6,11.4,15.6,17.4,16.2,15.7,14.3,14.9,14.0,20.2,17.4,18.6,17.0,16.0,12.2,10.8,12.4,10.2,19.8,23.4)
fit.frechet<-fitdist(x,"frechet")
fit.frechet<-fitdist(x,"frechet")
generating the following error
Error in computing default starting values.
Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = data, :
Error in start.arg.default(obs, distname) :
Unknown starting values for distribution frechet. `
When starting the parameters:
fit.frechet2<-fitdist(x,"frechet", start = list(loc=0,scale=1, shape=1))
Output:
Warning messages:
1: In fitdist(x, "frechet", start = list(loc = 0, scale = 1, shape = 1)) :
The dfrechet function should return a vector of with NaN values when input has inconsistent parameters and not raise an error
2: In fitdist(x, "frechet", start = list(loc = 0, scale = 1, shape = 1)) :
The pfrechet function should return a vector of with NaN values when input has inconsistent parameters and not raise an error
3: In sqrt(diag(varcovar)) : NaNs produced
4: In sqrt(1/diag(V)) : NaNs produced
5: In cov2cor(varcovar) :
diag(.) had 0 or NA entries; non-finite result is doubtful
Fitting of the distribution ' frechet ' by maximum likelihood
Parameters:
estimate Std. Error
loc -12128345 40.10705
scale 12128360 40.10705
shape 3493998 NaN
How can I estimate the parameters of the frechet in R?

Well, you could try limit your values and start with some reasonable estimates
F.e.
fit.frechet<-fitdist(x, "frechet", method = "mle", lower = c(0, 0, 0), start = list(loc=1,scale=12, shape=4))
will produce couple of expected warnings, and
print(fit.frechet)
will print somewhat reasonable values
loc 2.146861e-07
scale 1.449643e+01
shape 4.533351e+00
with plot of fit vs empirical
plot(fit.frechet,demp=TRUE)
UPDATE
I would say that Frechet might not be a good fit for your data. I tried Weibull and it looks a lot better, check it yourself
fit.weibull<-fitdist(x, "weibull", method = "mle", lower = c(0, 0))
print(fit.weibull)
plot(fit.weibull, demp=TRUE)
Output is
shape 5.865337
scale 17.837188
One could note that scale parameter is kind of similar and could have been guessed just from histogram. Plot for Weibull fit, given the data it looks quite good

Related

error in shape() function in evir library

I have a dataframe where a column is a mix of positive and negative numbers and the first entry is NA. I'm trying to run the shape function as
shape(data$col, models = 30, start = 30, end = 400, ci=.90,reverse = TRUE,auto.scale = TRUE)
where the data in 'col' is [NA, -0.2663194135, -3.7665034719, -0.2072122334, 1.5721742718, -9.142419, -8.954330, -5.167314, 11.805930, 9.533830, 7.065835]
but I get an error that says
Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
non-finite value supplied by optim
Can someone help me figure out what it means? I've googled it but haven't found anything concrete
It's not clear what you are trying to do here. Calling shape allows you to see how altering the threshold or nextremes parameters in the gpd function will alter the xi parameter of the resulting generalised Pareto distribution model.
There are a few reasons why the example you supplied doesn't work. Let's first of all show an example of what does work. The exponential distribution is a special case of a GPD with mu = 0 and xi = 0, so a sample drawn from the exponential distribution should do the trick:
library(evir) # For the shape() function
set.seed(69) # Makes this example reproducible
x <- rexp(300) # Random sample of 300 elements drawn from exponential distribution
shape(x)
Fine.
However, your sample contains an NA. What happens if we make a single value NA in our sample?
x[1] <- NA
shape(x)
#> Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
#> non-finite value supplied by optim
So, no NAs allowed.
Unfortunately, you will find that you still get the same error if you remove your NA value. There are two reasons for this. Firstly, you have 9 non-NA samples. What happens if we try a length-9 exponential sample?
shape(rexp(9))
#> Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
#> non-finite finite-difference value [1]
We will find that the model will fail to fit with fewer than about 16 data points.
But that's not the only problem. What if we try to get a plot for data that can't be drawn from a generalized Pareto distribution?
# Maybe a uniform distribution?
shape(runif(300, 1, 10))
#> Error in optim(theta, negloglik, hessian = TRUE, ..., tmp = excess) :
#> non-finite finite-difference value [1]
#> In addition: Warning message:
#> In sqrt(diag(varcov)) : NaNs produced
#>
So in effect, you need a bigger sample with no NAs, and it needs to conform approximately to a GPD, otherwise the gpd function will throw an error.
I might be able to help if you let us know the bigger picture of what you are trying to do.

Error using fitdist with truncated distributions

I am trying to fit a certain truncated distribution to a data set. For example, for a lognormal distribution, I define the density function of the truncated distribution as:
dtlnorm <- function(x, meanlog, sdlog,low)
dlnorm(x,meanlog,sdlog)/(1-plnorm(low,meanlog,sdlog))* (x >= low)
where low is the truncation point.
My data is the following vector
Data <- c(1068295.00589834, 1406446.49289834, 1540330.78489834, 1152321.94489834,
3108649.66189834, 3718417.97089834, 2981945.18089834, 4552923.31989834,
5747260.98289834, 2105461.57989834, 1044515.95889834, 1133641.75289834,
3847920.72789834, 2536441.02989834, 3073854.15789834, 1591039.28389834,
2592446.73289834, 4989152.55189834, 2426457.45489834, 120265066.499898,
6888222046.1999, 1092811.87089834, 3440123.51689834, 74684298.1398983,
1475038.27689834, 1124226.39489834, 11739544.5798983, 1187688.74489834,
1023193.88789834, 18784663.9698983)
To fit the distribution, I write:
fitdist(Data,distr="tlnorm",method="mle",start = list(meanlog=0,sdlog=0),fix.arg = list(low=100))
But the following error appears:
Error in fitdist(Data, distr = "tlnorm", method = "mle", start = list(meanlog = 0, :
the function mle failed to estimate the parameters,
with the error code 100
I do not know what is happening. Can somebody help me? Thank you!

Error with fitting a Generalized Extreme Value (GEV) using `extRemes` in R?

I have some data and I want to fit a Generalized Extreme Value (GEV) distribution using extRemes package in R. However, an error occurs:
library(extRemes)
Mydata = c(6,3,3,3,5,5,4,3,5,5,4,3,4,4,6,5,5,4,5,2,6,4,6,5,3,3,8,3,4,4,6,6,6,6,6,5,6,6,5,5)
fit_gev <- fevd(x=Mydata, method = "MLE", type="GEV", period.basis = "year")
summary(fit_gev)
Error in diag(cov.theta) : invalid 'nrow' value (too large or NA)
In addition: Warning message:
In diag(cov.theta) : NAs introduced by coercion
I wonder how can I fix this error? Thanks for any help.
You could adjust using the EnvStats package as follows:
library(EnvStats)
# Data
Mydata =
c(6,3,3,3,5,5,4,3,5,5,4,3,4,4,6,5,5,4,5,2,6,4,6,5,3,3,8,3,4,4,6,6,6,6,6,5,6,6,5,5)
# Generalized Extreme Value (EnvStats)
egevd(Mydata, method = "pwme")# (Method: probability-weighted moments)
Results of Distribution Parameter Estimation
--------------------------------------------
Assumed Distribution: Generalized Extreme Value
Estimated Parameter(s): location = 4.268896
scale = 1.314489
shape = 0.353434
Estimation Method: Unbiased pwme
Data: Mydata
Sample Size: 40

Power law fitted by `fitdistr()` function in package `fitdistrplus`

I generate some random variables using rplcon() function in package poweRlaw
data <- rplcon(1000,10,2)
Now, I want to know which known distributions fit the data best. Lognorm? exp? gamma? power law? power law with exponential cutoff?
So I use function fitdist() in package fitdistrplus:
fit.lnormdl <- fitdist(data,"lnorm")
fit.gammadl <- fitdist(data, "gamma", lower = c(0, 0))
fit.expdl <- fitdist(data,"exp")
Due to the power law distribution and power law with exponential cutoff are not the base probability function according to CRAN Task View: Probability Distributions, so I write the d,p,q function of power law based on the example 4 of ?fitdist
dplcon <- function (x, xmin, alpha, log = FALSE)
{
if (log) {
pdf = log(alpha - 1) - log(xmin) - alpha * (log(x/xmin))
pdf[x < xmin] = -Inf
}
else {
pdf = (alpha - 1)/xmin * (x/xmin)^(-alpha)
pdf[x < xmin] = 0
}
pdf
}
pplcon <- function (q, xmin, alpha, lower.tail = TRUE)
{
cdf = 1 - (q/xmin)^(-alpha + 1)
if (!lower.tail)
cdf = 1 - cdf
cdf[q < round(xmin)] = 0
cdf
}
qplcon <- function(p,xmin,alpha) alpha*p^(1/(1-xmin))
Finally, I use codes below to get parameter xmin and alpha of power law:
fitpl <- fitdist(data,"plcon",start = list(xmin=1,alpha=1))
But it throws an error:
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, ddistnam = ddistname, hessian = TRUE, method = meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
Error in fitdist(data, "plcon", start = list(xmin = 1, alpha = 1)) :
the function mle failed to estimate the parameters,
with the error code 100
I try to search in google and stackoverflow, and so many similar error questions appear, but after reading and trying, no solutions work in my issues, what should I do to complete it correctly to get the parameters?
Thank you for everyone who does me a favor!
This was an interesting one that I am not entirely happy with the discovery but I will tell you what I have found and see if it helps.
On calling the fitdist function, by default it wants to use mledist from the same package. This itself results in a call to stats::optim which is a general optimization function. In it's return value it gives a convergence error code, see ?optim for details. The 100 you see is not one of the ones returned by optim. So I pulled apart the code for mledist and fitdist to find where that error code comes from. Unfortunately it is defined in more than one case and is a general trap error code. If you break down all of the code, what fitdist is trying to do here is the following, subject to various checks etc beforehand.
fnobj <- function(par, fix.arg, obs, ddistnam) {
-sum(do.call(ddistnam, c(list(obs), as.list(par),
as.list(fix.arg), log = TRUE)))
}
vstart = list(xmin=5,alpha=5)
fnobj <- function(par, fix.arg obs, ddistnam) {
-sum(do.call(ddistnam, c(list(obs), as.list(par),
as.list(fix.arg), log = TRUE)))
}
ddistname=dplcon
fix.arg = NULL
meth = "Nelder-Mead"
lower = -Inf
upper = Inf
optim(par = vstart, fn = fnobj,
fix.arg = fix.arg, obs = data, ddistnam = ddistname,
hessian = TRUE, method = meth, lower = lower,
upper = upper)
If we run this code we find a more useful error "function cannot be evaluated at initial parameters". Which makes sense if we look at the function definition. Having xmin=0 or alpha=1 will yield a log-likelihood of -Inf. OK so think try different initial values, I tried a few random choices but all returned a new error, "non-finite finite-difference value 1".
Searching the optim source further for the source of these two errors they are not part of the R source itself, there is however a .External2 call so I can only assume the errors come from there. The non-finite error implies that one of the function evaluations somewhere gives a non numeric result. The function dplcon will do so when alpha <= 1 or xmin <= 0. fitdist lets you specify additional arguments that get passed to mledist or other (depending on what method you choose, mle is default) of which lower is one for controlling lower bounds on the parameters to be optimized. So I tried imposing these limits and trying again:
fitpl <- fitdist(data,"plcon",start = list(xmin=1,alpha=2), lower = c(xmin = 0, alpha = 1))
Annoyingly this still gives an error code 100. Tracking this down yields the error "L-BFGS-B needs finite values of 'fn'". The optimization method has changed from the default Nelder-Mead as you specifying the boundary and somewhere on the external C code call this error arises, presumably close to the limits of either xmin or alpha where the stability of the numerical calculation as we approach infinity is important.
I decided to do quantile matching rather than max likelihood to try to find out more
fitpl <- fitdist(data,"plcon",start = list(xmin=1,alpha=2),
method= "qme",probs = c(1/3,2/3))
fitpl
## Fitting of the distribution ' plcon ' by matching quantiles
## Parameters:
## estimate
## xmin 0.02135157
## alpha 46.65914353
which suggests that the optimum value of xmin is close to 0, it's limits. The reason I am not satisfied is that I can't get a maximum-likelihood fit of the distribution using fitdist however hopefully this explanation helps and the quantile matching gives an alternative.
Edit:
After learning a little more about power law distributions in general it makes sense that this does not work as you expect. The parameter power parameter has a likelihood function which can be maximised conditional on a given xmin. However no such expression exists for xmin since the likelihood function is increasing in xmin. Typically estimation of xmin comes from a Kolmogorov--Smirnov statistic, see this mathoverflow question and the d_jss_paper vignette of the poweRlaw package for more info and associated references.
There is functionality to estimate the parameters of the power law distribution in the poweRlaw package itself.
m = conpl$new(data)
xminhat = estimate_xmin(m)$xmin
m$setXmin(xminhat)
alphahat = estimate_pars(m)$pars
c(xmin = xminhat, alpha = alphahat)

Issues plotting count distribution displot()

I have count data. I'm trying to document my decision to use a negative binomial distribution rather than Poisson (I couldn't get a quasi-poisson dist. in lme4) and am having graphical issues (the vector is appended to the end of the post).
I've been trying to implement the distplot() function to inform my decision about which distribution to model:
here's the outcome variable (physician count):
plot(d1.2$totalmds)
Which might look poisson
but the mean and variance aren't close (the variance is doubled by two extreme values; but is still not anywhere near the mean)
> var(d1.2$totalmds, na.rm = T)
[1] 114240.7
> mean(d1.2$totalmds, na.rm = T)
[1] 89.3121
My outcome is partly population driven so I'm using the total population as an offset variable in preliminary models. This, as I understand it, divides the outcome by the natural log of the offset variable so totalmds/log(poptotal) is essentially what's being modeled. Which looks something like:
But when I try to model this using:
plot 1: distplot(x = d1.2$totalmds, type = "poisson")
plot 2: distplot(x = d1.2$totalmds, type = "nbinomial") # looks way off
plot 3: plot(fitdist(data = d1.2$totalmds, distr = "pois", method = "mle"))
plot 4: plot(fitdist(data = d1.2$totalmds, distr = "nbinom", method = "mle")) # throws warnings
plot 5: qqcomp(fitdist(data = d1.2$totalmds, distr = "pois", method = "mle"))
plot 6: qqcomp(fitdist(data = d1.2$totalmds, distr = "nbinom", method = "mle")) # throws warnings
Does anyone have suggestions for why the following plots look a little screwy/inconsistent?
As I mentioned I'm using another variable as an offset variable in my actual analysis, if that makes a difference.
Here's the vector:
https://gist.github.com/timothyslau/f95a777b713eb33a2fe6
I'm fairly sure NB is better than poisson since var(d1.2$totalmds)/mean(d1.2$totalmds) # variance-to-mean ratio (VMR) > 1
But if NB is appropriate the plots should look a lot cleaner (I think, unless I'm doing something wrong with these plotting functions/packages).

Resources