I need the R code for setting a threshold while fitting a generalized Pareto distribution.
I've just made a R package which exactly serves this purpose, namely gfiExtremes (now also on CRAN).
remotes::install_github("stla/gfiExtremes", build_vignettes = TRUE)
It allows to perform inference on the quantiles for a generalized Pareto distribution model and on the parameters of the Pareto exceedance distribution, with or without assuming the exceedance threshold is known.
Usage
The data must be given as a numeric vector, say x.
library(gfiExtremes)
gf <- gfigpd2(x, beta = c(0.99, 0.995, 0.999))
summary(gf) # provides estimates and confidence intervals of the beta-quantiles
thresholdEstimate(gf) # an estimate of the threshold
See the examples in the package documentation and the vignette for more info.
Related
When using the GAMLSS package in R, there are many different ways to fit a distribution to a set of data. My data is a single vector of values, and I am fitting a distribution over these values.
My question is this: what is the main difference between using fitDist() and gamlss() since they give similar but different answers for parameter values, and different worm plots?
Also, using the function confint() works for gamlss() fitted objects but not for objects fitted with fitDist(). Is there any way to produce confidence intervals for parameters fitted with the fitDist() function? Is there an accuracy difference between the two procedures? Thanks!
m1 <- fitDist()
fits many distributions and chooses the best according to a
generalized Akaike information criterion, GAIC(k), wit penalty k for each
fitted parameter in the distribution, where k is specified by the user,
e.g. k=2 for AIC,
k = log(n) for BIC,
k=4 for a Chi-squared test (rounded from 3.84, the 5% critical value of a Chi-squared distribution with 1 degree of fereedom), which is my preference.
m1$fits
gives the full results from the best to worst distribution according to GAIC(k).
I want to use a mixture of Gamma distribution as a parametric model for survival analysis on censored data using R. In the "flexsurv" package there are different distributions but I couldn't find a Gamma mixture distribution. In that package, it states that:
"Any user-defined parametric distribution can be fitted, given at least an R function defining the probability density or hazard."
https://cran.r-project.org/web/packages/flexsurv/flexsurv.pdf
Is there a way to directly define a Gamma mixture distribution (with a pre-specified number of components) in a parametric way to directly use this package for the maximum likelihood estimation?
data <- Surv(ages, censored)
fit_gammamixture <- flexsurvreg(data~1, dist=???)
I've found this paper regarding survival analysis with a mixture of gamma distributions but it is hard to understand and implement the algorithm presented here.
Modeling Censored Lifetime Data Using a Mixture of Gammas Baseline
https://projecteuclid.org/download/pdf_1/euclid.ba/1340371053
I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.
I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.
Any help/tips would be much appreciated.
Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).
With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:
using Distributions
new_sample = sample(dataset,sample_size)
Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.
For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)
I am converting SAS codes to R and there is a feature of using lognormal distribution in the SAS univariate procedure using histograms and midpoints. The result is a table containing the following variables,
EXPPCT - estimated percent of population in histogram interval determined from optional fitted distribution (here it is lognormal)
OBSPCT - percent of variable values in histogram interval
VAR - variable name
MIDPT - midpoint of histogram interval
There is an option in SAS to consider the MLE of the zeta, theta and sigma parameters while applying the distribution.
Now I was able to figure out the way to do this in R. My only problem arises in the likelihood estimation, when the three parameters are being estimated in SAS. R gives me different values.
I am using the following for MLE in R.
library(fitdistrplus)
set.seed(0)
cd <- rlnorm(40,4)
pars <- coef(fitdist(cd, "lnorm"))
meanlog sdlog
4.0549354 0.8620153
I am using the following for MLE in SAS. (the est option)
proc univariate data = testing;
histogram cd /lognormal (theta = est zeta=est sigma=est)
midpoints = 1 to &maxx. by 100
outhistogram = this;
run;
&maxx denotes the maximum of the input. The results of the run from SAS can be found here.
I am new to statistics and unable to find the method used for the MLE in SAS and have no clue as to how to estimate the same in R.
Thanks in advance.
I found these packages EnvStats and FAdist that let me estimate the threshold parameter and use these parameters to fit the 3 parameter lognormal distribution. Backlin was right about the parameters. Right now, the parameters are not an exact match but the end result is the same as SAS. Thank you vey much.
I want to fit a distribution to my data. I use fitdistrplus package in r to find the distribution. I can compare the goodness of fit results for different distributions to see which one is more fitted to my data but I don't know how to check the pvalue for goodness of fit test for each of the distributions. The results might show that among gamma, lognormal and exponential, exponential distribution has the lower statistics for anderson darling test but I don't know how to check if pvalue for these tests does not reject the null hypothesis. Is there any built in function in R which gives the pvalues?
Here is a piece of code I used as an example:
d <- sample(100,50)
library(fitdistrplus)
descdist(d)
fitg <- fitdist(d,"gamma")
fitg2 <- fitdist(d,"exp")
gofstat(list(fitg,fitg2))
This code makes 50 random numbers from 0 to 100 and tries to find best fitted model to these data. If descdist(d) shows that gamma and exponential are the two candidates as the best fitted model, fitg and fitg2 finds their related models. the last line compares Ks and anderson darling statistics to show which distribution is most fitted. Distribution with lower value for these tests is the best. However, I dont know how to find p-values for fitg and fitg2 before comparying them. If pvalues show that none of these distributions are not fitted to these data, there is no point to comparing their goodness of fit statistics to my knowledge.
Any help is appreciated.
Thanks