Finding the distribution from the data - r

I try to find the distribution for this dataset. I tried with the fitdistrplus package
data <- data.matrix(Book1)
descdist(data, discrete = FALSE)
but get this error:
Error in descdist(data, discrete = FALSE) : data must be a numeric vector

You can use instead
data <- as.numeric(Book1)
descdist(data, discrete = FALSE)
This gets you this graph:
And these values:
summary statistics
------
min: 3 max: 35
median: 5
mean: 6.244898
estimated sd: 3.517
estimated skewness: 1.977063
estimated kurtosis: 9.456783
If you then decide that the closest is an exponentional distribution, you can get its parameters like this
ft <- fitdist(data, distr = "exp" )
ft
Fitting of the distribution ' exp ' by maximum likelihood
Parameters:
estimate Std. Error
rate 0.1601307 0.002299016
And you can compare their density using this function:
denscomp(ft)

Related

Error in fitdist function: the function mle failed to estimate the parameters, with the error code 1 (Rayleigh distribution)

I am trying to fit my data to Rayleigh distribution by using the fitdist function from the fitdistrplus package.
x <- c(19.000000,23.000000,26.000000,45.000000,8.050000,46.900000,1.268333,30.000000,
1.466667,3.733333,1.683333,4.000000,3.950000,1.850000,42.000000,1.333333,
1.550000,1.000000,2.066667,1.566667,1.216667,1.850000,1.400000,8.366667,
19.000000,29.000000,17.000000,42.000000,19.000000,10.000000,53.000000,2.550000,
15.483333,1.533333,1.216667,1.550000,32.000000,6.583333,6.516667,5.750000,
9.283333,46.000000,2.016667,2.133333,4.516667,46.950000,1.600000,1.433333,
3.166667,4.416667,17.016667,2.433333,2.713333,8.633333,3.150000,1.183333,
14.000000,10.706667,7.026944,31.000000,35.000000,21.000000,14.000000,2.200000,
26.000000,3.316667,51.000000,13.000000,34.000000,11.650000,49.000000,12.000000,
26.000000,20.000000,22.000000,6.483333,24.000000,5.333333,4.833333,8.750000,
6.216667,17.000000,1.083333,19.000000,48.000000,15.000000,1.266667,54.000000,
32.000000,3.616667,6.666667,1.600000,2.083333,6.933333,33.033333,1.883333,
1.000000,3.072222,49.000000,1.400000)
dat <- data.frame(x)
# Generate gamma rvs
den <- density(x)
orig <- data.frame(x = den$x, y = den$y)
fit.params.2 <- fitdistrplus::fitdist(dat$x, "rayleigh", start = list(sigma = 1))
Then an error occurs:
Error in fitdistrplus::fitdist(dat$x, "rayleigh", start = list(sigma = 1)) :
the function mle failed to estimate the parameters,
with the error code 1
Is there any solution to this problem? Thanks for any help.
There are two problems. (1) the Rayleigh distribution does not seem to be a good fit to the data (see plot output below) and (2) you need a better starting value. Since sigma is proportional to the mean for the Rayleigh distribution (see wikipedia) try that:
library(fitdistrplus)
library(extraDistr)
fit <- fitdist(dat$x, "rayleigh", start = list(sigma = mean(dat$x)))
fit
## Fitting of the distribution ' rayleigh ' by maximum likelihood
## Parameters:
## estimate Std. Error
## sigma 15.00063 0.749935
plot(fit)

Truncated negative binomial distribution from age-binned population data

I have data for two populations that are binned by age, with different bins for each population.
Age bins in population 1: 18-24, 25-29, 30-34, 35-45, 46-60, 61+
Age bins in population 2: 15-19, 20-24, 25-29, 30-34 ... 85-89, 90+
I want to infer a continuous distribution from these binned data in order to compare the two populations more directly. I tried fitting an untruncated negative binomial distribution but it was underestimating the lower bins:
So, now I want to try a truncated negative binomial distribution. I did the following:
library(truncdist)
library(fitdistrplus)
dtruncated_nbinom <- function(x)
dtrunc(x, "nbinom", a=18, b=100)
ptruncated_nbinom <- function(q)
ptrunc(q, "nbinom", a=18, b=100)
pop1_nbinom <- fitdistcens(pop1_dt, "truncated_nbinom")
But I got the following error:
Error in computing default starting values.
Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = pseudodata, :
Error in start.arg.default(obs, distname) :
Unknown starting values for distribution truncated_nbinom.
Any advice on how to approach/resolve this?
Here's the pop 1 data:
pop1 <- data.table(left = c(18,25,30,35,46,61), right = c(25,30,35,46,61,100), counts = c(2745,3115,2726,3433,1368,204))
pop1_dt <- pop1[rep(1:nrow(pop1), pop1[,counts]), .(left, right)]

How to compute confidence intervall for Krippendorf's Alpha in R?

I am sure this is realted to Bootstrapping Krippendorff's Alpha. But I didn't understand the question nor the answers there. And it looks like that even the answers and comments are contradicting each other.
set.seed(0)
df <- data.frame(a = rep(sample(1:4),10), b = rep(sample(1:4),10))
kripp.alpha(t(df))
This is the output.
Krippendorff's alpha
Subjects = 40
Raters = 2
alpha = 0.342
How can I compute the confidence interval here?
You are right it is connected to bootstrapping. You could compute the confidence interval the following way:
library(irr)
library(boot)
alpha.boot <- function(d,w) {
data <- t(d[w,])
kripp.alpha(data)$value
}
b <- boot(data = df, statistic = alpha.boot, R = 1000)
b
plot(b)
boot.ci(b, type = "perc")
This is the output:
Bootstrap Statistics :
original bias std. error
t1* 0.3416667 -0.01376158 0.1058123
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b, type = "perc")
Intervals :
Level Percentile
95% ( 0.1116, 0.5240 )
Calculations and Intervals on Original Scale
there is also a R script from Zapf et al. 2016 look for Additional file 3 at the bottom of the page just before the references
Or you could use the kripp.boot function available on github MikeGruz/kripp.boot

Fitting survival density curves using different distributions

I am working with some log-normal data, and naturally I want to demonstrate log-normal distribution results in a better overlap than other possible distributions. Essentially, I want to replicate the following graph with my data:
where the fitted density curves are juxtaposed over log(time).
The text where the linked image is from describes the process as fitting each model and obtaining the following parameters:
For that purpose, I fitted four naive survival models with the above-mentioned distributions:
survreg(Surv(time,event)~1,dist="family")
and extracted the shape parameter (α) and the coefficient (β).
I have several questions regarding the process:
1) Is this the right way of going about it? I have looked into several R packages but couldn't locate one that plots density curves as a built-in function, so I feel like I must be overlooking something obvious.
2) Do the values corresponding log-normal distribution (μ and σ$^2$) just the mean and the variance of the intercept?
3) How can I create a similar table in R? (Maybe this is more of a stack overflow question) I know I can just cbind them manually, but I am more interested in calling them from the fitted models. survreg objects store the coefficient estimates, but calling survreg.obj$coefficients results a named number vector (instead of just a number).
4) Most importantly, how can I plot a similar graph? I thought it would be fairly simple if I just extract the parameters and plot them over the histrogram, but so far no luck. The author of the text says he estimated the density curves from the parameters, but I just get a point estimate - what am I missing? Should I calculate the density curves manually based on distribution before plotting?
I am not sure how to provide a mwe in this case, but honestly I just need a general solution for adding multiple density curves to survival data. On the other hand, if you think it will help, feel free to recommend a mwe solution and I will try to produce one.
Thanks for your input!
Edit: Based on eclark's post, I have made some progress. My parameters are:
Dist = data.frame(
Exponential = rweibull(n = 10000, shape = 1, scale = 6.636684),
Weibull = rweibull(n = 10000, shape = 6.068786, scale = 2.002165),
Gamma = rgamma(n = 10000, shape = 768.1476, scale = 1433.986),
LogNormal = rlnorm(n = 10000, meanlog = 4.986, sdlog = .877)
)
However, given the massive difference in scales, this is what I get:
Going back to question number 3, is this how I should get the parameters?
Currently this is how I do it (sorry for the mess):
summary(fit.exp)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "exponential")
Value Std. Error z p
(Intercept) 6.64 0.052 128 0
Scale fixed at 1
Exponential distribution
Loglik(model)= -2825.6 Loglik(intercept only)= -2825.6
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.wei)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "weibull")
Value Std. Error z p
(Intercept) 6.069 0.1075 56.5 0.00e+00
Log(scale) 0.694 0.0411 16.9 6.99e-64
Scale= 2
Weibull distribution
Loglik(model)= -2622.2 Loglik(intercept only)= -2622.2
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.gau)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "gaussian")
Value Std. Error z p
(Intercept) 768.15 72.6174 10.6 3.77e-26
Log(scale) 7.27 0.0372 195.4 0.00e+00
Scale= 1434
Gaussian distribution
Loglik(model)= -3243.7 Loglik(intercept only)= -3243.7
Number of Newton-Raphson Iterations: 4
n= 397
summary(fit.log)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "lognormal")
Value Std. Error z p
(Intercept) 4.986 0.1216 41.0 0.00e+00
Log(scale) 0.877 0.0373 23.5 1.71e-122
Scale= 2.4
Log Normal distribution
Loglik(model)= -2624 Loglik(intercept only)= -2624
Number of Newton-Raphson Iterations: 5
n= 397
I feel like I am particularly messing up the lognormal, given that it is not the standard shape-and-coefficient tandem but the mean and variance.
Try this; the idea is generating random variables using the random distribtion functions and then plotting the density functions with the output data, here is an example like you need:
require(ggplot2)
require(dplyr)
require(tidyr)
SampleData <- data.frame(Duration=rlnorm(n = 184,meanlog = 2.859,sdlog = .246)) #Asume this is data we have sampled from a lognormal distribution
#Then we estimate the parameters for different types of distributions for that sample data and come up for this parameters
#We then generate a dataframe with those distributions and parameters
Dist = data.frame(
Weibull = rweibull(10000,shape = 1.995,scale = 22.386),
Gamma = rgamma(n = 10000,shape = 4.203,scale = 4.699),
LogNormal = rlnorm(n = 10000,meanlog = 2.859,sdlog = .246)
)
#We use gather to prepare the distribution data in a manner better suited for group plotting in ggplot2
Dist <- Dist %>% gather(Distribution,Duration)
#Create the plot that sample data as a histogram
G1 <- ggplot(SampleData,aes(x=Duration)) + geom_histogram(aes(,y=..density..),binwidth=5, colour="black", fill="white")
#Add the density distributions of the different distributions with the estimated parameters
G2 <- G1 + geom_density(aes(x=Duration,color=Distribution),data=Dist)
plot(G2)

How to determine the initial points of the maximum likelihood method

I'm currently working on distribution fitting. I used fitdistr function, but having problem in determining the initial points for the MLE. For example, I want to fit my data (rainfall- 13149 by 1 matrix) with gamma distribution.
fit.gamma = fitdistr(rainfall,dgamma,start=list(shape = ?, scale = ?),method="Nelder-Mead")
The library fitdistrplus is very good for this. It will guess gamma parameters for you if you don't have starting values. Also, you can use method of moments if your guesses fail.
x <- rgamma(100, 0.5, 0.5)
library(fitdistrplus)
(pars <- fitdist(x, "gamma"))
# Fitting of the distribution ' gamma ' by maximum likelihood
# Parameters:
# estimate Std. Error
# shape 0.4443304 0.05131369
# rate 0.5622472 0.10644511

Resources