How to determine the initial points of the maximum likelihood method - r

I'm currently working on distribution fitting. I used fitdistr function, but having problem in determining the initial points for the MLE. For example, I want to fit my data (rainfall- 13149 by 1 matrix) with gamma distribution.
fit.gamma = fitdistr(rainfall,dgamma,start=list(shape = ?, scale = ?),method="Nelder-Mead")

The library fitdistrplus is very good for this. It will guess gamma parameters for you if you don't have starting values. Also, you can use method of moments if your guesses fail.
x <- rgamma(100, 0.5, 0.5)
library(fitdistrplus)
(pars <- fitdist(x, "gamma"))
# Fitting of the distribution ' gamma ' by maximum likelihood
# Parameters:
# estimate Std. Error
# shape 0.4443304 0.05131369
# rate 0.5622472 0.10644511

Related

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

Fitting survival density curves using different distributions

I am working with some log-normal data, and naturally I want to demonstrate log-normal distribution results in a better overlap than other possible distributions. Essentially, I want to replicate the following graph with my data:
where the fitted density curves are juxtaposed over log(time).
The text where the linked image is from describes the process as fitting each model and obtaining the following parameters:
For that purpose, I fitted four naive survival models with the above-mentioned distributions:
survreg(Surv(time,event)~1,dist="family")
and extracted the shape parameter (α) and the coefficient (β).
I have several questions regarding the process:
1) Is this the right way of going about it? I have looked into several R packages but couldn't locate one that plots density curves as a built-in function, so I feel like I must be overlooking something obvious.
2) Do the values corresponding log-normal distribution (μ and σ$^2$) just the mean and the variance of the intercept?
3) How can I create a similar table in R? (Maybe this is more of a stack overflow question) I know I can just cbind them manually, but I am more interested in calling them from the fitted models. survreg objects store the coefficient estimates, but calling survreg.obj$coefficients results a named number vector (instead of just a number).
4) Most importantly, how can I plot a similar graph? I thought it would be fairly simple if I just extract the parameters and plot them over the histrogram, but so far no luck. The author of the text says he estimated the density curves from the parameters, but I just get a point estimate - what am I missing? Should I calculate the density curves manually based on distribution before plotting?
I am not sure how to provide a mwe in this case, but honestly I just need a general solution for adding multiple density curves to survival data. On the other hand, if you think it will help, feel free to recommend a mwe solution and I will try to produce one.
Thanks for your input!
Edit: Based on eclark's post, I have made some progress. My parameters are:
Dist = data.frame(
Exponential = rweibull(n = 10000, shape = 1, scale = 6.636684),
Weibull = rweibull(n = 10000, shape = 6.068786, scale = 2.002165),
Gamma = rgamma(n = 10000, shape = 768.1476, scale = 1433.986),
LogNormal = rlnorm(n = 10000, meanlog = 4.986, sdlog = .877)
)
However, given the massive difference in scales, this is what I get:
Going back to question number 3, is this how I should get the parameters?
Currently this is how I do it (sorry for the mess):
summary(fit.exp)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "exponential")
Value Std. Error z p
(Intercept) 6.64 0.052 128 0
Scale fixed at 1
Exponential distribution
Loglik(model)= -2825.6 Loglik(intercept only)= -2825.6
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.wei)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "weibull")
Value Std. Error z p
(Intercept) 6.069 0.1075 56.5 0.00e+00
Log(scale) 0.694 0.0411 16.9 6.99e-64
Scale= 2
Weibull distribution
Loglik(model)= -2622.2 Loglik(intercept only)= -2622.2
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.gau)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "gaussian")
Value Std. Error z p
(Intercept) 768.15 72.6174 10.6 3.77e-26
Log(scale) 7.27 0.0372 195.4 0.00e+00
Scale= 1434
Gaussian distribution
Loglik(model)= -3243.7 Loglik(intercept only)= -3243.7
Number of Newton-Raphson Iterations: 4
n= 397
summary(fit.log)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "lognormal")
Value Std. Error z p
(Intercept) 4.986 0.1216 41.0 0.00e+00
Log(scale) 0.877 0.0373 23.5 1.71e-122
Scale= 2.4
Log Normal distribution
Loglik(model)= -2624 Loglik(intercept only)= -2624
Number of Newton-Raphson Iterations: 5
n= 397
I feel like I am particularly messing up the lognormal, given that it is not the standard shape-and-coefficient tandem but the mean and variance.
Try this; the idea is generating random variables using the random distribtion functions and then plotting the density functions with the output data, here is an example like you need:
require(ggplot2)
require(dplyr)
require(tidyr)
SampleData <- data.frame(Duration=rlnorm(n = 184,meanlog = 2.859,sdlog = .246)) #Asume this is data we have sampled from a lognormal distribution
#Then we estimate the parameters for different types of distributions for that sample data and come up for this parameters
#We then generate a dataframe with those distributions and parameters
Dist = data.frame(
Weibull = rweibull(10000,shape = 1.995,scale = 22.386),
Gamma = rgamma(n = 10000,shape = 4.203,scale = 4.699),
LogNormal = rlnorm(n = 10000,meanlog = 2.859,sdlog = .246)
)
#We use gather to prepare the distribution data in a manner better suited for group plotting in ggplot2
Dist <- Dist %>% gather(Distribution,Duration)
#Create the plot that sample data as a histogram
G1 <- ggplot(SampleData,aes(x=Duration)) + geom_histogram(aes(,y=..density..),binwidth=5, colour="black", fill="white")
#Add the density distributions of the different distributions with the estimated parameters
G2 <- G1 + geom_density(aes(x=Duration,color=Distribution),data=Dist)
plot(G2)

Maximum Likelihood Estimator for a Gamma density in R

I just simulated 100 randoms observations from a gamma density with alpha(shape parameter)=5 and lambda(rate parameter)=5 :
x=rgamma(100,shape=5,rate=5)
Now, I want to fin the maximum likelihood estimations of alpha and lambda with a function that would return both of parameters and that use these observations.
Any hints would be appreciate. Thank you.
You can use fitdistr(...) for this in the MASS package.
set.seed(1) # for reproducible example
x <- rgamma(100,shape=5,rate=5)
library(MASS)
fitdistr(x, "gamma", start=list(shape=1, rate=1))$estimate
# shape rate
# 6.603328 6.697338
Notice that with a small sample like this you don't get great estimates.
x <- rgamma(10000,shape=5,rate=5)
library(MASS) # may be loaded by default
fitdistr(x, "gamma", start=list(shape=1, rate=1))$estimate
# shape rate
# 4.984220 4.971021
fitdistr(...) also returns the standard error of the estimates and the log-likelihood.

Fitting a lognormal distribution to truncated data in R

For a brief background, I am insterested in describing a distribution of fire sizes, which is presumed to follow a lognormal distribution (many small fires and few large fires). For my specific application I am only interested in the fires that fall within a certain range of sizes (> min, < max). So, I am attempting to fit a lognormal distribution to a data set that has been censored on both ends. In essence, I want to find the parameters of the lognormal distribution (mu and sigma) that best fits the full distribution prior to censoring. Can I fit the distribution taking into account that I know I am only looking a a portion of the distribution?
I have done some experimentation, but have become stumped. Here's an example:
# Generate data #
D <- rlnorm(1000,meanlog = -0.75, sdlog = 1.5)
# Censor data #
min <- 0.10
max <- 20
Dt <- D[D > min]
Dt <- Dt[Dt <= max]
If I fit the non-censored data (D) using either fitdistr (MASS) or fitdist (fitdistrplus) I obviously get approximately the same parameter values as I entered. But if I fit the censored data (Dt) then the parameter values do not match, as expected. The question is how to incorporate the known censoring. I have seen some references elsewhere to using upper and lower within fitdistr, but I encounter an error that I'm not sure how to resolve:
> fitt <- fitdist(Dt, "lognormal", lower = min, upper = max)
Error in fitdist(Dt, "lognormal", lower = min, upper = max) :
The dlognormal function must be defined
I will appreciate any advice, first on whether this is the appropriate way to fit a censored distribution, and if so, how to go about defining the dlognormal function so that I can make this work. Thanks!
Your data is not censored (that would mean that observations outside the interval
are there, but you do not know their exact value)
but truncated (those observations have been discarded).
You just have to provide fitdist with the density and the cumulative distribution function
of your truncated distribution.
library(truncdist)
dtruncated_log_normal <- function(x, meanlog, sdlog)
dtrunc(x, "lnorm", a=.10, b=20, meanlog=meanlog, sdlog=sdlog)
ptruncated_log_normal <- function(q, meanlog, sdlog)
ptrunc(q, "lnorm", a=.10, b=20, meanlog=meanlog, sdlog=sdlog)
library(fitdistrplus)
fitdist(Dt, "truncated_log_normal", start = list(meanlog=0, sdlog=1))
# Fitting of the distribution ' truncated_log_normal ' by maximum likelihood
# Parameters:
# estimate Std. Error
# meanlog -0.7482085 0.08390333
# sdlog 1.4232373 0.0668787

R: How to fit a large dataset with a combination of distributions?

To fit a dataset of real-valued numbers (x) with one distribution, we can use MASS as follows either the gamma or Student's t distribution:
fitdistr(x, "gamma")
or
fitdistr(x2, "t")
What if I believe my dataset should fit by the sum of gamma and t distributions?
P(X) = Gamma(x) + t(x)
Can I fit the parameters of mixtures of probability distributions using Maximum Likelihood fitting in R?
There are analytic maximum-likelihood estimators for some parameters, such as the mean of a normal distribution or the rate of an exponential distribution. For other parameters, there is no analytic estimator, but you can use numerical analysis to find reasonable parameter estimates.
The fitdistr() function in R uses numerical optimization of the log-likelihood function by calling the optim() function. If you think that your data is a mixture of Gamma and t distribution, then simply make a likelihood function that describes such a mixture. Then, pass those parameter values to optim() for optimization. Here is an example using this approach to fitting a distribution:
library( MASS )
vals = rnorm( n = 10000, mean = 0, sd = 1 )
print( summary(x_vals) )
ll_func = function(params) {
log_probs = log( dnorm( x = vals, mean = params[1], sd = params[2] ))
tot = sum(log_probs)
return(-1 * tot)
}
params = c( 0.5, 10 )
print( ll_func(params) )
res = optim( params, ll_func )
print( res$par )
Running this program in R produces this output:
[1] "mean: 0.0223766157516646"
[1] "sd: 0.991566611447471"
That's fairly close to the initial values of mean = 0 and sd = 1.
Don't forget that with a mixture of two distributions, you have one extra parameter that specifies the relative weights between the distributions. Also, be careful about fitting lots of parameters at once. With lots of free parameters you need to worry about overfitting.
Try mixdist. Here's an example of a mixture of three distributions:
https://stats.stackexchange.com/questions/10062/which-r-package-to-use-to-calculate-component-parameters-for-a-mixture-model

Resources