Fitting survival density curves using different distributions - r

I am working with some log-normal data, and naturally I want to demonstrate log-normal distribution results in a better overlap than other possible distributions. Essentially, I want to replicate the following graph with my data:
where the fitted density curves are juxtaposed over log(time).
The text where the linked image is from describes the process as fitting each model and obtaining the following parameters:
For that purpose, I fitted four naive survival models with the above-mentioned distributions:
survreg(Surv(time,event)~1,dist="family")
and extracted the shape parameter (α) and the coefficient (β).
I have several questions regarding the process:
1) Is this the right way of going about it? I have looked into several R packages but couldn't locate one that plots density curves as a built-in function, so I feel like I must be overlooking something obvious.
2) Do the values corresponding log-normal distribution (μ and σ$^2$) just the mean and the variance of the intercept?
3) How can I create a similar table in R? (Maybe this is more of a stack overflow question) I know I can just cbind them manually, but I am more interested in calling them from the fitted models. survreg objects store the coefficient estimates, but calling survreg.obj$coefficients results a named number vector (instead of just a number).
4) Most importantly, how can I plot a similar graph? I thought it would be fairly simple if I just extract the parameters and plot them over the histrogram, but so far no luck. The author of the text says he estimated the density curves from the parameters, but I just get a point estimate - what am I missing? Should I calculate the density curves manually based on distribution before plotting?
I am not sure how to provide a mwe in this case, but honestly I just need a general solution for adding multiple density curves to survival data. On the other hand, if you think it will help, feel free to recommend a mwe solution and I will try to produce one.
Thanks for your input!
Edit: Based on eclark's post, I have made some progress. My parameters are:
Dist = data.frame(
Exponential = rweibull(n = 10000, shape = 1, scale = 6.636684),
Weibull = rweibull(n = 10000, shape = 6.068786, scale = 2.002165),
Gamma = rgamma(n = 10000, shape = 768.1476, scale = 1433.986),
LogNormal = rlnorm(n = 10000, meanlog = 4.986, sdlog = .877)
)
However, given the massive difference in scales, this is what I get:
Going back to question number 3, is this how I should get the parameters?
Currently this is how I do it (sorry for the mess):
summary(fit.exp)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "exponential")
Value Std. Error z p
(Intercept) 6.64 0.052 128 0
Scale fixed at 1
Exponential distribution
Loglik(model)= -2825.6 Loglik(intercept only)= -2825.6
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.wei)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "weibull")
Value Std. Error z p
(Intercept) 6.069 0.1075 56.5 0.00e+00
Log(scale) 0.694 0.0411 16.9 6.99e-64
Scale= 2
Weibull distribution
Loglik(model)= -2622.2 Loglik(intercept only)= -2622.2
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.gau)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "gaussian")
Value Std. Error z p
(Intercept) 768.15 72.6174 10.6 3.77e-26
Log(scale) 7.27 0.0372 195.4 0.00e+00
Scale= 1434
Gaussian distribution
Loglik(model)= -3243.7 Loglik(intercept only)= -3243.7
Number of Newton-Raphson Iterations: 4
n= 397
summary(fit.log)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "lognormal")
Value Std. Error z p
(Intercept) 4.986 0.1216 41.0 0.00e+00
Log(scale) 0.877 0.0373 23.5 1.71e-122
Scale= 2.4
Log Normal distribution
Loglik(model)= -2624 Loglik(intercept only)= -2624
Number of Newton-Raphson Iterations: 5
n= 397
I feel like I am particularly messing up the lognormal, given that it is not the standard shape-and-coefficient tandem but the mean and variance.

Try this; the idea is generating random variables using the random distribtion functions and then plotting the density functions with the output data, here is an example like you need:
require(ggplot2)
require(dplyr)
require(tidyr)
SampleData <- data.frame(Duration=rlnorm(n = 184,meanlog = 2.859,sdlog = .246)) #Asume this is data we have sampled from a lognormal distribution
#Then we estimate the parameters for different types of distributions for that sample data and come up for this parameters
#We then generate a dataframe with those distributions and parameters
Dist = data.frame(
Weibull = rweibull(10000,shape = 1.995,scale = 22.386),
Gamma = rgamma(n = 10000,shape = 4.203,scale = 4.699),
LogNormal = rlnorm(n = 10000,meanlog = 2.859,sdlog = .246)
)
#We use gather to prepare the distribution data in a manner better suited for group plotting in ggplot2
Dist <- Dist %>% gather(Distribution,Duration)
#Create the plot that sample data as a histogram
G1 <- ggplot(SampleData,aes(x=Duration)) + geom_histogram(aes(,y=..density..),binwidth=5, colour="black", fill="white")
#Add the density distributions of the different distributions with the estimated parameters
G2 <- G1 + geom_density(aes(x=Duration,color=Distribution),data=Dist)
plot(G2)

Related

How to check for overdispersion in a GAM with negative binomial distribution?

I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

Hill Equation Curve Fitting NLS

I am trying to calculate rate for the following data. I tried Michaelis menten equation, however, Km was coming negative. I am now trying to fit hill equation, but I am getting error message. I think my starting values are not so good. Any help will be very appreciated.
Thanks,
Krina
x<- c(0.0, 2.5, 5.0, 10.0, 25.0)
y <- c(4.91, 1.32, 1.18, 1.12, 1.09)
fo <- y~(Emax*(x^hill)/((EC50^hill)+(x^hill)))
st <- c(Emax=1.06, EC50=0.5, hill=1)
fit <- nls(fo, data = data.frame(x, y), start = st, trace = T)
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
I fit the data you posted to a few hundred known, named equations using a genetic algorithm for initial parameter estimation and found an excellent fit to a simple power law equation as follows (also see attached graph):
y = (a + x)b + Offset
a = 3.6792869983309306E-01
b = -1.3439157691306818E+00
Offset = 1.0766655470363218E+00
Degrees of freedom (error): 2
Degrees of freedom (regression): 2
Chi-squared: 1.98157151386e-06
R-squared: 0.999999822702
R-squared adjusted: 0.999999645405
Model F-statistic: 5640229.45337
Model F-statistic p-value: 1.77297720061e-07
Model log-likelihood: 29.7579529506
AIC: -10.7031811802
BIC: -10.9375184328
Root Mean Squared Error (RMSE): 0.000629534989315
a = 3.6792869983309306E-01
std err: 2.36769E-06
t-stat: 2.39112E+02
p-stat: 1.74898E-05
95% confidence intervals: [3.61308E-01, 3.74549E-01]
b = -1.3439157691306818E+00
std err: 2.91468E-05
t-stat: -2.48929E+02
p-stat: 1.61375E-05
95% confidence intervals: [-1.36714E+00, -1.32069E+00]
Offset = 1.0766655470363218E+00
std err: 9.37265E-07
t-stat: 1.11211E+03
p-stat: 8.08538E-07
95% confidence intervals: [1.07250E+00, 1.08083E+00]
Coefficient Covariance Matrix
[ 2.38970842 -8.3732707 1.30483649]
[ -8.3732707 29.41789844 -4.52058247]
[ 1.30483649 -4.52058247 0.94598199]
I was able to get a good fit using log-logistic model in drc library. However, I am not able to find parameters definition for this model. Is it similar to hill model with log transformation?
library(drc)
fit.ll <- drm(y~x, data=data.frame(x,y), fct=LL.5(), type="continuous")
print(summary(fit.ll))
plot(fit.ll)

How to determine the initial points of the maximum likelihood method

I'm currently working on distribution fitting. I used fitdistr function, but having problem in determining the initial points for the MLE. For example, I want to fit my data (rainfall- 13149 by 1 matrix) with gamma distribution.
fit.gamma = fitdistr(rainfall,dgamma,start=list(shape = ?, scale = ?),method="Nelder-Mead")
The library fitdistrplus is very good for this. It will guess gamma parameters for you if you don't have starting values. Also, you can use method of moments if your guesses fail.
x <- rgamma(100, 0.5, 0.5)
library(fitdistrplus)
(pars <- fitdist(x, "gamma"))
# Fitting of the distribution ' gamma ' by maximum likelihood
# Parameters:
# estimate Std. Error
# shape 0.4443304 0.05131369
# rate 0.5622472 0.10644511

Interpreting Weibull parameters from survreg

I am trying to generate an inverse Weibull distribution using parameters estimated from survreg in R. By this I mean I would like to, for a given probability (which will be a random number in a small simulation model implemented in MS Excel), return the expected time to failure using my parameters. I understand the general form for the inverse Weibull distribution to be:
X=b[-ln(1-rand())]^(1/a)
where a and b are shape and scale parameters respectively and X is the time to failure I want. My problem is in the interpretation of the intercept and covariate parameters from survreg. I have these parameters, the unit of time is days:
Value Std. Error z p
(Intercept) 7.79 0.2288 34.051 0.000
Group 2 -0.139 0.2335 -0.596 0.551
Log(scale) 0.415 0.0279 14.88 0.000
Scale= 1.51
Weibull distribution
Loglik(model)= -8356.7 Loglik(intercept only)= -8356.9
Chisq = 0.37 on 1 degrees of freedom, p= 0.55
Number of Newton-Raphson Iterations: 4
n=1682 (3 observations deleted due to missing values)
I have read in the help files that the coefficients from R are from the "extreme value distribution" but I'm unsure what this really means and how I get 'back to' the standard scale parameter used directly in the formulae. Using b=7.79 and a=1.51 gives nonsensical answers. I really want to be able to generate a time for both the base group and 'Group 2'. I also should note that I did not perform the analysis myself and cannot interrogate the data further.
This is explained in the manual page, ?survreg (in the "examples" section).
library(survival)
y <- rweibull(1000, shape=2, scale=5)
r <- survreg(Surv(y)~1, dist="weibull")
a <- 1/r$scale # Approximately 2
b <- exp( coef(r) ) # Approximately 5
y2 <- b * ( -ln( 1-runif(1000) ) ) ^(1/a)
y3 <- rweibull(1000, shape=a, scale=5)
# Check graphically that the distributions are the same
plot(sort(y), sort(y2))
abline(0,1)
The key is that the shape parameter the rweibull generates is the inverse of the shape parameter the survreg inputs

Resources