fitdistrplus vs MASS - difference in standard error outputs of estimates - r

I am trying to fit a Gamma distribution to data. Since data vector is huge, I am not able to copy paste the vector here, but following are some summary statisitics-
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.96 170.41 792.28 1983.93 2511.30 42039.76
I tried to fit Gamma distribution using fitdistrplus package -
fitdist(df %>% pull(x)/100, "gamma", start=list(shape = 1, rate = 0.1), lower=0.01)
I get estimates of parameters, but standard errors are NA (as well as correlation matrix between parameters) -
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters :
estimate Std. Error
shape 0.56172991 NA
rate 0.02846644 NA
Loglikelihood: -1582.601 AIC: 3169.202 BIC: 3177.244
Correlation matrix:
[1] NA
However, when I do the same with fitdistr from MASS, it gives out standard errors -
fitdistr(df$x/100, "gamma", start=list(shape = 1, rate = 0.1), lower=0.01)
shape rate
0.561910739 0.028481215
(0.032652615) (0.002494628)
My questions are following -
Obviously estimates are coming out as same, but why is one giving out standard errors but another is unable to do so?
The fitdist function from fitdistrplus gives out correlation matrix between parameters. But I am unable to understand what inferences I could deduce from it regarding the quality of fit?

Related

Confidence Interval of the predicted mean of a LMER object for large dataset

I would like to get the confidence interval (CI) for the predicted mean of a Linear Mixed Effect Model on a large dataset (~40k rows), which is itself a subset of an even larger dataset. This CI is then used for estimating the uncertainty of another calculation that uses the mean and its related CI as input data.
I managed to create a prediction estimate and interval for the full dataset, but a Prediction Interval is not the same and much larger than a CI. Beside bootstrapping (which takes way too much time with this much data), I cannot find a method that would allow me to estimate a CI – either because it is throwing errors or because it only offers to calculate Prediction intervals.
I quite recently moved into LME and I might therefore have overseen some obvious method.
Here is what I did so far in more detail:
The input data is confidential and I can therefore unfortunately not share any extract.
But in general, we have one dependent variable (y) representing the probability of a event and 2 categorical (c1 and c2) and two continuous variables (x1 and x2) with some weighting factor (w1). Some values in the dataset are missing. An extract of the first rows of the data could look like the example below:
c1
c2
x1
x2
w1
y
London
small
1
10
NA
NA
London
small
1
20
NA
NA
London
large
2
10
0.2
0.1
Paris
small
1
10
0.2
0.23
Paris
large
2
10
0.3
0.3
Based on this input data, I am then fitting a LMER model in the following form:
lmer1 <- lme4::lmer( y ~ x1 * poly(x2, 5) + ((x1 * poly(x2 ,5)) | c1),
data = df,
weights = w1,
control = lme4::lmerControl(check.conv.singular = lme4::.makeCC(action = "ignore", tol = 1e-3)))
This runs for some minutes and returns several warnings:
Warning messages: 1: In optwrap(optimizer, devfun, getStart(start,
rho$pp), lower = rho$lower, : convergence code 5 from nloptwrap:
NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above)
was reached.
2: In checkConv(attr(opt, “derivs”), opt$par, ctrl =
control$checkConv, : unable to evaluate scaled gradient
3: In checkConv(attr(opt, “derivs”), opt$par, ctrl =
control$checkConv, : Model failed to converge: degenerate Hessian with
11 negative eigenvalues
I increased the MAXEVAL parameter but this still did not help to get rid of the warnings and I found that despite these warnings, the model is still fitted. I therefore started to apply different methods to get a prediction of the mean for the whole dataset and the related CI for the mean.
predictInterval
I started with creating a Prediction Interval for the full dataset:
predictions <- merTools::predictInterval(lmer1,
newdata = df,
which = "full",
n.sims = 1000,
include.resid.var = FALSE,
level=0.95,
stat="mean")
However, as stated above, the Prediction Interval is not the same as the CI (see also https://datascienceplus.com/prediction-interval-the-wider-sister-of-confidence-interval/).
I found that the general predict function has the option to set interval to either “prediction” or “confidence”, but this option does not exist with the prediction from a LMER object. And I could not find another possibility to switch from Prediction Interval to CI – even though I would believe that the data drawn should be sufficient to do this.
confint
I then saw that there is a function called “confint”, but when running this function I get the following error:
predicition_ci = lme4::confint.merMod(lmer1)
Computing profile confidence intervals ...
Error in zeta(shiftpar, start = opt[seqpar1][-w]) : profiling
detected new, lower deviance
In addition: Warning messages:
1: In commonArgs(par, fn, control, environment()) : maxfun < 10 *
length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, x#theta, lower = x#lower, calc.derivs
= TRUE, : convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
I found this thread (Error when estimating CI for GLMM using confint()), which said that I need to reduce the “devtol” parameter by setting a different profile. But doing so results in the same error:
lmer1_devtol = profile(lmer1, devtol = 1e-7)
Error in zeta(shiftpar, start = opt[seqpar1][-w]) : profiling
detected new, lower deviance
In addition: Warning messages:
1: In commonArgs(par, fn, control, environment()) : maxfun < 10 *
length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, x#theta, lower = x#lower, calc.derivs
= TRUE, : convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
add_ci
I found the function “add_ci” but this again resulted in another error:
predictions_ci = ciTools::add_ci(df, lmer1,
alpha = 0.05)
Error in levelfun(r, n, allow.new.levels = allow.new.levels) : new
levels detected in newdata
I then set the new “allow.new.levels” parameter to TRUE like in the description of the prediction function, but this parameter seems not to be carried through:
predictions_ci = ciTools::add_ci(df, lmer1,
alpha = 0.05,
allow.new.levels = TRUE)
Error in levelfun(r, n, allow.new.levels = allow.new.levels) : new
levels detected in newdata
Diag
I found a method to calculate CI intervals for the sleepstudy data, which uses a matrix conversion with diag.
Designmat <- model.matrix(as.formula("y ~ x1 * poly(x2, 5)")[-2], df)
predvar <- diag(Designmat %*% vcov(lmer1) %*% t(Designmat))
#With new data
newdat = df
newdat$pred <- predict(lmer1, newdat, allow.new.levels = TRUE)
Designmat <- model.matrix(formula(lmer1)[-2], newdat)
But the diag method does not work for such large datasets.
bootMer
As said earlier, the boostrapping of the confidence interval with bootMer is taking too much time for this subset of data (I started it 1 day ago and it is still running). I tried to use some parallel processing with the sleepstudy sample data but this could not increase the speed dramatically, so I would assume it will have the same effect on my large dataset.
merBoot <- bootMer(lmer1, predict, nsim = 1000, re.form = NA)
Others
I have read through all these post (and more), but none of them could help me to get the CI in reasonable time for my case. But maybe I have overseen something.
https://stats.stackexchange.com/questions/344012/confidence-intervals-from-bootmer-in-r-and-pros-cons-of-different-interval-type
https://stats.stackexchange.com/questions/117641/how-trustworthy-are-the-confidence-intervals-for-lmer-objects-through-effects-pa
How to get coefficients and their confidence intervals in mixed effects models?
Error when estimating CI for GLMM using confint()
https://stats.stackexchange.com/questions/235018/r-extract-and-plot-confidence-intervals-from-a-lmer-object-using-ggplot
How to get confidence intervals for lmer object?
Confidence intervals for the predicted probabilities from glmer object, error with bootMer
https://rdrr.io/cran/ciTools/man/add_ci.lmerMod.html
Error when estimating Confidence interval in lme4
https://fromthebottomoftheheap.net/2018/12/10/confidence-intervals-for-glms/
https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html
https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/
Unsurprising to me but unfortunate for you, nonconvergence of mixed model estimation and difficulty in generating confidence intervals results from the misuse of a linear model for data with a limited dependent variable. "Despite these warnings, the model is still fitted" is a dangerous practice, as iterations are not to be used from predictions if not converged. As you described, the dependent variable (y) represents the probability of an event, which is a continuous variable between zero and one. Using a linear model to predict probability constitutes a linear probability regression, which requires censoring predicted outcomes (e.g. forcing all predicted values greater than .99 to be .99 while forcing all predicted values smaller than .01 to be .01) and adjusting for heterogenous variances using weighted least squares (see https://bookdown.org/ccolonescu/RPoE4/heteroskedasticity.html). Having continuous variables produce both fixed and random effects also burden the convergence, while some or all the random effects of continuous variables may not be necessary. The use of weights can be also problematic.
Instead of a linear probability regression, beta regression works best for dependent variables which are proportions and probabilities. Beta regression without random effects is done in betareg::betareg(). glmmTMB::glmmTMB() handles beta regression with random effects. Start from a simple setting where only the intercept has random effects such as
glmmTMB(y ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), family = list(family = "beta", link ="logit"), data = df)
You may compare the result with glmer() and lmer()
glmer(y ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), family = gaussian(link = "logit"), data = df)
lmer(log(y/(1-y)) ~ 1 + x1 * poly(x2, 5) + c2 + (1 | c1), data = df)
glmer() and lmer() with the above specifications are equivalent, and both assume that predicting log(y/(1-y)) has normal residuals, while glmmTMB() assumes that y follows a gamma distribution. lmer() results are easier to explain and receive wider support from other packages, since they are linear models. On the other hand, glmmTMB() may fit better according to AIC, BIC, and log likelihood. Note that all three requires y strictly in (0, 1) noninclusive. To include occasional zeros and ones, manipulate observations at both boundaries by introducing a small tolerance usually equal to half of the smaller distance from a boundary to its closest observed value (see https://stats.stackexchange.com/questions/109702 and https://graphworkflow.com/eda/bounded01/). For probabilities with either or both of many zeros and ones, zero-, one-, and zero-one–inflated beta regression is fitted via gamlss::gamlss(). See Korosteleva, O. (2019). Advanced regression models with SAS and R. CRC Press.
Add random effects of slopes if necessary according to likelihood ratio tests. Make sure there are enough levels in c1 (e.g. more than 10 different cities) to necessitate mixed effect models. The {glmmTMB} package extends glm() and glmer(). Its alternative {brms} package is built for Bayesian approach. Note that the weights = argument in glmmTMB() as in glm() specifies that values in weights are inversely proportional to the dispersions and are not automatically scaled to sum to one unless integer values which specifies number of observation units. Therefore, you need to investigate what w1 stands for and evaluate how to use it in modeling.
merTools::predictInterval() generates many kinds of intervals for mixed models, some comparable to confidence intervals and prediction intervals in linear models without random effects. However, it supports lmer() model objects only. See https://cran.r-project.org/web/packages/merTools/vignettes/merToolsIntro.html and https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html.
predictInterval(lmer(), include.resid.var = F) includes uncertainty from both fixed and random effects of all coefficients including the intercept but excludes variation from multiple measurements of the same group or individual. This can be considered similar to prediction intervals of linear models without random effects. predictInterval(lmer(), include.resid.var = F, fix.intercept.variance = T) generates shorter CI than above by accounting for covariance between the fixed and random effects of the intercept. predictInterval(lmer(), include.resid.var = F, ignore.fixed.terms = "(Intercept)") also shortens CI by removing uncertainty from the fixed effect of the intercept. If there are no random slopes other than random intercept, the last two methods are comparable to confidence intervals of of linear models without random effects. confint(lmear()) and confint(profile(lmear())) generates confidence intervals of modal parameters such as a slope, so they do not produce confidence intervals of predicted outcomes.
You may also find the following functions and packages useful for generating CIs of mixed effect models.
ggeffect() {ggeffects} predictions() {marginaleffects} and margins() prediction() {margins} {predictions}
They can produce predictions averaged over observed distribution of covariates, instead of making predictions by holding some predictors at specific values such as means or modes which can be misleading and not useful.

standard error & variance of model predictions using merTools::predictInterval

I would like to estimate the standard error and variance of predictions from a linear mixed model. I'm using merTools::predictInterval to estimate prediction intervals because I want to include some of the uncertainty in the random effects (in addition to the uncertainty in the fixed effects). Is it acceptable to use the simulations from merTools::predictInterval to estimate the se and variance of predictions? If so, how should I calculate them? I can think of 2 ways:
To get variance corresponding to the prediction interval (i.e. including residual variance), I would first get the simulated predictions:
predictions <- merTools::predictInterval(...,
include.resid.var = TRUE,
returnSims = TRUE)
1. Then I could estimate variance using the normal approximation (calculate the distance between the fit and the upper/lower interval and then divide that by 1.96):
var1 <- ((predictions$upr - predictions$lwr)/2/1.96)^2
2. Or I could just take the variance of the simulated values:
var2 <- apply(X = attr(x = predictions, which = 'sim.results'), MARGIN = 1, FUN = var)
The SE would then be the square root of the variance. To get the SE and/or variance relating to the confidence interval, I could repeat this with include.resid.var = FALSE in the merTools::predictInterval call.
Is either of these methods acceptable? Is one preferable to the other?

obtaining quantiles from complete gaussian fit of data in R

I have been struggling with how R calculates quantiles and the normal fitting of data.
I have data (NDVI values) that follows a truncated normal distribution (see figure)
I am interested in getting the lowest 10th percentile value (p=0.1) from the data and from the fitting normal distribution curve.
In my understanding, because the data is truncated, the two should be quite different: I expect the quantile from the data to be higher than the one calculated from the normal distribution, but this is not so. For what I understand of the quantile function help the quantile from the data should be the default quantile function:
q=quantile(y, p=0.1)
while the quantile from the normal distribution is :
qx=quantile(y, p=0.1, type=9)
However the two result very close in all cases, which makes me wonder to what type of distribution does R fit the data to calculate the quantile (truncated normal dist.?)
I have also tried to calculate the quantile based on the fitting normal curve as:
fitted=fitdist(as.numeric(y), "norm", discrete = T)
fit.q=as.numeric(quantile(fitted, p=0.1)[[1]][1])
but obtaining no difference.
So my questions are:
To what curve does R fit the data for calculating quantiles, in particular for type=9 ? How can I calculate the quantile based on the complete normal distribution (including the lower tail)?
I don't know how to generate a reproducible example for this, but the data is available at https://dl.dropboxusercontent.com/u/26249349/data.csv
Thanks!
R is using the empirical ordering of the data when determining quantiles, rather than assuming any particular distribution.
The 10th percentile for your truncated data and a normal distribution fit to your data happen to be pretty close, although the 1st percentile is quite a bit different. For example:
# Load data
df = read.csv("data.csv", header=TRUE, stringsAsFactors=FALSE)
# Fit a normal distribution to the data
df.dist = fitdist(df$x, "norm", discrete = T)
Now let's get quantiles of the fitted distribution and the original data. I've included the 1st percentile in addition to the 10th percentile. You can see that the fitted normal distribution's 10th percentile is just a bit lower than that of the data. However, the 1st percentile of the fitted normal distribution is much lower.
quantile(df.dist, p=c(0.01, 0.1))
Estimated quantiles for each specified probability (non-censored data)
p=0.01 p=0.1
estimate 1632.829 2459.039
quantile(df$x, p=c(0.01, 0.1))
1% 10%
2064.79 2469.90
quantile(df$x, p=c(0.01, 0.1), type=9)
1% 10%
2064.177 2469.400
You can also see this by direct ranking of the data and by getting the 1st and 10th percentiles of a normal distribution with mean and sd equal to the fitted values from fitdist:
# 1st and 10th percentiles of data by direct ranking
df$x[order(df$x)][round(c(0.01,0.1)*5780)]
[1] 2064 2469
# 1st and 10th percentiles of fitted distribution
qnorm(c(0.01,0.1), df.dist$estimate[1], df.dist$estimate[2])
[1] 1632.829 2459.039
Let's plot histograms of the original data (blue) and of fake data generated from the fitted normal distribution (red). The area of overlap is purple.
# Histogram of data (blue)
hist(df$x, xlim=c(0,8000), ylim=c(0,1600), col="#0000FF80")
# Overlay histogram of random draws from fitted normal distribution (red)
set.seed(685)
set.seed(685)
x.fit = rnorm(length(df$x), df.dist$estimate[1], df.dist$estimate[2])
hist(x.fit, add=TRUE, col="#FF000080")
Or we can plot the empirical cumulative distribution function (ecdf) for the data (blue) and the random draws from the fitted normal distribution (red). The horizontal grey line marks the 10th percentile:
plot(ecdf(df$x), xlim=c(0,8000), col="blue")
lines(ecdf(x.fit), col="red")
abline(0.1,0, col="grey40", lwd=2, lty="11")
Now that I've gone through this, I'm wondering if you were expecting fitdist to return the parameters of the normal distribution we would have gotten had your data really come from a normal distribution and not been truncated. Rather, fitdist returns a normal distribution with the mean and sd of the (truncated) data at hand, so the distribution returned by fitdist is shifted to the right compared to where we might have "expected" it to be.
c(mean=mean(df$x), sd=sd(df$x))
mean sd
3472.4708 790.8538
df.dist$estimate
mean sd
3472.4708 790.7853
Or, another quick example: x is normally distributed with mean ~ 0 and sd ~ 1. xtrunc removes all values less than -1, and xtrunc.dist is the output of fitdist on xtrunc:
set.seed(55)
x = rnorm(6000)
xtrunc = x[x > -1]
xtrunc.dist = fitdist(xtrunc, "norm")
round(cbind(sapply(list(x=x,xtrunc=xtrunc), function(x) c(mean=mean(x),sd=sd(x))),
xtrunc.dist=xtrunc.dist$estimate),3)
x xtrunc xtrunc.dist
mean -0.007 0.275 0.275
sd 1.009 0.806 0.806
And you can see in the ecdf plot below that the truncated data and the normal distribution fitted to the truncated data have about the same 10th percentile, while the 10th percentile of the untruncated data is (as we would expect) shifted to the left.

Interpreting mcmc output using glmmadmb

I'm trying to evaluate the output from a negative binomial mixed model using glmmadmb. To summarize the output I'm comparing the summary function with output forom the mcmc option. I have run this model:
pre1 <- glmmadmb(walleye~(1|year.center) + (1|Site) ,data=pre,
family="nbinom2",link="log",
mcmc=TRUE,mcmc.opts=mcmcControl(mcmc=1000))
I have two random intercepts: year and site. Year has 33 levels and site has 15.
The random effect parameter estimate for site and year from summary(pre1) do not seem to agree with the posterior distribution from the mcmc output. I am using the 50% confidence interval as the estimate that should coincide with the parameter estimate from the summary function. Is that incorrect? Is there a way to obtain an error around the random effect parameter using the summary function to gauge whether this is variance issue? I tried using postvar=T with ranef but that did not work. Also, Is there a way to format the mcmc output with informative row names to ensure I'm using the proper estimates?
summary output from glmmabmb:
summary(pre1)
Call:
glmmadmb(formula = walleye ~ (1 | year.center) + (1 | Site),
data = pre, family = "nbinom2", link = "log", mcmc = TRUE,
mcmc.opts = mcmcControl(mcmc = 1000))
AIC: 4199.8
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.226 0.154 21 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Number of observations: total=495, year.center=33, Site=15
Random effect variance(s):
Group=year.center
Variance StdDev
(Intercept) 0.1085 0.3295
Group=Site
Variance StdDev
(Intercept) 0.2891 0.5377
Negative binomial dispersion parameter: 2.0553 (std. err.: 0.14419)
Log-likelihood: -2095.88
mcmc output:
m <- as.mcmc(pre1$mcmc)
CI <- t(apply(m,2,quantile,c(0.025,0.5,0.975)))
2.5% 50% 97.5%
(Intercept) 2.911667943 3.211775843 3.5537371345
tmpL.1 0.226614903 0.342206509 0.4600328729
tmpL.2 0.395353518 0.554211483 0.8619127547
alpha 1.789687691 2.050871824 2.3175742167
u.01 0.676758365 0.896844797 1.0726750539
u.02 0.424938481 0.588191585 0.7364795440
these estimates continue to u.48 to include year and site specific coefficients.
Thank you in advance for any thoughts on this issue.
Tiffany
The random effect parameter estimate for site and year from summary(pre1) do not seem to agree with the posterior distribution from the mcmc output. I am using the 50% confidence interval as the estimate that should coincide with the parameter estimate from the summary function. Is that incorrect?
It's not the 50% confidence interval, it's the 50% quantile (i.e. the median). The point estimates from the Laplace approximation of the among-year and among-site standard deviations respectively are {0.3295,0.5377}, which seem quite close to the MCMC median estimates {0.342206509,0.554211483} ... as discussed below, the MCMC tmpL parameters are the random-effects standard deviations, not the variances -- this might be the main cause of your confusion?
Is there a way to obtain an error around the random effect parameter using the summary function to gauge whether this is variance issue? I tried using postvar=T with ranef but that did not work.
The lme4 package (not the glmmadmb package) allows estimates of the variances of the conditional modes (i.e. the random effects associated with particular levels) via ranef(...,condVar=TRUE) (postVar=TRUE is now deprecated). The equivalent information on the uncertainty of the conditional modes is available via ranef(model,sd=TRUE) (see ?ranef.glmmadmb).
However, I think you might be looking for the $S (variance-covariance matrices) and $sd_S (Wald standard errors of the variance-covariance estimates) instead (although as stated above, I don't think there's really a problem).
Also, Is there a way to format the mcmc output with informative row names to ensure I'm using the proper estimates?
See p. 15 of vignette("glmmADMB",package="glmmADMB"):
The MCMC output in glmmADMB is not completely translated. It includes, in order:
pz zero-inflation parameter (raw)
fixed effect parameters Named in the same way as the results of coef() or
fixef().
tmpL variances (standard-deviation scale)
tmpL1 correlation/off-diagonal elements of variance-covariance matrices (off-diagonal elements of the Cholesky factor of the correlation matrix'). (If you need to transform these to correlations, you will need to construct the relevant matrices with 1 on the diagonal and compute the cross-product, CC^T (see tcrossprod); if this makes no sense to you, contact the maintainers)
alpha overdispersion/scale parameter
u random effects (unscaled: these can be scaled using the estimated random-effects standard deviations from VarCorr())

Estimating mean time to failure with survreg/flexsurvreg with standard error

I am trying to estimate the mean time to failure for a Weibull distribution fitted to some survival data with flexsurvreg from the flexsurv package. I need to be able to estimate the standard error for use in a simulation model.
Using flexsurvreg with the lung data as an example;
require(flexsurv)
lungS <- Surv(lung$time,lung$status)
lungfit <- flexsurvreg(lungS~1,dist="weibull")
lungfit
Call:
flexsurvreg(formula = lungS ~ 1, dist = "weibull")
Maximum likelihood estimates:
est L95% U95%
shape 1.32 1.14 1.52
scale 418.00 372.00 469.00
N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1153.851, df = 2
AIC = 2311.702
Now, calculating the mean is just a case of plugging in the estimated parameter values into the standard formula, but is there an easy way of getting out the standard error of this estimate? Can survreg do this?
In flexsurv version 0.2, if x is the fitted model object, then x$cov is the covariance matrix of the parameter estimates, with positive parameters on the log scale. You could then use the asymptotic normal property of maximum likelihood estimators. Simulate a large number of multivariate normal vectors, with the estimates as means, and this covariance matrix (using e.g. rmvnorm from the mvtnorm package). This gives you replicates of the parameter estimates under sampling uncertainty. Calculate the corresponding mean survival for each replicate, then take the SD or quantiles of the resulting sample to get the standard error or a confidence interval.

Resources