how can I predict probability of an event using the weibull distribution - r

I have a data set of connection forces based on axial force in N (
Some previous analyses has been undertaken (by another party) and has fitted a Weibull distribution to it, and then predicted that the chances of recording a force of 60N or higher is around 1.2%.
I have to say that eyeballing the data, that doesn't seem likely to me, but I know nothing about this particular distribution.
So far I am able to fit the curve:
force<-read.csv(file="forcestats.csv",header = T)
fitdistr(force$F, 'weibull')
I am trying to understand
is a weibull distro really the best fit for this data ?
how I can make that same prediction using R (how to calculate the probability of values above 60N);
is it possible to calculate the 95% confidence interval for that value (i.e., 1.2% +/- x%)
Thanks for reading

To address your first item,
is a weibull distro really the best fit for this data ?
conceptually, this is more of a question about statistical inference rather than programming, so you most likely want to tackle that on CrossValidated rather than SO. However, you can certainly inquire about the means of investigating this programmatically, such as comparing the estimated density of the observed data to the theoretical density function or to the density function of random samples from a weibull distribution with your parameter estimates:
Weibull <- read.csv(
params <- fitdistr(Weibull$F, 'weibull')
Shape <- params[[1]][1]
Scale <- params[[1]][2]
"observed data"),
Of course, there are many other ways of assessing the fit of your model, this is just a quick sanity check.
As for your second question, you can use the pweibull function with lower.tail=FALSE to get probabilities from the theoretical survival function (S(x) = 1 - F(x)):
## Pr(X >= 60)
> pweibull(
[1] 0.01268268
As for your final item, I believe that calculating confidence intervals on probabilities (as well as certain other statistical quantities) for an estimated distribution requires using the Delta method; I could be recalling incorrectly though, so you may want to double check on this. If this is the case and you aren't familiar with the Delta method, then unfortunately you will probably have to do a fair amount of reading on the subject because the calculation involved is generally non-trivial - here's another link; the Wikipedia article doesn't give a very in-depth treatment of the subject. Or, you could inquire about this on Cross Validated as well.


Sample proportion confidence interval estimates using logit

This seems like a problem that has an accepted, statistically and mathematically sound answer, but I can't seem to find it.
When estimating confidence intervals from sample proportions, I generally use the normal approximation technique described here:
However, this fails spectacularly for proportions where my sample is close to 0 or 1, notably having symmetrical distribution which causes it to go above 1 or below 0. Generally, since proportion estimates "behave better" when modeled using a logit, I assume there is some way to apply a logit transform to the confidence intervals which would result in an asymmetric confidence interval that would never cross 0 or 1.
However, instead of trying to hack together my own technique with freshman calculus and MBA statistics as my highest formal mathematical training, I have been searching the web to see if such a technique has already been described by someone more qualified.
Is anyone aware of a way to do this?
A straightforward derivation via the usual change of variables formula shows that y = logit(x) where x has a beta distribution (the posterior distribution for the binomial proportion assuming a beta prior), has a distribution with pdf (exp(y)^a)/((1 + exp(y))^(a + b))/beta(a, b) where beta(a, b) = gamma(a)*gamma(b)/gamma(a + b).
That pdf has a somewhat Gaussian-like shape, but it's less symmetrical the more different a and b are. It probably has a name, although I don't recognize it.
It's not clear that taking y = logit(x) here is helpful. For several other approaches, see: Binomial proportion confidence interval
How does one extract hat values and Cook's Distance from an `nlsLM` model object in R?

I'm using the nlsLM function to fit a nonlinear regression. How does one extract the hat values and Cook's Distance from an nlsLM model object?
With objects created using the nls or nlreg functions, I know how to extract the hat values and the Cook's Distance of the observations, but I can't figure out how to get them using nslLM.
Can anyone help me out on this? Thanks!
So, it's not Cook's Distance or based on hat values, but you can use the function nlsJack in the nlstools package to jackknife your nls model, which means it removes every point, one by one, and bootstraps the resulting model to see, roughly speaking, how much the model coefficients change with or without a given observation in there.
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
The plot shows the percentage change in each model coefficient as each individual observation is removed, and it marks influential points above a certain threshold as "influential" in the resulting plot. The documentation for nlsJack describes how this threshold is determined:
An observation is empirically defined as influential for one parameter if the difference between the estimate of this parameter with and without the observation exceeds twice the standard error of the estimate divided by sqrt(n). This empirical method assumes a small curvature of the nonlinear model.
My impression so far is that this a fairly liberal criterion--it tends to mark a lot of points as influential.
nlstools is a pretty useful package overall for diagnosing nls model fits though.

{Methcomp} – Deming / orthogonal regression – goodness of fit + confidence intervals

A question following this post. I have the following data:
x1, disease symptom
y1, another disease symptom
I fitted the x1/y1 data with a Deming regression with vr (or sdr) option set to 1. In other words, the regression is a Total Least Squares regression, i.e. orthogonal regression. See previous post for the graph.
10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
dem_reg <- Deming(x1, y1)
abline(dem_reg[1:2], col = "green")
I would like to know how much x1 helps to predict y1:
normally, I’d go for a R-squared, but it does not seem to be relevant; although another mathematician told me he thinks a R-squared may be appropriate. And this page suggests to calculate a Pearson product-moment correlation coefficient, which is R I believe?
partially related, there is possibly a tolerance interval. I could calculated it with R ({tolerance} package or code shown in the post), but it is not exactly what I am searching for.
Does someone know how to calculate a goodness of fit for Deming regression, using R? I looked at MetchComp pdf but could not find it (perhaps missed it though).
EDIT: following Gaurav's answers about confidence interval: R code
Firstly: confidence intervals for parameters
Secondly: confidence intervals for predicted values
# plot of data
# Deming regression using functions from {mcr}
library(mcr) MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,"analytical")
# CI for predicted values
# plot regression line and CI for predicted values
abline(MCR_intercept,MCR_slope, col="red")
# comments
text(7.5,60, "Deming regression", col="red")
text(7.5,40, "Confidence Interval for", col="royalblue")
text(7.5,35, "Predicted values - 95%", col="royalblue")
Topic moved to Cross Validated:
There are many proposed methods to calculate goodness of fit and tolerance intervals for Deming Regression but none of them widely accepted. The conventional methods we use for OLS regression may not make sense. This is an area of active research. I don't think there many R-packages which will help you compute that since not many mathematicians agree on any particular method. Most methods for calculating intervals are based on Resampling techniques.
However you can check out the 'mcr' package for intervals...

estimating density in a multidimensional space with R

I have two types of individuals, say M and F, each described with six variables (forming a 6D space S). I would like to identify the regions in S where the densities of M and F differ maximally. I first tried a logistic binomial model linking F/ M to the six variables but the result of this GLM model is very hard to interpret (in part due to the numerous significant interaction terms). Thus I am thinking to an “spatial” analysis where I would separately estimate the density of M and F individuals everywhere in S, then calculating the difference in densities. Eventually I would manually look for the largest difference in densities, and extract the values at the 6 variables.
I found the function sm.density in the package sm that can estimate densities in a 3d space, but I find nothing for a space with n>3. Would you know something that would manage to do this in R? Alternatively, would have a more elegant method to answer my first question (2nd sentence)?
In advance,
Thanks a lot for your help
The function kde of the package ks performs kernel density estimation for multinomial data with dimensions ranging from 1 to 6.
pdfCluster and np packages propose functions to perform kernel density estimation in higher dimension.
If you prefer parametric techniques, you look at R packages doing gaussian mixture estimation like mclust or mixtools.
The ability to do this with GLM models may be constrained both by interpretablity issues that you already encountered as well as by numerical stability issues. Furthermore, you don't describe the GLM models, so it's not possible to see whether you include consideration of non-linearity. If you have lots of data, you might consider using 2D crossed spline terms. (These are not really density estimates.) If I were doing initial exploration with facilities in the rms/Hmisc packages in five dimensions it might look like:
dd <- datadist(dat)
big.mod <- lrm( MF ~ ( rcs(var1, 3) + # `lrm` is logistic regression in rms
rcs(var2, 3) +
rcs(var3, 3) +
rcs(var4, 3) +
rcs(var5, 3) )^2,# all 2way interactions
max.iter=50) # these fits may take longer times
bplot( Predict(bid.mod, var1,var2, n=10) )
That should show the simultaneous functional form of var1's and var2's contribution to the "5 dimensional" model estimates at 10 points each and at the median value of the three other variables.

Given a set of random numbers drawn from a continuous univariate distribution, find the distribution

Given a set of real numbers drawn from a unknown continuous univariate distribution (let's say is is one of beta, Cauchy, chi-square, exponential, F, gamma, Laplace, log-normal, normal, Pareto, Student's t, uniform and Weibull) ..
x <- c(7.7495976,12.1007857,5.8663491,9.9137894,11.3822335,7.4406175,8.6997212,9.4456074,11.8370711,6.4251469,9.3597039,8.7625700,10.3171063,8.0983110,11.7564283,11.7583461,7.3760516,14.5713098,14.3289690,12.8436795,7.1834376,12.2530520,8.9362235,11.8964391,5.4378782,7.8083060,0.1356370,14.9341847,6.8625143,9.0285873,10.2251998,10.3348486,7.7518365,2.8757024,9.2676577,10.6879259,11.7623207,14.0745924,9.3478318,7.6788852,9.7491924,14.9409955,11.0297640,8.5541261,8.6129808,9.2192320,12.3507414,8.9156903,11.6892831,10.2571897,11.1673235,10.5883741,8.2396129,7.3505839,3.4437525,8.3660082,10.5779227,8.5382177,13.6647484,9.0712034,4.1090454,13.4238382,16.1965937,14.2539891,14.6498816,6.9662381,12.3282141,10.9628268,10.8859495,11.6742822,12.0469869,9.1764119,4.2324549,12.6665295,10.7467579,6.4153703,10.3090806,12.0267082,9.2375369,13.8011813,13.0457227,14.0147179,6.9224316,7.1164269,10.7577799,8.0965571,13.3371566,14.6997535,8.8248384,8.0634834,10.2226001,8.5112199,8.1701147,8.1970784,10.5432878,5.9603389,6.6287037,13.3417943,3.1122822,10.4241008) # ... truncated for brevity,13.5725275,15.0862343,12.5248807,10.8804527,12.7291198) # ... truncated for brevity1676,13.4381778,7.4353197,8.9210043,10.2010750,11.9442048,11.0081195,4.3369520,13.2562675,15.9945674,8.7528248,14.4948086,14.3577443,6.7438382,9.1434984,15.4599419,13.1424011,7.0481925,7.4823108,10.5743730,6.4166006,11.8225244,8.9388744,10.3698150,10.3965596,13.5226492,16.0069239,6.1139247,11.0838351,9.1659242,7.9896031,10.7282936,14.2666492,13.6478802,10.6248561,15.3834373,11.5096033,14.5806570,10.7648690,5.3407430,7.7535042,7.1942866,9.8867927,12.7413156,10.8127809,8.1726772,8.3965665)
.. is there some easy way in R to programmatically and automatically find the most likely distribution and the estimated distribution parameters?
Please note that the distribution identification code will be part of an automated process, so manual intervention in the identification won't be possible.
My first approach would be to generate qq plots of the given data against the possible distributions.
x <- c(15.771062,14.741310,9.081269,11.276436,11.534672,17.980860,13.550017,13.853336,11.262280,11.049087,14.752701,4.481159,11.680758,11.451909,10.001488,11.106817,7.999088,10.591574,8.141551,12.401899,11.215275,13.358770,8.388508,11.875838,3.137448,8.675275,17.381322,12.362328,10.987731,7.600881,14.360674,5.443649,16.024247,11.247233,9.549301,9.709091,13.642511,10.892652,11.760685,11.717966,11.373979,10.543105,10.230631,9.918293,10.565087,8.891209,10.021141,9.152660,10.384917,8.739189,5.554605,8.575793,12.016232,10.862214,4.938752,14.046626,5.279255,11.907347,8.621476,7.933702,10.799049,8.567466,9.914821,7.483575,11.098477,8.033768,10.954300,8.031797,14.288100,9.813787,5.883826,7.829455,9.462013,9.176897,10.153627,4.922607,6.818439,9.480758,8.166601,12.017158,13.279630,14.464876,13.319124,12.331335,3.194438,9.866487,11.337083,8.958164,8.241395,4.289313,5.508243,4.737891,7.577698,9.626720,16.558392,10.309173,11.740863,8.761573,7.099866,10.032640)
> qqnorm(x)
For more info see link
Another possibility is based on the fitdistr function in the MASS package. Here is the different distributions ordered by their log-likelihood
> library(MASS)
> fitdistr(x, 't')$loglik
[1] -252.2659
Warning message:
In log(s) : NaNs produced
> fitdistr(x, 'normal')$loglik
[1] -252.2968
> fitdistr(x, 'logistic')$loglik
[1] -252.2996
> fitdistr(x, 'weibull')$loglik
[1] -252.3507
> fitdistr(x, 'gamma')$loglik
[1] -255.9099
> fitdistr(x, 'lognormal')$loglik
[1] -260.6328
> fitdistr(x, 'exponential')$loglik
[1] -331.8191
Warning messages:
1: In dgamma(x, shape, scale, log) : NaNs produced
2: In dgamma(x, shape, scale, log) : NaNs produced
Another similar approach is using the fitdistrplus package
Loop through the distributions of interest and generate 'fitdist' objects. Use either "mle" for maximum likelihood estimation or "mme" for matching moment estimation, as the fitting method.
Use bootstrap re-sampling in order to simulate uncertainty in the parameters of the selected model
The fitdist method allows for using custom distributions or distributions from other packages, provided that the corresponding density function dname, the corresponding distribution function pname and the corresponding quantile function qname have been defined (or even just the density function).
So if you wanted to test the log-likelihood for the inverse normal distribution:
You may also find Fitting distributions with R helpful.
(Answer edited to add additional explanation)
You can't really find "the" distribution; the actual distribution from which data are drawn can nearly always* be guaranteed not to be in any "laundry list" provided by any such software. At best you can find "a" distribution (more likely several), one that is an adequate description. Even if you find a great fit there are always an infinity of distributions that are arbitrarily close by. Real data tends to be drawn from heterogeneous mixtures of distributions that themselves don't necessarily have simple functional form.
* an example where you might hope to is where you know the data were actually generated from exactly one distribution on a list, but such situations are extremely rare.
I don't think just comparing likelihoods is necessarily going to make sense, since some distributions have more parameters than others. AIC might make more sense, except that ...
Attempting to identify a "best fitting" distribution from a list of candidates will tend to produce overfitting, and unless the effect of such model selection is accounted for properly will lead to overconfidence (a model that looks great but doesn't actually fit the data not in your sample). There are such possibilities in R (the package fitdistrplus comes to mind), but as a common practice I would advise against the idea. If you must do it, use holdout samples or cross-validation to obtain models with better generalization error.
I find it hard to imagine a realistic situation where this would be useful. Why not use a non-parametric tool like a kernel density estimate?
You could try using the Kolmogorov-Smirnov tests (ks.test in R).
If you have time-to-event data, here's software that does a Bayesian chi squared test against a list of common distributions to report the best fit.
As others have pointed out, this might be framed as a model selection question. It is a wrong approach to use the distribution that fits the data best without taking into account the complexity of the distribution. This is because the more complicated distribution will generally have better fit, but it will likely overfit the data.
You can use the Akaike Information Criteria (AIC) to take into account the complexity of the distribution. This is still unsatisfactory as you're only considering a limited number of distributions, but is still better than just using the log likelihood.
I use just a few distributions, but you can check the documentation to find others that could be relevant
Using the fitdistrplus you can run:
distributions = c("norm", "lnorm", "exp",
"cauchy", "gamma", "logis",
# the x vector is defined as in the question
# Plot to see which distributions make sense. This should influence
# your choice of candidate distributions
descdist(x, discrete = FALSE, boot = 500)
distr_aic = list()
distr_fit = list()
for (distribution in distributions) {
distr_fit[[distribution]] = fitdist(x, distribution)
distr_aic[[distribution]] = distr_fit[[distribution]]$aic
> distr_aic
[1] 5032.269
[1] 5421.815
[1] 6602.334
[1] 5382.643
[1] 5184.17
[1] 5047.796
[1] 5058.336
According to our plot and the AIC, it makes sense to use a normal. You can automatize this by just picking the distribution with the minimum AIC. You can check the estimated parameters with
> distr_fit[['norm']]
Fitting of the distribution ' norm ' by maximum likelihood
estimate Std. Error
mean 9.975849 0.09454476
sd 2.989768 0.06685321
