R : How to obtain the fitting values from distribution fit? - r

I fit gamma distribution on empirical distribution function using the $fitdist$ function:
fit = fitdist(data=empdistr,distr="gamma")
I then use the $denscomp$ function to compare data to fitted values:
dc = denscomp(fit)
But I would like to extract from $fit$ or from $dc$ the actual fitted values, i.e. the points of the gamma density (with the fitted parameters) which are displayed in the $denscomp$ function.
Does anybody have an idea of how I can do that.
Thanks in advance!

Use dgamma to predict the density for a given quantile:
dgamma(x, coef(fit)[1], coef(fit)[2])

Related

How do I use the pgamma() function in R to compute the CDF of a gamma distribution?

I want to compute the cumulative distribution function in R for data that follows a gamma distribution. I understood how to do this with a lognormal distribution using the equation from Wikipedia; however, the gamma equation seems more complicated and I decided to use the pgamma() function.
I'm new to this and don't understand the following:
Why do I get three different values out of pgamma, and how does it make sense that they are negative?
Am I supposed to take the log of all the quantiles, just as I used log(mean) and log(standard deviation) when doing calculations with a lognorm distribution?
How do I conceptually understand the CDF calculated by pgamma? It made sense for lognorm that I was calculating the probability that X would take a value <= x, but there is no "x" in this pgamma function.
Really appreciate the help in understanding this.
shape <- 1.35721347
scale <- 1/0.01395087
quantiles <- c(3.376354, 3.929347, 4.462594)
pgamma(quantiles, shape = shape, scale = scale, log.p = TRUE)

Fit a Weibull cumulative distribution to mass passing data in R

I have some particle size mass-passing cumulative data for crushed rock material to which I would like to fit a Weibull distribution using R. I have managed to do this in Excel using WEIBULL.DIST() function using the cumulative switch set to TRUE.
I then used excel SOLVER to derive the alpha and beta parameters using RMSE to get the best fit. I would like to reproduce the result in R.
(see attached spreadsheet here)
The particle data and cumulative mass passing % is are the following vectors
d.mm <- c(20.001,6.964,4.595,2.297,1.741,1.149,
0.871,0.574,0.287,0.082,0.062,0.020)
m.pct <- c(1.00,0.97,0.78,0.49,0.27,0.20,0.14,
0.11,0.07,0.03,0.025,0.00)
This is the plot to which I would like to fit the Weibull result:
plot(log10(d.mm),m.pct)
... computing the function for a vector of diameter values as per the spreadsheet
d.wei <- c(seq(0.01,0.1,0.01),seq(0.2,1,0.1),seq(2,30,1))
The values I've determined as best for the Weibull alpha and beta in Excel using Solver are 1.41 and 3.31 respectively
So my question is how to reproduce this analysis in R (not necessarily the Solver part) but fitting the Weibull to this dataset?
The nonlinear least squares function nls is the R version of the Execl's solver.
The pweibull will calculate the probability distribution for the Weibull distribution. The comments in the code should explain the step-by-step solution
d.mm <- c(20.001,6.964,4.595,2.297,1.741,1.149,
0.871,0.574,0.287,0.082,0.062,0.020)
m.pct <- c(1.00,0.97,0.78,0.49,0.27,0.20,0.14,
0.11,0.07,0.03,0.025,0.00)
#create data frame store data
df<-data.frame(m.pct, d.mm)
#data for prediction
d.wei <- c(seq(0.01,0.1,0.01),seq(0.2,1,0.1),seq(2,30,1))
#solver (provided starting value for solution)
# alpha is used for shape and beta is used for scale
fit<-nls(m.pct~pweibull(d.mm, shape=alpha, scale=beta), data=df, start=list(alpha=1, beta=2))
print(summary(fit))
#extract out shape and scale
print(summary(fit)$parameters[,1])
#predict new values base on model
y<-predict(fit, newdata=data.frame(d.mm=d.wei))
#Plot comparison
plot(log10(d.mm),m.pct)
lines(log10(d.wei),y, col="blue")

how to define a GEV (generalized extreme value) distribution to a copula?

I am trying to fit a copula for two variables which have extreme value distribution. for "mvdc" class, I need to define margins and parammargins. Since GEV is not included in default distribution functions of Rcopula, I got these two values by using "evd" package, by these two functions:
# pgev gives the Generalized Extreme Value distribution function
GEVmarginU1<-pgev(U1, loc=0, scale=1, shape=0, lower.tail = TRUE)
GEVmarginV2<-pgev(V2, loc=0, scale=1, shape=0, lower.tail = TRUE)
#fit a generalised extreme value distribution to my data
MU1 <- fgev(U1, scale = 1, shape = 0)
MV2 <- fgev(V2, scale = 1, shape = 0)
but when I give these values to "mvdc" function, I get an error
myMvd <- mvdc(copula = ellipCopula(family = "Frank", param = 0), margins = c(pgev, pgev),
paramMargins = list(list(MU1), list(MV2))
Most importantly, I want to be sure whether I am in a right track. Since two variables are obtained from discrete choice model, I have extreme value distribution. Also the marginal have GEV distribution, right? So I need to define GEV for "mvdc" otherwise my fitted copula will not wok well.
(1) Ui = β1Xi1 + β2Xi2 + β3Xi3 + εi
(2) Vi = γ1Yj1 + γ2Yj2 + γ3Yj3 + ηi
in summary:
(1) Ui = β'Xi' + εi
(2) Vi = γ'Yj' + ηi
Since these models are made from discrete choice modelling approach, the distribution function follows “extreme value” distribution. First step: I estimate coefficients of β1,β2,β3,γ1,γ2,γ3 separately for each variable of i and Vj by using multinomial logit model using Biogeme software. But intuitively I know that they are dependent variables, so I try to fit a copula and again estimate coefficients by considering dependency value. So, the joint probability that Ui and Vi is chosen by decision-maker n is:
These marginals are transformed to continuous, but still have extreme value distribution, am I right?!???
1) How can I define GEV when using “mvdc” copula class in Rcopula?
Second, assume I used “fitcopula” instead of “mvdc”, and got param(dependency parameter of copula), if I understood correctly, “fitcopula” is for parametric and in my case, it’s non-parametric, am I right?
2) Now, how should I update coefficients by using a joint distribution and dependency parameter???
For the first question, I found out that my marginals are logistic randomly distributed, since they are the difference between two error terms in the utility model and we know that error terms follow type 1 extreme value or Gumbel distribution, and the difference between two Gumbel distribution follow logistics distribution, according to the Wikipedia.

How can I get the probability density function from a regression random forest?

I am using random-forest for a regression problem to predict the label values of Test-Y for a given set of Test-X (new values of features). The model has been trained over a given Train-X (features) and Train-Y (labels). "randomForest" of R serves me very well in predicting the numerical values of Test-Y. But this is not all I want.
Instead of only a number, I want to use random-forest to produce a probability density function. I searched for a solution for several days and here is I found so far:
"randomForest" doesn't produce probabilities for regression, but only in classification. (via "predict" and setting type=prob).
Using "quantregForest" provides a nice way to make and visualize prediction intervals. But still not the probability density function!
Any other thought on this?
Please see the predict.all parameter of the predict.randomForest function.
library("ggplot2")
library("randomForest")
data(mpg)
rf = randomForest(cty ~ displ + cyl + trans, data = mpg)
# Predict the first car in the dataset
pred = predict(rf, newdata = mpg[1, ], predict.all = TRUE)
hist(pred$individual)
The histogram of 500 "elementary" predictions looks like this:
You can also use quantregForest with a very fine grid of quantiles, convert them into a "cumulative distribution function (cdf)" with R-function ecdf and convert this cdf into a density estimation with a kernel density estimator.

Maximum likelihood lognormal R and SAS

I am converting SAS codes to R and there is a feature of using lognormal distribution in the SAS univariate procedure using histograms and midpoints. The result is a table containing the following variables,
EXPPCT - estimated percent of population in histogram interval determined from optional fitted distribution (here it is lognormal)
OBSPCT - percent of variable values in histogram interval
VAR - variable name
MIDPT - midpoint of histogram interval
There is an option in SAS to consider the MLE of the zeta, theta and sigma parameters while applying the distribution.
Now I was able to figure out the way to do this in R. My only problem arises in the likelihood estimation, when the three parameters are being estimated in SAS. R gives me different values.
I am using the following for MLE in R.
library(fitdistrplus)
set.seed(0)
cd <- rlnorm(40,4)
pars <- coef(fitdist(cd, "lnorm"))
meanlog sdlog
4.0549354 0.8620153
I am using the following for MLE in SAS. (the est option)
proc univariate data = testing;
histogram cd /lognormal (theta = est zeta=est sigma=est)
midpoints = 1 to &maxx. by 100
outhistogram = this;
run;
&maxx denotes the maximum of the input. The results of the run from SAS can be found here.
I am new to statistics and unable to find the method used for the MLE in SAS and have no clue as to how to estimate the same in R.
Thanks in advance.
I found these packages EnvStats and FAdist that let me estimate the threshold parameter and use these parameters to fit the 3 parameter lognormal distribution. Backlin was right about the parameters. Right now, the parameters are not an exact match but the end result is the same as SAS. Thank you vey much.

Resources