compute the density of a multivariate Dirichlet and Gamma distribution in R - r

I'd like to compute the density of a multivariate dirichlet distribution and to generate random realizations from such a distribution. Like what the function dmvnorm does with the multivariate normal distribution. I found this for the normal distribution and i would like to know if there is a function that could do this for the Dirichlet and Gamma distribution :
g <- expand.grid(x = seq(-2,2,0.05), y = seq(-2,2,0.05)) ## x and y are the 2 normal distributions.
g$z <- dmvnorm(x=cbind(g$x,g$y),mean = c(0,0),sigma = diag(2),log = FALSE)

Related

Plotting the CDF of Generalized Pareto Distirbution

I need to plot the CDF of a generalized pareto distribution when x is greater than 100,000,000 with location parameter = 100,000,000, scale parameter = 49,761,000 and shape parameter = 0.10. The CDF starts at prob. 0.946844, the values below 100,000,000 are modeled by a uniform distribution. I only need to plot the CDF of the GPD.
library(DescTools)
x <- c(100000001:210580000)
pareto_distribution <- dGenPareto(x, 100000000, 49761000, 0.10)
graph <- data.frame(loss = x, probability = pareto_distribution)
plot(graph)
When I try the code above, the probabilities start at 0. I know that dGenPareto is not the code for the CDF but I was starting at the pdf and then going to calculate the CDF. How do I restrict the probability of the GPD so that it starts at the probability at 0.946844 not zero.
I am expecting the CDF of GPD to start at 0.946844 when x = 100,000,000. The x values are discrete.

MASS:: fitdistr negative binomial with weights in R

We are carrying out an Operational Risk study, in particular we are fitting a severity frequency function with a negative binomial as follows:
# Negative Binomial Fitting
fit = MASS::fitdistr(datosf$Freq,"negative binomial")[[1]]
BN_s <- fit[1]
BN_mu <- fit[2]
# fitdistr parametrises the BN with size and mu, we calculate the parameter p as size/(size+mu)
BN_prob<-fit[1]/(fit[1]+fit[2])
# scale size to model annual frequency
BN_size= BN_s*f_escala
# goodness-of-fit test
chi_2_test = chisq.test(datosf$Freq,rnbinom(n=l,size=BN_s,prob=BN_prob))
# goodness-of-fit plot
nbinom = function(x)dnbinom(x, size = BN_s, mu = BN_mu)
hist(datosf$Freq, freq=FALSE, nclass=50)
curve(nbinom, from=0, to=max(datosf$Freq), n=max(datosf$Freq)+1, add=TRUE, col="blue")
In the data frame datosf$Freq we have the frequency (of the historical series) grouped monthly.
Currently, we have the objective of weighting these years according to the time horizon using the function:
w(t) = 1.05 - t/20 where t is the number of years and t=1,....,10
i.e. the objective is to maximise the following likelihood function:
L(x_i,\theta) = \prod_{i} w_i f(x_i,\theta)
Where x_i is the frequency and f(x_i) is the negative binomial density function.
How can we readapt the code to include the weights w_i?
Thank you very much!

Generate random numbers with bivariate gamma distribution in R

How to generate random numbers with bivariate gamma distribution. The density is:
F(X, Y)(x, y) = αp+qxp-1(y-x)q-1e-αy / [Γ(p) Γ(q)], 𝕀0≤ x≤ y
With y>x>0, α>0, p>0 and q>0.
I did not find any package on R that does this and nothing in literature.
This is straightforward:
Generate X~ Gamma(p,alpha) (alpha being the rate parameter in your formulation)
Generate W~ Gamma(q,alpha), independent of X
Calculate Y=X+W
(X,Y) have the required bivariate distribution.
in R (assuming p,q,alpha and n are already defined):
x <- rgamma(n,p,alpha)
y <- x + rgamma(n,q,alpha)
generates n values from the bivariate distribution with parameters p,q,alpha

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

R: How to fit a large dataset with a combination of distributions?

To fit a dataset of real-valued numbers (x) with one distribution, we can use MASS as follows either the gamma or Student's t distribution:
fitdistr(x, "gamma")
or
fitdistr(x2, "t")
What if I believe my dataset should fit by the sum of gamma and t distributions?
P(X) = Gamma(x) + t(x)
Can I fit the parameters of mixtures of probability distributions using Maximum Likelihood fitting in R?
There are analytic maximum-likelihood estimators for some parameters, such as the mean of a normal distribution or the rate of an exponential distribution. For other parameters, there is no analytic estimator, but you can use numerical analysis to find reasonable parameter estimates.
The fitdistr() function in R uses numerical optimization of the log-likelihood function by calling the optim() function. If you think that your data is a mixture of Gamma and t distribution, then simply make a likelihood function that describes such a mixture. Then, pass those parameter values to optim() for optimization. Here is an example using this approach to fitting a distribution:
library( MASS )
vals = rnorm( n = 10000, mean = 0, sd = 1 )
print( summary(x_vals) )
ll_func = function(params) {
log_probs = log( dnorm( x = vals, mean = params[1], sd = params[2] ))
tot = sum(log_probs)
return(-1 * tot)
}
params = c( 0.5, 10 )
print( ll_func(params) )
res = optim( params, ll_func )
print( res$par )
Running this program in R produces this output:
[1] "mean: 0.0223766157516646"
[1] "sd: 0.991566611447471"
That's fairly close to the initial values of mean = 0 and sd = 1.
Don't forget that with a mixture of two distributions, you have one extra parameter that specifies the relative weights between the distributions. Also, be careful about fitting lots of parameters at once. With lots of free parameters you need to worry about overfitting.
Try mixdist. Here's an example of a mixture of three distributions:
https://stats.stackexchange.com/questions/10062/which-r-package-to-use-to-calculate-component-parameters-for-a-mixture-model

Resources