R: How to fit a large dataset with a combination of distributions? - r

To fit a dataset of real-valued numbers (x) with one distribution, we can use MASS as follows either the gamma or Student's t distribution:
fitdistr(x, "gamma")
or
fitdistr(x2, "t")
What if I believe my dataset should fit by the sum of gamma and t distributions?
P(X) = Gamma(x) + t(x)
Can I fit the parameters of mixtures of probability distributions using Maximum Likelihood fitting in R?

There are analytic maximum-likelihood estimators for some parameters, such as the mean of a normal distribution or the rate of an exponential distribution. For other parameters, there is no analytic estimator, but you can use numerical analysis to find reasonable parameter estimates.
The fitdistr() function in R uses numerical optimization of the log-likelihood function by calling the optim() function. If you think that your data is a mixture of Gamma and t distribution, then simply make a likelihood function that describes such a mixture. Then, pass those parameter values to optim() for optimization. Here is an example using this approach to fitting a distribution:
library( MASS )
vals = rnorm( n = 10000, mean = 0, sd = 1 )
print( summary(x_vals) )
ll_func = function(params) {
log_probs = log( dnorm( x = vals, mean = params[1], sd = params[2] ))
tot = sum(log_probs)
return(-1 * tot)
}
params = c( 0.5, 10 )
print( ll_func(params) )
res = optim( params, ll_func )
print( res$par )
Running this program in R produces this output:
[1] "mean: 0.0223766157516646"
[1] "sd: 0.991566611447471"
That's fairly close to the initial values of mean = 0 and sd = 1.
Don't forget that with a mixture of two distributions, you have one extra parameter that specifies the relative weights between the distributions. Also, be careful about fitting lots of parameters at once. With lots of free parameters you need to worry about overfitting.

Try mixdist. Here's an example of a mixture of three distributions:
https://stats.stackexchange.com/questions/10062/which-r-package-to-use-to-calculate-component-parameters-for-a-mixture-model

Related

How can I use the results from fitdist to create a probability distribution in R?

I used fitdist from the fitdistrplus package to fit a (gamma) distribution to my data:
fitg <- fitdist(mdt, "gamma")
The result is a list of parameters that describe the fit. I wonder if there is a way to use the result to create cumulative distribution functions and random sample generators from this distribution.
For example, if the distribution fitted with fitdist corresponded to a normal distribution with mean 0 and sd 1, how can I recreate easily pnorm(..,0,1) and rnorm(..,0,1)?
I understand I can do it manually, but it would be much easier for me to have a function doing it "automatically", as I have to do it for many different datasets that will be fitted with different kinds of distributions.
Thanks a lot for your help!
Do you want something like the following?
library(fitdistrplus)
data <- rnorm(1000, 0.01, 1.01) # sampled from original distribution N(0.01, 1.01^2)
fit_and_draw_sample <- function(data, nsamples, distr='norm') {
if (distr == 'norm') {
fitg <- fitdist(data, distr)
params <- fitg$estimate
print(params) # fitted distribution N(0.0398281, 0.9876068^2) with estimated params
# mean sd
# 0.0398281 0.9876068
mu <- params[1]
sigma <- params[2]
return (rnorm(nsamples, mu, sigma))
}
# handle other distributions here
return (NULL)
}
samples <- fit_and_draw_sample(data, 1000)
hist(data, col=scales::alpha('blue',.2), border=FALSE, main='samples from original and fitted distribution')
hist(samples, col=scales::alpha('red',.2), add=TRUE, border=FALSE)
legend('topright', c("original", "fitted"), col = c(rgb(0,0,1,0.2), rgb(1,0,0,0.2)), lwd=c(2,2))

Computing the marginal likelihood of a Gaussian model in R with integrate()

I am trying to compute the marginal likelihood of a Gaussian model in R. More precisely, I am trying to integrate the likelihood over both a Gaussian prior on mu and a Gaussian prior on sigma, with some observations yi.
In other words, I am trying to compute:
I tried to write this in R using the following function (following a similar SA question here: Quadrature to approximate a transformed beta distribution in R):
marglik <- function(data) {
integrand <-
Vectorize(function(data, mu, sigma) {
prod(dnorm(data, mu, sigma) ) * dnorm(mu, 110, 1) * dnorm(sigma, 10, 1)
} )
integrate(integrand, lower = 0, upper = Inf, mu = 100, sigma = 10)$value
}
Using this function, I can compute the marginal likelihood of the above model for a set of observations:
set.seed(666)
d <- rnorm(100, mean = 107.5, sd = 2.5)
marglik(data = d)
[1] 9.704133e-24
However, the results I obtain with this procedure are quite different from results I obtain with grid approximation, or using other packages/softwares.
My question is then: is it possible to do this double integration with integrate ? If it is, how would you do that ?
integrate() only takes in univariate functions. That is, the function you put in must be one-dimensional.
In general, such a problem is better tackled using specialised tools, either using something bridgesampling, ie. through the bridgesampling package if you have MCMC output or the cubature package if you have more general multivariate integration problems.
However, if we absolutely must do this using integrate() twice, we can make this work, but some errors need to be taken out of the code, and . Something like the following would work, although numerically the result seems to be zero most of the time, which is why you would generally try to obtain the log-marginal likelihood.
marglik <- function(data) {
# Function that integrates over mu for given sigma.
mu_integrand <- Vectorize(function(sigma) {
mu_given_sigma_fun <- Vectorize(function(mu) {
prod(dnorm(data, mu, sigma) ) * dnorm(mu, 110, 1) * dnorm(sigma, 10, 1)
})
integrate(mu_given_sigma_fun, lower = -Inf, upper = Inf)$value
})
integrate(mu_integrand, lower = 0, upper = Inf)$value
}
set.seed(666)
d <- rnorm(100, mean = 110, sd = 10)
marglik(data = d)

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

Finding critical values for the Pearson correlation coefficient

I'd like to use R to find the critical values for the Pearson correlation coefficient.
This has proved difficult to find in search engines since the standard variable for the Pearson correlation coefficient is itself r. In turn, I'm finding a lot of r critical value tables (rather than how to find this by using the statistical package R).
I'm looking for a function that will provide output like the following:
I'm comfortable finding the correlation with:
cor(x,y)
However, I'd also like to find the critical values.
Is there a function I can use to enter n (or degrees of freedom) as well as alpha in order to find the critical value?
The significance of a correlation coefficient, r, is determined by converting r to a t-statistic and then finding the significance of that t-value at the degrees of freedom that correspond to the sample size, n. So, you can use R to find the critical t-value and then convert that value back to a correlation coefficient to find the critical correlation coefficient.
critical.r <- function( n, alpha = .05 ) {
df <- n - 2
critical.t <- qt(alpha/2, df, lower.tail = F)
critical.r <- sqrt( (critical.t^2) / ( (critical.t^2) + df ) )
return(critical.r)
}
# Example usage: Critical correlation coefficient at sample size of n = 100
critical.r( 100 )
The general structure of hypothesis testing is kind of a mish-mash of two systems: Fisherian and Neyman-Pearson. Statisticians understand the differences but rarely does this get clearly presented in undergraduate stats classes. R was designed by and intended for statisticians as a toolbox, so they constructed a function named cor.test that will deliver a p-value (part of the Fisherian tradition) as well as a confidence interval for "r" (derived on the basis of the Neyman-Pearson formalism.) Fisher and Neyman had bitter disputes in their lifetime. The "critical value" terminology is part of the N-P testing strategy. It is equivalent to building a confidence interval and finding the particular statistic that reaches exactly a threshold value of 0.05 significance.
The code for constructing the inferential statistics in cor.test is available with:
methods(cor.test)
getAnywhere(cor.test.default)
# scroll down
method <- "Pearson's product-moment correlation"
#-----partial code----
r <- cor(x, y)
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
# ---- omitted some set up and error checking ----
# this is the confidence interval section------
z <- atanh(r)
sigma <- 1/sqrt(n - 3)
cint <- switch(alternative, less = c(-Inf, z + sigma *
qnorm(conf.level)), greater = c(z - sigma * qnorm(conf.level),
Inf), two.sided = z + c(-1, 1) * sigma * qnorm((1 +
conf.level)/2))
cint <- tanh(cint)
So now you know how R does it. Notice that there is no "critical value" mentioned. I suspect that your hope was to find some table where a tabulation of "r" and "df" was laid out displaying the minimum "r" that would reach a significance of 0.05 for a given 'df'. Such a table could be built but that's not how this particular toolbox is constructed. You should now have the tools to build it yourself.
I would do the same. But if you are using a Spearman correlation you need to convert t into r using a different formula.
just change the last line before the return in the function with this one:
critical.r <- sqrt(((critical.t^2) / (df)) + 1)

calculate vector valued Hessian in R

I want to calculate a variance-covariance matrix of parameters. The parameters are obtained by a non-linear least squares fit.
library(minpack.lm)
library(numDeriv)
variables
t <- seq(0.1,20,0.3)
a <- 20
b <- 14
c <- 0.4
jitter <- rnorm(length(t),0,0.5)
Hobs <- a+b*exp(-c*t)+jitter
function def
Hhat <- function(parList, t) {parList$a + parList$b*exp(-parL
Hhatde <- function(par, t) {par[1] + par[2]*exp(-par[3]*t)}st$c*t)}
residFun <- function(par, t, observed) observed - Hhat(par,t)
initial conditions
parStart = list(a = 20, b = 10 ,c = 0.5)
nls.lm
library(minpack.lm)
out1 <- nls.lm(par = parStart, fn = residFun, observed = Hobs,
t = t, control = nls.lm.control(nprint=0))
I wish to calculate manually what is given back via vcov(out1)
I tried it with: but sigma and vcov(out1) which don't seem to be the same
J <- jacobian(Hhatde, c(19.9508523,14.6586555,0.4066367 ), method="Richardson",
method.args=list(),t=t)
sigma <- solve((t(J)%*%J))
vcov(out1)
now trying to do it with the hessian, I can't get it working for error message cf below
hessian
H <- hessian(Hhatde, x = c(19.9508523,14.6586555,0.4066367 ), method="complex", method.args=list(),t=t)
Error in hessian.default(Hhatde, x = c(19.9508523, 14.6586555, 0.4066367), :
Richardson method for hessian assumes a scalar valued function.
How do I do I get my hessian() to work.
I am not very strong on the math here, hence the trial and error approach.
vcov(out1) returns an estimate of the scaled variance-covariance matrix for the parameters in your model. The inverse of the cross product of the gradient, solve(crossprod(J)) returns an estimate of the unscaled variance-covariance matrix. The scaling factor is the estimated variance of the errors. So to calculate the scaled variance-covariance matrix (with some rounding error) using the gradient and the residuals from your model:
df = length(Hobs) - length(out1$par) # degrees freedom
se_var = sum(out1$fvec^2) / df # estimated error variance
var_cov = se_var * solve(crossprod(J)) # scaled variance-covariance
print(var_cov)
print(vcov(out1))
To brush up on non-linear regression and non-linear least squares, you might wish to check out Seber & Wild's Nonlinear regression, or Bates & Watts' Nonlinear regression analysis and its applications. John Fox also has a short online appendix that you may find helpful.

Resources