Calculating mean and variance from a density kernel using Monte Carlo methods - r

Assuming a probability distribution has a density kernel of ,
what Monte Carlo methods can I use to estimate the mean and variance of the distribution?

We can use numerical methods here. First of all, we create a function to represent your probability density function (though this is not yet scaled so that its integral is 1 over its full domain):
pdf <- function(x) x^2 * exp(-x^2/4)
plot(pdf, xlim = c(0, 10))
To get the scaling factor to turn this into a genuine pdf, we can integrate this function over its domain of c(0, Inf).
integrate(pdf, 0, Inf)$value
#> [1] 3.544908
So now we can generate a genuine pdf by dividing our original pdf by this amount:
pdf <- function(x) x^2 * exp(-x^2/4) / 3.544908
plot(pdf, xlim = c(0, 10))
Now that we have a pdf, we can create a cdf with numerical integration:
cdf <- function(x) sapply(x, \(i) integrate(pdf, 0, i)$value)
plot(cdf, xlim = c(0, 10))
The inverse of the cdf is what we need, to be able to convert a sample taken from a uniform distribution between 0 and 1 into a sample drawn from our new distribution. We can create this inverse function using uniroot to find where the output of our cdf matches an arbitrary number between 0 and 1:
inverse_cdf <- function(p)
{
sapply(p, function(i) {
uniroot(function(a) {cdf(a) - i}, c(0, 100))$root
})
}
The inverse cdf looks like this:
plot(inverse_cdf, xlim = c(0, 0.99))
We are now ready to draw a sample from our distribution:
set.seed(1) # Makes this draw reproducible
x_sample <- inverse_cdf(runif(1000))
Now we can plot a histogram of our sample and ensure it matches the pdf:
hist(x_sample, freq = FALSE)
plot(function(x) pdf(x), add = TRUE, xlim = c(0, 6))
Now that we have a sample drawn from x, we can use the sample mean and variance as estimates for the distribution's mean and variance:
mean(x_sample)
#> [1] 2.264438
var(x_sample)
#> [1] 0.9265678
We can increase the accuracy of these estimates by taking larger and larger samples in our call to inverse_cdf(runif(1000)), by increasing the 1000 to a larger number.
Created on 2021-11-06 by the reprex package (v2.0.0)

Related

How can I use the results from fitdist to create a probability distribution in R?

I used fitdist from the fitdistrplus package to fit a (gamma) distribution to my data:
fitg <- fitdist(mdt, "gamma")
The result is a list of parameters that describe the fit. I wonder if there is a way to use the result to create cumulative distribution functions and random sample generators from this distribution.
For example, if the distribution fitted with fitdist corresponded to a normal distribution with mean 0 and sd 1, how can I recreate easily pnorm(..,0,1) and rnorm(..,0,1)?
I understand I can do it manually, but it would be much easier for me to have a function doing it "automatically", as I have to do it for many different datasets that will be fitted with different kinds of distributions.
Thanks a lot for your help!
Do you want something like the following?
library(fitdistrplus)
data <- rnorm(1000, 0.01, 1.01) # sampled from original distribution N(0.01, 1.01^2)
fit_and_draw_sample <- function(data, nsamples, distr='norm') {
if (distr == 'norm') {
fitg <- fitdist(data, distr)
params <- fitg$estimate
print(params) # fitted distribution N(0.0398281, 0.9876068^2) with estimated params
# mean sd
# 0.0398281 0.9876068
mu <- params[1]
sigma <- params[2]
return (rnorm(nsamples, mu, sigma))
}
# handle other distributions here
return (NULL)
}
samples <- fit_and_draw_sample(data, 1000)
hist(data, col=scales::alpha('blue',.2), border=FALSE, main='samples from original and fitted distribution')
hist(samples, col=scales::alpha('red',.2), add=TRUE, border=FALSE)
legend('topright', c("original", "fitted"), col = c(rgb(0,0,1,0.2), rgb(1,0,0,0.2)), lwd=c(2,2))

how to generate data from a distribution whose cdf is not in closed form?

I am working on a distribution whose PDF and CDF is
clearly, CDF is not in neat form. So how do I generate data from this CDF in R as the inverse CDF method can't be applied here?
I am using the following R program to generate a sample from this dist but I am not able to justify every step. Here is my code:
alpha=1.5; beta=3.2;
pdf=function(x)((beta/alpha)*((x/alpha)^beta)*sin(pi/beta))/(((1+(x/alpha)^beta)^2)*(pi/beta))
rand_smplfunc <- function(func, lower, upper){
x_values <- seq(lower,upper,by = 10^-3)
sample(x = x_values, size = 1, replace = TRUE,prob = func(x_values))
}
replicate(10,rand_smplfunc(pdf,0,10))
You can do it the following way, using the inverse cdf method with the empirical cdf (don't need the analytically derived cdf):
alpha = 1.5
beta = 3.2
pdf = function(x)((beta/alpha)*((x/alpha)^beta)*sin(pi/beta))/(((1+(x/alpha)^beta)^2)*(pi/beta))
cdf = function(x) cumsum(pdf(x)) / sum(pdf(x)) # approximate the cdf
x <- seq(0.1, 50, 0.1)
y <- pdf(x)
plot(x, pdf(x), type='l', main='pdf', ylab='f(x)') # plot the pdf
plot(x, cdf(x), type='l', main='cdf', ylab='F(x)') # plot the cdf
# draw n samples by using probability integral transform from unifrom random
# samples by computing the inverse cdf approximately
n <- 5000
random_samples <- approx(x = cdf(x), y = x, runif(n))$y
# plot histogram of the samples
hist(random_samples, 200)
From the above plot, you can see that the distribution of the generated samples indeed resembles the original pdf.
You could use the empirical cdf function ecdf as an approximation. E.g.
set.seed(7)
pdf <- rnorm(10000)
cdf <- ecdf(pdf)
summary(cdf)
Empirical CDF: 10000 unique values with summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.930902 -0.671235 -0.001501 0.001928 0.669820 3.791824
range <- -3:3
cdf(range)
[1] 0.0013 0.0219 0.1601 0.5007 0.8420 0.9748 0.9988
So similarly generate a sufficient number of realizations of your pdf and then use the ecdf function to approximate your cdf.
Note that you can also plot your empirical cdf:plot(cdf)

get the derivative of an ECDF

Is it possible to differentiate an ECDF? Take the one obtained in the following for example example.
set.seed(1)
a <- sort(rnorm(100))
b <- ecdf(a)
plot(b)
I would like to take the derivative of b in order to obtain its probability density function (PDF).
n <- length(a) ## `a` must be sorted in non-decreasing order already
plot(a, 1:n / n, type = "s") ## "staircase" plot; not "line" plot
However I'm looking to find the derivative of b
In samples-based statistics, estimated density (for a continuous random variable) is not obtained from ECDF by differentiation, because the sample size is finite and and ECDF is not differentiable. Instead, we estimate the density directly. I guess plot(density(a)) is what you are really looking for.
a few days later..
Warning: the following is just a numerical solution without statistical ground!
I take it as an exercise to learn about R package scam for shape constrained additive models, a child package of mgcv by Prof Wood's early PhD student Dr Pya.
The logic is as such:
using scam::scam, fit a monotonically increasing P-spline to ECDF (you have to specify how many knots you want); [Note that monotonicity is not the only theoretical constraint. It is required that the smoothed ECDF are "clipped" on its two edges: the left edge at 0 and the right edge at 1. I am currently using weights to impose such constraint, by giving very large weight at two edges]
using stats::splinefun, reparametrize the fitted spline with a monotonic interpolation spline through knots and predicted values at knots;
return the interpolation spline function, which can also evaluate the 1st, 2nd and 3rd derivatives.
Why I expect this to work:
As sample size grows,
ECDF converges to CDF;
P-spline is consistent so a smoothed ECDF will be increasingly unbiased for ECDF;
the 1st derivative of smoothed ECDF will be increasingly unbiased for PDF.
Use with caution:
You have to choose number of knots yourself;
the derivative is NOT normalized so that the area under the curve is 1;
the result can be rather unstable, and is only good for large sample size.
function arguments:
x: a vector of samples;
n.knots: number of knots;
n.cells: number of grid points when plotting derivative function
You need to install scam package from CRAN.
library(scam)
test <- function (x, n.knots, n.cells) {
## get ECDF
n <- length(x)
x <- sort(x)
y <- 1:n / n
dat <- data.frame(x = x, y = y) ## make sure `scam` can find `x` and `y`
## fit a monotonically increasing P-spline for ECDF
fit <- scam::scam(y ~ s(x, bs = "mpi", k = n.knots), data = dat,
weights = c(n, rep(1, n - 2), 10 * n))
## interior knots
xk <- with(fit$smooth[[1]], knots[4:(length(knots) - 3)])
## spline values at interior knots
yk <- predict(fit, newdata = data.frame(x = xk))
## reparametrization into a monotone interpolation spline
f <- stats::splinefun(xk, yk, "hyman")
par(mfrow = c(1, 2))
plot(x, y, pch = 19, col = "gray") ## ECDF
lines(x, f(x), type = "l") ## smoothed ECDF
title(paste0("number of knots: ", n.knots,
"\neffective degree of freedom: ", round(sum(fit$edf), 2)),
cex.main = 0.8)
xg <- seq(min(x), max(x), length = n.cells)
plot(xg, f(xg, 1), type = "l") ## density estimated by scam
lines(stats::density(x), col = 2) ## a proper density estimate by density
## return smooth ECDF function
f
}
## try large sample size
set.seed(1)
x <- rnorm(1000)
f <- test(x, n.knots = 20, n.cells = 100)
f is a function as returned by stats::splinefun (read ?splinefun).
A naive, similar solution is to do interpolation spline on ECDF without smoothing. But this is a very bad idea, as we have no consistency.
g <- splinefun(sort(x), 1:length(x) / length(x), method = "hyman")
curve(g(x, deriv = 1), from = -3, to = 3)
A reminder: it is highly recommended to use stats::density for a direct density estimation.

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

Exponential curve fitting in R

time = 1:100
head(y)
0.07841589 0.07686316 0.07534116 0.07384931 0.07238699 0.07095363
plot(time,y)
This is an exponential curve.
How can I fit line on this curve without knowing the formula ? I can't use 'nls' as the formula is unknown (only data points are given).
How can I get the equation for this curve and determine the constants in the equation?
I tried loess but it doesn't give the intercepts.
You need a model to fit to the data.
Without knowing the full details of your model, let's say that this is an
exponential growth model,
which one could write as: y = a * e r*t
Where y is your measured variable, t is the time at which it was measured,
a is the value of y when t = 0 and r is the growth constant.
We want to estimate a and r.
This is a non-linear problem because we want to estimate the exponent, r.
However, in this case we can use some algebra and transform it into a linear equation by taking the log on both sides and solving (remember
logarithmic rules), resulting in:
log(y) = log(a) + r * t
We can visualise this with an example, by generating a curve from our model, assuming some values for a and r:
t <- 1:100 # these are your time points
a <- 10 # assume the size at t = 0 is 10
r <- 0.1 # assume a growth constant
y <- a*exp(r*t) # generate some y observations from our exponential model
# visualise
par(mfrow = c(1, 2))
plot(t, y) # on the original scale
plot(t, log(y)) # taking the log(y)
So, for this case, we could explore two possibilies:
Fit our non-linear model to the original data (for example using nls() function)
Fit our "linearised" model to the log-transformed data (for example using the lm() function)
Which option to choose (and there's more options), depends on what we think
(or assume) is the data-generating process behind our data.
Let's illustrate with some simulations that include added noise (sampled from
a normal distribution), to mimic real data. Please look at this
StackExchange post
for the reasoning behind this simulation (pointed out by Alejo Bernardin's comment).
set.seed(12) # for reproducible results
# errors constant across time - additive
y_add <- a*exp(r*t) + rnorm(length(t), sd = 5000) # or: rnorm(length(t), mean = a*exp(r*t), sd = 5000)
# errors grow as y grows - multiplicative (constant on the log-scale)
y_mult <- a*exp(r*t + rnorm(length(t), sd = 1)) # or: rlnorm(length(t), mean = log(a) + r*t, sd = 1)
# visualise
par(mfrow = c(1, 2))
plot(t, y_add, main = "additive error")
lines(t, a*exp(t*r), col = "red")
plot(t, y_mult, main = "multiplicative error")
lines(t, a*exp(t*r), col = "red")
For the additive model, we could use nls(), because the error is constant across
t. When using nls() we need to specify some starting values for the optimization algorithm (try to "guesstimate" what these are, because nls() often struggles to converge on a solution).
add_nls <- nls(y_add ~ a*exp(r*t),
start = list(a = 0.5, r = 0.2))
coef(add_nls)
# a r
# 11.30876845 0.09867135
Using the coef() function we can get the estimates for the two parameters.
This gives us OK estimates, close to what we simulated (a = 10 and r = 0.1).
You could see that the error variance is reasonably constant across the range of the data, by plotting the residuals of the model:
plot(t, resid(add_nls))
abline(h = 0, lty = 2)
For the multiplicative error case (our y_mult simulated values), we should use lm() on log-transformed data, because
the error is constant on that scale instead.
mult_lm <- lm(log(y_mult) ~ t)
coef(mult_lm)
# (Intercept) t
# 2.39448488 0.09837215
To interpret this output, remember again that our linearised model is log(y) = log(a) + r*t, which is equivalent to a linear model of the form Y = β0 + β1 * X, where β0 is our intercept and β1 our slope.
Therefore, in this output (Intercept) is equivalent to log(a) of our model and t is the coefficient for the time variable, so equivalent to our r.
To meaningfully interpret the (Intercept) we can take its exponential (exp(2.39448488)), giving us ~10.96, which is quite close to our simulated value.
It's worth noting what would happen if we'd fit data where the error is multiplicative
using the nls function instead:
mult_nls <- nls(y_mult ~ a*exp(r*t), start = list(a = 0.5, r = 0.2))
coef(mult_nls)
# a r
# 281.06913343 0.06955642
Now we over-estimate a and under-estimate r
(Mario Reutter
highlighted this in his comment). We can visualise the consequence of using the wrong approach to fit our model:
# get the model's coefficients
lm_coef <- coef(mult_lm)
nls_coef <- coef(mult_nls)
# make the plot
plot(t, y_mult)
lines(t, a*exp(r*t), col = "brown", lwd = 5)
lines(t, exp(lm_coef[1])*exp(lm_coef[2]*t), col = "dodgerblue", lwd = 2)
lines(t, nls_coef[1]*exp(nls_coef[2]*t), col = "orange2", lwd = 2)
legend("topleft", col = c("brown", "dodgerblue", "orange2"),
legend = c("known model", "nls fit", "lm fit"), lwd = 3)
We can see how the lm() fit to log-transformed data was substantially better than the nls() fit on the original data.
You can again plot the residuals of this model, to see that the variance is not constant across the range of the data (we can also see this in the graphs above, where the spread of the data increases for higher values of t):
plot(t, resid(mult_nls))
abline(h = 0, lty = 2)
Unfortunately taking the logarithm and fitting a linear model is not optimal.
The reason is that the errors for large y-values weight much more than those
for small y-values when apply the exponential function to go back to the
original model.
Here is one example:
f <- function(x){exp(0.3*x+5)}
squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
x <- 0:12
y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
x
y
#--------------------------------------------------------------------
M <- lm(log(y)~x)
a <- unlist(M[1])[2]
b <- unlist(M[1])[1]
print(c(a,b))
squaredError(a,b,x,y)
approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
for ( i in 0:10 )
{
eps <- -i*sign(approxPartAbl_a)*1e-5
print(c(eps,squaredError(a+eps,b,x,y)))
}
Result:
> f <- function(x){exp(0.3*x+5)}
> squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
> x <- 0:12
> y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
> x
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12
> y
[1] 151.2182 203.4020 278.3769 366.8992 503.5895 682.4353 880.1597 1186.5158 1630.9129 2238.1607 3035.8076 4094.6925 5559.3036
> #--------------------------------------------------------------------
>
> M <- lm(log(y)~x)
> a <- unlist(M[1])[2]
> b <- unlist(M[1])[1]
> print(c(a,b))
coefficients.x coefficients.(Intercept)
0.2995808 5.0135529
> squaredError(a,b,x,y)
[1] 5409.752
> approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
> for ( i in 0:10 )
+ {
+ eps <- -i*sign(approxPartAbl_a)*1e-5
+ print(c(eps,squaredError(a+eps,b,x,y)))
+ }
[1] 0.000 5409.752
[1] -0.00001 5282.91927
[1] -0.00002 5157.68422
[1] -0.00003 5034.04589
[1] -0.00004 4912.00375
[1] -0.00005 4791.55728
[1] -0.00006 4672.70592
[1] -0.00007 4555.44917
[1] -0.00008 4439.78647
[1] -0.00009 4325.71730
[1] -0.0001 4213.2411
>
Perhaps one can try some numeric method, i.e. gradient search, to find the
minimum of the squared error function.
If it really is exponential, you can try taking the logarithm of your variable and fitting a linear model to that.

Resources