How to find the inverse for the inverse sampling method in R - r

Generally for the inverse sampling method, we have a density and we would like to sample from it. A first step is to find the the cumulative density function for the density. Then to find it's inverse, and finally to find the inverse function for a randomly sampled value from the uniform distribution.
For example, I have this function y= ((3/2)/(1+x)^2) so the cdf equals (3x)/2(x+1) and the inverse of the cdf is ((3/2)*u)/(1-(3/2)*u)
To do this in R, I wrote
f<-function(x){
y= ((3/2)/(1+x)^2)
return(y)
}
cdf <- function(x){
integrate(f, -Inf, x)$value
}
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(x))$root
}
U <- runif(1e6)
X <- invcdf(U)
I have two problem! First: the code returns the function and not the samples.
The second: is there another simple way to do this work? for example to find the cdf and inverse in more simple ways?
I would like to add that I am not looking for efficiency of the code. I am just interested of a code that could be written by a beginner.

You could try a numerical approach to inverse sampling. As per your request, this is more about transparency of method than efficiency.
This function will numerically integrate a given function over the given range (though it will trim infinite values)
cdf <- function(f, lower_bound, upper_bound)
{
if(lower_bound < -10000) lower_bound <- -10000 # Trim large negatives
if(upper_bound > 10000) upper_bound <- 10000 # Trim large positive
x <- seq(lower_bound, upper_bound, length.out = 100001) # Finely divide x axis
delta <- mean(diff(x)) # Get delta x (i.e. dx)
mid_x <- (x[-1] + x[-length(x)])/2 # Get the mid point of each slice
result <- cumsum(delta * f(mid_x)) # sum f(x) dx
result <- result / max(result) # normalize
list(x = mid_x, cdf = result) # return both x and f(x) in list
}
And to get the inverse, we find the closest value in the cdf of a random number drawn from the uniform distribution between 0 and 1. We then see which value of x corresponds to that value of the cdf. We want to be able to do this for n samples at a time so we use sapply:
inverse_sample <- function(f, n = 1, lower_bound = -1000, upper_bound = 1000)
{
CDF <- cdf(f, lower_bound, upper_bound)
samples <- runif(n)
sapply(samples, function(s) CDF$x[which.min(abs(s - CDF$cdf))])
}
We can test it by drawing histograms of the results. We'll start with the normal distribution's density function (dnorm in R), drawing 1000 samples and plotting their distribution:
hist(inv_sample(dnorm, 1000))
And we can do the same for the exponential distribution, this time setting the limits of integration between 0 and 100:
hist(inv_sample(dexp, 1000, 0, 100))
And finally we can do the same with your own example:
f <- function(x) 3/2/(1 + x)^2
hist(inv_sample(f, 1000, 0, 10))

Related

Trying to replicate rgeom() funtion

As an exercise, I'm trying to write a function which replicates the rgeom() function. I want it to have the same arguments and return values. I've started out by using runif to generate a vector with x elements, but I'm not sure how to apply the probability distribution:
rgeometric <- function(x, prob) {
outcomes <- runif(x)
P <- (1 - prob)^length(x) * prob
return (P)
}
Would it be something like the following? How can I check that the distribution is geometric?
set.seed(0)
rgeometric <- function(x, prob) {
outcomes <- runif(x)
P <- (1 - prob)^length(x) * prob
for (i in x) {
x[i] <- x[i]*P
}
return (outcomes)
}
rgeometric(5, 0.4)
We can accomplish this task using Inverse Transform Sampling.
First, let's clear up some of your notation.
In the rgeom() function, we'll want that first argument to be n, an integer vector of length one giving the number of samples to generate:
rgeometric <- function(n, prob) {
u <- runif(n)
## do stuff
}
So how does inverse transform sampling work?
First we generate a vector u of standard uniform deviates, as shown above.
Then, for each element ui of u, we find the value of the inverse of the cumulative density function at ui.
For the geometric distribution, the CDF is 1 - (1 - prob)^(x+1); the inverse of the CDF is ceiling(log(1-u) / log(1-prob)) - 1 (link to derivation, p. 11).
So, we can complete the function like so:
rgeometric <- function(n, prob) {
u <- runif(n)
return(ceiling(log(1-u) / log(1-prob)) - 1)
}
Your last question is how can we test if the resulting samples are distributed geometric?
I don't know of a formal test that will help, but we can see it appears to work when we compare the density of 1 million random draws from this custom function to the density of 1 million random draws from base R's rgeom() function:
n <- 1e6
p <- 0.25
set.seed(0)
x <- rgeometric(n, p)
y <- rgeom(n, p)
png("so-answer.png", width = 960)
opar <- par(mfrow = c(1, 2))
plot(density(x), main = "Draws from custom function")
plot(density(y), main = "Draws from base R function")
par(opar)
dev.off()
Note that for the definition of the geometric function implemented by r, the random variable is the number of failures until the first success. Therefore you could do:
my_rgeom <- function(n, p){
fun <- function(p){
n <- 0
stopifnot(p>0)
while(runif(1)>p) n <- n+1
n
}
replicate(n, fun(p))
}
Now test the function:
n <- 100000
p <- 0.25
X <- rgeom(n, p)
Y <- my_rgeom(n, p)
You can do a ks.test on X and Y, though this is for continuous variables. The best thing to do is the chisq.test to determine whether the two are similar.
Lastly we could use graphical methods. eg superimposed histogram:
barplot(table(X), col = rgb(0.5, 1, 0.5, 0.4))
barplot(table(Y), add = TRUE, col = rgb(1, 0.5, 0, 0.3))
From the image above you can see that the two are nearly identical

How to run monte carlo simulation from a custom distribution in R

I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.

How to find quantiles of an empirical cumulative density function (ECDF)

I am using ecdf() function to calculate empirical cumulative density function (ECDF) from some random samples:
set.seed(0)
X = rnorm(100)
P = ecdf(X)
Now P gives ECDF and we could plot it:
plot(P)
abline(h = 0.6, lty = 3)
My question is: how can I find the sample value x, such that P(x) = 0.6, i.e., the 0.6-quantile of ECDF, or the x-coordinate of the intersection point of ECDF and h = 0.6?
In the following, I will not use ecdf(), as it is easy to obtain empirical cumulative density function (ECDF) ourselves.
First, we sort samples X in ascending order:
X <- sort(X)
ECDF at those samples takes function values:
e_cdf <- 1:length(X) / length(X)
We could then sketch ECDF by:
plot(X, e_cdf, type = "s")
abline(h = 0.6, lty = 3)
Now, we are looking for the first value of X, such that P(X) >= 0.6. This is just:
X[which(e_cdf >= 0.6)[1]]
# [1] 0.2290196
Since our data are sampled from standard normal distribution, the theoretical quantile is
qnorm(0.6)
# [1] 0.2533471
So our result is quite close.
Extension
Since the inverse of CDF is quantile function (for example, the inverse of pnorm() is qnorm()), one may guess the inverse of ECDF as sample quantile, i,e, the inverse ecdf() is quantile(). This is not true!
ECDF is a staircase / step function, and it does not have inverse. If we rotate ECDF around y = x, the resulting curve is not a mathematical function. So sample quantile is has nothing to do with ECDF.
For n sorted samples, sample quantile function is in fact a linear interpolation function of (x, y), with:
x-values being seq(0, 1, length = n);
y-values being sorted samples.
We can define our own version of sample quantile function by:
my_quantile <- function(x, prob) {
if (is.unsorted(x)) x <- sort(x)
n <- length(x)
approx(seq(0, 1, length = n), x, prob)$y
}
Let's have a test:
my_quantile(X, 0.6)
# [1] 0.2343171
quantile(X, prob = 0.6, names = FALSE)
# [1] 0.2343171
Note that result is different from what we get from X[which(e_cdf >= 0.6)[1]].
It is for this reason, that I refuse to use quantile() in my answer.

Draw random numbers from restricted Pareto distribution

Am a newcomer to R and need advice on how to draw random numbers from a limited area of a Pareto Distribution with parameters s & beta. (System: Windows 7, R 2.15.2.)
(1) I have data in a vector data$t; each single data point I'll call data&tx
For these data the parameters s & beta of a Pareto distribution are estimated following https://stats.stackexchange.com/questions/27426/how-do-i-fit-a-set-of-data-to-a-pareto-distribution-in-r
pareto.MLE <- function(X)
{
n <- length(X)
m <- min(X)
a <- n/sum(log(X)-log(m))
return( c(m,a) )
}
(2) Now I need to draw as many random numbers (RndNew) von this Pareto distribution (s, beta, see (1)) as there are observations (= data points: data$tx) . For the draw the area from which random numbers are drawn must be limited to the area where RndNewx >= data$tx; in other words: RndNewx must never be smaller than the corresponding data$tx.
Question: how to tell R to restrict the area of a Pareto distribution from which to draw a random number to be RndNewx >= data$tx?
Thanks a million for any help!
The standard approach to sampling from a truncated distribution has three steps. Here's an example with the normal distribution so you can get the idea.
n <- 1000
lower_bound <- -1
upper_bound <- 1
Apply the CDF to your lower and upper bounds to find the quantiles of the edges of your distribution.
(quantiles <- pnorm(c(lower_bound, upper_bound)))
# [1] 0.1586553 0.8413447
Sample from a uniform distribution between those quantiles.
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
Apply the inverse CDF.
truncated_normal_random_numbers <- qnorm(uniform_random_numbers)
The CDF for the pareto distribution is
ppareto <- function(x, scale, shape)
{
ifelse(x > scale, 1 - (scale / x) ^ shape, 0)
}
And the inverse is
qpareto <- function(y, scale, shape)
{
ifelse(
y >= 0 & y <= 1,
scale * ((1 - y) ^ (-1 / shape)),
NaN
)
}
We can rework the above example to use these Pareto functions.
n <- 1000
scale <- 1
shape <- 1
lower_bound <- 2
upper_bound <- 10
(quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape))
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
truncated_pareto_random_numbers <- qpareto(uniform_random_numbers, scale, shape)
To make it easier to reuse this code, we can wrap it into a function. The lower and upper bounds have default values that match the range of the distribution, so if you don't pass values in, then you'll get a non-truncated Pareto distribution.
rpareto <- function(n, scale, shape, lower_bound = scale, upper_bound = Inf)
{
quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape)
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
qpareto(uniform_random_numbers, scale, shape)
}

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

I'm trying to create (in r) the equivalent to the following MATLAB function that will generate n samples from a mixture of N(m1,(s1)^2) and N(m2, (s2)^2) with a fraction, alpha, from the first Gaussian.
I have a start, but the results are notably different between MATLAB and R (i.e., the MATLAB results give occasional values of +-8 but the R version never even gives a value of +-5). Please help me sort out what is wrong here. Thanks :-)
For Example:
Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one.
MATLAB
function
function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);
implementation
P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])
resulting plot
resulting hist
R
yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))
resulting plot
resulting hist
As always, THANK YOU!
SOLUTION
gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
U <- runif(nsim)
I <- as.numeric(U<alpha)
y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
(1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
return(y)
}
z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))
par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(0,36)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(3,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
main="1000 samples from LN(0,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
There are two problems, I think ... (1) your R code is creating a mixture of normal distributions with standard deviations of 1 and 37. (2) By setting prob equal to alpha in your rbinom() call, you're getting a fraction alpha in the second mode rather than the first. So what you are getting is a distribution that is mostly a Gaussian with sd 37, contaminated by a 5% mixture of Gaussian with sd 1, rather than a Gaussian with sd 1 that is contaminated by a 5% mixture of a Gaussian with sd 6. Scaling by the standard deviation of the mixture (which is about 36.6) basically reduces it to a standard Gaussian with a slight bump near the origin ...
(The other answers posted here do solve your problem perfectly well, but I thought you might be interested in a diagnosis ...)
A more compact (and perhaps more idiomatic) version of your Matlab gaussmix function (I think runif(n)<alpha is slightly more efficient than rbinom(n,size=1,prob=alpha) )
gaussmix <- function(n,m1,m2,s1,s2,alpha) {
I <- runif(n)<alpha
rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)
Not that you asked for it, but the mclust package offers a way to generalize your problem to more dimensions and diverse covariance structures. See ?mclust::sim. The example task would be done this way:
require(mclust)
simdata = sim(modelName = "V",
parameters = list(pro = c(0.95, 0.05),
mean = c(0, 0),
variance = list(modelName = "V",
d = 1,
G = 2,
sigmasq = c(0, 36))),
n = 1000)
plot(scale(simdata[,2]), type = "h")
I recently wrote the density and sampling function of a multinomial mixture of normal distributions:
dmultiNorm <- function(x,means,sds,weights)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
N <- length(x)
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
weights <- weights/sum(weights)
dens <- numeric(N)
for (i in 1:n)
{
dens <- dens + weights[i] * dnorm(x,means[i],sds[i])
}
return(dens)
}
rmultiNorm <- function(N,means,sds,weights,scale=TRUE)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
Res <- numeric(N)
for (i in 1:N)
{
s <- sample(1:n,1,prob=weights)
Res[i] <- rnorm(1,means[s],sds[s])
}
return(Res)
}
With means being a vector of means, sds being a vector of standard deviatians and weights being a vector with proportional probabilities to sample from each of the distributions. Is this useful to you?
Here is code to do this task:
"For Example: Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one."
plot(multG <- c( rnorm(950), rnorm(50, 0, 36))[sample(1000)] , type="h")
scmulG <- scale(multG)
summary(scmulG)
#-----------
V1
Min. :-9.01845
1st Qu.:-0.06544
Median : 0.03841
Mean : 0.00000
3rd Qu.: 0.13940
Max. :12.33107

Resources