Empirical CDF vs Theoretical CDF in R - r

I want to check the "probability integral transform" theorem using R.
Let's suppose X is an exponential random variable with lambda = 5.
I want to check that the random variable U = F_X = 1 - exp(-5*X) has a uniform (0,1) distribution.
How would you do it?
I would start in this way:
nsample <- 1000
lambda <- 5
x <- rexp(nsample, lambda) #1000 exponential observation
u <- 1- exp(-lambda*x) #CDF of x
Then I need to find the CDF of u and compare it with the CDF of a Uniform (0,1).
For the empirical CDF of u I could use the ECDF function:
ECDF_u <- ecdf(u) #empirical CDF of U
Now I should create the theoretical CDF of Uniform (0,1) and plot it on the same graph of the ECDF in order to compare the two graphs.
Can you help with the code?

You are almost there. You don't need to compute the ECDF yourself – qqplot will take care of this. All you need is your sample (u) and data from the distribution you want to check against. The lazy (and not quite correct) approach would be to check against a random sample drawn from a uniform distribution:
qqplot(runif(nsample), u)
But of course, it is better to plot against the theoretical quantiles:
# the actual plot
qqplot( qunif(ppoints(length(u))), u )
# add a line
qqline(u, distribution=qunif, col='red', lwd=2)
Looks pretty good to me.

Related

Random number simulation in R

I have been going through some random number simulation equations while i found out that as Pareto dosent have an inbuilt function.
RPareto is found as
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}
can someone explain the intuitive meaning behind this.
It's a well known result that if X is a continuous random variable with CDF F(.), then Y = F(X) has a Uniform distribution on [0, 1].
This result can be used to draw random samples of any continuous random variable whose CDF is known: generate u, a Uniform(0, 1) random variable and then determine the value of x for which F(x) = u.
In specific cases, there may well be more efficient ways of sampling from F(.), but this will always work as a fallback.
It's likely (I haven't checked the accuracy of the code myself, but it looks about right) that the body of your function solves f(x) = u for known u in order to generate a random variable with a Pareto distribution. You can check it with a little algebra after getting the CDF from this Wikipedia page.

Simulating a draw from the distribution of $X$ (in R)

I have a pdf $f(x)=4x^3$ of a random variable $X$ in which I need to simulate a draw from the distribution.
My solution consists of finding the cdf from the pdf (1st issue):
> pdf <- function(x){4*x^3}
> cdf <- integrate(pdf,lower=0,upper=x)
Error in integrate(pdf, lower = 0, upper = x) : object 'x' not found
Once I get the cdf $U$, I will set $X=F^-1(U)$. I notice that the pdf follows a Beta distribution with $\alpha=4$ and $\beta=1$.
Is it best to find the $F^-1$ via a inverse beta function? Is there a quick way to find the inverse of a beta function in R?
Since you have identified your pdf as beta, just use rbeta to sample.
s1 <- rbeta(5000,4,1)
In the case where the distribution is non-standard and you cannot solve analytically, you can use rejection sampling. Let's pretend we don't know your pdf is beta and we don't know how to integrate/inverse.
pdf <- function(x) 4*x^3 # on [0,1]
First we draw from our proposal distribution
p <- runif(50000)
Calculate the density values under our pdf
dp <- pdf(p)
And randomly accept/reject in proportion
s2 <- p[runif(50000) < dp/max(dp)]
You should find the distributions of s1 and s2 comparable, using histograms or, preferably, a qqplot.

Plot density of a distribution

Is it possible to plot in R the density function of a distribution?
For example suppose that I want to plot the density function of a Normal(0,5) or a Gamma(5,5).
f <- function(x) dnorm(x,0,5)
g <- function(x) dgamma(x,5,5)
par(mfrow=c(1,2)) # set up gaphics window for 2 plots
plot(f,xlim=c(-15,15),main="N(0,5)")
plot(g,xlim=c(0,3),main="Gamma(5,5)")
In R the distribution functions follow a pattern. For instance, for the normal distribution, the PDF is dnorm(...), the CDF is pnorm(...), the inverse CDF (quantile function) is qnorm(...), and the random number generator is rnorm(...).
One thing to watch out for is that R's convention for the arguments does not necessarily match what you find, for instance, on Wikipedia. For instance the arguments to dgamma(...) are x, shape, and rate, not x, k, and theta.

Testing ratio of density distributions for normality

I have a normal distribution and a uniform distribution. I want to calculate a ratio: the density of the normal distribution, over the density of the uniform. Then I want to test this ratio for normality.
ht <- runif(3000, 1, 18585056) # Uniform distribution
hm <- rnorm(35, 10000000, 5000000) # Normal distribution
hmd <- density(hm, from=0, to=18585056) # Kernel density of distributions over range
htd <- density(ht, from=0, to=18585056)
ratio <- hmd$y/htd$y # Ratio of kernel density values
The distributions hm and ht above are examples of what my experimental data shows; the vectors I will actually be using are not randomly generated in R.
I know that I can get a good idea of normality from the correlation coefficient of a Q-Q plot:
qqp <- qqnorm(hm)
cor(qqp$x,qqp$y)
For hm, which is normally distributed, this gives a value close to 1.
Is there a way of determining the normality of the density vectors? e.g. hmd and ratio.
(Additional information: hm and ht are modelling homozygous and heterozygous SNPs across a genome of length 18585056)
First, this is really a statistics question; you should consider posting it on stats.stackexchange.com - you are likely to get a better answer.
Second, the short answer to your question is that "testing the ratio of two density functions for normality" is not a meaningful idea. As mentioned in the comment, the ratio of two density functions is not a density function. Among other things, a density function must integrate to 1 over (-Inf,+Inf), which this ratio will not (generally).
It is meaningful, however, to test if the distribution of the ratio of two random variables is normal. If you know that the numerator is normally distributed and the denominator is uniformly distributed, then the ratio will definitely not be normally distributed, as demonstrated below in the discussion of the slash distribution.
If you do not know the distributions of the numerator and denominator, but just have random samples, you should calculate the ratio of the random variates and test that for normality. In your case (with minor edits):
set.seed(123)
ht <- runif(3000, 1, 18585056)
hm <- rnorm(3500, 10000000, 5000000)
Z <- sample(hm,1000)/sample(ht,1000) # numer. and denom. must be same length
par(mfrow=c(1,2))
# histogram of Z
hist(Z,xlim=c(-5,5), breaks=c(-Inf,seq(-5,5,0.2),Inf),freq=F, ylim=c(0,.4))
# normal Q-Q plot
qqnorm(Z,ylim=c(-5,5))
qqline(Z,xlim=c(-5,5),lty=2,col="blue")
Clearly, the ratio distribution is not normal.
Slash Distribution
In the special case
X ~ N[0,1] = φ(x) (-Inf ≤ x ≤ Inf), and
Y ~ U[0,1] = 1 (0 ≤ x ≤ 1); 0 elsewhere
Z = X/Y ~ [ φ(0) - φ(x) ]/x2
That is, a random variable formed as the ratio of two other (independent) random variables, the numerator distributed as N(0,1) and the denominator distributed as U(0,1), has the slash distribution, defined above. We can show this in R code as follows
set.seed(123)
X <- rnorm(10000)
Y <- runif(10000)
Z <- X/Y
dslash <- function(x) (dnorm(0)-dnorm(x))/x^2
x <- seq(-5,5,0.02)
par(mfrow=c(1,2))
hist(Z,xlim=c(-5,5), breaks=c(-Inf,seq(-5,5,0.2),Inf),freq=F, ylim=c(0,.4))
lines(x,dslash(x),xlim=c(-5,5),col="red")
lines(x,dnorm(x),xlim=c(-5,5),col="blue",lty=2)
qqnorm(Z,ylim=c(-5,5))
qqline(Z,xlim=c(-5,5),lty=2,col="blue")
The bars represent the histogram of Z = X/Y, the red curve is the slash distribution, and the blue curve is the pdf of N[0,1] for reference. Because the red curve is "bell shaped" there is a temptation to think that Z is normally distributed, just with a larger variance. The Q-Q plot shows clearly that this is not the case. The tails of the slash distribution are much larger than would be expected from a normal distribution.

How to get Inverse CDF (kernel) in R?

Is there any function in R which will calculate the inverse kernel(i am considering normal) CDF for a particular alpha(0,1).
I have found quantile but I am not sure how it works.
Thanks
We can integrate to get the cdf and we can use a root finding algorithm to invert the cdf. First we'll want to interpolate the output from density.
set.seed(10000)
x <- rnorm(1000, 10, 13)
pdf <- density(x)
# Interpolate the density
f <- approxfun(pdf$x, pdf$y, yleft=0, yright=0)
# Get the cdf by numeric integration
cdf <- function(x){
integrate(f, -Inf, x)$value
}
# Use a root finding function to invert the cdf
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(x))$root
}
which gives
med <- invcdf(.5)
cdf(med)
#[1] 0.5000007
This could definitely be improved upon. One issue is that I don't guarantee that the cdf is always less than or equal to 1 (and if you check the cdf for values larger than max(x) you might get something like 1.00097. But I'm too tired to fix that now. This should give a decent start.
An alternative approach would be to use log-spline density estimation rather than kernel density estimation. Look at the 'logspline' package. With logspline density estimations you get CDF (plogspline) and inverse CDF (qlogspline) functions.

Resources