Testing ratio of density distributions for normality - r

I have a normal distribution and a uniform distribution. I want to calculate a ratio: the density of the normal distribution, over the density of the uniform. Then I want to test this ratio for normality.
ht <- runif(3000, 1, 18585056) # Uniform distribution
hm <- rnorm(35, 10000000, 5000000) # Normal distribution
hmd <- density(hm, from=0, to=18585056) # Kernel density of distributions over range
htd <- density(ht, from=0, to=18585056)
ratio <- hmd$y/htd$y # Ratio of kernel density values
The distributions hm and ht above are examples of what my experimental data shows; the vectors I will actually be using are not randomly generated in R.
I know that I can get a good idea of normality from the correlation coefficient of a Q-Q plot:
qqp <- qqnorm(hm)
cor(qqp$x,qqp$y)
For hm, which is normally distributed, this gives a value close to 1.
Is there a way of determining the normality of the density vectors? e.g. hmd and ratio.
(Additional information: hm and ht are modelling homozygous and heterozygous SNPs across a genome of length 18585056)

First, this is really a statistics question; you should consider posting it on stats.stackexchange.com - you are likely to get a better answer.
Second, the short answer to your question is that "testing the ratio of two density functions for normality" is not a meaningful idea. As mentioned in the comment, the ratio of two density functions is not a density function. Among other things, a density function must integrate to 1 over (-Inf,+Inf), which this ratio will not (generally).
It is meaningful, however, to test if the distribution of the ratio of two random variables is normal. If you know that the numerator is normally distributed and the denominator is uniformly distributed, then the ratio will definitely not be normally distributed, as demonstrated below in the discussion of the slash distribution.
If you do not know the distributions of the numerator and denominator, but just have random samples, you should calculate the ratio of the random variates and test that for normality. In your case (with minor edits):
set.seed(123)
ht <- runif(3000, 1, 18585056)
hm <- rnorm(3500, 10000000, 5000000)
Z <- sample(hm,1000)/sample(ht,1000) # numer. and denom. must be same length
par(mfrow=c(1,2))
# histogram of Z
hist(Z,xlim=c(-5,5), breaks=c(-Inf,seq(-5,5,0.2),Inf),freq=F, ylim=c(0,.4))
# normal Q-Q plot
qqnorm(Z,ylim=c(-5,5))
qqline(Z,xlim=c(-5,5),lty=2,col="blue")
Clearly, the ratio distribution is not normal.
Slash Distribution
In the special case
X ~ N[0,1] = φ(x) (-Inf ≤ x ≤ Inf), and
Y ~ U[0,1] = 1 (0 ≤ x ≤ 1); 0 elsewhere
Z = X/Y ~ [ φ(0) - φ(x) ]/x2
That is, a random variable formed as the ratio of two other (independent) random variables, the numerator distributed as N(0,1) and the denominator distributed as U(0,1), has the slash distribution, defined above. We can show this in R code as follows
set.seed(123)
X <- rnorm(10000)
Y <- runif(10000)
Z <- X/Y
dslash <- function(x) (dnorm(0)-dnorm(x))/x^2
x <- seq(-5,5,0.02)
par(mfrow=c(1,2))
hist(Z,xlim=c(-5,5), breaks=c(-Inf,seq(-5,5,0.2),Inf),freq=F, ylim=c(0,.4))
lines(x,dslash(x),xlim=c(-5,5),col="red")
lines(x,dnorm(x),xlim=c(-5,5),col="blue",lty=2)
qqnorm(Z,ylim=c(-5,5))
qqline(Z,xlim=c(-5,5),lty=2,col="blue")
The bars represent the histogram of Z = X/Y, the red curve is the slash distribution, and the blue curve is the pdf of N[0,1] for reference. Because the red curve is "bell shaped" there is a temptation to think that Z is normally distributed, just with a larger variance. The Q-Q plot shows clearly that this is not the case. The tails of the slash distribution are much larger than would be expected from a normal distribution.

Related

how to sample from an upside down bell curve

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

Empirical CDF vs Theoretical CDF in R

I want to check the "probability integral transform" theorem using R.
Let's suppose X is an exponential random variable with lambda = 5.
I want to check that the random variable U = F_X = 1 - exp(-5*X) has a uniform (0,1) distribution.
How would you do it?
I would start in this way:
nsample <- 1000
lambda <- 5
x <- rexp(nsample, lambda) #1000 exponential observation
u <- 1- exp(-lambda*x) #CDF of x
Then I need to find the CDF of u and compare it with the CDF of a Uniform (0,1).
For the empirical CDF of u I could use the ECDF function:
ECDF_u <- ecdf(u) #empirical CDF of U
Now I should create the theoretical CDF of Uniform (0,1) and plot it on the same graph of the ECDF in order to compare the two graphs.
Can you help with the code?
You are almost there. You don't need to compute the ECDF yourself – qqplot will take care of this. All you need is your sample (u) and data from the distribution you want to check against. The lazy (and not quite correct) approach would be to check against a random sample drawn from a uniform distribution:
qqplot(runif(nsample), u)
But of course, it is better to plot against the theoretical quantiles:
# the actual plot
qqplot( qunif(ppoints(length(u))), u )
# add a line
qqline(u, distribution=qunif, col='red', lwd=2)
Looks pretty good to me.

Fitting the Poisson distribution

I was unable to calculate the maximum likelihood estimator and BIC for the Poisson distribution.. I was able to get the histogram but couldn't superimpose a kernel density estimate on it.
Can you please tell me where I went wrong?
x.pois<-rpois(Y1, 20)
hist(x.pois, breaks=100,freq=FALSE)
lines(density(Y1, bw=0.8), col="red")
library(MASS)
fitdistr(Y1,densfun="pois")
my.mle<-fitdistr(Y1, densfun="poison")
print(my.mle)
BIC(my.mle)
You need to (1) spell "poisson" correctly; (2) use x.pois (the Poisson sample), not Y1 (which should be the number of points you're trying to sample, based on your code example).
Note that kernel density estimates, and histograms, of discrete distributions don't necessarily make a lot of sense.
Y1 <- 100
set.seed(101) ## for reproducibility
x.pois<-rpois(Y1, 20)
hist(x.pois, breaks=100,freq=FALSE)
lines(density(x.pois, bw=0.8), col="red")
library(MASS)
(my.mle<-fitdistr(x.pois, densfun="poisson"))
## lambda
## 20.6700000
## ( 0.4546427)
BIC(my.mle)
## [1] 572.7861
update: your other question makes it clear that Y1 really is your sample, in which case the whole rpois()-sampling thing is just a red herring. In that case you should just leave out the first three lines, and substitute Y1 for x.pois, in the code above.

Derivative of Kernel Density

I am using density {stats} to construct a kernel "gaussian' density of a vector of variables. If I use the following example dataset:
x <- rlogis(1475, location=0, scale=1) # x is a vector of values - taken from a rlogis just for the purpose of explanation
d<- density(x=x, kernel="gaussian")
Is there some way to get the first derivative of this density d at each of the n=1475 points
Edit #2:
Following up on Greg Snow's excellent suggestion to use the analytical expression for the derivative of a Gaussian, and our conversation following his post, this will get you the exact slope at each of those points:
s <- d$bw;
slope2 <- sapply(x, function(X) {mean(dnorm(x - X, mean = 0, sd = s) * (x - X))})
## And then, to compare to the method below, plot the results against one another
plot(slope2 ~ slope)
Edit:
OK, I just reread your question, and see that you wanted slopes at each of the points in the input vector x. Here's one way you might approximate that:
slope <- (diff(d$y)/diff(d$x))[findInterval(x, d$x)]
A possible further refinement would be to find the location of the point within its interval, and then calculate its slope as the weighted average of the slope of the present interval and the interval to its right or left.
I'd approach this by averaging the slopes of the segments just to the right and left of each point. (A bit of special care needs to be taken for the first and last points, which have no segment to their left and right, respectively.)
dy <- diff(d$y)
dx <- diff(d$x)[1] ## Works b/c density() returns points at equal x-intervals
((c(dy, tail(dy, 1)) + c(head(dy, 1), dy))/2)/dx
The curve of a density estimator is just the sum of all the kernels, in your case a gaussian (divided by the number of points). The derivative of a sum is the sum of the derivatives and the derivative of a constant times a function is that constant times the derivative. So the derivative of the density estimate at a given point will just be the average of the slopes for the 1475 different gaussian curves at that given point. Each gaussian curve will have a mean corresponding to each of the data points and a standard deviation based on the bandwidth. So if you can calculate the slope for a gaussian, then finding the slope for the density estimate is just a mean of the 1475 slopes.

Plotting Probability Density / Mass Function of Dataset in R

I have a dataset and I want to analyse these data with a probability density function or a probability mass function in R. I used a density function but it didn't gave me the probability.
My data are like this:
"step","Time","energy"
1, 22469 , 392.96E-03
2, 22547 , 394.82E-03
3, 22828,400.72E-03
4, 21765, 383.51E-03
5, 21516, 379.85E-03
6, 21453, 379.89E-03
7, 22156, 387.47E-03
8, 21844, 384.09E-03
9 , 21250, 376.14E-03
10, 21703, 380.83E-03
I want to the get PDF/PMF for the energy vector ; the data we take into account are discrete in nature so I don't have any special type for the distribution of the data.
Your data looks far from discrete to me. Expecting a probability when working with continuous data is plain wrong. density() gives you an empirical density function, which approximates the true density function. To prove it is a correct density, we calculate the area under the curve :
energy <- rnorm(100)
dens <- density(energy)
sum(dens$y)*diff(dens$x[1:2])
[1] 1.000952
Given some rounding error. the area under the curve sums up to one, and hence the outcome of density() fulfills the requirements of a PDF.
Use the probability=TRUE option of hist or the function density() (or both)
eg :
hist(energy,probability=TRUE)
lines(density(energy),col="red")
gives
If you really need a probability for a discrete variable, you use:
x <- sample(letters[1:4],1000,replace=TRUE)
prop.table(table(x))
x
a b c d
0.244 0.262 0.275 0.219
Edit : illustration why the naive count(x)/sum(count(x)) is not a solution. Indeed, it's not because the values of the bins sum to one, that the area under the curve does. For that, you have to multiply with the width of the 'bins'. Take the normal distribution, for which we can calculate the PDF using dnorm(). Following code constructs a normal distribution, calculates the density, and compares with the naive solution :
x <- sort(rnorm(100,0,0.5))
h <- hist(x,plot=FALSE)
dens1 <- h$counts/sum(h$counts)
dens2 <- dnorm(x,0,0.5)
hist(x,probability=TRUE,breaks="fd",ylim=c(0,1))
lines(h$mids,dens1,col="red")
lines(x,dens2,col="darkgreen")
Gives :
The cumulative distribution function
In case #Iterator was right, it's rather easy to construct the cumulative distribution function from the density. The CDF is the integral of the PDF. In the case of the discrete values, that simply the sum of the probabilities. For the continuous values, we can use the fact that the intervals for the estimation of the empirical density are equal, and calculate :
cdf <- cumsum(dens$y * diff(dens$x[1:2]))
cdf <- cdf / max(cdf) # to correct for the rounding errors
plot(dens$x,cdf,type="l")
Gives :

Resources