How to find quantiles of an empirical cumulative density function (ECDF) - r

I am using ecdf() function to calculate empirical cumulative density function (ECDF) from some random samples:
set.seed(0)
X = rnorm(100)
P = ecdf(X)
Now P gives ECDF and we could plot it:
plot(P)
abline(h = 0.6, lty = 3)
My question is: how can I find the sample value x, such that P(x) = 0.6, i.e., the 0.6-quantile of ECDF, or the x-coordinate of the intersection point of ECDF and h = 0.6?

In the following, I will not use ecdf(), as it is easy to obtain empirical cumulative density function (ECDF) ourselves.
First, we sort samples X in ascending order:
X <- sort(X)
ECDF at those samples takes function values:
e_cdf <- 1:length(X) / length(X)
We could then sketch ECDF by:
plot(X, e_cdf, type = "s")
abline(h = 0.6, lty = 3)
Now, we are looking for the first value of X, such that P(X) >= 0.6. This is just:
X[which(e_cdf >= 0.6)[1]]
# [1] 0.2290196
Since our data are sampled from standard normal distribution, the theoretical quantile is
qnorm(0.6)
# [1] 0.2533471
So our result is quite close.
Extension
Since the inverse of CDF is quantile function (for example, the inverse of pnorm() is qnorm()), one may guess the inverse of ECDF as sample quantile, i,e, the inverse ecdf() is quantile(). This is not true!
ECDF is a staircase / step function, and it does not have inverse. If we rotate ECDF around y = x, the resulting curve is not a mathematical function. So sample quantile is has nothing to do with ECDF.
For n sorted samples, sample quantile function is in fact a linear interpolation function of (x, y), with:
x-values being seq(0, 1, length = n);
y-values being sorted samples.
We can define our own version of sample quantile function by:
my_quantile <- function(x, prob) {
if (is.unsorted(x)) x <- sort(x)
n <- length(x)
approx(seq(0, 1, length = n), x, prob)$y
}
Let's have a test:
my_quantile(X, 0.6)
# [1] 0.2343171
quantile(X, prob = 0.6, names = FALSE)
# [1] 0.2343171
Note that result is different from what we get from X[which(e_cdf >= 0.6)[1]].
It is for this reason, that I refuse to use quantile() in my answer.

Related

integrating the square of probability density?

Suppose I have
set.seed(2020) # make the results reproducible
a <- rnorm(100, 0, 1)
My probability density is estimated through kernel density estimator (gaussian) in R using the R built in function density. The question is how to integrate the square of the estimated density. It does not matter between which values, let us suppose between -Inf and +Inf. I have tried the following:
f <- approxfun(density(a)$x, density(a)$y)
integrate (f*f, min(density(a)$x), max(density(a)$x))
There are a couple of problems here. First you have the x and y round the wrong way in approxfun. Secondly, you can't multiply function names together. You need to specify a new function that gives you the square of your original function:
set.seed(2020)
a <- rnorm(100, 0, 1)
f <- approxfun(density(a)$x, density(a)$y)
f2 <- function(v) ifelse(is.na(f(v)), 0, f(v)^2)
integrate (f2, -Inf, Inf)
#> 0.2591153 with absolute error < 0.00011
We can also plot the original density function and the squared density function:
curve(f, -3, 3)
curve(f2, -3, 3, add = TRUE, col = "red")
I think you should write the objective function as function(x) f(x)**2, rather than f*f, e.g.,
> integrate (function(x) f(x)**2, min(density(a)$x), max(density(a)$x))
0.2331793 with absolute error < 6.6e-06
Here is a way using package caTools, function trapz. It computes the integral given a vector x and its corresponding image y using the trapezoidal rule.
I also include a function trapzf based on the original to have the integral computed with the function returned by approxfun
trapzf <- function(x, FUN) trapz(x, FUN(x))
set.seed(2020) # make the results reproducible
a <- rnorm(100, 0, 1)
d <- density(a)
f <- approxfun(d$x, d$y)
int1 <- trapz(d$x, d$y^2)
int2 <- trapzf(d$x, function(x) f(x)^2)
int1
#[1] 0.2591226
identical(int1, int2)
#[1] TRUE

How to find the inverse for the inverse sampling method in R

Generally for the inverse sampling method, we have a density and we would like to sample from it. A first step is to find the the cumulative density function for the density. Then to find it's inverse, and finally to find the inverse function for a randomly sampled value from the uniform distribution.
For example, I have this function y= ((3/2)/(1+x)^2) so the cdf equals (3x)/2(x+1) and the inverse of the cdf is ((3/2)*u)/(1-(3/2)*u)
To do this in R, I wrote
f<-function(x){
y= ((3/2)/(1+x)^2)
return(y)
}
cdf <- function(x){
integrate(f, -Inf, x)$value
}
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(x))$root
}
U <- runif(1e6)
X <- invcdf(U)
I have two problem! First: the code returns the function and not the samples.
The second: is there another simple way to do this work? for example to find the cdf and inverse in more simple ways?
I would like to add that I am not looking for efficiency of the code. I am just interested of a code that could be written by a beginner.
You could try a numerical approach to inverse sampling. As per your request, this is more about transparency of method than efficiency.
This function will numerically integrate a given function over the given range (though it will trim infinite values)
cdf <- function(f, lower_bound, upper_bound)
{
if(lower_bound < -10000) lower_bound <- -10000 # Trim large negatives
if(upper_bound > 10000) upper_bound <- 10000 # Trim large positive
x <- seq(lower_bound, upper_bound, length.out = 100001) # Finely divide x axis
delta <- mean(diff(x)) # Get delta x (i.e. dx)
mid_x <- (x[-1] + x[-length(x)])/2 # Get the mid point of each slice
result <- cumsum(delta * f(mid_x)) # sum f(x) dx
result <- result / max(result) # normalize
list(x = mid_x, cdf = result) # return both x and f(x) in list
}
And to get the inverse, we find the closest value in the cdf of a random number drawn from the uniform distribution between 0 and 1. We then see which value of x corresponds to that value of the cdf. We want to be able to do this for n samples at a time so we use sapply:
inverse_sample <- function(f, n = 1, lower_bound = -1000, upper_bound = 1000)
{
CDF <- cdf(f, lower_bound, upper_bound)
samples <- runif(n)
sapply(samples, function(s) CDF$x[which.min(abs(s - CDF$cdf))])
}
We can test it by drawing histograms of the results. We'll start with the normal distribution's density function (dnorm in R), drawing 1000 samples and plotting their distribution:
hist(inv_sample(dnorm, 1000))
And we can do the same for the exponential distribution, this time setting the limits of integration between 0 and 100:
hist(inv_sample(dexp, 1000, 0, 100))
And finally we can do the same with your own example:
f <- function(x) 3/2/(1 + x)^2
hist(inv_sample(f, 1000, 0, 10))

Find x from a given y in a spline function in R [duplicate]

I am interested in a general root finding problem for an interpolation function.
Suppose I have the following (x, y) data:
set.seed(0)
x <- 1:10 + runif(10, -0.1, 0.1)
y <- rnorm(10, 3, 1)
as well as a linear interpolation and a cubic spline interpolation:
f1 <- approxfun(x, y)
f3 <- splinefun(x, y, method = "fmm")
How can I find x-values where these interpolation functions cross a horizontal line y = y0? The following is a graphical illustration with y0 = 2.85.
par(mfrow = c(1, 2))
curve(f1, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
curve(f3, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
I am aware of a few previous threads on this topic, like
predict x values from simple fitting and annoting it in the plot
Predict X value from Y value with a fitted model
It is suggested that we simply reverse x and y, do an interpolation for (y, x) and compute the interpolated value at y = y0.
However, this is a bogus idea. Let y = f(x) be an interpolation function for (x, y), this idea is only valid when f(x) is a monotonic function of x so that f is invertible. Otherwise x is not a function of y and interpolating (y, x) makes no sense.
Taking the linear interpolation with my example data, this fake idea gives
fake_root <- approx(y, x, 2.85)[[2]]
# [1] 6.565559
First of all, the number of roots is incorrect. We see two roots from the figure (on the left), but the code only returns one. Secondly, it is not a correct root, as
f1(fake_root)
#[1] 2.906103
is not 2.85.
I have made my first attempt on this general problem at How to estimate x value from y value input after approxfun() in R. The solution turns out stable for linear interpolation, but not necessarily stable for non-linear interpolation. I am now looking for a stable solution, specially for a cubic interpolation spline.
How can a solution be useful in practice?
Sometimes after a univariate linear regression y ~ x or a univariate non-linear regression y ~ f(x) we want to backsolve x for a target y. This Q & A is an example and has attracted many answers: Solve best fit polynomial and plot drop-down lines, but none is truly adaptive or easy to use in practice.
The accepted answer using polyroot only works for a simple polynomial regression;
Answers using quadratic formula for an analytical solution only works for a quadratic polynomial;
My answer using predict and uniroot works in general, but is not convenient, as in practice using uniroot needs interaction with users (see Uniroot solution in R for more on uniroot).
It would be really good if there is an adaptive and easy-to-use solution.
First of all, let me copy in the stable solution for linear interpolation proposed in my previous answer.
## given (x, y) data, find x where the linear interpolation crosses y = y0
## the default value y0 = 0 implies root finding
## since linear interpolation is just a linear spline interpolation
## the function is named RootSpline1
RootSpline1 <- function (x, y, y0 = 0, verbose = TRUE) {
if (is.unsorted(x)) {
ind <- order(x)
x <- x[ind]; y <- y[ind]
}
z <- y - y0
## which piecewise linear segment crosses zero?
k <- which(z[-1] * z[-length(z)] <= 0)
## analytical root finding
xr <- x[k] - z[k] * (x[k + 1] - x[k]) / (z[k + 1] - z[k])
## make a plot?
if (verbose) {
plot(x, y, "l"); abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
For cubic interpolation splines returned by stats::splinefun with methods "fmm", "natrual", "periodic" and "hyman", the following function provides a stable numerical solution.
RootSpline3 <- function (f, y0 = 0, verbose = TRUE) {
## extract piecewise construction info
info <- environment(f)$z
n_pieces <- info$n - 1L
x <- info$x; y <- info$y
b <- info$b; c <- info$c; d <- info$d
## list of roots on each piece
xr <- vector("list", n_pieces)
## loop through pieces
i <- 1L
while (i <= n_pieces) {
## complex roots
croots <- polyroot(c(y[i] - y0, b[i], c[i], d[i]))
## real roots (be careful when testing 0 for floating point numbers)
rroots <- Re(croots)[round(Im(croots), 10) == 0]
## the parametrization is for (x - x[i]), so need to shift the roots
rroots <- rroots + x[i]
## real roots in (x[i], x[i + 1])
xr[[i]] <- rroots[(rroots >= x[i]) & (rroots <= x[i + 1])]
## next piece
i <- i + 1L
}
## collapse list to atomic vector
xr <- unlist(xr)
## make a plot?
if (verbose) {
curve(f, from = x[1], to = x[n_pieces + 1], xlab = "x", ylab = "f(x)")
abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
It uses polyroot piecewise, first finding all roots on complex field, then retaining only real ones on the piecewise interval. This works because a cubic interpolation spline is just a number of piecewise cubic polynomials. My answer at How to save and load spline interpolation functions in R? has shown how to obtain piecewise polynomial coefficients, so using polyroot is straightforward.
Using the example data in the question, both RootSpline1 and RootSpline3 correctly identify all roots.
par(mfrow = c(1, 2))
RootSpline1(x, y, 2.85)
#[1] 3.495375 6.606465
RootSpline3(f3, 2.85)
#[1] 3.924512 6.435812 9.207171 9.886640
Given data points and spline function as above, simply apply findzeros() from the pracma package.
library(pracma)
xs <- findzeros(function(x) f3(x) - 2.85,min(x), max(x))
xs # [1] 3.924513 6.435812 9.207169 9.886618
points(xs, f3(xs))

get x-value given y-value: general root finding for linear / non-linear interpolation function

I am interested in a general root finding problem for an interpolation function.
Suppose I have the following (x, y) data:
set.seed(0)
x <- 1:10 + runif(10, -0.1, 0.1)
y <- rnorm(10, 3, 1)
as well as a linear interpolation and a cubic spline interpolation:
f1 <- approxfun(x, y)
f3 <- splinefun(x, y, method = "fmm")
How can I find x-values where these interpolation functions cross a horizontal line y = y0? The following is a graphical illustration with y0 = 2.85.
par(mfrow = c(1, 2))
curve(f1, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
curve(f3, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
I am aware of a few previous threads on this topic, like
predict x values from simple fitting and annoting it in the plot
Predict X value from Y value with a fitted model
It is suggested that we simply reverse x and y, do an interpolation for (y, x) and compute the interpolated value at y = y0.
However, this is a bogus idea. Let y = f(x) be an interpolation function for (x, y), this idea is only valid when f(x) is a monotonic function of x so that f is invertible. Otherwise x is not a function of y and interpolating (y, x) makes no sense.
Taking the linear interpolation with my example data, this fake idea gives
fake_root <- approx(y, x, 2.85)[[2]]
# [1] 6.565559
First of all, the number of roots is incorrect. We see two roots from the figure (on the left), but the code only returns one. Secondly, it is not a correct root, as
f1(fake_root)
#[1] 2.906103
is not 2.85.
I have made my first attempt on this general problem at How to estimate x value from y value input after approxfun() in R. The solution turns out stable for linear interpolation, but not necessarily stable for non-linear interpolation. I am now looking for a stable solution, specially for a cubic interpolation spline.
How can a solution be useful in practice?
Sometimes after a univariate linear regression y ~ x or a univariate non-linear regression y ~ f(x) we want to backsolve x for a target y. This Q & A is an example and has attracted many answers: Solve best fit polynomial and plot drop-down lines, but none is truly adaptive or easy to use in practice.
The accepted answer using polyroot only works for a simple polynomial regression;
Answers using quadratic formula for an analytical solution only works for a quadratic polynomial;
My answer using predict and uniroot works in general, but is not convenient, as in practice using uniroot needs interaction with users (see Uniroot solution in R for more on uniroot).
It would be really good if there is an adaptive and easy-to-use solution.
First of all, let me copy in the stable solution for linear interpolation proposed in my previous answer.
## given (x, y) data, find x where the linear interpolation crosses y = y0
## the default value y0 = 0 implies root finding
## since linear interpolation is just a linear spline interpolation
## the function is named RootSpline1
RootSpline1 <- function (x, y, y0 = 0, verbose = TRUE) {
if (is.unsorted(x)) {
ind <- order(x)
x <- x[ind]; y <- y[ind]
}
z <- y - y0
## which piecewise linear segment crosses zero?
k <- which(z[-1] * z[-length(z)] <= 0)
## analytical root finding
xr <- x[k] - z[k] * (x[k + 1] - x[k]) / (z[k + 1] - z[k])
## make a plot?
if (verbose) {
plot(x, y, "l"); abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
For cubic interpolation splines returned by stats::splinefun with methods "fmm", "natrual", "periodic" and "hyman", the following function provides a stable numerical solution.
RootSpline3 <- function (f, y0 = 0, verbose = TRUE) {
## extract piecewise construction info
info <- environment(f)$z
n_pieces <- info$n - 1L
x <- info$x; y <- info$y
b <- info$b; c <- info$c; d <- info$d
## list of roots on each piece
xr <- vector("list", n_pieces)
## loop through pieces
i <- 1L
while (i <= n_pieces) {
## complex roots
croots <- polyroot(c(y[i] - y0, b[i], c[i], d[i]))
## real roots (be careful when testing 0 for floating point numbers)
rroots <- Re(croots)[round(Im(croots), 10) == 0]
## the parametrization is for (x - x[i]), so need to shift the roots
rroots <- rroots + x[i]
## real roots in (x[i], x[i + 1])
xr[[i]] <- rroots[(rroots >= x[i]) & (rroots <= x[i + 1])]
## next piece
i <- i + 1L
}
## collapse list to atomic vector
xr <- unlist(xr)
## make a plot?
if (verbose) {
curve(f, from = x[1], to = x[n_pieces + 1], xlab = "x", ylab = "f(x)")
abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
It uses polyroot piecewise, first finding all roots on complex field, then retaining only real ones on the piecewise interval. This works because a cubic interpolation spline is just a number of piecewise cubic polynomials. My answer at How to save and load spline interpolation functions in R? has shown how to obtain piecewise polynomial coefficients, so using polyroot is straightforward.
Using the example data in the question, both RootSpline1 and RootSpline3 correctly identify all roots.
par(mfrow = c(1, 2))
RootSpline1(x, y, 2.85)
#[1] 3.495375 6.606465
RootSpline3(f3, 2.85)
#[1] 3.924512 6.435812 9.207171 9.886640
Given data points and spline function as above, simply apply findzeros() from the pracma package.
library(pracma)
xs <- findzeros(function(x) f3(x) - 2.85,min(x), max(x))
xs # [1] 3.924513 6.435812 9.207169 9.886618
points(xs, f3(xs))

How to draw an $\alpha$ confidence areas on a 2D-plot?

There are a lot of answers regarding to plotting confidence intervals.
I'm reading the paper by Lourme A. et al (2016) and I'd like to draw the 90% confidence boundary and the 10% exceptional points like in the Fig. 2 from the paper: .
I can't use LaTeX and insert the picture with the definition of confidence areas:
library("MASS")
library(copula)
set.seed(612)
n <- 1000 # length of sample
d <- 2 # dimension
# random vector with uniform margins on (0,1)
u1 <- runif(n, min = 0, max = 1)
u2 <- runif(n, min = 0, max = 1)
u = matrix(c(u1, u2), ncol=d)
Rg <- cor(u) # d-by-d correlation matrix
Rg1 <- ginv(Rg) # inv. matrix
# round(Rg %*% Rg1, 8) # check
# the multivariate c.d.f of u is a Gaussian copula
# with parameter Rg[1,2]=0.02876654
normal.cop = normalCopula(Rg[1,2], dim=d)
fit.cop = fitCopula(normal.cop, u, method="itau") #fitting
# Rg.hat = fit.cop#estimate[1]
# [1] 0.03097071
sim = rCopula(n, normal.cop) # in (0,1)
# Taking the quantile function of N1(0, 1)
y1 <- qnorm(sim[,1], mean = 0, sd = 1)
y2 <- qnorm(sim[,2], mean = 0, sd = 1)
par(mfrow=c(2,2))
plot(y1, y2, col="red"); abline(v=mean(y1), h=mean(y2))
plot(sim[,1], sim[,2], col="blue")
hist(y1); hist(y2)
Reference.
Lourme, A., F. Maurer (2016) Testing the Gaussian and Student's t copulas in a risk management framework. Economic Modelling.
Question. Could anyone help me and give the explanation of the variable v=(v_1,...,v_d) and G(v_1),..., G(v_d) in the equation?
I think v is the non-random matrix, the dimensions should be $k^2$ (grid points) by d=2 (dimensions). For example,
axis_x <- seq(0, 1, 0.1) # 11 grid points
axis_y <- seq(0, 1, 0.1) # 11 grid points
v <- expand.grid(axis_x, axis_y)
plot(v, type = "p")
So, your question is about the vector nu and correponding G(nu).
nu is a simple random vector drawn from any distribution that has a domain (0,1). (Here I use uniform distribution). Since you want your samples in 2D one single nu can be nu = runif(2). Given the explanations above, G is a gaussain pdf with mean 0 and a covariance matrix Rg. (Rg has dimensions of 2x2 in 2D).
Now what the paragraph says: if you have a random sample nu and you want it to be drawn from Gamma given the number of dimensions d and confidence level alpha then you need to compute the following statistic (G(nu) %*% Rg^-1) %*% G(nu) and check that is below the pdf of Chi^2 distribution for d and alpha.
For example:
# This is the copula parameter
Rg <- matrix(c(1,runif(2),1), ncol = 2)
# But we need to compute the inverse for sampling
Rginv <- MASS::ginv(Rg)
sampleResult <- replicate(10000, {
# we draw our nu from uniform, but others that map to (0,1), e.g. beta, are possible, too
nu <- runif(2)
# we compute G(nu) which is a gaussian cdf on the sample
Gnu <- qnorm(nu, mean = 0, sd = 1)
# for this we compute the statistic as given in formula
stat <- (Gnu %*% Rginv) %*% Gnu
# and return the result
list(nu = nu, Gnu = Gnu, stat = stat)
})
theSamples <- sapply(sampleResult["nu",], identity)
# this is the critical value of the Chi^2 with alpha = 0.95 and df = number of dimensions
# old and buggy threshold <- pchisq(0.95, df = 2)
# new and awesome - we are looking for the statistic at alpha = .95 quantile
threshold <- qchisq(0.95, df = 2)
# we can accept samples given the threshold (like in equation)
inArea <- sapply(sampleResult["stat",], identity) < threshold
plot(t(theSamples), col = as.integer(inArea)+1)
The red points are the points you would keep (I plot all points here).
As for drawing the decision boundries, I think it is a little bit more complicated, since you need to compute the exact pair of nu so that (Gnu %*% Rginv) %*% Gnu == pchisq(alpha, df = 2). It is a linear system that you solve for Gnu and then apply inverse to get your nu at the decision boundries.
edit: Reading the paragraph again, I noticed, the parameter for Gnu does not change, it is simply Gnu <- qnorm(nu, mean = 0, sd = 1).
edit: There was a bug: for threshold you need to use the quantile function qchisq instead of the distribution function pchisq - now corrected in the code above (and updated the figures).
This has two parts: first, compute the copula value as a function of X and Y; then, plot the curve giving the boundary where the copula exceeds the threshold.
Computing the value is basically linear algebra which #drey has answered. This is a rewritten version so that the copula is given by a function.
cop1 <- function(x)
{
Gnu <- qnorm(x)
Gnu %*% Rginv %*% Gnu
}
copula <- function(x)
{
apply(x, 1, cop1)
}
Plotting the boundary curve can be done using the same method as here (which in turn is the method used by the textbooks Modern Applied Stats with S, and Elements of Stat Learning). Create a grid of values, and use interpolation to find the contour line at the given height.
Rg <- matrix(c(1,runif(2),1), ncol = 2)
Rginv <- MASS::ginv(Rg)
# draw the contour line where value == threshold
# define a grid of values first: avoid x and y = 0 and 1, where infinities exist
xlim <- 1e-3
delta <- 1e-3
xseq <- seq(xlim, 1-xlim, by=delta)
grid <- expand.grid(x=xseq, y=xseq)
prob.grid <- copula(grid)
threshold <- qchisq(0.95, df=2)
contour(x=xseq, y=xseq, z=matrix(prob.grid, nrow=length(xseq)), levels=threshold,
col="grey", drawlabels=FALSE, lwd=2)
# add some points
data <- data.frame(x=runif(1000), y=runif(1000))
points(data, col=ifelse(copula(data) < threshold, "red", "black"))

Resources