how to transform density object to function - r

I would like to use the output of the density() object as a function (to do many things as derivative, integrate on specific interval, evaluate at specific point,...)
To be clear, let's take an example:
a=c(1,3,10,-5,0,0,2, 1, 3, 8,2, -2)
b=density(a)
I would like some transformation of b
f=some_transformation(b) # transformation I don't know
is.function(f) # answer must be "TRUE"
so that I can evaluate the density at any point
f(1.2) # evaluate density at 1.2
compute its derivative
Df=D(body(f), "x") # derivative of f
Df(1.2) # derivative at 1.2
and do other R stuff as if f is a function.

You can use approxfun.
a <- c(1,3,10,-5,0,0,2, 1, 3, 8,2, -2)
b <- density(a)
f <- approxfun(b, rule=2)
is.function(f)
f(1.2)
Since it is not defined by a formula,
you cannot use D (symbolic differentiation)
to compute its derivative.
You can estimate it numerically, though.
library(numDeriv)
df <- function(x) grad(f,x)
curve( f(x), lwd=3, xlim=c(-10,10) )
curve( df(x), lwd=3, xlim=c(-10,10) )

D takes an expression, not a function as its first argument. It is for doing symbolic calculus, not finding the gradient of numeric values. You can numerically calculate the derivative of b wrt x using.
with(b, diff(y) / diff(x))
Here's a visualisation of the gradient to give an example of how you might use it.
librray(ggplot2)
gradient_data <- with(
density(a),
{
data.frame(
dy_by_dx = diff(y) / diff(x),
x = x[-1] + x[-length(x)] / 2
)
}
)
(gradient_plot <- ggplot(gradient_data, aes(x, dy_by_dx)) +
geom_line()
)
If you want to evaluate the function at any point, then use approx.
with(density(a), approx(x, y, xout = -8:13))
The answer will be more accurate if you increase the n argument to the density function.

Related

integrating the square of probability density?

Suppose I have
set.seed(2020) # make the results reproducible
a <- rnorm(100, 0, 1)
My probability density is estimated through kernel density estimator (gaussian) in R using the R built in function density. The question is how to integrate the square of the estimated density. It does not matter between which values, let us suppose between -Inf and +Inf. I have tried the following:
f <- approxfun(density(a)$x, density(a)$y)
integrate (f*f, min(density(a)$x), max(density(a)$x))
There are a couple of problems here. First you have the x and y round the wrong way in approxfun. Secondly, you can't multiply function names together. You need to specify a new function that gives you the square of your original function:
set.seed(2020)
a <- rnorm(100, 0, 1)
f <- approxfun(density(a)$x, density(a)$y)
f2 <- function(v) ifelse(is.na(f(v)), 0, f(v)^2)
integrate (f2, -Inf, Inf)
#> 0.2591153 with absolute error < 0.00011
We can also plot the original density function and the squared density function:
curve(f, -3, 3)
curve(f2, -3, 3, add = TRUE, col = "red")
I think you should write the objective function as function(x) f(x)**2, rather than f*f, e.g.,
> integrate (function(x) f(x)**2, min(density(a)$x), max(density(a)$x))
0.2331793 with absolute error < 6.6e-06
Here is a way using package caTools, function trapz. It computes the integral given a vector x and its corresponding image y using the trapezoidal rule.
I also include a function trapzf based on the original to have the integral computed with the function returned by approxfun
trapzf <- function(x, FUN) trapz(x, FUN(x))
set.seed(2020) # make the results reproducible
a <- rnorm(100, 0, 1)
d <- density(a)
f <- approxfun(d$x, d$y)
int1 <- trapz(d$x, d$y^2)
int2 <- trapzf(d$x, function(x) f(x)^2)
int1
#[1] 0.2591226
identical(int1, int2)
#[1] TRUE

Find x from a given y in a spline function in R [duplicate]

I am interested in a general root finding problem for an interpolation function.
Suppose I have the following (x, y) data:
set.seed(0)
x <- 1:10 + runif(10, -0.1, 0.1)
y <- rnorm(10, 3, 1)
as well as a linear interpolation and a cubic spline interpolation:
f1 <- approxfun(x, y)
f3 <- splinefun(x, y, method = "fmm")
How can I find x-values where these interpolation functions cross a horizontal line y = y0? The following is a graphical illustration with y0 = 2.85.
par(mfrow = c(1, 2))
curve(f1, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
curve(f3, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
I am aware of a few previous threads on this topic, like
predict x values from simple fitting and annoting it in the plot
Predict X value from Y value with a fitted model
It is suggested that we simply reverse x and y, do an interpolation for (y, x) and compute the interpolated value at y = y0.
However, this is a bogus idea. Let y = f(x) be an interpolation function for (x, y), this idea is only valid when f(x) is a monotonic function of x so that f is invertible. Otherwise x is not a function of y and interpolating (y, x) makes no sense.
Taking the linear interpolation with my example data, this fake idea gives
fake_root <- approx(y, x, 2.85)[[2]]
# [1] 6.565559
First of all, the number of roots is incorrect. We see two roots from the figure (on the left), but the code only returns one. Secondly, it is not a correct root, as
f1(fake_root)
#[1] 2.906103
is not 2.85.
I have made my first attempt on this general problem at How to estimate x value from y value input after approxfun() in R. The solution turns out stable for linear interpolation, but not necessarily stable for non-linear interpolation. I am now looking for a stable solution, specially for a cubic interpolation spline.
How can a solution be useful in practice?
Sometimes after a univariate linear regression y ~ x or a univariate non-linear regression y ~ f(x) we want to backsolve x for a target y. This Q & A is an example and has attracted many answers: Solve best fit polynomial and plot drop-down lines, but none is truly adaptive or easy to use in practice.
The accepted answer using polyroot only works for a simple polynomial regression;
Answers using quadratic formula for an analytical solution only works for a quadratic polynomial;
My answer using predict and uniroot works in general, but is not convenient, as in practice using uniroot needs interaction with users (see Uniroot solution in R for more on uniroot).
It would be really good if there is an adaptive and easy-to-use solution.
First of all, let me copy in the stable solution for linear interpolation proposed in my previous answer.
## given (x, y) data, find x where the linear interpolation crosses y = y0
## the default value y0 = 0 implies root finding
## since linear interpolation is just a linear spline interpolation
## the function is named RootSpline1
RootSpline1 <- function (x, y, y0 = 0, verbose = TRUE) {
if (is.unsorted(x)) {
ind <- order(x)
x <- x[ind]; y <- y[ind]
}
z <- y - y0
## which piecewise linear segment crosses zero?
k <- which(z[-1] * z[-length(z)] <= 0)
## analytical root finding
xr <- x[k] - z[k] * (x[k + 1] - x[k]) / (z[k + 1] - z[k])
## make a plot?
if (verbose) {
plot(x, y, "l"); abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
For cubic interpolation splines returned by stats::splinefun with methods "fmm", "natrual", "periodic" and "hyman", the following function provides a stable numerical solution.
RootSpline3 <- function (f, y0 = 0, verbose = TRUE) {
## extract piecewise construction info
info <- environment(f)$z
n_pieces <- info$n - 1L
x <- info$x; y <- info$y
b <- info$b; c <- info$c; d <- info$d
## list of roots on each piece
xr <- vector("list", n_pieces)
## loop through pieces
i <- 1L
while (i <= n_pieces) {
## complex roots
croots <- polyroot(c(y[i] - y0, b[i], c[i], d[i]))
## real roots (be careful when testing 0 for floating point numbers)
rroots <- Re(croots)[round(Im(croots), 10) == 0]
## the parametrization is for (x - x[i]), so need to shift the roots
rroots <- rroots + x[i]
## real roots in (x[i], x[i + 1])
xr[[i]] <- rroots[(rroots >= x[i]) & (rroots <= x[i + 1])]
## next piece
i <- i + 1L
}
## collapse list to atomic vector
xr <- unlist(xr)
## make a plot?
if (verbose) {
curve(f, from = x[1], to = x[n_pieces + 1], xlab = "x", ylab = "f(x)")
abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
It uses polyroot piecewise, first finding all roots on complex field, then retaining only real ones on the piecewise interval. This works because a cubic interpolation spline is just a number of piecewise cubic polynomials. My answer at How to save and load spline interpolation functions in R? has shown how to obtain piecewise polynomial coefficients, so using polyroot is straightforward.
Using the example data in the question, both RootSpline1 and RootSpline3 correctly identify all roots.
par(mfrow = c(1, 2))
RootSpline1(x, y, 2.85)
#[1] 3.495375 6.606465
RootSpline3(f3, 2.85)
#[1] 3.924512 6.435812 9.207171 9.886640
Given data points and spline function as above, simply apply findzeros() from the pracma package.
library(pracma)
xs <- findzeros(function(x) f3(x) - 2.85,min(x), max(x))
xs # [1] 3.924513 6.435812 9.207169 9.886618
points(xs, f3(xs))

get x-value given y-value: general root finding for linear / non-linear interpolation function

I am interested in a general root finding problem for an interpolation function.
Suppose I have the following (x, y) data:
set.seed(0)
x <- 1:10 + runif(10, -0.1, 0.1)
y <- rnorm(10, 3, 1)
as well as a linear interpolation and a cubic spline interpolation:
f1 <- approxfun(x, y)
f3 <- splinefun(x, y, method = "fmm")
How can I find x-values where these interpolation functions cross a horizontal line y = y0? The following is a graphical illustration with y0 = 2.85.
par(mfrow = c(1, 2))
curve(f1, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
curve(f3, from = x[1], to = x[10]); abline(h = 2.85, lty = 2)
I am aware of a few previous threads on this topic, like
predict x values from simple fitting and annoting it in the plot
Predict X value from Y value with a fitted model
It is suggested that we simply reverse x and y, do an interpolation for (y, x) and compute the interpolated value at y = y0.
However, this is a bogus idea. Let y = f(x) be an interpolation function for (x, y), this idea is only valid when f(x) is a monotonic function of x so that f is invertible. Otherwise x is not a function of y and interpolating (y, x) makes no sense.
Taking the linear interpolation with my example data, this fake idea gives
fake_root <- approx(y, x, 2.85)[[2]]
# [1] 6.565559
First of all, the number of roots is incorrect. We see two roots from the figure (on the left), but the code only returns one. Secondly, it is not a correct root, as
f1(fake_root)
#[1] 2.906103
is not 2.85.
I have made my first attempt on this general problem at How to estimate x value from y value input after approxfun() in R. The solution turns out stable for linear interpolation, but not necessarily stable for non-linear interpolation. I am now looking for a stable solution, specially for a cubic interpolation spline.
How can a solution be useful in practice?
Sometimes after a univariate linear regression y ~ x or a univariate non-linear regression y ~ f(x) we want to backsolve x for a target y. This Q & A is an example and has attracted many answers: Solve best fit polynomial and plot drop-down lines, but none is truly adaptive or easy to use in practice.
The accepted answer using polyroot only works for a simple polynomial regression;
Answers using quadratic formula for an analytical solution only works for a quadratic polynomial;
My answer using predict and uniroot works in general, but is not convenient, as in practice using uniroot needs interaction with users (see Uniroot solution in R for more on uniroot).
It would be really good if there is an adaptive and easy-to-use solution.
First of all, let me copy in the stable solution for linear interpolation proposed in my previous answer.
## given (x, y) data, find x where the linear interpolation crosses y = y0
## the default value y0 = 0 implies root finding
## since linear interpolation is just a linear spline interpolation
## the function is named RootSpline1
RootSpline1 <- function (x, y, y0 = 0, verbose = TRUE) {
if (is.unsorted(x)) {
ind <- order(x)
x <- x[ind]; y <- y[ind]
}
z <- y - y0
## which piecewise linear segment crosses zero?
k <- which(z[-1] * z[-length(z)] <= 0)
## analytical root finding
xr <- x[k] - z[k] * (x[k + 1] - x[k]) / (z[k + 1] - z[k])
## make a plot?
if (verbose) {
plot(x, y, "l"); abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
For cubic interpolation splines returned by stats::splinefun with methods "fmm", "natrual", "periodic" and "hyman", the following function provides a stable numerical solution.
RootSpline3 <- function (f, y0 = 0, verbose = TRUE) {
## extract piecewise construction info
info <- environment(f)$z
n_pieces <- info$n - 1L
x <- info$x; y <- info$y
b <- info$b; c <- info$c; d <- info$d
## list of roots on each piece
xr <- vector("list", n_pieces)
## loop through pieces
i <- 1L
while (i <= n_pieces) {
## complex roots
croots <- polyroot(c(y[i] - y0, b[i], c[i], d[i]))
## real roots (be careful when testing 0 for floating point numbers)
rroots <- Re(croots)[round(Im(croots), 10) == 0]
## the parametrization is for (x - x[i]), so need to shift the roots
rroots <- rroots + x[i]
## real roots in (x[i], x[i + 1])
xr[[i]] <- rroots[(rroots >= x[i]) & (rroots <= x[i + 1])]
## next piece
i <- i + 1L
}
## collapse list to atomic vector
xr <- unlist(xr)
## make a plot?
if (verbose) {
curve(f, from = x[1], to = x[n_pieces + 1], xlab = "x", ylab = "f(x)")
abline(h = y0, lty = 2)
points(xr, rep.int(y0, length(xr)))
}
## return roots
xr
}
It uses polyroot piecewise, first finding all roots on complex field, then retaining only real ones on the piecewise interval. This works because a cubic interpolation spline is just a number of piecewise cubic polynomials. My answer at How to save and load spline interpolation functions in R? has shown how to obtain piecewise polynomial coefficients, so using polyroot is straightforward.
Using the example data in the question, both RootSpline1 and RootSpline3 correctly identify all roots.
par(mfrow = c(1, 2))
RootSpline1(x, y, 2.85)
#[1] 3.495375 6.606465
RootSpline3(f3, 2.85)
#[1] 3.924512 6.435812 9.207171 9.886640
Given data points and spline function as above, simply apply findzeros() from the pracma package.
library(pracma)
xs <- findzeros(function(x) f3(x) - 2.85,min(x), max(x))
xs # [1] 3.924513 6.435812 9.207169 9.886618
points(xs, f3(xs))

Plot density curve of mixture of two normal distribution

I am rather new to R and could use some basic help. I'd like to generate sums of two normal random variables (variance = 1 for each) as their means move apart and plot the results. The basic idea: if the means are sufficiently far apart, the distribution will be bimodal. Here's the code I'm trying:
x <- seq(-3, 3, length=500)
for(i in seq(0, 3, 0.25)) {
y <- dnorm(x, mean=0-i, sd=1)
z <- dnorm(x, mean=0+i, sd=1)
plot(x,y+z, type="l", xlim=c(-3,3))
}
Several questions:
Are there better ways to do this?
I'm only getting one PDF on my plot. How can I put multiple PDFs on the same plot?
Thank you in advance!
It is not difficult to do this using basic R features. We first define a function f to compute the density of this mixture of normal:
## `x` is an evaluation grid
## `dev` is deviation of mean from 0
f <- function (x, dev) {
(dnorm(x, -dev) + dnorm(x, dev)) / 2
}
Then we use sapply to loop through various dev to get corresponding density:
## `dev` sequence to test
dev <- seq(0, 3, 0.25)
## evaluation grid; extending `c(-1, 1) * max(dev)` by 4 standard deviation
x <- seq(-max(dev) -4, max(dev) + 4, by = 0.1)
## density matrix
X <- sapply(dev, f, x = x)
## a comment on 2022-07-31: X <- outer(x, dev, f)
Finally we use matplot for plotting:
matplot(x, X, type = "l", lty = 1)
Explanation of sapply:
During sapply, x is not changed, while we pick up and try one element of dev each iteration. It is like
X <- matrix(0, nrow = length(x), ncol = length(dev))
for (i in 1:length(dev)) X[, i] <- f(x, dev[i])
matplot(x, X) will plot columns of X one by one, against x.
A comment on 2022-07-31: Just use outer. Here are more examples:
Run a function of 2 arguments over a span of parameter values in R
Plot of a Binomial Distribution for various probabilities of success in R

How to find quantiles of an empirical cumulative density function (ECDF)

I am using ecdf() function to calculate empirical cumulative density function (ECDF) from some random samples:
set.seed(0)
X = rnorm(100)
P = ecdf(X)
Now P gives ECDF and we could plot it:
plot(P)
abline(h = 0.6, lty = 3)
My question is: how can I find the sample value x, such that P(x) = 0.6, i.e., the 0.6-quantile of ECDF, or the x-coordinate of the intersection point of ECDF and h = 0.6?
In the following, I will not use ecdf(), as it is easy to obtain empirical cumulative density function (ECDF) ourselves.
First, we sort samples X in ascending order:
X <- sort(X)
ECDF at those samples takes function values:
e_cdf <- 1:length(X) / length(X)
We could then sketch ECDF by:
plot(X, e_cdf, type = "s")
abline(h = 0.6, lty = 3)
Now, we are looking for the first value of X, such that P(X) >= 0.6. This is just:
X[which(e_cdf >= 0.6)[1]]
# [1] 0.2290196
Since our data are sampled from standard normal distribution, the theoretical quantile is
qnorm(0.6)
# [1] 0.2533471
So our result is quite close.
Extension
Since the inverse of CDF is quantile function (for example, the inverse of pnorm() is qnorm()), one may guess the inverse of ECDF as sample quantile, i,e, the inverse ecdf() is quantile(). This is not true!
ECDF is a staircase / step function, and it does not have inverse. If we rotate ECDF around y = x, the resulting curve is not a mathematical function. So sample quantile is has nothing to do with ECDF.
For n sorted samples, sample quantile function is in fact a linear interpolation function of (x, y), with:
x-values being seq(0, 1, length = n);
y-values being sorted samples.
We can define our own version of sample quantile function by:
my_quantile <- function(x, prob) {
if (is.unsorted(x)) x <- sort(x)
n <- length(x)
approx(seq(0, 1, length = n), x, prob)$y
}
Let's have a test:
my_quantile(X, 0.6)
# [1] 0.2343171
quantile(X, prob = 0.6, names = FALSE)
# [1] 0.2343171
Note that result is different from what we get from X[which(e_cdf >= 0.6)[1]].
It is for this reason, that I refuse to use quantile() in my answer.

Resources