R: probability / numerical integral of bivariate (or multivariate) kernel density - r

I am using the package ks for kernel density estimation. Here's an easy example:
n <- 70
x <- rnorm(n)
library(ks)
f_kde <- kde(x)
I am actually interested in the respective exceeding probabilities of my input data, which can be easily returned by ks having f_kde:
p_kde <- pkde(x, f_kde)
This is done in ks with a numerical integration using Simpson's rule. Unfortunately, they only implemented this for a 1d case. In a bivariate case, there's no implementation in ks of any method for returning the probabilities :
y <- rnorm(n)
f_kde <- kde(data.frame(x,y))
# does not work, but it's what I am looking for:
p_kde <- pkde(data.frane(x,y), f_kde)
I couldnt find any package or help searching in stackoverflow to solve this issue in R (some suggestions for Python exist, but I would like to keep it in R). Any line of code or package recommendation is appreciated. Even though I am mostly interested in the bivariate case, any ideas for a multivariate case are appreciated as well.

kde allows multidimensional kernel estimate, so we could use kde to calculate pkde.
For this, we calculate kde on small enough dx and dy steps using eval.points parameter : this gives us the local density estimate on a dx*dy
square.
We verify that the sum of estimates mutiplied by the surface of the squares almost equals 1:
library(ks)
set.seed(1)
n <- 10000
x <- rnorm(n)
y <- rnorm(n)
xy <- cbind(x,y)
xmin <- -10
xmax <- 10
dx <- .1
ymin <- -10
ymax <- 10
dy <- .1
pts.x <- seq(xmin, xmax, dx)
pts.y <- seq(ymin, ymax, dy)
pts <- as.data.frame(expand.grid(x = pts.x, y = pts.y))
f_kde <- kde(xy,eval.points=pts)
pts$est <- f_kde$estimate
sum(pts$est)*dx*dy
[1] 0.9998778
You can now query the pts dataframe for the cumulative probability on the area of your choice :
library(data.table)
setDT(pts)
# cumulative density
pts[x < 1 & y < 2 , .(pkde=sum(est)*dx*dy)]
pkde
1: 0.7951228
# average density around a point
tolerance <-.1
pts[pmin(abs(x-1))<tolerance & pmin(abs(y-2))<tolerance, .(kde = mean(est))]
kde
1: 0.01465478

Related

Manually calculate two-sample kolmogorov-smirnov using ECDF

I am trying to manually compute the KS statistic for two random samples. As far as I understood the KS statistic D is the maximum vertical deviation between the two CDFs. However, manually calculating the differences between the two CDF and running the ks.test from the base R yields different results. I wonder where is the mistake.
set.seed(123)
a <- rnorm(10000)
b <- rnorm(10000)
### Manual calculation
# function for calculating manually the ecdf
decdf <- function(x, baseline, treatment) ecdf(baseline)(x) - ecdf(treatment)(x)
#Difference between the two CDFs
d <- curve(decdf(x,a,b), from=min(a,b), to=max(a,b))
# getting D
ks <- max(abs(d$y))
#### R-Base calculation
ks.test(a,b)
The R-Base D = 0.0109 while the manual calculation is 0.0088. Any help explaining the difference is appreciated.
I attach the R-Base source code ( a bit cleaned up)
n <- length(a)
n.x <- as.double(n)
n.y <- length(b)
n <- n.x * n.y/(n.x + n.y)
w <- c(a, b)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
STATISTIC <- max(abs(z))
By default, curve evaluates the function on a subdivision of 100 points between from and to. By restricting to these 100 points, it's possible that you miss the value for which the maximum difference is attained.
Instead, evaluate the difference at all points where the ecdf's jump and you are sure to catch the value for which the maximum difference is attained.
set.seed(123)
a <- rnorm(10000)
b <- rnorm(10000)
Fa <- ecdf(a)
Fb <- ecdf(b)
x <- c(a,b) # the points where Fa or Fb jump
max(abs(Fa(x) - Fb(x)))
# [1] 0.0109

plotting the posterior distribution of N given likelihood and prior

Given that the likelihood is Y|n~Binomial(n, theta) and the prior is n~Poisson(5), I tried to calculate the posterior distribution of sample size n with Y=0 and theta=0.2. My code is as follows:
Y <- 0
theta <- 0.2
n_grid <- seq(0,1,length=1000)
like <- dbinom(Y,n_grid,theta)
prior <- dpois(n_grid,5)
fy <- sum(like*prior)
post <- like*prior/fy
plot(n_grid,post,type="l")
I keep getting NaN results when computing the likelihood function and priors. Any help would be appreciated!
So I realize it might be unconventional to answer my own question, but I figured out my solution and thought I would post the answer to help someone else out.
Y <- 0
theta <- 0.2
N <- 0:0.01:100
like <- dbinom(Y,N,theta)
prior <- dpois(N,5)
fy <- sum(like*prior)
post <- like*prior/fy
plot(N,post,type="l")

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Extract approximate probability density function (pdf) in R from random sampling

I have got n>2 independent continuous Random Variables(RV). For example say I have 4 Uniform RVs with different set of Upper and lowers.
W~U[-1,5], X~U[0,1], Y~[0,2], Z~[0.5,2]
I am trying to find out the approximate PDF for the sum of these RVs i.e. for T=W+X+Y+Z. As I don't need any closed form solution, I have sampled 1 million points for each of them to get 1 million samples for T. Is it possible in R to get the approximate PDF function or a way to get approximate probability of P(t<T)from this samples I have drawn. For example is there a easy way I can calculate P(0.5<T) in R. My priority here is to get probability first even if getting the density function is not possible.
Thanks
Consider the ecdf function:
set.seed(123)
W <- runif(1e6, -1, 5)
X <- runif(1e6, 0, 1)
Y <- runif(1e6, 0, 2)
Z <- runif(1e6, 0.5, 2)
T <- Reduce(`+`, list(W, X, Y, Z))
cdfT <- ecdf(T)
1 - cdfT(0.5) # Pr(T > 0.5)
# [1] 0.997589
See How to calculate cumulative distribution in R? for more details.

R optimize not giving the finite minimum but Inf when the search interval is wider

I have a problem with optimize().
When I limit the search in a small interval around zero, e.g., (-1, 1), the optimize algorithm gives a finite minimum with a finite objective function value.
But when I make the interval wider to (-10, 10), then the minimum is on the boundary of the interval and the objective is Inf, which is really puzzling for me.
How can this happen and how to fix this? Thanks a lot in advance.
The following is my code.
set.seed(123)
n <- 120
c <- rnorm(n,mean=1,sd=.3);
eps <- rnorm(n,mean=0,sd=5)
tet <- 32
r <- eps * c^tet
x <- matrix(c(c,r), ncol=2)
g <- function(tet, x){
matrix((x[,1]^(-tet))*x[,2],ncol=1)
}
theta <- 37
g_t <- g(theta,x)
f.tau <- function(tau){
exp.tau.g <- exp(g_t %*% tau)
g.exp <- NULL; i <- 1:n
g.exp <- matrix(exp.tau.g[i,] * g_t[i,], ncol=1)
sum.g.exp <- apply(g.exp,2,sum)
v <- t(sum.g.exp) %*% sum.g.exp
return(v)
}
band.tau <- 1;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-1, 1)"); print(f);
band.tau <- 10;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-10, 10)"); print(f);
The problem is that your function f.tau(x) is not well behaved. You can see that here:
vect.f <- Vectorize(f.tau)
z1 <- seq(-1,1,by=0.01)
z10 <- seq(-10,10,by=0.01)
par(mfrow=c(2,1), mar=c(2,2,1,1))
plot(z1, log(vect.f(z1)), type="l")
plot(z10,log(vect.f(z10)),type="l")
Note that these are plots of log(f.tau). So there are two problems: f.tau(...) has an extremely large slope on either side of the minimum, and f.tau = Inf for x<-0.6 and x>1.0, where Inf means that f.tau(...) is greater than the largest number that can be represented on this system. When you set the range to (-1,1) your starting point is close enough to the minimum that optimize(...) manages to converge. When you set the limits to (-10,10) the starting point is too far away. There are examples in the documentation which show a similar problem with functions that are not nearly as ill-behaved as f.tau.
EDIT (Response to OP's comment)
The main problem is that you are trying to optimize a function which has computational infinities in the interval of interest. Here's a way around that.
band.tau <- 10
z <- seq(-band.tau,band.tau,length=1000)
vect.f <- Vectorize(f.tau)
interval <- range(z[is.finite(vect.f(z))])
f <- optimize(f.tau, interval, tol=1e-20)
f
# $minimum
# [1] 0.001615433
#
# $objective
# [,1]
# [1,] 7.157212e-12
This evaluates f.tau(x) at 1000 equally spaced points on (-band.tau,+band.tau), identifies all the values of x where f.tau is finite, and uses the range as the increment in optimize(...). This works in your case because f.tau(x) does not (appear to...) have asymptotes.

Resources