Calculate probability of point on 2d density surface - r

If I calculate the 2d density surface of two vectors like in this example:
library(MASS)
a <- rnorm(1000)
b <- rnorm(1000, sd=2)
f1 <- kde2d(a, b, n = 100)
I get the following surface
filled.contour(f1)
The z-value is the estimated density.
My question now is: Is it possible to calculate the probability of a single point, e.g. a = 1, b = -4
[as I'm not a statistician this is maybe the wrong wording. Sorry for that. I would like to know - if this is possible at all - with which probability a point occurs.]
Thanks for every comment!

If you specify an area, then that area has a probability with respect to your density function. Of course a single point does not have a probability different from zero. But it does have a non-zero density at that point. What is that then?
The density is the limit of integral of that probability density integrated over the area divided by the normal area measure as the normal area measure goes to zero. (It was actual rather hard to state that correctly, needed a few tries and it is still not optimal).
All this is really basic calculus. It is also fairly easy to write a routine to calculate the integral of that density over the area, although I imagine MASS has standard ways to do it that use more sophisticated integration techniques. Here is a quick routine that I threw together based on your example:
library(MASS)
n <- 100
a <- rnorm(1000)
b <- rnorm(1000, sd=2)
f1 <- kde2d(a, b, n = 100)
lims <- c(min(a),max(a),min(b),max(b))
filled.contour(f1)
prob <- function(f,xmin,xmax,ymin,ymax,n,lims){
ixmin <- max( 1, n*(xmin-lims[1])/(lims[2]-lims[1]) )
ixmax <- min( n, n*(xmax-lims[1])/(lims[2]-lims[1]) )
iymin <- max( 1, n*(ymin-lims[3])/(lims[4]-lims[3]) )
iymax <- min( n, n*(ymax-lims[3])/(lims[4]-lims[3]) )
avg <- mean(f$z[ixmin:ixmax,iymin:iymax])
probval <- (xmax-xmin)*(ymax-ymin)*avg
return(probval)
}
prob(f1,0.5,1.5,-4.5,-3.5,n,lims)
# [1] 0.004788993
prob(f1,-1,1,-1,1,n,lims)
# [1] 0.2224353
prob(f1,-2,2,-2,2,n,lims)
# [1] 0.5916984
prob(f1,0,1,-1,1,n,lims)
# [1] 0.119455
prob(f1,1,2,-1,1,n,lims)
# [1] 0.05093696
prob(f1,-3,3,-3,3,n,lims)
# [1] 0.8080565
lims
# [1] -3.081773 4.767588 -5.496468 7.040882
Caveat, the routine seems right and is giving reasonable answers, but it has not undergone anywhere near the scrutiny I would give it for a production function.

The z-value here is a called a "probability density" rather than a "probability". As comments have pointed out, if you want an estimated probability you will need to integrate the estimated density to find the volume under your estimated surface.
However, if what you want is the probability density at a particular point, then you can use:
kde2d(a, b, n=1, lims=c(1, 1, -4, -4))$z[1,1]
# [1] 0.006056323
This will calculate a 1x1 "grid" with a single density estimate for the point you want.
A plot confirming that it worked:
z0 <- kde2d(a, b, n=1, lims=c(1, 1, -4, -4))$z[1,1]
filled.contour(
f1,
plot.axes = {
contour(f1, levels=z0, add=TRUE)
abline(v=1, lty=3)
abline(h=-4, lty=3)
axis(1); axis(2)
}
)

Related

approximate variance of the area of a randomly generated triangle using simulations in R

If two sides of a triangle (a and b) and their angle (γ) is known, one can calculate the area of the triangle with
area = (1/2)*ab*sin(γ)
Using simulations find the approximate variance of the area of a randomly generated triangle, with the parameters having following distributions:
a ~ Gamma(shape=11, rate=6)
b ~ Gamma(shape=24, rate=18)
γ ~ Uniform(1, π/2)
I am quite stuck and don't really know how to proceed. I have tried
rgamma(n, shape, rate = 1, scale = 1/rate)
however I do not clearly see how to connect a,b and γ together and how to calculate the variance out the the area.
# Define a function that would calculate area
tri.area <- function(a, b, gamma){
return(0.5*a*b*sin(gamma) )
}
# Number of random triangles to construct
N <- 100
# create randomly distributed vectors with given distributioin
a.vector <- rgamma(N, shape=11, rate=6)
b.vector <- rgamma(N, shape=14, rate=8)
gamma.vector <- runif(N, min=1, max=pi/2)
# calculate vector of areas for each triple of values
tri.area.vec <- tri.area (a.vector, b.vector, gamma.vector)
# find variance
var(tri.area.vec)

How to run monte carlo simulation from a custom distribution in R

I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.

How to draw an $\alpha$ confidence areas on a 2D-plot?

There are a lot of answers regarding to plotting confidence intervals.
I'm reading the paper by Lourme A. et al (2016) and I'd like to draw the 90% confidence boundary and the 10% exceptional points like in the Fig. 2 from the paper: .
I can't use LaTeX and insert the picture with the definition of confidence areas:
library("MASS")
library(copula)
set.seed(612)
n <- 1000 # length of sample
d <- 2 # dimension
# random vector with uniform margins on (0,1)
u1 <- runif(n, min = 0, max = 1)
u2 <- runif(n, min = 0, max = 1)
u = matrix(c(u1, u2), ncol=d)
Rg <- cor(u) # d-by-d correlation matrix
Rg1 <- ginv(Rg) # inv. matrix
# round(Rg %*% Rg1, 8) # check
# the multivariate c.d.f of u is a Gaussian copula
# with parameter Rg[1,2]=0.02876654
normal.cop = normalCopula(Rg[1,2], dim=d)
fit.cop = fitCopula(normal.cop, u, method="itau") #fitting
# Rg.hat = fit.cop#estimate[1]
# [1] 0.03097071
sim = rCopula(n, normal.cop) # in (0,1)
# Taking the quantile function of N1(0, 1)
y1 <- qnorm(sim[,1], mean = 0, sd = 1)
y2 <- qnorm(sim[,2], mean = 0, sd = 1)
par(mfrow=c(2,2))
plot(y1, y2, col="red"); abline(v=mean(y1), h=mean(y2))
plot(sim[,1], sim[,2], col="blue")
hist(y1); hist(y2)
Reference.
Lourme, A., F. Maurer (2016) Testing the Gaussian and Student's t copulas in a risk management framework. Economic Modelling.
Question. Could anyone help me and give the explanation of the variable v=(v_1,...,v_d) and G(v_1),..., G(v_d) in the equation?
I think v is the non-random matrix, the dimensions should be $k^2$ (grid points) by d=2 (dimensions). For example,
axis_x <- seq(0, 1, 0.1) # 11 grid points
axis_y <- seq(0, 1, 0.1) # 11 grid points
v <- expand.grid(axis_x, axis_y)
plot(v, type = "p")
So, your question is about the vector nu and correponding G(nu).
nu is a simple random vector drawn from any distribution that has a domain (0,1). (Here I use uniform distribution). Since you want your samples in 2D one single nu can be nu = runif(2). Given the explanations above, G is a gaussain pdf with mean 0 and a covariance matrix Rg. (Rg has dimensions of 2x2 in 2D).
Now what the paragraph says: if you have a random sample nu and you want it to be drawn from Gamma given the number of dimensions d and confidence level alpha then you need to compute the following statistic (G(nu) %*% Rg^-1) %*% G(nu) and check that is below the pdf of Chi^2 distribution for d and alpha.
For example:
# This is the copula parameter
Rg <- matrix(c(1,runif(2),1), ncol = 2)
# But we need to compute the inverse for sampling
Rginv <- MASS::ginv(Rg)
sampleResult <- replicate(10000, {
# we draw our nu from uniform, but others that map to (0,1), e.g. beta, are possible, too
nu <- runif(2)
# we compute G(nu) which is a gaussian cdf on the sample
Gnu <- qnorm(nu, mean = 0, sd = 1)
# for this we compute the statistic as given in formula
stat <- (Gnu %*% Rginv) %*% Gnu
# and return the result
list(nu = nu, Gnu = Gnu, stat = stat)
})
theSamples <- sapply(sampleResult["nu",], identity)
# this is the critical value of the Chi^2 with alpha = 0.95 and df = number of dimensions
# old and buggy threshold <- pchisq(0.95, df = 2)
# new and awesome - we are looking for the statistic at alpha = .95 quantile
threshold <- qchisq(0.95, df = 2)
# we can accept samples given the threshold (like in equation)
inArea <- sapply(sampleResult["stat",], identity) < threshold
plot(t(theSamples), col = as.integer(inArea)+1)
The red points are the points you would keep (I plot all points here).
As for drawing the decision boundries, I think it is a little bit more complicated, since you need to compute the exact pair of nu so that (Gnu %*% Rginv) %*% Gnu == pchisq(alpha, df = 2). It is a linear system that you solve for Gnu and then apply inverse to get your nu at the decision boundries.
edit: Reading the paragraph again, I noticed, the parameter for Gnu does not change, it is simply Gnu <- qnorm(nu, mean = 0, sd = 1).
edit: There was a bug: for threshold you need to use the quantile function qchisq instead of the distribution function pchisq - now corrected in the code above (and updated the figures).
This has two parts: first, compute the copula value as a function of X and Y; then, plot the curve giving the boundary where the copula exceeds the threshold.
Computing the value is basically linear algebra which #drey has answered. This is a rewritten version so that the copula is given by a function.
cop1 <- function(x)
{
Gnu <- qnorm(x)
Gnu %*% Rginv %*% Gnu
}
copula <- function(x)
{
apply(x, 1, cop1)
}
Plotting the boundary curve can be done using the same method as here (which in turn is the method used by the textbooks Modern Applied Stats with S, and Elements of Stat Learning). Create a grid of values, and use interpolation to find the contour line at the given height.
Rg <- matrix(c(1,runif(2),1), ncol = 2)
Rginv <- MASS::ginv(Rg)
# draw the contour line where value == threshold
# define a grid of values first: avoid x and y = 0 and 1, where infinities exist
xlim <- 1e-3
delta <- 1e-3
xseq <- seq(xlim, 1-xlim, by=delta)
grid <- expand.grid(x=xseq, y=xseq)
prob.grid <- copula(grid)
threshold <- qchisq(0.95, df=2)
contour(x=xseq, y=xseq, z=matrix(prob.grid, nrow=length(xseq)), levels=threshold,
col="grey", drawlabels=FALSE, lwd=2)
# add some points
data <- data.frame(x=runif(1000), y=runif(1000))
points(data, col=ifelse(copula(data) < threshold, "red", "black"))

Extract approximate probability density function (pdf) in R from random sampling

I have got n>2 independent continuous Random Variables(RV). For example say I have 4 Uniform RVs with different set of Upper and lowers.
W~U[-1,5], X~U[0,1], Y~[0,2], Z~[0.5,2]
I am trying to find out the approximate PDF for the sum of these RVs i.e. for T=W+X+Y+Z. As I don't need any closed form solution, I have sampled 1 million points for each of them to get 1 million samples for T. Is it possible in R to get the approximate PDF function or a way to get approximate probability of P(t<T)from this samples I have drawn. For example is there a easy way I can calculate P(0.5<T) in R. My priority here is to get probability first even if getting the density function is not possible.
Thanks
Consider the ecdf function:
set.seed(123)
W <- runif(1e6, -1, 5)
X <- runif(1e6, 0, 1)
Y <- runif(1e6, 0, 2)
Z <- runif(1e6, 0.5, 2)
T <- Reduce(`+`, list(W, X, Y, Z))
cdfT <- ecdf(T)
1 - cdfT(0.5) # Pr(T > 0.5)
# [1] 0.997589
See How to calculate cumulative distribution in R? for more details.

Observation in a bivariate Ellipse

I am trying find the probability that a point lies within an ellipse?
For eg if I was plotting the bivariate data (x,y) for 300 datasets in an 95% ellipsoid region, how do I calculate how many times out of 300 will my points fall inside the
ellipse?
Heres the code I am using
library(MASS)
seed<-1234
x<-NULL
k<-1
Sigma2 <- matrix(c(.72,.57,.57,.46),2,2)
Sigma2
rho <- Sigma2[1,2]/sqrt(Sigma2[1,1]*Sigma2[2,2])
rho
eta1<-replicate(300,mvrnorm(k, mu=c(-1.59,-2.44), Sigma2))
library(car)
dataEllipse(eta1[1,],eta1[2,], levels=c(0.05, 0.95))
Thanks for your help.
I don't see why people are jumping on the OP. In context, it's clearly a programming question: it's about getting the empirical frequency of data points within a given ellipse, not a theoretical probability. The OP even posted code and a graph showing what they're trying to obtain.
It may be that they don't fully understand the statistical theory behind a 95% ellipse, but they didn't ask about that. Besides, making plots and calculating frequencies like this is an excellent way of coming to grips with the theory.
Anyway, here's some code that answers the narrowly-defined question of how to count the points within an ellipse obtained via a normal distribution (which is what underlies dataEllipse). The idea is to transform your data to the unit circle via principal components, then get the points within a certain radius of the origin.
within.ellipse <- function(x, y, plot.ellipse=TRUE)
{
if(missing(y) && is.matrix(x) && ncol(x) == 2)
{
y <- x[,2]
x <- x[,1]
}
if(plot.ellipse)
dataEllipse(x, y, levels=0.95)
d <- scale(prcomp(cbind(x, y), scale.=TRUE)$x)
rad <- sqrt(2 * qf(.95, 2, nrow(d) - 1))
mean(sqrt(d[,1]^2 + d[,2]^2) < rad)
}
It was also commented that a 95% data ellipse contains 95% of the data by definition. This is certainly not true, at least for normal-theory ellipses. If your distribution is particularly bad, the coverage frequency may not even converge to the assumed level as the sample size increases. Consider a generalised pareto distribution, for example:
library(evd) # for rgpd
# generalised pareto has no variance for shape > 0.5
z <- sapply(1:1000, function(...) within.ellipse(rgpd(100, shape=5), rgpd(100, shape=5), FALSE))
mean(z)
[[1] 0.97451
z <- sapply(1:1000, function(...) within.ellipse(rgpd(10000, shape=5), rgpd(10000, shape=5), FALSE))
mean(z)
[1] 0.9995808

Resources