Calculate the volume under a plot of kernel bivariate density estimation - r

I need to calculate a measure called mutual information. First of all, I need to calculate another measure, called entropy, for example, the joint entropy of x and y:
-∬p(x,y)·log p(x,y)dxdy
So, to calculate p(x,y), I used the kernel density estimator (in this way, function kde2d, and it returned the Z values (probability of having x and y in that window).
Again, by now, I have a matrix of Z values [1x100] x [1x100], that's equal my p(x,y). But I have to integrate it, by discovering the volume under the surface (doble integral). But I didn't found a way to do that. The function quad2d, to compute the double quadrature didn't work, because I only integrated a numerical matrix p(x,y), and it gives me a constant....
Anyone knows something to find that volume/calculate the double integral?
The image of the plot from persp3d:
Thanks everybody !!!!

Once you have the results from kde2d, it is very straighforward to compute a numerical integral. The example session below sketches how to do it.
As you know, numerical double integral is just a 2D summation. The kde2d, by default takes range(x) and range(y) as 2D domain. I see that you got a 100*100 matrix, so I think you have set n = 100 in using kde2d. Now, kde$x, kde$y defines a 100 * 100 grid, with den$z giving density on each grid cell. It is easy to compute the size of each grid cell (they are all equal), then we do three steps:
find normalizing constants; although we know that in theory, density sums up (or integrates) to 1, but after computer discretization, it only approximates 1. So we first compute this normalizing constant for later rescaling;
the integrand for entropy is z * log(z); since z is a 100 * 100 matrix, this is also a matrix. You simply sum them up, and multiply it by the cell size cell_size, then you get a non-normalized entropy;
rescale the non-normalized entropy for a normalized one.
## sample data: bivariate normal, with covariance/correlation 0
set.seed(123); x <- rnorm(1000, 0, 2) ## marginal variance: 4
set.seed(456); y <- rnorm(1000, 0, 2) ## marginal variance: 4
## load MASS
library(MASS)
## domain:
xlim <- range(x)
ylim <- range(y)
## 2D Kernel Density Estimation
den <- kde2d(x, y, n = 100, lims = c(xlim, ylim))
##persp(den$x,den$y,den$z)
z <- den$z ## extract density
## den$x, den$y expands a 2D grid, with den$z being density on each grid cell
## numerical integration is straighforward, by aggregation over all cells
## the size of each grid cell (a rectangular cell) is:
cell_size <- (diff(xlim) / 100) * (diff(ylim) / 100)
## normalizing constant; ideally should be 1, but actually only close to 1 due to discretization
norm <- sum(z) * cell_size
## your integrand: z * log(z) * (-1):
integrand <- z * log(z) * (-1)
## get numerical integral by summation:
entropy <- sum(integrand) * cell_size
## self-normalization:
entropy <- entropy / norm
Verification
The above code gives entropy of 4.230938. Now, Wikipedia - Multivariate normal distribution gives entropy formula:
(k / 2) * (1 + log(2 * pi)) + (1 / 2) * log(det(Sigma))
For the above bivariate normal distribution, we have k = 2. We have Sigma (covariance matrix):
4 0
0 4
whose determinant is 16. Hence, the theoretical value is:
(1 + log(2 * pi)) + (1 / 2) * log(16) = 4.224171
Good match!

Related

How can I find LGCP random field Lambda values in overall area?

There is a rLGCP model example in the RandomField package.
if(require(RandomFields)) {
# homogeneous LGCP with exponential covariance function
X <- rLGCP("exp", 3, var=0.2, scale=.1)
# inhomogeneous LGCP with Gaussian covariance function
m <- as.im(function(x, y){5 - 1.5 * (x - 0.5)^2 + 2 * (y - 0.5)^2}, W=owin())
X <- rLGCP("gauss", m, var=0.15, scale =0.5)
plot(attr(X, "Lambda"))
points(X)
}
I think that the Lambda attribute of X does not show the overall values in the overall two dimensional area.
How can I find the overall Lambda values in overall area?
I'm not entirely sure if this is what you are looking for, but the matrix of values of Lambda for each point in the plot are stored in the Lambda attribute of the model created by spatstat::rLGCP.
You can access them like this:
m <- as.im(function(x, y){5 - 1.5 * (x - 0.5)^2 + 2 * (y - 0.5)^2}, W=owin())
X <- rLGCP("gauss", m, var=0.15, scale = 0.5)
lambda_matrix <- attr(X, "Lambda")$v
Now lambda_matrix is a 128 x 128 matrix containing the value of Lambda at each point on the grid.

Cholesky Decomposition of a random exponential correlation matrix in R

I have a set of exponential correlation matrices created using the following code.
x=runif(n)
R=matrix(0,n,n)
for (j in 1:n)
{
for(k in 1:n)
{
R[j,k]=exp(-(x[j]-x[k])^2);
}
}
and now I want to get their Cholesky decomposition. But many of these are negative definite. How could I resolve this?
The exponential correlation matrix used in spatial or temporal modeling, has a factor alpha that controls the speed of decay:
exp(- alpha * (x[i] - x[j]) ^ 2))
You have fixed such factor at 1. But in practice, such factor is estimated from data.
Note that alpha is necessary to ensure numerical positive definiteness. This matrix is in principle positive definite, but numerically not if alpha is not large enough for a fast decay.
Given that x <- runif(n, 0, 1), the distance between x[i] and x[j] is clustered in a short range [0, 1]. This is not a big range to see a decay in correlation, and maybe you want to try alpha = 10000.
Alternatively if you want to stay with alpha = 1, you need to make distance more spread out. Try x <- runif(n, 0, 100). The decay is very fast, even with alpha = 1.
So we see a duality between distance and alpha. This is also the reason why such correlation matrix can be used stably in statistical modeling. When alpha is to be estimated, it can be made adaptive to distance, so that the correlation matrix is always positive definite.
Examples:
f <- function (xi, xj, alpha) exp(- alpha * (xi - xj) ^ 2)
n <- 100
# large alpha, small distance
x <- runif(n, 0, 1)
A <- outer(x, x, f, alpha = 10000)
R <- chol(A)
# small alpha, large distance
x <- runif(n, 0, 100)
A <- outer(x, x, f, alpha = 1)
R <- chol(A)
try use this to construct the positive defitive matrix
A<-matrix(runif(n^2),n,n)
dim(A)
A<-A%*%t(A)
chol(A)

How to interpolate those signal data with a polynomial?

I am trying to find the coefficients of a polynomial in R, but I am not sure of which order the polynomial is.
I have data:
x=seq(6, 174, by=8)
y=rep(c(-1,1),11)
Now I want to find the (obviously) non-linear function that hits up all these points. Function values should still is in the interval [-1,1], and all these points should be understood as the vertex of a parabola.
EDIT
Actually this is not example data, I just need exactly this function for exactly these points.
I tried to describe it with polynomials up to degree 25 and then gave up, with polynomials it seems that it is only possible to approximate the curve but not to get it directly.
Comments suggested using a sine curve. Does someone know how to get the exact trigonometric function?
Your data have a strong characteristic that they are sampled from a sinusoid signal. With restriction that y is constrained onto [-1,1], we know for sure the amplitude is 1, so let's assume we want a sin function:
y = sin((2 * pi / T) * x + phi)
where T is period and phi is phase. The period of your data is evident: 2 * 8 = 16. To get phi, just use the fact that when x = 6, y = -1. That is
sin(12 * pi / T + phi) = -1
which gives one solution: phi = -pi/2 - 12 * pi / T.
Here we go:
T <- 16
phi <- -pi/2 - 12 * pi / T
f <- function(x) sin(x * pi / 8 + phi)
plot(x, y)
x0 <- seq(6, 174, by = 0.2)
y0 <- f(x0)
lines(x0, y0, col = 2)
Your original intention to have a polynomial is not impossible, but it can't be an ordinary polynomial. An ordinary polynomial is unbounded. It will tends to Inf or -Inf when x tends to Inf or -Inf.
Local polynomial is possible. Since you say: all these points should be understood as the vertex of a parabola, you seem to expect a smooth function. Then a cubic spline is ideal. Specifically, we don't want a natural cubic spline but a period cubic spline. The spline function from stats package can help us:
int <- spline(x[-1], y[-1], method = "periodic", xout = x0)
Note, I have dropped the first datum, as with "periodic" method, spline wants y to have the same value on both ends. Once we drop the first datum, y values are 1 on both sides.
plot(x, y)
lines(int, col = 2)
I did not compare the spline interpolation with the sinusoid function. They can't be exactly the same, but in statistical modelling we can use either one to model the underlying cyclic signal / effect.

Maximum Likelihood Estimation for three-parameter Weibull distribution in r

I want to estimate the scale, shape and threshold parameters of a 3p Weibull distribution.
What I've done so far is the following:
Refering to this post, Fitting a 3 parameter Weibull distribution in R
I've used the functions
EPS = sqrt(.Machine$double.eps) # "epsilon" for very small numbers
llik.weibull <- function(shape, scale, thres, x)
{
sum(dweibull(x - thres, shape, scale, log=T))
}
thetahat.weibull <- function(x)
{
if(any(x <= 0)) stop("x values must be positive")
toptim <- function(theta) -llik.weibull(theta[1], theta[2], theta[3], x)
mu = mean(log(x))
sigma2 = var(log(x))
shape.guess = 1.2 / sqrt(sigma2)
scale.guess = exp(mu + (0.572 / shape.guess))
thres.guess = 1
res = nlminb(c(shape.guess, scale.guess, thres.guess), toptim, lower=EPS)
c(shape=res$par[1], scale=res$par[2], thres=res$par[3])
}
to "pre-estimate" my Weibull parameters, such that I can use them as initial values for the argument "start" in the "fitdistr" function of the MASS-Package.
You might ask why I want to estimate the parameters twice... reason is that I need the variance-covariance-matrix of the estimates which is also estimated by the fitdistr function.
EXAMPLE:
set.seed(1)
thres <- 450
dat <- rweibull(1000, 2.78, 750) + thres
pre_mle <- thetahat.weibull(dat)
my_wb <- function(x, shape, scale, thres) {
dweibull(x - thres, shape, scale)
}
ml <- fitdistr(dat, densfun = my_wb, start = list(shape = round(pre_mle[1], digits = 0), scale = round(pre_mle[2], digits = 0),
thres = round(pre_mle[3], digits = 0)))
ml
> ml
shape scale thres
2.942548 779.997177 419.996196 ( 0.152129) ( 32.194294) ( 28.729323)
> ml$vcov
shape scale thres
shape 0.02314322 4.335239 -3.836873
scale 4.33523868 1036.472551 -889.497580
thres -3.83687258 -889.497580 825.374029
This works quite well for cases where the shape parameter is above 1. Unfortunately my approach should deal with the cases where the shape parameter could be smaller than 1.
The reason why this is not possible for shape parameters that are smaller than 1 is described here: http://www.weibull.com/hotwire/issue148/hottopics148.htm
in Case 1, All three parameters are unknown the following is said:
"Define the smallest failure time of ti to be tmin. Then when γ → tmin, ln(tmin - γ) → -∞. If β is less than 1, then (β - 1)ln(tmin - γ) goes to +∞ . For a given solution of β, η and γ, we can always find another set of solutions (for example, by making γ closer to tmin) that will give a larger likelihood value. Therefore, there is no MLE solution for β, η and γ."
This makes a lot of sense. For this very reason I want to do it the way they described it on this page.
"In Weibull++, a gradient-based algorithm is used to find the MLE solution for β, η and γ. The upper bound of the range for γ is arbitrarily set to be 0.99 of tmin. Depending on the data set, either a local optimal or 0.99tmin is returned as the MLE solution for γ."
I want to set a feasible interval for gamma (in my code called 'thres') such that the solution is between (0, .99 * tmin).
Does anyone have an idea how to solve this problem?
In the function fitdistr there seems to be no opportunity doing a constrained MLE, constraining one parameter.
Another way to go could be the estimation of the asymptotic variance via the outer product of the score vectors. The score vector could be taken from the above used function thetahat.weibul(x). But calculating the outer product manually (without function) seems to be very time consuming and does not solve the problem of the constrained ML estimation.
Best regards,
Tim
It's not too hard to set up a constrained MLE. I'm going to do this in bbmle::mle2; you could also do it in stats4::mle, but bbmle has some additional features.
The larger issue is that it's theoretically difficult to define the sampling variance of an estimate when it's on the boundary of the allowed space; the theory behind Wald variance estimates breaks down. You can still calculate confidence intervals by likelihood profiling ... or you could bootstrap. I ran into a variety of optimization issues when doing this ... I haven't really thought about wether there are specific reasons
Reformat three-parameter Weibull function for mle2 use (takes x as first argument, takes log as an argument):
dweib3 <- function(x, shape, scale, thres, log=TRUE) {
dweibull(x - thres, shape, scale, log=log)
}
Starting function (slightly reformatted):
weib3_start <- function(x) {
mu <- mean(log(x))
sigma2 <- var(log(x))
logshape <- log(1.2 / sqrt(sigma2))
logscale <- mu + (0.572 / logshape)
logthres <- log(0.5*min(x))
list(logshape = logshape, logsc = logscale, logthres = logthres)
}
Generate data:
set.seed(1)
dat <- data.frame(x=rweibull(1000, 2.78, 750) + 450)
Fit model: I'm fitting the parameters on the log scale for convenience and stability, but you could use boundaries at zero as well.
tmin <- log(0.99*min(dat$x))
library(bbmle)
m1 <- mle2(x~dweib3(exp(logshape),exp(logsc),exp(logthres)),
data=dat,
upper=c(logshape=Inf,logsc=Inf,
logthres=tmin),
start=weib3_start(dat$x),
method="L-BFGS-B")
vcov(m1), which should normally provide a variance-covariance estimate (unless the estimate is on the boundary, which is not the case here) gives NaN values ... not sure why without more digging.
library(emdbook)
tmpf <- function(x,y) m1#minuslogl(logshape=x,
logsc=coef(m1)["logsc"],
logthres=y)
tmpf(1.1,6)
s1 <- curve3d(tmpf,
xlim=c(1,1.2),ylim=c(5.9,tmin),sys3d="image")
with(s1,contour(x,y,z,add=TRUE))
h <- lme4:::hessian(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1))
vv <- solve(h)
diag(vv) ## [1] 0.002672240 0.001703674 0.004674833
(se <- sqrt(diag(vv))) ## standard errors
## [1] 0.05169371 0.04127558 0.06837275
cov2cor(vv)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.8852090 -0.8778424
## [2,] 0.8852090 1.0000000 -0.9616941
## [3,] -0.8778424 -0.9616941 1.0000000
This is the variance-covariance matrix of the log-scaled variables. If you want to convert to the variance-covariance matrix on the original scale, you need to scale by (x_i)*(x_j) (i.e. by the derivatives of the transformation exp(x)).
outer(exp(coef(m1)),exp(coef(m1))) * vv
## logshape logsc logthres
## logshape 0.02312803 4.332993 -3.834145
## logsc 4.33299307 1035.966372 -888.980794
## logthres -3.83414498 -888.980794 824.831463
I don't know why this doesn't work with numDeriv - would be very careful with variance estimates above. (Maybe too close to boundary for Richardson extrapolation to work?)
library(numDeriv)
hessian()
grad(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1)) ## looks OK
vcov(m1)
The profiles look OK ... (we have to supply std.err because the Hessian isn't invertible)
pp <- profile(m1,std.err=c(0.01,0.01,0.01))
par(las=1,bty="l",mfcol=c(1,3))
plot(pp,show.points=TRUE)
confint(pp)
## 2.5 % 97.5 %
## logshape 0.9899645 1.193571
## logsc 6.5933070 6.755399
## logthres 5.8508827 6.134346
Alternately, we can do this on the original scale ... one possibility would be to use the log-scaling to fit, then refit starting from those parameters on the original scale.
wstart <- as.list(exp(unlist(weib3_start(dat$x))))
names(wstart) <- gsub("log","",names(wstart))
m2 <- mle2(x~dweib3(shape,sc,thres),
data=dat,
lower=c(shape=0.001,sc=0.001,thres=0.001),
upper=c(shape=Inf,sc=Inf,
thres=exp(tmin)),
start=wstart,
method="L-BFGS-B")
vcov(m2)
## shape sc thres
## shape 0.02312399 4.332057 -3.833264
## sc 4.33205658 1035.743511 -888.770787
## thres -3.83326390 -888.770787 824.633714
all.equal(unname(coef(m2)),unname(exp(coef(m1))),tol=1e-4)
About the same as the values above.
We can fit with a small shape, if we are a little more careful to bound the paraameters, but now we end up on the boundary for the threshold, which will cause lots of problems for the variance calculations.
set.seed(1)
dat <- data.frame(x = rweibull(1000, .53, 365) + 100)
tmin <- log(0.99 * min(dat$x))
m1 <- mle2(x ~ dweib3(exp(logshape), exp(logsc), exp(logthres)),
lower=c(logshape=-10,logscale=0,logthres=0),
upper = c(logshape = 20, logsc = 20, logthres = tmin),
data = dat,
start = weib3_start(dat$x), method = "L-BFGS-B")
For censored data, you need to replace dweibull with pweibull; see Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf for some hints.
Another possible solution is to do Bayesian inference. Using scale priors on the shape and scale parameters and a uniform prior on the location parameter, you can easily run Metropolis-Hastings as follows. It might be adviceable to reparameterize in terms of log(shape), log(scale) and log(y_min - location) because the posterior for some of the parameters becomes strongly skewed, in particular for the location parameter. Note that the output below shows the posterior for the backtransformed parameters.
library(MCMCpack)
logposterior <- function(par,y) {
gamma <- min(y) - exp(par[3])
sum(dweibull(y-gamma,exp(par[1]),exp(par[2]),log=TRUE)) + par[3]
}
y <- rweibull(100,shape=.8,scale=10) + 1
chain0 <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=.01*diag(3))
chain <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=var(chain0))
plot(exp(chain))
summary(exp(chain))
This produces the following output
#########################################################
The Metropolis acceptance rate was 0.43717
#########################################################
Iterations = 501:20500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
[1,] 0.81530 0.06767 0.0004785 0.001668
[2,] 10.59015 1.39636 0.0098738 0.034495
[3,] 0.04236 0.05642 0.0003990 0.001174
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
var1 0.6886083 0.768054 0.81236 0.8608 0.9498
var2 8.0756210 9.637392 10.50210 11.4631 13.5353
var3 0.0003397 0.007525 0.02221 0.0548 0.1939

Draw random numbers from restricted Pareto distribution

Am a newcomer to R and need advice on how to draw random numbers from a limited area of a Pareto Distribution with parameters s & beta. (System: Windows 7, R 2.15.2.)
(1) I have data in a vector data$t; each single data point I'll call data&tx
For these data the parameters s & beta of a Pareto distribution are estimated following https://stats.stackexchange.com/questions/27426/how-do-i-fit-a-set-of-data-to-a-pareto-distribution-in-r
pareto.MLE <- function(X)
{
n <- length(X)
m <- min(X)
a <- n/sum(log(X)-log(m))
return( c(m,a) )
}
(2) Now I need to draw as many random numbers (RndNew) von this Pareto distribution (s, beta, see (1)) as there are observations (= data points: data$tx) . For the draw the area from which random numbers are drawn must be limited to the area where RndNewx >= data$tx; in other words: RndNewx must never be smaller than the corresponding data$tx.
Question: how to tell R to restrict the area of a Pareto distribution from which to draw a random number to be RndNewx >= data$tx?
Thanks a million for any help!
The standard approach to sampling from a truncated distribution has three steps. Here's an example with the normal distribution so you can get the idea.
n <- 1000
lower_bound <- -1
upper_bound <- 1
Apply the CDF to your lower and upper bounds to find the quantiles of the edges of your distribution.
(quantiles <- pnorm(c(lower_bound, upper_bound)))
# [1] 0.1586553 0.8413447
Sample from a uniform distribution between those quantiles.
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
Apply the inverse CDF.
truncated_normal_random_numbers <- qnorm(uniform_random_numbers)
The CDF for the pareto distribution is
ppareto <- function(x, scale, shape)
{
ifelse(x > scale, 1 - (scale / x) ^ shape, 0)
}
And the inverse is
qpareto <- function(y, scale, shape)
{
ifelse(
y >= 0 & y <= 1,
scale * ((1 - y) ^ (-1 / shape)),
NaN
)
}
We can rework the above example to use these Pareto functions.
n <- 1000
scale <- 1
shape <- 1
lower_bound <- 2
upper_bound <- 10
(quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape))
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
truncated_pareto_random_numbers <- qpareto(uniform_random_numbers, scale, shape)
To make it easier to reuse this code, we can wrap it into a function. The lower and upper bounds have default values that match the range of the distribution, so if you don't pass values in, then you'll get a non-truncated Pareto distribution.
rpareto <- function(n, scale, shape, lower_bound = scale, upper_bound = Inf)
{
quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape)
uniform_random_numbers <- runif(n, quantiles[1], quantiles[2])
qpareto(uniform_random_numbers, scale, shape)
}

Resources