confusion on 2 dimension kernel density estimation in R - r

A kernel density estimator is used to estimate a particular probability density function (see mvstat.net and sckit-learn docs for references)
My confusion is about what exactly does kde2d() do? Does it estimate the joint distribution probability density function of two random variables f(a,b) in the below example? And what does the color mean?
Here is the code example I am referring to.
b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
density <- kde2d(a, b, n=100)
colour_flow <- colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred'))
filled.contour(density, color.palette=colour_flow)

What is a kernel density estimator?
Essentially it fits a little normal density curve over every point (the center of the normal density being that point) of the data and then adds up all little normal densities to a kernel density estimator.
For the sake of illustration I will add an image of a 1 dimensional kernel density estimator from one of your links.
What about 2 dimensional kernel densities?
# library(MASS)
b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
# a and b contain 1000 values each.
density <- kde2d(a,b,n=100)
The function creates a grid from min(a) to max(a) and from min(b) to max(b). Instead of fitting a tiny 1D normal density over every value in a or b, kde2d now fits a tiny 2D normal density over every point in the grid. Just like in the 1 dimensional case kernel density, it then adds up all density values.
What do the colours mean?
As #cel pointed out in the comments: the estimated probability depends on two variables, so we have three axes now (a, b and estimated probability). One way to visualize 3 axes is by using iso-probability contours. This sounds fancy, but it is basically the same as the high/low pressure images we know from the weather forecast.
You are using
filled.contour(density,
color.palette = colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred')))))
So from low to high, the plot will be coloured white, blue, yellow, red and eventually darkred for the highest values of estimated probability. This results in the following plot:

Related

2d density plot from curves

I have a multi-parameter function on which I infer the parameters using MCMC. This means that I have many samples of the parameters, and I can plot the functions:
# Simulate some parameters. Really, I get these from MCMC sampling.
first = rnorm(1000) # a
second = rnorm(1000) # b
# The function (geometric)
geometric = function(x, a, b) b*(1 - a^(x + 1)/a)
# Plot curves. Perhaps not the most efficient way, but it works.
curve(geometric(x, first[1], second[1]), ylim=c(-3, 3)) # first curve
for(i in 2:length(first)) {
curve(geometric(x, first[i], second[i]), add=T, col='#00000030') # add others
}
How do I make this into a density plot instead of plotting the individual curves? For example, it's hard to see just how much denser it is around y=0 than around other values.
The following would be nice:
The ability to draw observed values on top (points and lines).
Drawing a contour line in the density, e.g. the 95% Highest Posterior Density interval or the 2.5 and 97.5 quantiles.

R: area under curve of ogive?

I have an algorithm that uses an x,y plot of sorted y data to produce an ogive.
I then derive the area under the curve to derive %'s.
I'd like to do something similar using kernel density estimation. I like how the upper/lower bounds are smoothed out using kernel densities (i.e. the min and max will extend slightly beyond my hard coded input).
Either way... I was wondering if there is a way to treat an ogive as a type of cumulative distribution function and/or use kernel density estimation to derive a cumulative distribution function given y data?
I apologize if this is a confusing question. I know there is a way to derive a cumulative frequency graph (i.e. ogive). However, I can't determine how to derive a % given this cumulative frequency graph.
What I don't want is an ecdf. I know how to do that, and I am not quite trying to capture an ecdf. But, rather integration of an ogive given two intervals.
I'm not exactly sure what you have in mind, but here's a way to calculate the area under the curve for a kernel density estimate (or more generally for any case where you have the y values at equally spaced x-values (though you can, of course, generalize to variable x intervals as well)):
library(zoo)
# Kernel density estimate
# Set n to higher value to get a finer grid
set.seed(67839)
dens = density(c(rnorm(500,5,2),rnorm(200,20,3)), n=2^5)
# How to extract the x and y values of the density estimate
#dens$y
#dens$x
# x interval
dx = median(diff(dens$x))
# mean height for each pair of y values
h = rollmean(dens$y, 2)
# Area under curve
sum(h*dx) # 1.000943
# Cumulative area
# cumsum(h*dx)
# Plot density, showing points at which density is calculated
plot(dens)
abline(v=dens$x, col="#FF000060", lty="11")
# Plot cumulative area under curve, showing mid-point of each x-interval
plot(dens$x[-length(dens$x)] + 0.5*dx, cumsum(h*dx), type="l")
abline(v=dens$x[-length(dens$x)] + 0.5*dx, col="#FF000060", lty="11")
UPDATE to include ecdf function
To address your comments, look at the two plots below. The first is the empirical cumulative distribution function (ECDF) of the mixture of normal distributions that I used above. Note that the plot of this data looks the same below as it does above. The second is a plot of the ECDF of a plain vanilla normal distribution, mean=0, sd=1.
set.seed(67839)
x = c(rnorm(500,5,2),rnorm(200,20,3))
plot(ecdf(x), do.points=FALSE)
plot(ecdf(rnorm(1000)))

represent the frequency line in a histagram using freq=TRUE

I have the following piece of code in R:
w=rbeta(365,1,3,ncp=0)
hist(10*w,breaks=25,freq=TRUE,xlim=c(0,10), ylim=c(0,60))
h=seq(0,1,0.05)
So far so good.
What I want to do now is to add a line representing the beta function having parameters alpha=1, beta=3 (as in the rbeta function I used), which takes into account the frequency and not the density. The total number of elements in the rbeta is 365 (the days in a year) and the reason why I multiply w by 10 is because the variable I am studying can assume value [0,10] each day, following the beta distribution described above.
What do I have to do to represent this line?
Summarizing, the histogram is based on simulated values, and I want to show how the theoretical beta function would had behaved in comparison to the simulation.
If you want them to match up, you're going to want to match up the area under the curves of the histogram and the density plot. That should put them on the same scale. One way to do that would be
set.seed(15) #to make it reproducible
w <- rbeta(365, 1, 3, ncp=0)
hh <- hist(w*10, breaks=25, freq=TRUE, xlim=c(0,10), ylim=c(0,60))
ss <- sum(diff(hh$breaks)*hh$counts)
curve(dbeta(x/10, 1, 3, ncp=0)*ss/10, add=T)
This gives

How do I interpret the output of corrplot?

The corrplot packages provides some neat plots and documents with examples.
But I don't understand the output. I can see that if you have a matrix A_ij, you can plot it as an arrangement of n by n square tiles, where the color of tile ij corresponds to the value of A_ij. But some examples appear to have more dimensions:
Here we can guess that color shows the correlation coefficient, and orientation of the ellipse is negative/positive correlation. What is the eccentricity?
The documentation for method says:
the visualization method of correlation matrix to be used. Currently, it supports seven methods, named "circle" (default), "square", "ellipse", "number", "pie", "shade" and "color". See examples for details.
The areas of circles or squares show the absolute value of corresponding correlation coefficients. Method "pie" and "shade" came from Michael Friendly’s job (with some adjustment about the shade added on), and "ellipse" came from D.J. Murdoch and E.D. Chow’s job, see in section References.
So we know that the area, for circles and squares, should show the coefficient. What about the other dimensions, and other methods?
There is only one dimension shown by the plot.
Michael Friendly, in Corrgrams: Exploratory displays for correlation matrices (the corrplot documentation confusingly refers to this as his "job"), says:
In the shaded row, each cell is shaded blue or red depending on the sign of the correlation, and with the intensity of color scaled 0–100% in proportion to the magnitude of the correlation. (Such scaled colors are easily computed using RGB coding from red, (1, 0, 0), through white (1, 1, 1), to blue (0, 0, 1). For simplicity, we ignore the non-linearities of color reproduction and perception, but note that these are easily accommodated in the color mapping function.) White diagonal lines are added so that the direction of the correlation may still be discerned in black and white. This bipolar scale of color was chosen to leave correlations near 0 empty (white), and to make positive and negative values of equal magnitude approximately equally intensely shaded. Gray scale and other color schemes are implemented in our software (Section 6), but not illustrated here.
The bar and circular symbols also use the same scaled colors, but fill an area proportional to the absolute value of the correlation. For the bars, negative values are filled from the bottom, positive values from the top. The circles are filled clockwise for positive values, anti-clockwise for negative values. The ellipses have their eccentricity parametrically scaled to the correlation value (Murdoch and Chow, 1996). Perceptually, they have the property of becoming visually less prominent as the magnitude of the correlation increases, in contrast to the other glyphs.
(emphasis mine)
"Murdoch and Chow, 1996" is a publication describing the equation for drawing the ellipses (A Graphical Display of Large Correlation Matrices). The ellipses are apparently meant to be caricatures of bivariate normal distributions:
So in conclusion, the only dimension shown is always the correlation coefficient (or the value of A_ij, to use the question's terminology) itself. The multiple apparent dimensions are redundant.
I think the plot is quite self explanatory. On the right hand side you have the scale which is colored from red (negative correlation) to blue (positive correlation). The color follows a gradient according to the strength of the correlation.
If the ellipse leans towards the right, it is again positive correlation and if it leans to the left, it is negative correlation.
The diffusion around a line (which denotes perfect correlation, for example mpg ~ mpg) creates an ellipse. You will have a more diffused ellipse for lower strengths of the correlation. This is typically how a weakly correlated relationship will look in a scatterplot. These I think are caricatures, however.
Here is some code from the corrplot function responsible for drawing ellipses. I am not going to attempt to explain this (because it is a part of a larger system). I wanted to show that the logic is all there if you'd like to deep dive into it:
if (method == "ellipse" & plotCI == "n") {
ell.dat <- function(rho, length = 99) {
k <- seq(0, 2 * pi, length = length)
x <- cos(k + acos(rho)/2)/2
y <- cos(k - acos(rho)/2)/2
return(cbind(rbind(x, y), c(NA, NA)))
}
ELL.dat <- lapply(DAT, ell.dat)
ELL.dat2 <- 0.85 * matrix(unlist(ELL.dat), ncol = 2,
byrow = TRUE)
ELL.dat2 <- ELL.dat2 + Pos[rep(1:length(DAT), each = 100),
]
polygon(ELL.dat2, border = col.border, col = col.fill)
}

kde2d density comparison

I have a question about the kde2d (Kernel density estimator). I am computing two different kde2d for two different sets of data in the same space of variables. When I compare both with a filled.contour2 or contours I see that the set with lower density of points in a scatter plot(Also has less points in the total with a factor of 10) has an higher density for the contours values. I was expecting that the set with higher point density will have higher density contours values, but like I said above is not the case. It has to be with the choice of bandwidth (h)? I am using equals h, and i tried to change but the result did not changed a lot. What is my error?
An example
a <- runif(1000, 5.0, 7.5)
b <- runif(1000, 2.0, 3.0)
c <- runif(100000,5.0, 7.5)
d <- runif(100000, 2.0, 3.0)
library(MASS)
abdens <- kde2d(a,b,n=100,h=0.5)
cddens <- kde2d(c,d,n=100,h=0.5)
mylevels <- seq(-2.4,30,0.9)
filled.contour2(abdens,xlab="a",ylab="b",xlim=c(5,7.5),ylim=c(2,3),
col=hsv(seq(0,1,length=length(mylevels))))
plot(a,b)
contour(abdens,nlevels=5,add=T,col="blue")
plot(c,d)
contour(cddens,nlevels=5,add=T,col="orange")
I'm not sure I agree that the densities should be different in the uniform case. I would have expected a set with a higher number of randomly drawn points from a Normal distribution to have more support for extreme regions and therefore to have lower (estimated) density in the center. That effect might be also be occasionally apparent with bibariate Uniform draws with 1,000 points versus 100,000. I hope my modifications to your code are acceptable. It's easier to see the contours if done after the plots.
(The theoretic density would be the same in both these cases since the density distribution is normalized to an integral of 1.0. We are only looking at estimates with some expected artifacts from "edge" effects. In the case of univariate densities adding the information about bounds can be done with the desity functions in package::logspline.)

Resources