I have a question about the kde2d (Kernel density estimator). I am computing two different kde2d for two different sets of data in the same space of variables. When I compare both with a filled.contour2 or contours I see that the set with lower density of points in a scatter plot(Also has less points in the total with a factor of 10) has an higher density for the contours values. I was expecting that the set with higher point density will have higher density contours values, but like I said above is not the case. It has to be with the choice of bandwidth (h)? I am using equals h, and i tried to change but the result did not changed a lot. What is my error?
An example
a <- runif(1000, 5.0, 7.5)
b <- runif(1000, 2.0, 3.0)
c <- runif(100000,5.0, 7.5)
d <- runif(100000, 2.0, 3.0)
library(MASS)
abdens <- kde2d(a,b,n=100,h=0.5)
cddens <- kde2d(c,d,n=100,h=0.5)
mylevels <- seq(-2.4,30,0.9)
filled.contour2(abdens,xlab="a",ylab="b",xlim=c(5,7.5),ylim=c(2,3),
col=hsv(seq(0,1,length=length(mylevels))))
plot(a,b)
contour(abdens,nlevels=5,add=T,col="blue")
plot(c,d)
contour(cddens,nlevels=5,add=T,col="orange")
I'm not sure I agree that the densities should be different in the uniform case. I would have expected a set with a higher number of randomly drawn points from a Normal distribution to have more support for extreme regions and therefore to have lower (estimated) density in the center. That effect might be also be occasionally apparent with bibariate Uniform draws with 1,000 points versus 100,000. I hope my modifications to your code are acceptable. It's easier to see the contours if done after the plots.
(The theoretic density would be the same in both these cases since the density distribution is normalized to an integral of 1.0. We are only looking at estimates with some expected artifacts from "edge" effects. In the case of univariate densities adding the information about bounds can be done with the desity functions in package::logspline.)
Related
uppose that i have a poisson distribution with mean of 6 i would like to plot a probability mass function which includes an overlay of the approximating normal density.
This is what i have tried
plot( dpois( x=0:10, lambda=6 ))
this produces
which is wrong since it doesnt contain an overlay of approxiamating noral density
How do i go about this?
Something like what you seem to be asking for (I'm outlining the commands and the basic ideas, but checking the help on the functions and trying should fill in the remaining details):
taking a wider range of x-values (out to at least 13 or so) and use xlim to extend the plot slightly into the negatives (maybe to -1.5) and
plotting the pmf of the Poisson with solid dots (similar to your command but with pch=16 as an argument to plot) with a suitable color, then
call points with the same x and y arguments as above and have type=h and lty=3 to get vertical dotted lines (to give a clear impression of the relative heights, somewhat akin to the appearance of a Cleveland dot-chart); I'd use the same colour as the dots or a slightly lighter/greyer version of the dot-colour
use curve to draw the normal curve with the same mean and standard deviation as the Poisson with mean 6 (see details at the Wikipedia page for the Poisson which gives the mean and variance), but across the wider range we plotted; I'd use a slightly contrasting colour for that.
I'd draw a light x-axis in (e.g. using abline with the h argument)
Putting all those suggestions together:
(However, while it's what you're asking for it's not strictly a suitable way to compare discrete and continuous variables since density and pmf are not on the same scale, since density is not probability -- the "right" comparison between a Poisson and an approximating normal would be on the scale of the cdfs so you compare like with like -- they'd both be on the scale of probabilities then)
A kernel density estimator is used to estimate a particular probability density function (see mvstat.net and sckit-learn docs for references)
My confusion is about what exactly does kde2d() do? Does it estimate the joint distribution probability density function of two random variables f(a,b) in the below example? And what does the color mean?
Here is the code example I am referring to.
b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
density <- kde2d(a, b, n=100)
colour_flow <- colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred'))
filled.contour(density, color.palette=colour_flow)
What is a kernel density estimator?
Essentially it fits a little normal density curve over every point (the center of the normal density being that point) of the data and then adds up all little normal densities to a kernel density estimator.
For the sake of illustration I will add an image of a 1 dimensional kernel density estimator from one of your links.
What about 2 dimensional kernel densities?
# library(MASS)
b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
# a and b contain 1000 values each.
density <- kde2d(a,b,n=100)
The function creates a grid from min(a) to max(a) and from min(b) to max(b). Instead of fitting a tiny 1D normal density over every value in a or b, kde2d now fits a tiny 2D normal density over every point in the grid. Just like in the 1 dimensional case kernel density, it then adds up all density values.
What do the colours mean?
As #cel pointed out in the comments: the estimated probability depends on two variables, so we have three axes now (a, b and estimated probability). One way to visualize 3 axes is by using iso-probability contours. This sounds fancy, but it is basically the same as the high/low pressure images we know from the weather forecast.
You are using
filled.contour(density,
color.palette = colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred')))))
So from low to high, the plot will be coloured white, blue, yellow, red and eventually darkred for the highest values of estimated probability. This results in the following plot:
I have an algorithm that uses an x,y plot of sorted y data to produce an ogive.
I then derive the area under the curve to derive %'s.
I'd like to do something similar using kernel density estimation. I like how the upper/lower bounds are smoothed out using kernel densities (i.e. the min and max will extend slightly beyond my hard coded input).
Either way... I was wondering if there is a way to treat an ogive as a type of cumulative distribution function and/or use kernel density estimation to derive a cumulative distribution function given y data?
I apologize if this is a confusing question. I know there is a way to derive a cumulative frequency graph (i.e. ogive). However, I can't determine how to derive a % given this cumulative frequency graph.
What I don't want is an ecdf. I know how to do that, and I am not quite trying to capture an ecdf. But, rather integration of an ogive given two intervals.
I'm not exactly sure what you have in mind, but here's a way to calculate the area under the curve for a kernel density estimate (or more generally for any case where you have the y values at equally spaced x-values (though you can, of course, generalize to variable x intervals as well)):
library(zoo)
# Kernel density estimate
# Set n to higher value to get a finer grid
set.seed(67839)
dens = density(c(rnorm(500,5,2),rnorm(200,20,3)), n=2^5)
# How to extract the x and y values of the density estimate
#dens$y
#dens$x
# x interval
dx = median(diff(dens$x))
# mean height for each pair of y values
h = rollmean(dens$y, 2)
# Area under curve
sum(h*dx) # 1.000943
# Cumulative area
# cumsum(h*dx)
# Plot density, showing points at which density is calculated
plot(dens)
abline(v=dens$x, col="#FF000060", lty="11")
# Plot cumulative area under curve, showing mid-point of each x-interval
plot(dens$x[-length(dens$x)] + 0.5*dx, cumsum(h*dx), type="l")
abline(v=dens$x[-length(dens$x)] + 0.5*dx, col="#FF000060", lty="11")
UPDATE to include ecdf function
To address your comments, look at the two plots below. The first is the empirical cumulative distribution function (ECDF) of the mixture of normal distributions that I used above. Note that the plot of this data looks the same below as it does above. The second is a plot of the ECDF of a plain vanilla normal distribution, mean=0, sd=1.
set.seed(67839)
x = c(rnorm(500,5,2),rnorm(200,20,3))
plot(ecdf(x), do.points=FALSE)
plot(ecdf(rnorm(1000)))
I have calculated the kernel density of a 3-column matrix in R using the following code:
ss<-read.table("data.csv",header=TRUE,sep=",")
x<-ss[,1]
y<-ss[,2]
z<-ss[,3]
ssdata<-c(x,y,z)
ssmat<-matrix(ssdata,,3)
rp<-kde(ssmat)
plot(rp)
What I need now are the (x,y,z) coordinates of the point of maximum kernel density. Based on the answer provided at on the R-help list, I understand that the kde() function plots the joint density of the three variables in a fourth dimension which is represented in the 3d plot by shading to indicate areas of greater point density. So in effect I am trying to locate the maximum value of this "fourth" dimension. I suspect that this is a relatively simple problem but I haven't been able to find the answer. Any ideas?
You can extract the max value from the info returned from kde. To see all the stuff returned, use str(rp).
## Get the indices
inds <- which(abs(rp$estimate - max(rp$estimate)) < 1e-10, arr.ind=T)
xyz <- mapply(function(a, b) a[b], rp$eval.points, inds)
## Add it to plot
plot(rp)
points3d(x=xyz[1], y=xyz[2], z=xyz[3], size=20, col="blue")
I'm trying to plot an histogram of the Cauchy distribution in R using the following code:
X = rcauchy(10^5)
hist(X)
and no matter what options I try in the hist() function, I can never see more than two bars on my histogram (basically one for negative values and one for positive values).
It works fine, however, when I use the normal distribution (or others).
This results from the properties of the distribution.
Most values are relatively close to zero, but very large absolute values are much more probable than for the normal distribution. There are about 1 % values with an absolute value greater than 50, and 0.1 % greater than 500.
Try plotting only part of the values:
hist(X[abs(X)<1])
hist(X[abs(X)<5])
hist(X[abs(X)<50])
hist(X)
You can also look at the cumulative distribution function:
plot(ecdf(X))
And check the boxplot:
boxplot(X)