How to rescale the kernel density values? - scale

I found that some estimated kernel density values are negative. The solution to this is to set the negative values to 0 and rescale the kernel density estimation and find the closest pdf to the scaled KDE. But I don't know how to rescale the KDE.

The KDE should not have negative values unless the kernel function allows negative values (e.g., Silverman Kernel).
To rescale, divide the chopped KDE value by the area (integration) under the chopped KDE so that the area is now equal to one (1).

Related

Plotting a probability mass function for a poisson distribution

uppose that i have a poisson distribution with mean of 6 i would like to plot a probability mass function which includes an overlay of the approximating normal density.
This is what i have tried
plot( dpois( x=0:10, lambda=6 ))
this produces
which is wrong since it doesnt contain an overlay of approxiamating noral density
How do i go about this?
Something like what you seem to be asking for (I'm outlining the commands and the basic ideas, but checking the help on the functions and trying should fill in the remaining details):
taking a wider range of x-values (out to at least 13 or so) and use xlim to extend the plot slightly into the negatives (maybe to -1.5) and
plotting the pmf of the Poisson with solid dots (similar to your command but with pch=16 as an argument to plot) with a suitable color, then
call points with the same x and y arguments as above and have type=h and lty=3 to get vertical dotted lines (to give a clear impression of the relative heights, somewhat akin to the appearance of a Cleveland dot-chart); I'd use the same colour as the dots or a slightly lighter/greyer version of the dot-colour
use curve to draw the normal curve with the same mean and standard deviation as the Poisson with mean 6 (see details at the Wikipedia page for the Poisson which gives the mean and variance), but across the wider range we plotted; I'd use a slightly contrasting colour for that.
I'd draw a light x-axis in (e.g. using abline with the h argument)
Putting all those suggestions together:
(However, while it's what you're asking for it's not strictly a suitable way to compare discrete and continuous variables since density and pmf are not on the same scale, since density is not probability -- the "right" comparison between a Poisson and an approximating normal would be on the scale of the cdfs so you compare like with like -- they'd both be on the scale of probabilities then)

kde2d density comparison

I have a question about the kde2d (Kernel density estimator). I am computing two different kde2d for two different sets of data in the same space of variables. When I compare both with a filled.contour2 or contours I see that the set with lower density of points in a scatter plot(Also has less points in the total with a factor of 10) has an higher density for the contours values. I was expecting that the set with higher point density will have higher density contours values, but like I said above is not the case. It has to be with the choice of bandwidth (h)? I am using equals h, and i tried to change but the result did not changed a lot. What is my error?
An example
a <- runif(1000, 5.0, 7.5)
b <- runif(1000, 2.0, 3.0)
c <- runif(100000,5.0, 7.5)
d <- runif(100000, 2.0, 3.0)
library(MASS)
abdens <- kde2d(a,b,n=100,h=0.5)
cddens <- kde2d(c,d,n=100,h=0.5)
mylevels <- seq(-2.4,30,0.9)
filled.contour2(abdens,xlab="a",ylab="b",xlim=c(5,7.5),ylim=c(2,3),
col=hsv(seq(0,1,length=length(mylevels))))
plot(a,b)
contour(abdens,nlevels=5,add=T,col="blue")
plot(c,d)
contour(cddens,nlevels=5,add=T,col="orange")
I'm not sure I agree that the densities should be different in the uniform case. I would have expected a set with a higher number of randomly drawn points from a Normal distribution to have more support for extreme regions and therefore to have lower (estimated) density in the center. That effect might be also be occasionally apparent with bibariate Uniform draws with 1,000 points versus 100,000. I hope my modifications to your code are acceptable. It's easier to see the contours if done after the plots.
(The theoretic density would be the same in both these cases since the density distribution is normalized to an integral of 1.0. We are only looking at estimates with some expected artifacts from "edge" effects. In the case of univariate densities adding the information about bounds can be done with the desity functions in package::logspline.)

Calculating the volume under a surface

I have created a 3D plot (a surface) using wireframe function. I wonder if there is any functions by which I can calculate the volume under the surface in a 3D plot?
Here is a sample of my data plus the wrieframe syntax I used to create my 3D (surface) plot:
x1<-c(13,27,41,55,69,83,97,111,125,139)
x2<-c(27,55,83,111,139,166,194,222,250,278)
x3<-c(41,83,125,166,208,250,292,333,375,417)
x4<-c(55,111,166,222,278,333,389,445,500,556)
x5<-c(69,139,208,278,347,417,487,556,626,695)
x6<-c(83,166,250,333,417,500,584,667,751,834)
x7<-c(97,194,292,389,487,584,681,779,876,974)
x8<-c(111,222,333,445,556,667,779,890,1001,1113)
x9<-c(125,250,375,500,626,751,876,1001,1127,1252)
x10<-c(139,278,417,556,695,834,974,1113,1252,1391)
df<-data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10)
df.matrix<-as.matrix(df)
wireframe(df.matrix,
aspect = c(61/87, 0.4),scales=list(arrows=FALSE,cex=.5,tick.number="10",z=list(arrows=T)),ylim=c(1:10),xlab=expression(phi1),ylab="Percentile",zlab=" Loss",main="Random Classifier",
light.source = c(10,10,10),drape=T,col.regions = rainbow(100, s = 1, v = 1, start = 0, end = max(1,100 - 1)/100, alpha = 1),screen=list(z=-60,x=-60))
Note: my real data is a 100X100 matrix
Thanks
The data you are feeding to wireframe is a grid of values. Hence one estimate of the volume of whatever underlying surface this is approximating is the sum of the grid values multiplied by the grid cell areas. This is just like adding up the heights of histogram bars to get the number of values in your histogram.
The problem I see with you doing this on your data is that the cell areas are going to be in odd units - percentiles on one axis, phi on the other has unknown units, so your volume is going to have units of loss times units of percentile times units of phi.
This isn't a problem if you want to compare volumes of similar things on exactly the same grid, but if you have surfaces on different grids (different values of phi, or different percentiles) then you need to be careful.
Now, noting that wireframe doesn't draw like a 3d histogram would (looking like square tower blocks) this gives us another way to estimate the volume. Your 10x10 matrix is plotted as 9x9 squares. Divide each of those squares into triangles and then compute the volume of the 192 right truncated triangular prisms (I think this is what they are - they are equilateral triangular prisms with a right angle and one sloping end). The formula for that should be out there somewhere. Probably base area times height to the centroid of the triangle or something.
I thought maybe this would be in the raster package, but it isn't. There's code for computing the surface area but not the volume! I'm sure the raster maintainer would be happy to have some code for this!
If the points are arbitrary (ie, don't follow smooth function), it seems like you're looking for the volume of the convex hull (minimum surface) surrounding these points. One package to help you calculate this is alphashape3d.
You'll need a 3-column matrix of the coordinates to form the right type of object to make the calculation but it seems rather straight-forward.

Plot histogram of Cauchy distribution in R

I'm trying to plot an histogram of the Cauchy distribution in R using the following code:
X = rcauchy(10^5)
hist(X)
and no matter what options I try in the hist() function, I can never see more than two bars on my histogram (basically one for negative values and one for positive values).
It works fine, however, when I use the normal distribution (or others).
This results from the properties of the distribution.
Most values are relatively close to zero, but very large absolute values are much more probable than for the normal distribution. There are about 1 % values with an absolute value greater than 50, and 0.1 % greater than 500.
Try plotting only part of the values:
hist(X[abs(X)<1])
hist(X[abs(X)<5])
hist(X[abs(X)<50])
hist(X)
You can also look at the cumulative distribution function:
plot(ecdf(X))
And check the boxplot:
boxplot(X)

Density plot in R, ggplot2

I am trying to plot and compare two sets of decimal numbers, between 0 and 1 using the R package, ggplot2. When I plotted using geom="density" in qplot, I noticed that the density curve goes past 1.0. I would like to have a density plot for the data that does not exceed the value range of the set, ie, all the area stays between 0 and 1.
Is it possible to plot the density between the values 0 and 1, without going past 1 or 0? If so, how would I accomplish this? I need the area of the two plots to be equal between 0 and 1, the range of the data.
Here is the code I used to generate the plots.
Right: qplot(precision,data = compare, fill=factor(dataset),binwidth = .05,geom="density", alpha=I(0.5))+ xlim(-1,2)
Left:qplot(precision,data = compare, fill=factor(dataset),binwidth = .05,geom="density", alpha=I(0.5))
You might consider using a different tool to estimate the density (the built in density functions do not consider bounds), then use ggplot2 to plot the estimated densities. The logspline package has tools that will estimate densities (useing a different algorythm than density does) and you can tell the functions that your density is bounded between 0 and 1 and it will take that into consideration in estimating the densities. Then use ggplot2 (or other code) to compare the estimated densities.

Resources