2d density plot from curves - r

I have a multi-parameter function on which I infer the parameters using MCMC. This means that I have many samples of the parameters, and I can plot the functions:
# Simulate some parameters. Really, I get these from MCMC sampling.
first = rnorm(1000) # a
second = rnorm(1000) # b
# The function (geometric)
geometric = function(x, a, b) b*(1 - a^(x + 1)/a)
# Plot curves. Perhaps not the most efficient way, but it works.
curve(geometric(x, first[1], second[1]), ylim=c(-3, 3)) # first curve
for(i in 2:length(first)) {
curve(geometric(x, first[i], second[i]), add=T, col='#00000030') # add others
}
How do I make this into a density plot instead of plotting the individual curves? For example, it's hard to see just how much denser it is around y=0 than around other values.
The following would be nice:
The ability to draw observed values on top (points and lines).
Drawing a contour line in the density, e.g. the 95% Highest Posterior Density interval or the 2.5 and 97.5 quantiles.

Related

Filling a curve with points that fit under the curve in R plot

I was wondering how I can efficiently (using short R code) fill a curve with points that can fill up the area under my curve?
I have tried something without success, here is my R code:
data = rnorm(1000) ## random data points to fill the curve
curve(dnorm(x), -4, 4) ## curve to be filled by "data" above
points(data) ## plotting the points to fill the curve
Here's a method that uses interpolation to ensure that the plotted points won't exceed the height of the curve (although, if you want the actual point markers to not stick out above the curve, you'll need to set the threshold slightly below the height of the curve):
# Curve to be filled
c.pts = as.data.frame(curve(dnorm(x), -4, 4))
# Generate 1000 random points in the same x-interval and with y value between
# zero and the maximum y-value of the curve
set.seed(2)
pts = data.frame(x=runif(1000,-4,4), y=runif(1000,0,max(c.pts$y)))
# Using interpolation, keep only those points whose y-value is less than y(x)
pts = pts[pts$y < approx(c.pts$x,c.pts$y,xout=pts$x)$y, ]
# Plot the points
points(pts, pch=16, col="red", cex=0.7)
A method for plotting exactly a desired number of points under a curve
Responding to #d.b's comment, here's a way to get exactly a desired number of points plotted under a curve:
First, let's figure out how many random points we need to generate over the entire plot region in order to get (roughly) a target number of points under the curve. We do this as follows:
Calculate the area under the curve as a fraction of the area of the rectangle bounded by zero and the maximum height of the curve on the vertical axis, and by the width of the curve on the horizontal axis.
The number of random points we need to generate is the target number of points, divided by the area ratio calculated above.
# Area ratio
aa = sum(c.pts$y*median(diff(c.pts$x)))/(diff(c(-4,4))*max(c.pts$y))
# Target number of points under curve
n.target = 1000
# Number of random points to generate
n = ceiling(n.target/aa)
But we need more points than this to ensure we get at least n.target, because random variation will result in fewer than n.target points about half the time, once we limit the plotted points to those below the curve. So we'll add an excess.factor in order to generate more points under the curve than we need, then we'll just randomly select n.target of those points to plot. Here's a function that takes care of the entire process for a general curve.
# Plot a specified number of points under a curve
pts.under.curve = function(data, n.target=1000, excess.factor=1.5) {
# Area under curve as fraction of area of plot region
aa = sum(data$y*median(diff(data$x)))/(diff(range(data$x))*max(data$y))
# Number of random points to generate
n = excess.factor*ceiling(n.target/aa)
# Generate n random points in x-range of the data and with y value between
# zero and the maximum y-value of the curve
pts = data.frame(x=runif(n,min(data$x),max(data$x)), y=runif(n,0,max(data$y)))
# Using interpolation, keep only those points whose y-value is less than y(x)
pts = pts[pts$y < approx(data$x,data$y,xout=pts$x)$y, ]
# Randomly select only n.target points
pts = pts[sample(1:nrow(pts), n.target), ]
# Plot the points
points(pts, pch=16, col="red", cex=0.7)
}
Let's run the function for the original curve:
c.pts = as.data.frame(curve(dnorm(x), -4, 4))
pts.under.curve(c.pts)
Now let's test it with a different distribution:
# Curve to be filled
c.pts = as.data.frame(curve(df(x, df1=100, df2=20),0,5,n=1001))
pts.under.curve(c.pts, n.target=200)
n_points = 10000 #A large number
#Store curve in a variable and plot
cc = curve(dnorm(x), -4, 4, n = n_points)
#Generate 1000 random points
p = data.frame(x = seq(-4,4,length.out = n_points), y = rnorm(n = n_points))
#OR p = data.frame(x = runif(n_points,-4,4), y = rnorm(n = n_points))
#Find out the index of values in cc$x closest to p$x
p$ind = findInterval(p$x, cc$x)
#Only retain those points within the curve whose p$y are smaller than cc$y
p2 = p[p$y >= 0 & p$y < cc$y[p$ind],] #may need p[p$y < 0.90 * cc$y[p$ind],] or something
#Plot points
points(p2$x, p2$y)

confusion on 2 dimension kernel density estimation in R

A kernel density estimator is used to estimate a particular probability density function (see mvstat.net and sckit-learn docs for references)
My confusion is about what exactly does kde2d() do? Does it estimate the joint distribution probability density function of two random variables f(a,b) in the below example? And what does the color mean?
Here is the code example I am referring to.
b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
density <- kde2d(a, b, n=100)
colour_flow <- colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred'))
filled.contour(density, color.palette=colour_flow)
What is a kernel density estimator?
Essentially it fits a little normal density curve over every point (the center of the normal density being that point) of the data and then adds up all little normal densities to a kernel density estimator.
For the sake of illustration I will add an image of a 1 dimensional kernel density estimator from one of your links.
What about 2 dimensional kernel densities?
# library(MASS)
b <- log10(rgamma(1000, 6, 3))
a <- log10((rweibull(1000, 8, 2)))
# a and b contain 1000 values each.
density <- kde2d(a,b,n=100)
The function creates a grid from min(a) to max(a) and from min(b) to max(b). Instead of fitting a tiny 1D normal density over every value in a or b, kde2d now fits a tiny 2D normal density over every point in the grid. Just like in the 1 dimensional case kernel density, it then adds up all density values.
What do the colours mean?
As #cel pointed out in the comments: the estimated probability depends on two variables, so we have three axes now (a, b and estimated probability). One way to visualize 3 axes is by using iso-probability contours. This sounds fancy, but it is basically the same as the high/low pressure images we know from the weather forecast.
You are using
filled.contour(density,
color.palette = colorRampPalette(c('white', 'blue', 'yellow', 'red', 'darkred')))))
So from low to high, the plot will be coloured white, blue, yellow, red and eventually darkred for the highest values of estimated probability. This results in the following plot:

R: area under curve of ogive?

I have an algorithm that uses an x,y plot of sorted y data to produce an ogive.
I then derive the area under the curve to derive %'s.
I'd like to do something similar using kernel density estimation. I like how the upper/lower bounds are smoothed out using kernel densities (i.e. the min and max will extend slightly beyond my hard coded input).
Either way... I was wondering if there is a way to treat an ogive as a type of cumulative distribution function and/or use kernel density estimation to derive a cumulative distribution function given y data?
I apologize if this is a confusing question. I know there is a way to derive a cumulative frequency graph (i.e. ogive). However, I can't determine how to derive a % given this cumulative frequency graph.
What I don't want is an ecdf. I know how to do that, and I am not quite trying to capture an ecdf. But, rather integration of an ogive given two intervals.
I'm not exactly sure what you have in mind, but here's a way to calculate the area under the curve for a kernel density estimate (or more generally for any case where you have the y values at equally spaced x-values (though you can, of course, generalize to variable x intervals as well)):
library(zoo)
# Kernel density estimate
# Set n to higher value to get a finer grid
set.seed(67839)
dens = density(c(rnorm(500,5,2),rnorm(200,20,3)), n=2^5)
# How to extract the x and y values of the density estimate
#dens$y
#dens$x
# x interval
dx = median(diff(dens$x))
# mean height for each pair of y values
h = rollmean(dens$y, 2)
# Area under curve
sum(h*dx) # 1.000943
# Cumulative area
# cumsum(h*dx)
# Plot density, showing points at which density is calculated
plot(dens)
abline(v=dens$x, col="#FF000060", lty="11")
# Plot cumulative area under curve, showing mid-point of each x-interval
plot(dens$x[-length(dens$x)] + 0.5*dx, cumsum(h*dx), type="l")
abline(v=dens$x[-length(dens$x)] + 0.5*dx, col="#FF000060", lty="11")
UPDATE to include ecdf function
To address your comments, look at the two plots below. The first is the empirical cumulative distribution function (ECDF) of the mixture of normal distributions that I used above. Note that the plot of this data looks the same below as it does above. The second is a plot of the ECDF of a plain vanilla normal distribution, mean=0, sd=1.
set.seed(67839)
x = c(rnorm(500,5,2),rnorm(200,20,3))
plot(ecdf(x), do.points=FALSE)
plot(ecdf(rnorm(1000)))

How to estimate the area of 95% contour of a kde object from ks R package

I'm trying to estimate the area of the 95% contour of a kde object from the ks package in R.
If I use the example data set from the ks package, I would create the kernel object as follow:
library(ks)
data(unicef)
H.scv <- Hscv(x=unicef)
fhat <- kde(x=unicef, H=H.scv)
I can easily plot the 25, 50, 75% contour using the plot function:
plot(fhat)
But I want to estimate the area within the contour.
I saw a similar question here, but the answer proposed does not solve the problem.
In my real application, my dataset is a time series of coordinates of an animal and I want to measure the home range size of this animal using a bivariate normal kernel. I'm using ks package because it allows to estimate the bandwith of a kernel distribution with methods such as plug-in and smoothed cross-validation.
Any help would be really appreciated!
Here are two ways to do it. They are both fairly complex conceptually, but actually very simple in code.
fhat <- kde(x=unicef, H=H.scv,compute.cont=TRUE)
contour.95 <- with(fhat,contourLines(x=eval.points[[1]],y=eval.points[[2]],
z=estimate,levels=cont["95%"])[[1]])
library(pracma)
with(contour.95,polyarea(x,y))
# [1] -113.677
library(sp)
library(rgeos)
poly <- with(contour.95,data.frame(x,y))
poly <- rbind(poly,poly[1,]) # polygon needs to be closed...
spPoly <- SpatialPolygons(list(Polygons(list(Polygon(poly)),ID=1)))
gArea(spPoly)
# [1] 113.677
Explanation
First, the kde(...) function returns a kde object, which is a list with 9 elements. You can read about this in the documentation, or you can type str(fhat) at the command line, or, if you're using RStudio (highly recommended), you can see this by expanding the fhat object in the Environment tab.
One of the elements is $eval.points, the points at which the kernel density estimates are evaluated. The default is to evaluate at 151 equally spaced points. $eval.points is itself a list of, in your case 2 vectors. So, fhat$eval.points[[1]] represents the points along "Under-5" and fhat$eval.points[[2]] represents the points along "Ave life exp".
Another element is $estimate, which has the z-values for the kernel density, evaluated at every combination of x and y. So $estimate is a 151 X 151 matrix.
If you call kde(...) with compute.cont=TRUE, you get an additional element in the result: $cont, which contains the z-value in $estimate corresponding to every percentile from 1% to 99%.
So, you need to extract the x- and y-values corresponding to the 95% contour, and use that to calculate the area. You would do that as follows:
fhat <- kde(x=unicef, H=H.scv,compute.cont=TRUE)
contour.95 <- with(fhat,contourLines(x=eval.points[[1]],y=eval.points[[2]],
z=estimate,levels=cont["95%"])[[1]])
Now, contour.95 has the x- and y-values corresponding to the 95% contour of fhat. There are (at least) two ways to get the area. One uses the pracma package and calculates
it directly.
library(pracma)
with(contour.95,polyarea(x,y))
# [1] -113.677
The reason for the negative value has to do with the ordering of x and y: polyarea(...) is interpreting the polygon as a "hole", so it has negative area.
An alternative uses the area calculation routines in rgeos (a GIS package). Unfortunately, this requires you to first turn your coordinates into a "SpatialPolygon" object, which is a bit of a bear. Nevertheless, it is also straightforward.
library(sp)
library(rgeos)
poly <- with(contour.95,data.frame(x,y))
poly <- rbind(poly,poly[1,]) # polygon needs to be closed...
spPoly <- SpatialPolygons(list(Polygons(list(Polygon(poly)),ID=1)))
gArea(spPoly)
# [1] 113.677
Another method would be to use the contourSizes() function within the kde package. I've also been interested in using this package to compare both 2D and 3D space use in ecology, but I wasn't sure how to extract the 2D density estimates. I tested this method by estimating the area of an "animal" which was limited to the area of a circle with a known radius. Below is the code:
set.seed(123)
require(GEOmap)
require(kde)
# need this library for the inpoly function
# Create a data frame centered at coordinates 0,0
data = data.frame(x=0,y=0)
# Create a vector of radians from 0 to 2*pi for making a circle to
# test the area
circle = seq(0,2*pi,length=100)
# Select a radius for your circle
radius = 10
# Create a buffer for when you simulate points (this will be more clear below)
buffer = radius+2
# Simulate x and y coordinates from uniform distribution and combine
# values into a dataframe
createPointsX = runif(1000,min = data$x-buffer, max = data$x+buffer)
createPointsY = runif(1000,min = data$y-buffer, max = data$y+buffer)
data1 = data.frame(x=createPointsX,y=createPointsY)
# Plot the raw data
plot(data1$x,data1$y)
# Calculate the coordinates used to create a cirle with center 0,0 and
# with radius specified above
coords = as.data.frame(t(rbind(data$x+sin(circle)*radius,
data$y+cos(circle)*radius)))
names(coords) = c("x","y")
# Add circle to plot with red line
lines(coords$x,coords$y,col=2,lwd=2)
# Use the inpoly function to calculate whether points lie within
# the circle or not.
inp = inpoly(data1$x, data1$y, coords)
data1 = data1[inp == 1,]
# Finally add points that lie with the circle as blue filled dots
points(data1$x,data1$y,pch=19,col="blue")
# Radius of the circle (known area)
pi * radius^2
#[1] 314.1593
# Sub in your own data here to calculate 95% homerange or 50% core area usage
H.pi = Hpi(data1,binned=T)
fhat = kde(data1,H=H.pi)
ct1 = contourSizes(fhat, cont = 95, approx=TRUE)
# Compare the known area of the circle to the 95% contour size
ct1
# 5%
# 291.466
I've also tried creating 2 un-connected circles and testing the contourSizes() function and it seems to work really well on disjointed distributions.

Find the ranges of minimal and maximal gradients on sigmoidal curve in R

First, this is my first Stack Overflow question so I apologize for violating and decorum. Second, I realize this will be very trivial but I'm stumped. I'm trying to figure out how to find the minimum and maximum gradients on a sigmoidal curve.
I have a function that generates a vector of y values that form a sigmoidal curve:
#function to generate Sigmoid curves - works better with enough Xs to be smooth
genSigmoid = function(a, b, c, theta){
y = c + ((1-c) / (1 + exp(-a*(theta-b))))
return(y)
}
x<-c(1:100)
y<-genSigmoid(.25, .50, 0, x)
plot(x, y, type="n")
lines(x, y)
What I would like to do is find the points along this curve where the gradient is the smallest or zero and the points where the gradient is largest. My ultimate goal is to plot the different sections of this curve with different lines styles according the strength of the gradient along the curve. I can generate these different styles by 'eye-balling' it but it would be nice to have something that can do this more precisely.
You could do this using the grad(...) function in package numDeriv.
genSigmoid = function(theta,pars){
y <- with(pars,c + ((1-c) / (1 + exp(-a*(theta-b)))))
return(y)
}
x<-c(1:100)
pars <- list(a=0.25, b=0.50, c=0.0)
y<-genSigmoid(x, pars)
plot(x, y, type="l", ylim=c(0,1), col="blue")
library(numDeriv)
z<-grad(genSigmoid,x,pars=pars)
lines(x,z,col="red")
Here z is a vector of the derivative of genSigmoid(...) with respect to theta.
I redefined your function a bit to make the calling sequence simpler (combined the parameters into a named list, and reversed the order of the arguments).
Plotting segments of the curve with different line styles is a bit trickier:
lt <- as.integer(3*(z-min(z))/diff(range(z))+1)
df <- data.frame(x,y,z,lt)
plot(x,y,type="n")
lapply(split(df,df$lt),function(df)with(df,lines(x,y,lty=lt)))
So this creates a vector of line types (1,2,3, or 4) based on the value of the derivative, then splits the data based on line type, and plots the segments.

Resources