How can I get the area under overlapping density curves?
How can I solve the problem with R? (There is a solution for python here: Calculate overlap area of two functions )
set.seed(1234)
df <- data.frame(
sex=factor(rep(c("F", "M"), each=200)),
weight=round(c(rnorm(200, mean=55, sd=5),
rnorm(200, mean=65, sd=5)))
)
(Source: http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization )
ggplot(df, aes(x=weight, color=sex, fill=sex)) +
geom_density(aes(y=..density..), alpha=0.5)
"The points used in the plot are returned by ggplot_build(), so you can access them." So now, I have the points, and I can feed them to approxfun, but my problem is that i don't know how to subtract the density functions.
Any help greatly appreciated! (And I believe in high demand, there is no solution for this readily available.)
I was looking for a way to do this for empirical data, and had the problem of multiple intersections as mentioned by user5878028. After some digging I found a very simple solution, even for a total R noob like me:
Install and load the libraries "overlapping" (which performs the calculation) and "lattice" (which displays the result):
library(overlapping)
library(lattice)
Then define a variable "x" as a list that contains the two density distributions you want to compare. For this example, the two datasets "data1" and "data2" are both columns in a text file called "yourfile":
x <- list(X1=yourfile$data1, X2=yourfile$data2)
Then just tell it to display the output as a plot which will also display the estimated % overlap:
out <- overlap(x, plot=TRUE)
I hope this helps someone like it helped me! Here's an example overlap plot
I will make a few base R plots, but the plots are not actually part of
the solution. They are just there to confirm that I am getting the right
answer.
You can get each of the density functions and solve for where they intersect.
## Create the two density functions and display
FDensity = approxfun(density(df$weight[df$sex=="F"], from=40, to=80))
MDensity = approxfun(density(df$weight[df$sex=="M"], from=40, to=80))
plot(FDensity, xlim=c(40,80), ylab="Density")
curve(MDensity, add=TRUE)
Now solve for the intersection
## Solve for the intersection and plot to confirm
FminusM = function(x) { FDensity(x) - MDensity(x) }
Intersect = uniroot(FminusM, c(40, 80))$root
points(Intersect, FDensity(Intersect), pch=20, col="red")
Now we can just integrate to get the area of the overlap.
integrate(MDensity, 40,Intersect)$value +
integrate(FDensity, Intersect, 80)$value
[1] 0.2952838
The above two proposed methods give different results.
If the data in the first answer is given to the overlap function it will result in overlap% of 0.18, while the first one results in overlap% of 0.29.
X1 = df$weight[df$sex=="F"]
X2 = df$weight[df$sex=="M"]
x=list(X1=X1, X2=X2)
out <- overlap(x, plot=TRUE)
out$OV
X1-X2
0.1754
Related
To create a parallel coordinate plot I wanted to use ggparcoord() function in package GGally. The following codes show a reproducible example.
set.seed(3674)
k <- rep(1:3, each=30)
x <- k + rnorm(mean=10, sd=.2,n=90)
y <- -2*k + rnorm(mean=10, sd=.4,n=90)
z <- 3*k + rnorm(mean=10, sd=.6,n=90)
dat <- data.frame(group=factor(k),x,y,z)
library(GGally)
ggparcoord(dat,columns=1:4,groupColumn = 1)
Notice in the picture that the color for group was continuous even though I have the group variable as a factor. Is there any way I can display the plot with three discrete color instead?
I have looked at some other posts where they discuss various other ways of doing parallel coordinate plots in here. But I really wanted to do this in ggparcoord() function of package GGally. I appreciate your time in thinking about this problem.
Your code was almost correct. I spotted that columns=1:4 was not right in this case. You need to drop the column for groupColumn in columns
ggparcoord(dat,columns=2:4,groupColumn = 1)
I've run a 2d simulation in some modelling software from which i've got an export of x,y point locations with a set of 6 attributes. I wish to recreate a figure that combines the data, like this:
The ellipses and the background are shaded according to attribute 1 (and the borders of these are of course representing the model geometry, but I don't think I can replicate that), the isolines are contours of attribute 2, and the arrow glyphs are from attributes 3 (x magnitude) and 4 (y magnitude).
The x,y points are centres of the triangulated mesh I think, and look like this:
I want to know how I can recreate a plot like this with R. To start with I have irregularly-spaced data due to it being exported from an irregular mesh. That's immediately where I get stuck with R, having only ever used it for producing box-and-whisper plots and the like.
Here's the data:
https://dl.dropbox.com/u/22417033/Ellipses_noheader.txt
Edit: fields: x, y, heat flux (x), heat flux (y), thermal conductivity, Temperature, gradT (x), gradT (y).
names(Ellipses) <- c('x','y','dfluxx','dfluxy','kxx','Temps','gradTx','gradTy')
It's quite easy to make the lower plot (making the assumption that there is a dataframe named 'edat' read in with:
edat <- read.table(file=file.choose())
with(edat, plot(V1,V2), cex=0.2)
Things get a bit more beautiful with:
with(edat, plot(V1,V2, cex=0.2, col=V5))
So I do not think your original is being faithfully represented by the data. The contour lines are NOT straight across the "conductors". I call them "conductors" because this looks somewhat like iso-potential lines in electrostatics. I'm adding some text here to serve as a search handle for others who might be searching for plotting problems in real world physics: vector-field (the arrows) , heat equations, gradient, potential lines.
You can then overlay the vector field with:
with(edat, arrows(V1,V2, V1-20*V6*V7, V2-20*V6*V8, length=0.04, col="orange") )
You could"zoom in" with xlim and ylim:
with(edat, plot(V1,V2, cex=0.3, col=V5, xlim=c(0, 10000), ylim=c(-8000, -2000) ))
with(edat, arrows(V1,V2, V1-20*V6*V7, V2-20*V6*V8, length=0.04, col="orange") )
Guessing that the contour requested if for the Temps variable. Take your pick of contourplots.
require(akima)
intflow<- with(edat, interp(x=x, y=y, z=Temps, xo=seq(min(x), max(x), length = 410),
yo=seq(min(y), max(y), length = 410), duplicate="mean", linear=FALSE) )
require(lattice)
contourplot(intflow$z)
filled.contour(intflow)
with( intflow, contour(x=x, y=y, z=z) )
The last one will mix with the other plotting examples since those were using base plotting functions. You may need to switch to points instead of plot.
There are several parts to your plot so you will probably need several tools to make the different parts.
The background and ellipses can be created with polygon (once you figure where they should be).
The contourLines function can calculate the contour lines for you which you can add with the lines function (or contour has and add argument and could probably be used to add the lines directly).
The akima package has a function interp which can estimate values on a grid given the values ungridded.
The my.symbols function along with ms.arrows, both from the TeachingDemos package, can be used to draw the vector field.
#DWin is right to say that your graph don't represent faithfully your data, so I would advice to follow his answer. However here is how to reproduce (the closest I could) your graph:
Ellipses <- read.table(file.choose())
names(Ellipses) <- c('x','y','dfluxx','dfluxy','kxx','Temps','gradTx','gradTy')
require(splancs)
require(akima)
First preparing the data:
#First the background layer (the 'kxx' layer):
# Here the regular grid on which we're gonna do the interpolation
E.grid <- with(Ellipses,
expand.grid(seq(min(x),max(x),length=200),
seq(min(y),max(y),length=200)))
names(E.grid) <- c("x","y") # Without this step, function inout throws an error
E.grid$Value <- rep(0,nrow(E.grid))
#Split the dataset according to unique values of kxx
E.k <- split(Ellipses,Ellipses$kxx)
# Find the convex hull delimiting each of those values domain
E.k.ch <- lapply(E.k,function(X){X[chull(X$x,X$y),]})
for(i in unique(Ellipses$kxx)){ # Pick the value for each coordinate in our regular grid
E.grid$Value[inout(E.grid[,1:2],E.k.ch[names(E.k.ch)==i][[1]],bound=TRUE)]<-i
}
# Then the regular grid for the second layer (Temp)
T.grid <- with(Ellipses,
interp(x,y,Temps, xo=seq(min(x),max(x),length=200),
yo=seq(min(y),max(y),length=200),
duplicate="mean", linear=FALSE))
# The regular grids for the arrow layer (gradT)
dx <- with(Ellipses,
interp(x,y,gradTx,xo=seq(min(x),max(x),length=15),
yo=seq(min(y),max(y),length=10),
duplicate="mean", linear=FALSE))
dy <- with(Ellipses,
interp(x,y,gradTy,xo=seq(min(x),max(x),length=15),
yo=seq(min(y),max(y),length=10),
duplicate="mean", linear=FALSE))
T.grid2 <- with(Ellipses,
interp(x,y,Temps, xo=seq(min(x),max(x),length=15),
yo=seq(min(y),max(y),length=10),
duplicate="mean", linear=FALSE))
gradTgrid<-expand.grid(dx$x,dx$y)
And then the plotting:
palette(grey(seq(0.5,0.9,length=5)))
par(mar=rep(0,4))
plot(E.grid$x, E.grid$y, col=E.grid$Value,
axes=F, xaxs="i", yaxs="i", pch=19)
contour(T.grid, add=TRUE, col=colorRampPalette(c("blue","red"))(15), drawlabels=FALSE)
arrows(gradTgrid[,1], gradTgrid[,2], # Here I multiply the values so you can see them
gradTgrid[,1]-dx$z*40*T.grid2$z, gradTgrid[,2]-dy$z*40*T.grid2$z,
col="yellow", length=0.05)
To understand in details how this code works, I advise you to read the following help pages: ?inout, ?chull, ?interp, ?expand.grid and ?contour.
I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)
I have a couple of cumulative empirical density functions which I would like to plot on top of each other in order to illustrate differences in the two curves. As was pointed out in a previous question, the function to draw the ECDF is simply plot(Ecdf()) And as I read the fine manual page, I determined that I can plot multiple ECDFs on top of each other using something like the following:
require( Hmisc )
set.seed(3)
g <- c(rep(1, 20), rep(2, 20))
Ecdf(c( rnorm(20), rnorm(20)), group=g)
However my curves sometimes overlap a bit and can be hard to tell which is which, just like the example above which produces this graph:
I would really like to make the color of these two CDFs different. I can't figure out how to do that, however. Any tips?
If memory serves, I have done this in the past. As I recall, you needed to trick it as Ecdf() is so darn paramterised. I think in help(ecdf) it hints that it is just a plot of stepfunctions, so you could estimate two or more ecdfs, plot one and then annotate via lines().
Edit Turns out it is as easy as
R> Ecdf(c(rnorm(20), rnorm(20)), group=g, col=c('blue', 'orange'))
as the help page clearly states the col= argument. But I have also found some scriptlets where I used plot.stepfun() explicitly.
You can add each curve one at a time (each with its own style), e.g.
Ecdf(rnorm(20), lwd = 2)
Ecdf(rnorm(20),add = TRUE, col = 'red', lty = 1)
Without using Ecdf (doesn't look like Hmisc is available):
set.seed(3)
mat <- cbind(rnorm(20), rnorm(20))
matplot(apply(mat, 2, sort), seq(20)/20, type='s')
I have come across a number of situations where I want to plot more points than I really ought to be -- the main holdup is that when I share my plots with people or embed them in papers, they occupy too much space. It's very straightforward to randomly sample rows in a dataframe.
if I want a truly random sample for a point plot, it's easy to say:
ggplot(x,y,data=myDf[sample(1:nrow(myDf),1000),])
However, I was wondering if there were more effective (ideally canned) ways to specify the number of plot points such that your actual data is accurately reflected in the plot. So here is an example.
Suppose I am plotting something like the CCDF of a heavy tailed distribution, e.g.
ccdf <- function(myList,density=FALSE)
{
# generates the CCDF of a list or vector
freqs = table(myList)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(rlnorm(10000,3,2.4)),log='xy')
This will produce a plot where the x & y axis become increasingly dense. Here it would be ideal to have fewer samples plotted for large x or y values.
Does anybody have any tips or suggestions for dealing with similar issues?
Thanks,
-e
I tend to use png files rather than vector based graphics such as pdf or eps for this situation. The files are much smaller, although you lose resolution.
If it's a more conventional scatterplot, then using semi-transparent colours also helps, as well as solving the over-plotting problem. For example,
x <- rnorm(10000); y <- rnorm(10000)
qplot(x, y, colour=I(alpha("blue",1/25)))
Beyond Rob's suggestions, one plot function I like as it does the 'thinning' for you is hexbin; an example is at the R Graph Gallery.
Here is one possible solution for downsampling plot with respect to the x-axis, if it is log transformed. It log transforms the x-axis, rounds that quantity, and picks the median x value in that bin:
downsampled_qplot <- function(x,y,data,rounding=0, ...) {
# assumes we are doing log=xy or log=x
group = factor(round(log(data$x),rounding))
d <- do.call(rbind, by(data, group,
function(X) X[order(X$x)[floor(length(X)/2)],]))
qplot(x,count,data=d, ...)
}
Using the definition of ccdf() from above, we can then compare the original plot of the CCDF of the distribution with the downsampled version:
myccdf=ccdf(rlnorm(10000,3,2.4))
qplot(x,count,data=myccdf,log='xy',main='original')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=1,main='rounding = 1')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=0,main='rounding = 0')
In PDF format, the original plot takes up 640K, and the downsampled versions occupy 20K and 8K, respectively.
I'd either make image files (png or jpeg devices) as Rob already mentioned, or I'd make a 2D histogram. An alternative to the 2D histogram is a smoothed scatterplot, it makes a similar graphic but has a more smooth cutoff from dense to sparse regions of space.
If you've never seen addictedtor before, it's worth a look. It has some very nice graphics generated in R with images and sample code.
Here's the sample code from the addictedtor site:
2-d histogram:
require(gplots)
# example data, bivariate normal, no correlation
x <- rnorm(2000, sd=4)
y <- rnorm(2000, sd=1)
# separate scales for each axis, this looks circular
hist2d(x,y, nbins=50, col = c("white",heat.colors(16)))
rug(x,side=1)
rug(y,side=2)
box()
smoothscatter:
library("geneplotter") ## from BioConductor
require("RColorBrewer") ## from CRAN
x1 <- matrix(rnorm(1e4), ncol=2)
x2 <- matrix(rnorm(1e4, mean=3, sd=1.5), ncol=2)
x <- rbind(x1,x2)
layout(matrix(1:4, ncol=2, byrow=TRUE))
op <- par(mar=rep(2,4))
smoothScatter(x, nrpoints=0)
smoothScatter(x)
smoothScatter(x, nrpoints=Inf,
colramp=colorRampPalette(brewer.pal(9,"YlOrRd")),
bandwidth=40)
colors <- densCols(x)
plot(x, col=colors, pch=20)
par(op)