ggplot - what is the unit of stat_density2d plots? - r

stat_density2d is really a nice display for dense scatter plots, however I could not find any explanation on what the density actually means. I have a plot with densities ranging from 0 to 400. What is the unit of this scale ?
Thanks !

The density values will depend on the range of x and y in your dataset.
stat_density2d(...) uses kde2d(...) in the MASS package to calculate the 2-dimensional kernal density estimate, based on bivariate normal distributions. The density at a point is scaled so that the integral of density over all x and y = 1. So if you data is highly localized, or if the range for x and y is small, you can get large numbers for density.
You can see this in the following simple example:
library(ggplot2)
set.seed(1)
df1 <- data.frame(x=c(rnorm(50,0,5),rnorm(50,20,5)),
y=c(rnorm(50,0,5),rnorm(50,20,5)))
ggplot(df1, aes(x,y)) + geom_point()+
stat_density2d(geom="path",aes(color=..level..))
set.seed(1)
df2 <- data.frame(x=c(rnorm(50,0,5),rnorm(50,20,5))/100,
y=c(rnorm(50,0,5),rnorm(50,20,5))/100)
ggplot(df2, aes(x,y)) + geom_point()+
stat_density2d(geom="path",aes(color=..level..))
These two data frames are identical except that in df2 the scale is 1/100 that in df1 (in each direction), and therefore the density levels are 10,000 times greater in the the plot of df2.

Related

Sorting data vector for a histogram using ggplot and R

So I have 10.000 values in a vector from a Monte Carlo simulation. I want to plot this data as a histogram and a density plot. Doing this with the hist() function is easy, and it will calculate the frequency of the of the different values automatically. My ambition is however doing this in ggplot.
My biggest problem right now is how to transform the data so ggplot can handle it. I would like my x-axis to show the "price" while the x-axis shows the frequency or density. My data has a lot decimals as shown in the example data below.
myData <- c(266.8997, 271.5137, 225.4786, 223.3533, 258.1245, 199.5601, 234.2341, 231.7850, 260.2091, 184.5102, 272.8287, 203.7482, 212.5140, 220.9094, 221.2627, 236.3224)
My current code using the hist()-function, and the plot is shown below.
hist(myData,
xlab ="Price",
prob=TRUE)
lines(density(myData))
Histogram for the data vector containing 10000 values
How would you sort the data, and how would you do this with ggplot? I am thinking if I should round the numbers as well?
Hard to say exactly without seeing a sample of your data, but have you tried:
ggplot(myData, aes(Price)) + geom_histogram()
or:
ggplot(myData, aes(Price)) + geom_density()
Just try this:
ggplot() +
geom_bar(aes(myData)) +
geom_density(aes(myData))

How to plot the difference between two ggplot density distributions?

I would like to use ggplot2 to illustrate the difference between two similar density distributions. Here is a toy example of the type of data I have:
library(ggplot2)
# Make toy data
n_sp <- 100000
n_dup <- 50000
D <- data.frame(
event=c(rep("sp", n_sp), rep("dup", n_dup) ),
q=c(rnorm(n_sp, mean=2.0), rnorm(n_dup, mean=2.1))
)
# Standard density plot
ggplot( D, aes( x=q, y=..density.., col=event ) ) +
geom_freqpoly()
Rather than separately plot the density for each category ( dup and sp ) as above, how could I plot a single line that shows the difference between these distributions?
In the toy example above, if I subtracted the dup density distribution from the sp density distribution, the resulting line would be above zero on the left side of the plot (since there is an abundance of smaller sp values) and below 0 on the right (since there is an abundance of larger dup values). Not that there may be a different number of observations of type dup and sp.
More generally - what is the best way to show differences between similar density distributions?
There may be a way to do this within ggplot, but frequently it's easiest to do the calculations beforehand. In this case, call density on each subset of q over the same range, then subtract the y values. Using dplyr (translate to base R or data.table if you wish),
library(dplyr)
library(ggplot2)
D %>% group_by(event) %>%
# calculate densities for each group over same range; store in list column
summarise(d = list(density(q, from = min(.$q), to = max(.$q)))) %>%
# make a new data.frame from two density objects
do(data.frame(x = .$d[[1]]$x, # grab one set of x values (which are the same)
y = .$d[[1]]$y - .$d[[2]]$y)) %>% # and subtract the y values
ggplot(aes(x, y)) + # now plot
geom_line()

R ggplot: Weighted CDF

I'd like to plot a weighted CDF using ggplot. Some old non-SO discussions (e.g. this from 2012) suggest this is not possible, but thought I'd reraise.
For example, consider this data:
df <- data.frame(x=sort(runif(100)), w=1:100)
I can show an unweighted CDF with
ggplot(df, aes(x)) + stat_ecdf()
How would I weight this by w? For this example, I'd expect an x^2-looking function, since the larger numbers have higher weight.
There is a mistake in your answer.
This is the right code to compute the weighted ECDF:
df <- df[order(df$x), ] # Won't change anything since it was created sorted
df$cum.pct <- with(df, cumsum(w) / sum(w))
ggplot(df, aes(x, cum.pct)) + geom_line()
The ECDF is a function F(a) equal to the sum of weights (probabilities) of observations where x<a divided by the total sum of weights.
But here is a more satisfying option that simply modifies the original code of the ggplot2 stat_ecdf:
https://github.com/NicolasWoloszko/stat_ecdf_weighted

R - How to histogram multiple matrixes using qplot/ggplot2

I'm using R to read and plot data from NetCDF files (ncdf4). I've started using R only recently thus I'm very confused, I beg your pardon.
Let's say from the files I obtain N 2-D matrixes of numerical values, each with different dimensions and many NA values.
I have to histogram these values in the same plot, with bins of given width and within given limits, the same for every matrix.
For just one matrix, I can do this:
library(ncdf4)
library(ggplot2)
file0 <- nc_open("test.nc")
#Read a variable
prec0 <- ncvar_get(file0,"pr")
#Some settings
min_plot=0
max_plot=30
bin_width=2
xlabel="mm/day"
ylabel="PDF"
title="Precipitation"
#Get maximum of array, exclude NAs
maximum_prec0=max(prec0, na.rm=TRUE)
#Store the histogram
histo_prec0 <- hist(prec0, xlim=c(min_plot,max_plot), right=FALSE, breaks=seq(0,ceiling(maximum_prec0),by=bin_width))
#Plot the histogram densities using points instead of bars, which is what we want
qplot(histo_prec0$mids, histo_prec0$density, xlim=c(min_plot,max_plot), color=I("yellow"), xlab=xlabel, ylab=ylabel, main=title, log="y")
#If necessary, can transform matrix to vector using
#vector_prec0 <- c(prec0)
However it occurs to me that it would be best to use a DataFrame for plotting multiple matrixes. I'm not certain of that nor on how to do it. This would also allow for automatic legends and all the advantages that come from using dataframes with ggplot2.
What I want to achieve is something akin to this:
https://copy.com/thumbs_public/j86WLyOWRs4N1VTi/scatter_histo.jpg?size=1024
Where on Y we have the Density and on X the bins.
Thanks in advance.
To be honest, it is unclear what you are after (scatter plot or histogram of data with values as points?).
Here are a couple of examples using ggplot which might fit your goals (based on your last sentence: "Where on Y we have the Density and on X the bins"):
# some data
nsample<- 200
d1<- rnorm(nsample,1,0.5)
d2<- rnorm(nsample,2,0.6)
#transformed into histogram bins and collected in a data frame
hist.d1<- hist(d1)
hist.d2<- hist(d2)
data.d1<- data.frame(hist.d1$mids, hist.d1$density, rep(1,length(hist.d1$density)))
data.d2<- data.frame(hist.d2$mids, hist.d2$density, rep(2,length(hist.d2$density)))
colnames(data.d1)<- c("bin","den","group")
colnames(data.d2)<- c("bin","den","group")
ddata<- rbind(data.d1,data.d2)
ddata$group<- factor(ddata$group)
# plot
plots<- ggplot(data=ddata, aes(x=bin, y=den, group=group)) +
geom_point(aes(color=group)) +
geom_line(aes(color=group)) #optional
print(plots)
However, you could also produce smooth density plots (or histograms) directly in ggplot:
ddata2<- cbind(c(rep(1,nsample),rep(2,nsample)),c(d1,d2))
ddata2<- as.data.frame(ddata2)
colnames(ddata2)<- c("group","value")
ddata2$group<- factor(ddata2$group)
plots2<- ggplot(data=ddata2, aes(x=value, group=group)) +
geom_density(aes(color=group))
# geom_histogram(aes(color=group, fill=group)) # for histogram instead
windows()
print(plots2)

Scaled/weighted density plot

I want to generate a density plot of observed temperatures that is scaled by the number of events observed for each temperature data point. My data contains two columns: Temperature and Number [of observations].
Right now, I have a density plot that only incorporates the Temperature frequency according to:
plot(density(Temperature, na.rm=T), type="l", bty="n")
How do I scale this density to account for the Number of observations at each temperature? For example, I want to be able to see the temperature density plot scaled to show if there are greater/fewer observations for each temperature at higher/lower temperatures.
I think I'm looking for something that could weight the temperatures?
I think you can get what you want by passing a weights argument to density. Here's an example using ggplot
dat <- data.frame(Temperature = sort(runif(10)), Number = 1:10)
ggplot(dat, aes(Temperature)) + geom_density(aes(weights=Number/sum(Number)))
And to do this in base (using DanM's data):
plot(density(dat$Temperature,weights=dat$Number/sum(dat$Number),na.rm=T),type='l',bty='n')

Resources