I'd like to plot data such that on y axis there would be probability (in range [0,1]) and on x-axis I have the data values. The data is contiguous (also in range [0,1]), thus I'd like to use some kernel density estimation function and normalize it such that the y-value at some point x would mean the probability of seeing value x in input data.
So, I'd like to ask:
a) Is it reasonable at all? I understand that I cannot have probability of seeing values I do not have in the data, but I just would like to interpolate between points I have using a kernel density estimation function and normalize it afterwards.
b) Are there any built-in options in ggplot I could use, that would override default behavior of geom_density() for example for doing this?
Thanks in advance,
Timo
EDIT:
when i said "normalize" before, I actually meant "scale". But I got the answer, so thanks guys for clearing up my mind about this.
Just making up a quick merge of #JD Long's and #yesterday's answers:
ggplot(df, aes(x=x)) +
geom_histogram(aes(y = ..density..), binwidth=density(df$x)$bw) +
geom_density(fill="red", alpha = 0.2) +
theme_bw() +
xlab('') +
ylab('')
This way the binwidth for ggplot2 was calculated by the density function, and also the latter is drawn on the top of a histogram with a nice transparency. But you should definitely look into stat_densitiy as #yesterday suggested for further customization.
This isn't a ggplot answer, but if you want to bring together the ideas of kernel smoothing and histograms you could do a bootstrapping + smoothing approach. You'll get beat about the head and shoulders by stats folks for doing ugly things like this, so use at your own risk ;)
start with some synthetic data:
set.seed(1)
randomData <- c(rnorm(100, 5, 3), rnorm(100, 20, 3) )
hist(randomData, freq=FALSE)
lines(density(randomData), col="red")
The density function has a reasonably smart bandwidth calculator which you can borrow from:
bw <- density(randomData)$bw
resample <- sample( randomData, 10000, replace=TRUE)
Then use the bandwidth calc as the SD to make some random noise
noise <- rnorm(10000, 0, bw)
hist(resample + noise, freq=FALSE)
lines(density(randomData), col="red")
Hey look! A kernel smoothed histogram!
I know this long response is not really an answer to your question, but maybe it will provide some creative ideas on how to abuse your data.
You can control the behaviour of density / kernel estimation in ggplot by calling stat_density() rather than geom_density().
See the on-line user manual: http://had.co.nz/ggplot2/stat_density.html
You can specify any of the kernel estimation functions that are supported by by stats::density()
library(ggplot2)
df <- data.frame(x = rnorm(1000))
ggplot(df, aes(x=x)) + stat_density(kernel="biweight")
Related
Considering the following random data:
set.seed(123456)
# generate random normal data
x <- rnorm(100, mean = 20, sd = 5)
weights <- 1:100
df1 <- data.frame(x, weights)
#
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
We can create a general cumulative distribution plot.
But, I want to compare my curve to that from data used 20 years ago. From the paper, I only know that the data is "best modeled by a shifted exponential distribution with an x intercept of 1.1 and a mean of 18"
How can I add such a function to my plot?
+ stat_function(fun=dexp, geom = "line", size=2, col="red", args = (mean=18.1))
but I am not sure how to deal with the shift (x intercept)
I think scenarios like this are best handled by making your function first outside of the ggplot call.
dexp doesn't take a parameter mean but uses rate instead which is the same as lambda. That means you want rate = 1/18.1 based on properties of exponential distributions. Also, I don't think dexp makes much sense here since it shows the density and I think you really want the probability with is pexp.
your code could look something like this:
library(ggplot2)
test <- function(x) {pexp(x, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red")
you could shift your pexp distributions doing this:
test <- function(x) {pexp(x-10, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red") +
xlim(10,45)
just for fun this is what using dexp produces:
I am not entirely sure if I understand concept of mean for exponential function. However, generally, when you pass function as an argument, which is fun=dexp in your case, you can pass your own, modified functions in form of: fun = function(x) dexp(x)+1.1, for example.
Maybe experimenting with this feature will get you to the solution.
I have a more general question regarding the principle behind density2d.
I'm using ggplot and the density2d function to visualize animal movements. My idea was calculating heat maps showing where the animal is most of the time and/or to identify areas of particular interest. Yet, the density2d function sometimes generates rather inexplicable plots.
Here's what I mean:
set.seed(4)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
There are areas with a density estimate but without data (around x:50, y:300).
Now compare with this:
set.seed(13)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
Here there are regions "wihtout" a density estimate but with actual data (around x:100,y:550).
Someone asked a related question:
Create heatmap with distribution of attribute values in R (not density heatmap)
but there are no satisfactory answers to be found.
So my question would be (i) Why? and (ii) How to avoid/adjust if possible?
This may be helpful. I am not that familiar with stat_density2d. After seeing your code and ggplot documents (http://docs.ggplot2.org/0.9.2.1/stat_density2d.html), I thought ..level.. might not be the one. I, then, tried ..density.. Someone will be able to explain why you need density meanwhile I think this is the graph you wanted.
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) +
geom_path() +
coord_equal(xlim=c(0,600),ylim=c(0,600)) +
expand_limits(x=c(0,600),y=c(0,600))
I'm trying to display some frequencies convolved with a Gaussian kernel in ggplot2. I tried smoothing the lines with:
+ stat_smooth(se = F,method = "lm", formula = y ~ poly(x, 24))
Without success.
I read an article suggesting the frequencies should be convolved with a Gaussian kernel. Which ggplot2's stat_density function (http://docs.ggplot2.org/current/stat_density.html) seem to be able to produce.
However, I can't seem to be able to replace my geometry with stat_density. I there anything wrong with my code?
require(reshape2)
library(ggplot2)
library(RColorBrewer)
fileName = "/1.csv" # downloadable there: https://www.dropbox.com/s/l5j7ckmm5s9lo8j/1.csv?dl=0
mydata = read.csv(fileName,sep=",", header=TRUE)
dataM = melt(mydata,c("bins"))
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data=dataM,
aes(x=bins, y=value, colour=variable)) +
geom_line() + scale_x_continuous(limits = c(0, 2))
This code produces the following plot:
I'm looking at smoothing the lines a little bit, so they look more like this:
(from http://journal.frontiersin.org/Journal/10.3389/fncom.2013.00189/full)
Since my comments solved your problem, I'll convert them to an answer:
The density function takes individual measurements and calculates a kernel density distribution by convolution (gaussian is the default kernel). For example, plot(density(rnorm(1000))). You can control the smoothness with the bw (bandwidth) parameter. For example, plot(density(rnorm(1000), bw=0.01)).
But your data frame is already a density distribution (analogous to the output of the density function). To generate a smoother density estimate, you need to start with the underlying data and run density on it, adjusting bw to get the smoothness where you want it.
If you don't have access to the underlying data, you can smooth out your existing density distributions as follows:
ggplot(data=dataM, aes(x=bins, y=value, colour=variable)) +
geom_smooth(se=FALSE, span=0.3) +
scale_x_continuous(limits = c(0, 2)).
Play around with the span parameter to get the smoothness you want.
I'm working with a dataset where I have to transform some data for a curve fit. I'm plotting it using ggplot2, and can use stat_smooth on the transformed data to get the fit, but then want to overlay the result on the correct datapoints.
As a toy example, let's say I had
qplot(1:10, 1:10)+stat_smooth(formula=y+1~x, method="lm")
But I want to shift the stat_smooth line down by one (other than by taking the +1 out of the formula). Is this possible?
I don't think position_nudge() was available when this was asked 10.5 years ago but it's provided a simpler way of doing this for some time (as of ggplot 3.3.5, late 2021).
qplot(1:10, 1:10 + rnorm(10, sd = 0.3)) + stat_smooth(formula = y~x, method = "lm", position = position_nudge(y = 1))
It seems worth cautioning there's a good chance of displaying confusing or misleading confidence intervals when manipulating stat_smooth()'s formula. I've added a bit of variation to qplot()'s input in the line above to illustrate this.
Sometimes things can be very obvious :
qplot(1:10, 1:10)+stat_smooth(formula=(y+1)-1~x, method="lm")
If you can raise it 1 by adding 1 to y, you can lower it 1 by substracting 1 from y. ;-)
I am trying to produce some example graphics using ggplot2, and one of the examples I picked was the birthday problem, here using code 'borrowed' from a Revolution computing presentation at Oscon.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
sum(sapply(seq(ntests), anydup))/ntests
}
x<-data.frame(x=rep(1:100, each=5))
x<-ddply(x, .(x), function(df) {return(data.frame(x=df$x, prob=birthday(df$x)))})
birthdayplot<-ggplot(x, aes(x, prob))+
geom_point()+geom_smooth()+
theme_bw()+
opts(title = "Probability that at least two people share a birthday in a random group")+
labs(x="Size of Group", y="Probability")
Here my graph is what I would describe as exponential, but the geom_smooth doesn't fit the data particularly well. I've tried the loess method but this didn't change things much. Can anyone suggest how to add a better smooth ?
Thanks
Paul.
The smoothing routine does not react to the sudden change for low values of x fast enough (and it has no way of knowing that the values of prob are restricted to a 0-1 range). Since you have so low variability, a quick solution is to reduce the span of values over which smoothing at each point is done. Check out the red line in this plot:
birthdayplot + geom_smooth(span=0.1, colour="red")
The problem is that the probabilities follow a logistic curve. You could fit a proper smoothing line if you change the birthday function to return the raw successes and failures instead of the probabilities.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
data.frame(Dups = sapply(seq(ntests), anydup) * 1, n = n)
}
x<-ddply(x, .(x),function(df) birthday(df$x))
Now, you'll have to add the points as a summary, and specify a logistic regression as the smoothing type.
ggplot(x, aes(n, Dups)) +
stat_summary(fun.y = mean, geom = "point") +
stat_smooth(method = "glm", family = binomial)