Overlaying histogram and density estimate [duplicate] - r

I am trying to make a histogram of density values and overlay that with the curve of a density function (not the density estimate).
Using a simple standard normal example, here is some data:
x <- rnorm(1000)
I can do:
q <- qplot( x, geom="histogram")
q + stat_function( fun = dnorm )
but this gives the scale of the histogram in frequencies and not densities. with ..density.. I can get the proper scale on the histogram:
q <- qplot( x,..density.., geom="histogram")
q
But now this gives an error:
q + stat_function( fun = dnorm )
Is there something I am not seeing?
Another question, is there a way to plot the curve of a function, like curve(), but then not as layer?

Here you go!
# create some data to work with
x = rnorm(1000);
# overlay histogram, empirical density and normal density
p0 = qplot(x, geom = 'blank') +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.85, 0.85))
print(p0)

A more bare-bones alternative to Ramnath's answer, passing the observed mean and standard deviation, and using ggplot instead of qplot:
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'red'
)

What about using geom_density() from ggplot2? Like so:
df <- data.frame(x = rnorm(1000, 2, 2))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
This also works for multimodal distributions, for example:
df <- data.frame(x = c(rnorm(1000, 2, 2), rnorm(1000, 12, 2), rnorm(500, -8, 2)))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")

I'm trying for iris data set. You should be able to see graph you need in these simple code:
ker_graph <- ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density(lwd = 1.2,
linetype = 2,
colour = 2)

Related

To plot a theoretical density over an empirical density

If I want to plot an empirical density, I would go:
library(ggplot2) ggplot() + geom_density(aes(x = rbeta(100,3,1)))
or
library(ggplot2)
ggplot() +
geom_histogram(aes(x = rbeta(100,3,1)))
where rbeta(100,3,1) can be any vector.
If I want to plot a theoretical density, I could go:
library(ggplot2)
ggplot(data = data.frame(x = c(0, 1)), mapping = aes(x = x)) +
stat_function(fun = dbeta, args = c(3,1), n = 100)
But when I try to plot the first curve over the second:
library(ggplot2)
ggplot(data = data.frame(x = c(0, 1)), mapping = aes(x = x)) +
stat_function(fun = dbeta, args = c(3,1), n = 100) +
geom_histogram(aes(x = rbeta(100,3,1)))
I will get an error.
How can I plot an empirically determined density over a theoretical?
This seems to work.
Plot the histogram first, then the density. And with more data points the histogram fits the theoretical density better.
library(ggplot2)
library(gridExtra)
set.seed(2022)
p1 <- ggplot() +
geom_histogram(aes(x = rbeta(100,3,1), y = ..density..), bins = 30) +
stat_function(fun = dbeta, args = c(3,1), n = 100)
p2 <- ggplot() +
geom_histogram(aes(x = rbeta(10000,3,1), y = ..density..), bins = 30) +
stat_function(fun = dbeta, args = c(3,1), n = 100)
grid.arrange(p1, p2, ncol=2)
Created on 2022-02-20 by the reprex package (v2.0.1)

Fitting Rayleigh in R

This code
library(ggplot2)
library(MASS)
# Generate gamma rvs
x <- rgamma(100000, shape = 2, rate = 0.2)
den <- density(x)
dat <- data.frame(x = den$x, y = den$y)
ggplot(data = dat, aes(x = x, y = y)) +
geom_point(size = 3) +
theme_classic()
# Fit parameters (to avoid errors, set lower bounds to zero)
fit.params <- fitdistr(estimate, "gamma", lower = c(0, 0))
# Plot using density points
ggplot(data = dat, aes(x = x,y = y)) +
geom_point(size = 3) +
geom_line(aes(x=dat$x, y=dgamma(dat$x,fit.params$estimate["shape"], fit.params$estimate["rate"])),
color="red", size = 1) +
theme_classic()
fits and plots the distribution of series x. The resulting plot is:
Packages stats and MASS seem not to support the Rayleigh distribution. How can I extend the previous code to the Rayleigh distribution?
In the code below I start by recreating the vector x, this time setting the RNG seed, in order to make the results reproducible. Then a data.frame dat with only that vector is also recreated.
The density functions of the Gamma and Rayleigh distributions are fit to the histogram of x by first estimating their parameters and with stat_function.
library(ggplot2)
library(MASS)
library(extraDistr) # for the Rayleigh distribution functions
# Generate gamma rvs
set.seed(2020)
x <- rgamma(100000, shape = 2, rate = 0.2)
dat <- data.frame(x)
# Fit parameters (to avoid errors, set lower bounds to zero)
fit.params <- fitdistr(dat$x, "gamma", lower = c(0, 0))
ggplot(data = dat, aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = nclass.Sturges(x)) +
stat_function(fun = dgamma,
args = list(shape = fit.params$estimate["shape"],
rate = fit.params$estimate["rate"]),
color = "red", size = 1) +
ggtitle("Gamma density") +
theme_classic()
fit.params.2 <- fitdistrplus::fitdist(dat$x, "rayleigh", start = list(sigma = 1))
fit.params.2$estimate
ggplot(data = dat, aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = nclass.Sturges(x)) +
stat_function(fun = drayleigh,
args = list(sigma = fit.params.2$estimate),
color = "blue", size = 1) +
ggtitle("Rayleigh density") +
theme_classic()
To plot points and lines like in the question, not histograms, use the code below.
den <- density(x)
orig <- data.frame(x = den$x, y = den$y)
ggplot(data = orig, aes(x = x)) +
geom_point(aes(y = y), size = 3) +
geom_line(aes(y = dgamma(x, fit.params$estimate["shape"], fit.params$estimate["rate"])),
color="red", size = 1) +
geom_line(aes(y = drayleigh(x, fit.params.2$estimate)),
color="blue", size = 1) +
theme_classic()

ggplot histogram and distribution same levels [duplicate]

I am trying to make a histogram of density values and overlay that with the curve of a density function (not the density estimate).
Using a simple standard normal example, here is some data:
x <- rnorm(1000)
I can do:
q <- qplot( x, geom="histogram")
q + stat_function( fun = dnorm )
but this gives the scale of the histogram in frequencies and not densities. with ..density.. I can get the proper scale on the histogram:
q <- qplot( x,..density.., geom="histogram")
q
But now this gives an error:
q + stat_function( fun = dnorm )
Is there something I am not seeing?
Another question, is there a way to plot the curve of a function, like curve(), but then not as layer?
Here you go!
# create some data to work with
x = rnorm(1000);
# overlay histogram, empirical density and normal density
p0 = qplot(x, geom = 'blank') +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.85, 0.85))
print(p0)
A more bare-bones alternative to Ramnath's answer, passing the observed mean and standard deviation, and using ggplot instead of qplot:
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'red'
)
What about using geom_density() from ggplot2? Like so:
df <- data.frame(x = rnorm(1000, 2, 2))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
This also works for multimodal distributions, for example:
df <- data.frame(x = c(rnorm(1000, 2, 2), rnorm(1000, 12, 2), rnorm(500, -8, 2)))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
I'm trying for iris data set. You should be able to see graph you need in these simple code:
ker_graph <- ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density(lwd = 1.2,
linetype = 2,
colour = 2)

Normal Curve Overlay in ggplot2 [duplicate]

I am trying to make a histogram of density values and overlay that with the curve of a density function (not the density estimate).
Using a simple standard normal example, here is some data:
x <- rnorm(1000)
I can do:
q <- qplot( x, geom="histogram")
q + stat_function( fun = dnorm )
but this gives the scale of the histogram in frequencies and not densities. with ..density.. I can get the proper scale on the histogram:
q <- qplot( x,..density.., geom="histogram")
q
But now this gives an error:
q + stat_function( fun = dnorm )
Is there something I am not seeing?
Another question, is there a way to plot the curve of a function, like curve(), but then not as layer?
Here you go!
# create some data to work with
x = rnorm(1000);
# overlay histogram, empirical density and normal density
p0 = qplot(x, geom = 'blank') +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.85, 0.85))
print(p0)
A more bare-bones alternative to Ramnath's answer, passing the observed mean and standard deviation, and using ggplot instead of qplot:
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'red'
)
What about using geom_density() from ggplot2? Like so:
df <- data.frame(x = rnorm(1000, 2, 2))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
This also works for multimodal distributions, for example:
df <- data.frame(x = c(rnorm(1000, 2, 2), rnorm(1000, 12, 2), rnorm(500, -8, 2)))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
I'm trying for iris data set. You should be able to see graph you need in these simple code:
ker_graph <- ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density(lwd = 1.2,
linetype = 2,
colour = 2)

ggplot2: Stat_function misbehaviour with log scales

I am trying to plot a point histogram (a histogram that shows the values with a point instead of bars) that is log-scaled. The result should look like this:
MWE:
Lets simulate some Data:
set.seed(123)
d <- data.frame(x = rnorm(1000))
To get the point histogram I need to calculate the histogram data (hdata) first
hdata <- hist(d$x, plot = FALSE)
tmp <- data.frame(mids = hdata$mids,
density = hdata$density,
counts = hdata$counts)
which we can plot like this
p <- ggplot(tmp, aes(x = mids, y = density)) + geom_point() +
stat_function(fun = dnorm, col = "red")
p
to get this graph:
In theory we should be able to apply the log scales (and set the y-limits to be above 0) and we should have a similar picture to the target graph.
However, if I apply it I get the following graph:
p + scale_y_log10(limits = c(0.001, 10))
The stat_function clearly shows non-scaled values instead of producing a figure closer to the solid line in the first picture.
Any ideas?
Bonus
Are there any ways to graph the histogram with dots without using the hist(..., plot = FALSE) function?
EDIT Workaround
One possible solution is to calculate the dnorm-data outside of ggplot and then insert it as a line. For example
tmp2 <- data.frame(mids = seq(from = min(tmp$mids), to = max(tmp$mids),
by = (max(tmp$mids) - min(tmp$mids))/10000))
tmp2$dnorm <- dnorm(tmp2$mids)
# Plot it
ggplot() +
geom_point(data = tmp, aes(x = mids, y = density)) +
geom_line(data = tmp2, aes(x = mids, y = dnorm), col = "red") +
scale_y_log10()
This returns a graph like the following. This is basically the graph, but it doesn't resolve the stat_function issue.
library(ggplot2)
set.seed(123)
d <- data.frame(x = rnorm(1000))
ggplot(d, aes(x)) +
stat_bin(geom = "point",
aes(y = ..density..),
#same breaks as function hist's default:
breaks = pretty(range(d$x), n = nclass.Sturges(d$x), min.n = 1),
position = "identity") +
stat_function(fun = dnorm, col = "red") +
scale_y_log10(limits = c(0.001, 10))
Another possible solution that I found while revisiting this issue is to apply the log10 to the stat_function-call.
library(ggplot2)
set.seed(123)
d <- data.frame(x = rnorm(1000))
hdata <- hist(d$x, plot = FALSE)
tmp <- data.frame(mids = hdata$mids,
density = hdata$density,
counts = hdata$counts)
ggplot(tmp, aes(x = mids, y = density)) + geom_point() +
stat_function(fun = function(x) log10(dnorm(x)), col = "red") +
scale_y_log10()
Created on 2018-07-25 by the reprex package (v0.2.0).

Resources