Adding a density line to a histogram with count data in ggplot2 - r

I want to add a density line (a normal density actually) to a histogram.
Suppose I have the following data. I can plot the histogram by ggplot2:
set.seed(123)
df <- data.frame(x = rbeta(10000, shape1 = 2, shape2 = 4))
ggplot(df, aes(x = x)) + geom_histogram(colour = "black", fill = "white",
binwidth = 0.01)
I can add a density line using:
ggplot(df, aes(x = x)) +
geom_histogram(aes(y = ..density..),colour = "black", fill = "white",
binwidth = 0.01) +
stat_function(fun = dnorm, args = list(mean = mean(df$x), sd = sd(df$x)))
But this is not what I actually want, I want this density line to be fitted to the count data.
I found a similar post (HERE) that offered a solution to this problem. But it did not work in my case. I need to an arbitrary expansion factor to get what I want. And this is not generalizable at all:
ef <- 100 # Expansion factor
ggplot(df, aes(x = x)) +
geom_histogram(colour = "black", fill = "white", binwidth = 0.01) +
stat_function(fun = function(x, mean, sd, n){
n * dnorm(x = x, mean = mean, sd = sd)},
args = list(mean = mean(df$x), sd = sd(df$x), n = ef))
Any clues that I can use to generalize this
first to normal distribution,
then to any other bin size,
and lastly to any other distribution will be very helpful.

Fitting a distribution function does not happen by magic. You have to do it explicitly. One way is using fitdistr(...) in the MASS package.
library(MASS) # for fitsidtr(...)
# excellent fit (of course...)
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
stat_function(fun=dbeta,args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)
# horrible fit - no surprise here
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
stat_function(fun=dnorm,args=fitdistr(df$x,"normal")$estimate)
# mediocre fit - also not surprising...
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
stat_function(fun=dgamma,args=fitdistr(df$x,"gamma")$estimate)
EDIT: Response to OP's comment.
The scale factor is binwidth ✕ sample size.
ggplot(df, aes(x = x)) +
geom_histogram(colour = "black", fill = "white", binwidth = 0.01)+
stat_function(fun=function(x,shape1,shape2)0.01*nrow(df)*dbeta(x,shape1,shape2),
args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

Related

Binned Histogram with overlay of empirical and/or normal distribution [duplicate]

This question already has answers here:
ggplot2: histogram with normal curve
(5 answers)
Closed 1 year ago.
I am trying to look at the frequency distribution of a certain variable. Due to the large amount of data, I have created bins for a range of values and I'm plotting the count of each bin. I want to be able to overlay lines which will represent both the empirical distribution seen by my data, and what a theoretically normal distribution would look like. I can accomplish this without pre-binning my data or using ggplot2 by doing something such as this:
df <- ggplot2::diamonds
hist(df$price,freq = FALSE)
lines(density(df$price),lwd=3,col="blue")
or with ggplot2 as such:
mean_price <- mean(df$price)
sd_price <- sd(df$price)
ggplot(df, aes(x = price)) +
geom_histogram(aes(y = ..density..),
bins = 40, colour = "black", fill = "white") +
geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(color = 'Normal'),
args = list(mean = mean_price, sd = sd_price)) +
scale_colour_manual(name = "Colors", values = c("red", "blue"))
but I cannot figure out how to overlay similar lines on my pre-binned data:
breaks <- seq(from=min(df$price),to=max(df$price),length.out=11)
price_freq <- cut(df$price,breaks = breaks,right = TRUE,include.lowest = TRUE)
ggplot(data = df,mapping = aes(x=price_freq)) +
stat_count() +
theme(axis.text.x = element_text(angle = 270))
# + geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density') +
# stat_function(fun = dnorm, aes(color = 'Normal'),
# args = list(mean = mean_price, sd = sd_price)) +
# scale_colour_manual(name = "Colors", values = c("red", "blue"))
Any ideas?
Your problem is that cut gives you a factor/character for your x-axis. You need a numeric x-axis to add the other layers. A first step might be to try the following. I added a small fudge to get the last bin to work out.
library(tidyverse)
df <- ggplot2::diamonds
mean_price <- mean(df$price)
sd_price <- sd(df$price)
num_bins <- 40
breaks <- seq(from=min(df$price),to=max(df$price)+1e-10,length.out=num_bins+1)
midpoints <- (breaks[1:num_bins] + breaks[2:(num_bins+1)])/2
precomputed <- df %>%
mutate(bin_left = breaks[findInterval(price, breaks)],
bin_mid = midpoints[findInterval(price, breaks)]) %>%
count(bin_mid)
precomputed %>%
ggplot(aes(x = bin_mid, weight = n)) +
geom_histogram(aes(y = ..density..), bins = num_bins, boundary = breaks[1], colour = "black", fill = "white") +
geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(color = 'Normal'),
args = list(mean = mean_price, sd = sd_price)) +
scale_colour_manual(name = "Colors", values = c("red", "blue"))
But you will notice that the red Empirical curve is quite different from your ggplot2 example. The reason is that here it is being computed using the summary data which moves all x-values to the bin midpoint. You will need to pre-compute this empirical curve, or drop it and rely on the histogram to represent this data.
Sorry for the partial answer.
Take a look at the PearsonDS package ( I am guessing you are not using rnorm for a reason). The easiest approach may be to generate a vector of data that meets your requirements and map that vector using geom_line.
library("PearsonDS")
df <- rpearson(5000,moments=c(mean=10,variance=2,skewness=0,kurtosis=3))

Overlaying histogram and density estimate [duplicate]

I am trying to make a histogram of density values and overlay that with the curve of a density function (not the density estimate).
Using a simple standard normal example, here is some data:
x <- rnorm(1000)
I can do:
q <- qplot( x, geom="histogram")
q + stat_function( fun = dnorm )
but this gives the scale of the histogram in frequencies and not densities. with ..density.. I can get the proper scale on the histogram:
q <- qplot( x,..density.., geom="histogram")
q
But now this gives an error:
q + stat_function( fun = dnorm )
Is there something I am not seeing?
Another question, is there a way to plot the curve of a function, like curve(), but then not as layer?
Here you go!
# create some data to work with
x = rnorm(1000);
# overlay histogram, empirical density and normal density
p0 = qplot(x, geom = 'blank') +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.85, 0.85))
print(p0)
A more bare-bones alternative to Ramnath's answer, passing the observed mean and standard deviation, and using ggplot instead of qplot:
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'red'
)
What about using geom_density() from ggplot2? Like so:
df <- data.frame(x = rnorm(1000, 2, 2))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
This also works for multimodal distributions, for example:
df <- data.frame(x = c(rnorm(1000, 2, 2), rnorm(1000, 12, 2), rnorm(500, -8, 2)))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
I'm trying for iris data set. You should be able to see graph you need in these simple code:
ker_graph <- ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density(lwd = 1.2,
linetype = 2,
colour = 2)

ggplot histogram and distribution same levels [duplicate]

I am trying to make a histogram of density values and overlay that with the curve of a density function (not the density estimate).
Using a simple standard normal example, here is some data:
x <- rnorm(1000)
I can do:
q <- qplot( x, geom="histogram")
q + stat_function( fun = dnorm )
but this gives the scale of the histogram in frequencies and not densities. with ..density.. I can get the proper scale on the histogram:
q <- qplot( x,..density.., geom="histogram")
q
But now this gives an error:
q + stat_function( fun = dnorm )
Is there something I am not seeing?
Another question, is there a way to plot the curve of a function, like curve(), but then not as layer?
Here you go!
# create some data to work with
x = rnorm(1000);
# overlay histogram, empirical density and normal density
p0 = qplot(x, geom = 'blank') +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.85, 0.85))
print(p0)
A more bare-bones alternative to Ramnath's answer, passing the observed mean and standard deviation, and using ggplot instead of qplot:
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'red'
)
What about using geom_density() from ggplot2? Like so:
df <- data.frame(x = rnorm(1000, 2, 2))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
This also works for multimodal distributions, for example:
df <- data.frame(x = c(rnorm(1000, 2, 2), rnorm(1000, 12, 2), rnorm(500, -8, 2)))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
I'm trying for iris data set. You should be able to see graph you need in these simple code:
ker_graph <- ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density(lwd = 1.2,
linetype = 2,
colour = 2)

Normal Curve Overlay in ggplot2 [duplicate]

I am trying to make a histogram of density values and overlay that with the curve of a density function (not the density estimate).
Using a simple standard normal example, here is some data:
x <- rnorm(1000)
I can do:
q <- qplot( x, geom="histogram")
q + stat_function( fun = dnorm )
but this gives the scale of the histogram in frequencies and not densities. with ..density.. I can get the proper scale on the histogram:
q <- qplot( x,..density.., geom="histogram")
q
But now this gives an error:
q + stat_function( fun = dnorm )
Is there something I am not seeing?
Another question, is there a way to plot the curve of a function, like curve(), but then not as layer?
Here you go!
# create some data to work with
x = rnorm(1000);
# overlay histogram, empirical density and normal density
p0 = qplot(x, geom = 'blank') +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme(legend.position = c(0.85, 0.85))
print(p0)
A more bare-bones alternative to Ramnath's answer, passing the observed mean and standard deviation, and using ggplot instead of qplot:
df <- data.frame(x = rnorm(1000, 2, 2))
# overlay histogram and normal density
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
stat_function(
fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 2,
col = 'red'
)
What about using geom_density() from ggplot2? Like so:
df <- data.frame(x = rnorm(1000, 2, 2))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
This also works for multimodal distributions, for example:
df <- data.frame(x = c(rnorm(1000, 2, 2), rnorm(1000, 12, 2), rnorm(500, -8, 2)))
ggplot(df, aes(x)) +
geom_histogram(aes(y=..density..)) + # scale histogram y
geom_density(col = "red")
I'm trying for iris data set. You should be able to see graph you need in these simple code:
ker_graph <- ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density(lwd = 1.2,
linetype = 2,
colour = 2)

ggplot2: histogram with normal curve

I've been trying to superimpose a normal curve over my histogram with ggplot 2.
My formula:
data <- read.csv (path...)
ggplot(data, aes(V2)) +
geom_histogram(alpha=0.3, fill='white', colour='black', binwidth=.04)
I tried several things:
+ stat_function(fun=dnorm)
....didn't change anything
+ stat_density(geom = "line", colour = "red")
...gave me a straight red line on the x-axis.
+ geom_density()
doesn't work for me because I want to keep my frequency values on the y-axis, and want no density values.
Any suggestions?
Solution found!
+geom_density(aes(y=0.045*..count..), colour="black", adjust=4)
Think I got it:
library(ggplot2)
set.seed(1)
df <- data.frame(PF = 10*rnorm(1000))
ggplot(df, aes(x = PF)) +
geom_histogram(aes(y =..density..),
breaks = seq(-50, 50, by = 10),
colour = "black",
fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))
This has been answered here and partially here.
The area under a density curve equals 1, and the area under the histogram equals the width of the bars times the sum of their height ie. the binwidth times the total number of non-missing observations. To fit both on the same graph, one or other needs to be rescaled so that their areas match.
If you want the y-axis to have frequency counts, there are a number of options:
First simulate some data.
library(ggplot2)
set.seed(1)
dat_hist <- data.frame(
group = c(rep("A", 200), rep("B",150)),
value = c(rnorm(200, 20, 5), rnorm(150,25,10)))
# Set desired binwidth and number of non-missing obs
bw = 2
n_obs = sum(!is.na(dat_hist$value))
Option 1: Plot both histogram and density curve as density and then rescale the y axis
This is perhaps the easiest approach for a single histogram.
Using the approach suggested by Carlos, plot both histogram and density curve as density
g <- ggplot(dat_hist, aes(value)) +
geom_histogram(aes(y = ..density..), binwidth = bw, colour = "black") +
stat_function(fun = dnorm, args = list(mean = mean(dat_hist$value), sd = sd(dat_hist$value)))
And then rescale the y axis.
ybreaks = seq(0,50,5)
## On primary axis
g + scale_y_continuous("Counts", breaks = round(ybreaks / (bw * n_obs),3), labels = ybreaks)
## Or on secondary axis
g + scale_y_continuous("Density", sec.axis = sec_axis(
trans = ~ . * bw * n_obs, name = "Counts", breaks = ybreaks))
Option 2: Rescale the density curve using stat_function
With code tidied as per PatrickT's answer.
ggplot(dat_hist, aes(value)) +
geom_histogram(colour = "black", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = mean(dat_hist$value), sd = sd(dat_hist$value)) * bw * n_obs)
Option 3: Create an external dataset and plot using geom_line.
Unlike the above options, this one works with facets. (EDITED to provide dplyr rather than plyr based solution). Note, the summarised dataset is being used as the primary, and the raw passed in for the histogram only.
library(tidyverse)
dat_hist %>%
group_by(group) %>%
nest(data = c(value)) %>%
mutate(y = map(data, ~ dnorm(
.$value, mean = mean(.$value), sd = sd(.$value)
) * bw * sum(!is.na(.$value)))) %>%
unnest(c(data,y)) %>%
ggplot(aes(x = value)) +
geom_histogram(data = dat_hist, binwidth = bw, colour = "black") +
geom_line(aes(y = y)) +
facet_wrap(~ group)
Option 4: Create external functions to edit the data on the fly
A bit over the top perhaps, but might be useful for someone?
## Function to create scaled dnorm data along full x axis range
dnorm_scaled <- function(data, x = NULL, binwidth = 1, xlim = NULL) {
.x <- na.omit(data[,x])
if(is.null(xlim))
xlim = c(min(.x), max(.x))
x_range = seq(xlim[1], xlim[2], length.out = 101)
setNames(
data.frame(
x = x_range,
y = dnorm(x_range, mean = mean(.x), sd = sd(.x)) * length(.x) * binwidth),
c(x, "y"))
}
## Function to apply over groups
dnorm_scaled_group <- function(data, x = NULL, group = NULL, binwidth = NULL, xlim = NULL) {
dat_hists <- lapply(
split(data, data[, group]), dnorm_scaled,
x = x, binwidth = binwidth, xlim = xlim)
for(g in names(dat_hists))
dat_hists[[g]][, "group"] <- g
setNames(do.call(rbind, dat_hists), c(x, "y", group))
}
## Single histogram
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = bw, colour = "black") +
geom_line(data = ~ dnorm_scaled(., "value", binwidth = bw),
aes(y = y))
## With a single faceting variable
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = 2, colour = "black") +
geom_line(data = ~ dnorm_scaled_group(
., x = "value", group = "group", binwidth = 2, xlim = c(0,50)),
aes(y = y)) +
facet_wrap(~ group)
This is an extended comment on JWilliman's answer. I found J's answer very useful. While playing around I discovered a way to simplify the code. I'm not saying it is a better way, but I thought I would mention it.
Note that JWilliman's answer provides the count on the y-axis and a "hack" to scale the corresponding density normal approximation (which otherwise would cover a total area of 1 and have therefore a much lower peak).
Main point of this comment: simpler syntax inside stat_function, by passing the needed parameters to the aesthetics function, e.g.
aes(x = x, mean = 0, sd = 1, binwidth = 0.3, n = 1000)
This avoids having to pass args = to stat_function and is therefore more user-friendly. Okay, it's not very different, but hopefully someone will find it interesting.
# parameters that will be passed to ``stat_function``
n = 1000
mean = 0
sd = 1
binwidth = 0.3 # passed to geom_histogram and stat_function
set.seed(1)
df <- data.frame(x = rnorm(n, mean, sd))
ggplot(df, aes(x = x, mean = mean, sd = sd, binwidth = binwidth, n = n)) +
theme_bw() +
geom_histogram(binwidth = binwidth,
colour = "white", fill = "cornflowerblue", size = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = mean, sd = sd) * n * binwidth,
color = "darkred", size = 1)
This code should do it:
set.seed(1)
z <- rnorm(1000)
qplot(z, geom = "blank") +
geom_histogram(aes(y = ..density..)) +
stat_density(geom = "line", aes(colour = "bla")) +
stat_function(fun = dnorm, aes(x = z, colour = "blabla")) +
scale_colour_manual(name = "", values = c("red", "green"),
breaks = c("bla", "blabla"),
labels = c("kernel_est", "norm_curv")) +
theme(legend.position = "bottom", legend.direction = "horizontal")
Note: I used qplot but you can use the more versatile ggplot.
Here's a tidyverse informed version:
Setup
library(tidyverse)
Some data
d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv")
Preparing data
We'll use a "total" histogram for the whole sample, to that end, we'll need to remove the grouping information from the data.
d2 <-
d |>
select(-gender)
Here's a data set with summary data:
d_summary <-
d %>%
group_by(gender) %>%
summarise(height_m = mean(height, na.rm = T),
height_sd = sd(height, na.rm = T))
d_summary
Plot it
d %>%
ggplot() +
aes() +
geom_histogram(aes(y = ..density.., x = height, fill = gender)) +
facet_wrap(~ gender) +
geom_histogram(data = d2, aes(y = ..density.., x = height),
alpha = .5) +
stat_function(data = d_summary %>% filter(gender == "female"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "female")$height_m,
sd = filter(d_summary,
gender == "female")$height_sd)) +
stat_function(data = d_summary %>% filter(gender == "male"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "male")$height_m,
sd = filter(d_summary,
gender == "male")$height_sd)) +
theme(legend.position = "none",
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(title = "Facetted histograms with overlaid normal curves",
caption = "The grey histograms shows the whole distribution (over) both groups, i.e. females and men") +
scale_fill_brewer(type = "qual", palette = "Set1")

Resources