I`m a novice with the R programming language. What is the standard/general method for overlaying a density curve on a histogram using ggplot2?
It depends wether you want an empirical density estimate or to fit a theoretical density. In both cases, you'd need to match the width of histogram bins to the density.
For the empirical kernel density estimates:
library(ggplot2)
# dummy data
df <- data.frame(
x = rnorm(1000)
)
binwidth <- 0.1
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
geom_density(aes(y = after_stat(count * binwidth)),
color = "red")
Theoretical density estimates don't live in ggplot2 but in extention packages. Disclaimer: I'm the author of the following package, so I'm biased:
library(ggh4x)
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
stat_theodensity(aes(y = after_stat(count * binwidth)),
color = "red")
Alternatively, if you don't want to bother with setting binwidths you can also scale the histogram to density instead:
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
geom_density(color = "red")
Note: after_stat() requires ggplot2 v3.3.0, earlier versions use stat().
You need to make sure that to multiply value of ..count.. in in the density plot call by the value of whatever the binwidth is in the histogram call.
You can do it as follows:
set.seed(100)
a = data.frame(z = rnorm(10000))
binwidthVal=0.1
ggplot(a, aes(x=z)) +
geom_histogram(binwidth = binwidthVal) +
geom_density(colour='red', aes(y=binwidthVal * ..count..))
Credit to Brian Diggs for the idea.
EDIT: Seems like there is already a perfectly good answer here
Related
I have a bunch of data, that I've fitted with a gamma distribution. I've got the histogram and the fitting curve just fine, but now I wish to draw a histogram and the curve with x in log scale. Using scale_x_log10 works just fine for the histogram, but I can't make it work for the stat_function/geom_line.
I understand that's because stat_function now takes the log values, but I'm not sure how to transform the gamma function beforehand for it to work properly. Here are the relevant pictures and code snippets:
This is the original graph:
fit.gamma2 <- fitdist(myvalues[,1],distr="gamma",method="mme")
ggplot(myvalues, aes(x = V1)) +
geom_histogram(aes(y =..density..),
boundary = 0,
binwidth = sirina,
col="black",
fill="blue",
alpha=.2) +
stat_function(fun=dgamma,
args=list(shape = fit.gamma2$estimate["shape"],
rate = fit.gamma2$estimate["rate"])) +
labs(title="Histogram žarkov + Gama porazdelitev (MM)",
x = "Medprihodni časi (s)",
y = "Gostota")
This is that same graph after using scale_x_log10. The red curve is supposed to be the fitting curve, but it's obviously way off.
ggplot(myvalues, aes(x = V1)) +
geom_histogram(aes(y =..density..),
boundary = 0,
binwidth = sirina_log,
col="black",
fill="blue",
alpha=.2) +
stat_function(fun=dgamma,
args=list(shape = fit.gamma2$estimate["shape"],
rate = fit.gamma2$estimate["rate"])) +
# geom_line(aes(x=V1,y=dgamma(V1,fit.gamma2$estimate["shape"], fit.gamma2$estimate["rate"])), color="red", size = 1) +
scale_x_log10()
I have tried applying the values in 10**x form, but as my original data ranges between 0.1 and 800, some values then escape to Inf.
You need to transform your PDF based on the derivative of log10. First create a function for the transformed PDF:
dgammalog10 <- function(x, shape, rate) {
return(x*log(10)*dgamma(x, shape, rate))
}
Then you can use fun=dgammalog10 where you had fun=dgamma.
I'm quite new in R and I'm struggling overlaying a filled histogram divided in 6 classes and a KDE based on the whole distribution (not the individual distributions of the 6 classes).
I have this dataset with 4 columns (data1, data2, data3, origin) with all data being continuous and origin being my categories (geographical locations). I'm fine with plotting the histogram for data1 with the 6 classes but when I'm adding the KDE curve, it's also divided in 6 curves (one for each class). I think I understand I have to override the first aes argument and make a new one when I call geom_density, but I can't find how to do so.
Translating my problem with the iris dataset, I would like the KDE curve for the Sepal.Length and not one KDE curve Sepal.Length for each species. Here is my code and my results with iris data.
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_histogram() +
theme_minimal() +
geom_density(kernel="gaussian", bw= 0.1, alpha=.3)
The problem is that the histogram displays counts, which integrates to the sum, and the density plot shows, well, density, that integrates to 1. To make the two compatible you'd have to use the 'computed variables' of the stat parts of the layers, which are accessible with after_stat(). You can either scale the density such that it integrates to the sum, or you can scale the histogram such that it integrates to 1.
Scaling the histogram to the density:
library(ggplot2)
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(aes(y = after_stat(density)),
position = 'identity') +
geom_density(bw = 0.1, alpha = 0.3)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Scaling density to counts. To do this properly you should multiply the count computed variable with the binwidth parameter of the histogram.
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2, position = 'identity') +
geom_density(aes(y = after_stat(count * 0.2)),
bw = 0.1, alpha = 0.3)
Created on 2021-06-22 by the reprex package (v1.0.0)
As a side note; the default position argument for the histogram is to stack bars on top of oneanother. Setting position = "identity" prevents this. Alternatively, you could also set position = "stack" in the density layer.
EDIT: Sorry I've seem to have glossed over the 'I want 1 KDE for the entire Sepal.Length'-part of the question. You'd have to manually set the group, like so:
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2) +
geom_density(bw = 0.1, alpha = 0.3,
aes(group = 1, y = after_stat(count * 0.2)))
I also found a nice tutorial on combining geom_hist() and geom_density() with matching scale on sthda.com
http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization#combine-histogram-and-density-plots
Reprex from there is:
set.seed(1234)
df <- data.frame(
sex=factor(rep(c("F", "M"), each=200)),
weight=round(c(rnorm(200, mean=55, sd=5),
rnorm(200, mean=65, sd=5)))
)
library(ggplot2)
ggplot(df, aes(x=weight, color=sex, fill=sex)) +
geom_histogram(aes(y=..density..), alpha=0.5,position="identity") +
geom_density(alpha=.2)
I have a vector of sample means and I've been tying to plot a probability histogram using hist(x) and ggplot but the bins exceed 1(which is very unusual for a probability distribution),I then used a PlotRelativeFrequency(hist(x)) function to force R to plot a histogram of probabilities,It worked! but My problem is,I cannot plot a density function over the histogram.When I used the lines(density(x)) function it plots a density function that goes way off the graph.
Since your question is tagged with ggplot, I'll give a ggplot answer.
To make histograms relative you have to set aes(y = stat(density)) such that it integrates to 1. Then, you could give the stat_function() the relevant density function for any theoretical distribution. The downside is that you'll have to pre-compute the parameters.
df <- data.frame(x = rnorm(500, 10, 2))
pars <- list(mean = mean(df$x), sd = sd(df$x))
library(ggplot2)
ggplot(df, aes(x)) +
geom_histogram(binwidth = 1, aes(y = stat(density))) +
stat_function(fun = function(x) {dnorm(x, mean = pars$mean, sd = pars$sd)})
Next up, we can plot the empirical density using kernel density estimates, which does everything pretty much automatically:
ggplot(df, aes(x)) +
geom_histogram(binwidth = 1, aes(y = stat(density))) +
geom_density()
Lastly, you can have a look at this stats function, that essentially automates the first version. Full disclaimer: I'm the author of that github repo.
library(ggnomics)
ggplot(df, aes(x)) +
geom_histogram(binwidth = 1, aes(y = stat(density))) +
stat_theodensity()
When I plot densities with ggplot, it seems to be very wrong around the limits. I see that geom_density and other functions allow specifying various density kernels, but none of them seem to fix the issue.
How do you correctly plot densities around the limits with ggplot?
As an example, let's plot the Chi-square distribution with 2 degrees of freedom. Using the builtin probability densities:
library(ggplot2)
u = seq(0, 2, by=0.01)
v = dchisq(u, df=2)
df = data.frame(x=u, p=v)
p = ggplot(df) +
geom_line(aes(x=x, y=p), size=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 0.5))
show(p)
We get the expected plot:
Now let's try simulating it and plotting the empirical distribution:
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_density(aes(x=x)) +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
We get an incorrect plot:
We can try to visualize the actual distribution:
library(ggplot2, dplyr, tidyr)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_point(aes(x=x, y=0.5), position=position_jitter(height=0.2), shape='.', alpha=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 1))
show(p)
And it seems to look correct, contrary to the density plot:
It seems like the problem has to do with kernels, and geom_density does allow using different kernels. But they don't really correct the limit problem. For example, the code above with triangular looks about the same:
Here's an idea of what I'm expecting to see (of course, I want a density, not a histogram):
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_histogram(aes(x=x), center=0.1, binwidth=0.2, fill='white', color='black') +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
The usual kernel density methods have trouble when there is a constraint such as in this case for a density with only support above zero. The usual recommendation for handling this has been to use the logspline package:
install.packages("logspline")
library(logspline)
png(); fit <- logspline(rchisq(10000, 3))
plot(fit) ; dev.off()
If this needed to be done in the ggplot2 environment there is a dlogspline function:
densdf <- data.frame( y=dlogspline(seq(0,12,length=1000), fit),
x=seq(0,12,length=1000))
ggplot(densdf, aes(y=y,x=x))+geom_line()
Perhaps you were insisting on one with 2 degrees of freedom?
I have question probably similar to Fitting a density curve to a histogram in R. Using qplot I have created 7 histograms with this command:
(qplot(V1, data=data, binwidth=10, facets=V2~.)
For each slice, I would like to add a fitting gaussian curve. When I try to use lines() method, I get error:
Error in plot.xy(xy.coords(x, y), type = type, ...) :
plot.new has not been called yet
What is the command to do it correctly?
Have you tried stat_function?
+ stat_function(fun = dnorm)
You'll probably want to plot the histograms using aes(y = ..density..) in order to plot the density values rather than the counts.
A lot of useful information can be found in this question, including some advice on plotting different normal curves on different facets.
Here are some examples:
dat <- data.frame(x = c(rnorm(100),rnorm(100,2,0.5)),
a = rep(letters[1:2],each = 100))
Overlay a single normal density on each facet:
ggplot(data = dat,aes(x = x)) +
facet_wrap(~a) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, colour = "red")
From the question I linked to, create a separate data frame with the different normal curves:
grid <- with(dat, seq(min(x), max(x), length = 100))
normaldens <- ddply(dat, "a", function(df) {
data.frame(
predicted = grid,
density = dnorm(grid, mean(df$x), sd(df$x))
)
})
And plot them separately using geom_line:
ggplot(data = dat,aes(x = x)) +
facet_wrap(~a) +
geom_histogram(aes(y = ..density..)) +
geom_line(data = normaldens, aes(x = predicted, y = density), colour = "red")
ggplot2 uses a different graphics paradigm than base graphics. (Although you can use grid graphics with it, the best way is to add a new stat_function layer to the plot. The ggplot2 code is the following.
Note that I couldn't get this to work using qplot, but the transition to ggplot is reasonably straighforward, the most important difference is that your data must be in data.frame format.
Also note the explicit mapping of the y aesthetic aes=aes(y=..density..)) - this is slighly unusual but takes the stat_function results and maps it to the data:
library(ggplot2)
data <- data.frame(V1 <- rnorm(700), V2=sample(LETTERS[1:7], 700, replace=TRUE))
ggplot(data, aes(x=V1)) +
stat_bin(aes(y=..density..)) +
stat_function(fun=dnorm) +
facet_grid(V2~.)