I'm quite new in R and I'm struggling overlaying a filled histogram divided in 6 classes and a KDE based on the whole distribution (not the individual distributions of the 6 classes).
I have this dataset with 4 columns (data1, data2, data3, origin) with all data being continuous and origin being my categories (geographical locations). I'm fine with plotting the histogram for data1 with the 6 classes but when I'm adding the KDE curve, it's also divided in 6 curves (one for each class). I think I understand I have to override the first aes argument and make a new one when I call geom_density, but I can't find how to do so.
Translating my problem with the iris dataset, I would like the KDE curve for the Sepal.Length and not one KDE curve Sepal.Length for each species. Here is my code and my results with iris data.
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_histogram() +
theme_minimal() +
geom_density(kernel="gaussian", bw= 0.1, alpha=.3)
The problem is that the histogram displays counts, which integrates to the sum, and the density plot shows, well, density, that integrates to 1. To make the two compatible you'd have to use the 'computed variables' of the stat parts of the layers, which are accessible with after_stat(). You can either scale the density such that it integrates to the sum, or you can scale the histogram such that it integrates to 1.
Scaling the histogram to the density:
library(ggplot2)
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(aes(y = after_stat(density)),
position = 'identity') +
geom_density(bw = 0.1, alpha = 0.3)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Scaling density to counts. To do this properly you should multiply the count computed variable with the binwidth parameter of the histogram.
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2, position = 'identity') +
geom_density(aes(y = after_stat(count * 0.2)),
bw = 0.1, alpha = 0.3)
Created on 2021-06-22 by the reprex package (v1.0.0)
As a side note; the default position argument for the histogram is to stack bars on top of oneanother. Setting position = "identity" prevents this. Alternatively, you could also set position = "stack" in the density layer.
EDIT: Sorry I've seem to have glossed over the 'I want 1 KDE for the entire Sepal.Length'-part of the question. You'd have to manually set the group, like so:
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2) +
geom_density(bw = 0.1, alpha = 0.3,
aes(group = 1, y = after_stat(count * 0.2)))
I also found a nice tutorial on combining geom_hist() and geom_density() with matching scale on sthda.com
http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization#combine-histogram-and-density-plots
Reprex from there is:
set.seed(1234)
df <- data.frame(
sex=factor(rep(c("F", "M"), each=200)),
weight=round(c(rnorm(200, mean=55, sd=5),
rnorm(200, mean=65, sd=5)))
)
library(ggplot2)
ggplot(df, aes(x=weight, color=sex, fill=sex)) +
geom_histogram(aes(y=..density..), alpha=0.5,position="identity") +
geom_density(alpha=.2)
Related
I would like to plot a background that captures the density of points in one dimension in a scatter plot. This would serve a similar purpose to a marginal density plot or a rug plot. I have a way of doing it that is not particularly elegant, I am wondering if there's some built-in functionality I can use to produce this kind of plot.
Mainly there are a few issues with the current approach:
Alpha overlap at boundaries causes banding at lower resolution as seen here. - Primary objective, looking for a geom or other solution that draws a nice continuous band filled with specific colour. Something like geom_density_2d() but with the stat drawn from only the X axis.
"Background" does not cover expanded area, can use coord_cartesian(expand = FALSE) but would like to cover regular margins. - Not a big deal, is nice-to-have but not required.
Setting scale_fill "consumes" the option for the plot, not allowing it to be set independently for the points themselves. - This may not be easily achievable, independent palettes for layers appears to be a fundamental issue with ggplot2.
data(iris)
dns <- density(iris$Sepal.Length)
dns_df <- tibble(
x = dns$x,
density = dns$y
)%>%
mutate(
start = x - mean(diff(x))/2,
end = x + mean(diff(x))/2
)
ggplot() +
geom_rect(
data = dns_df,
aes(xmin = start, xmax = end, fill = density),
ymin = min(iris$Sepal.Width),
ymax = max(iris$Sepal.Width),
alpha = 0.5) +
scale_fill_viridis_c(option = "A") +
geom_point(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_rug(data = iris, aes(x = Sepal.Length))
This is a bit of a hacky solution because it (ab)uses knowledge of how objects are internally parametrised to get what you want, which will yield some warnings, but gets you want you'd want.
First, we'll use a geom_raster() + stat_density() decorated with some choice after_stat()/stage() delayed evaluation. Normally, this would result in a height = 1 strip, but by setting the internal parameters ymin/ymax to infinitives, we'll have the strip extend the whole height of the plot. Using geom_raster() resolves the alpha issue you were having.
library(ggplot2)
p <- ggplot(iris) +
geom_raster(
aes(Sepal.Length,
y = mean(Sepal.Width),
fill = after_stat(density),
ymin = stage(NULL, after_scale = -Inf),
ymax = stage(NULL, after_scale = Inf)),
stat = "density", alpha = 0.5
)
#> Warning: Ignoring unknown aesthetics: ymin, ymax
p
#> Warning: Duplicated aesthetics after name standardisation: NA
Next, we add a fill scale, and immediately follow that by ggnewscale::new_scale_fill(). This allows another layer to use a second fill scale, as demonstrated with fill = Species.
p <- p +
scale_fill_viridis_c(option = "A") +
ggnewscale::new_scale_fill() +
geom_point(aes(Sepal.Length, Sepal.Width, fill = Species),
shape = 21) +
geom_rug(aes(Sepal.Length))
p
#> Warning: Duplicated aesthetics after name standardisation: NA
Lastly, to get rid of the padding at the x-axis, we can manually extend the limits and then shrink in the expansion. It allows for an extended range over which the density can be estimated, making the raster fill the whole area. There is some mismatch between how ggplot2 and scales::expand_range() are parameterised, so the exact values are a bit of trial and error.
p +
scale_x_continuous(
limits = ~ scales::expand_range(.x, mul = 0.05),
expand = c(0, -0.2)
)
#> Warning: Duplicated aesthetics after name standardisation: NA
Created on 2022-07-04 by the reprex package (v2.0.1)
This doesn't solve your problem (I'm not sure I understand all the issues correctly), but perhaps it will help:
Background does not cover expanded area, can use coord_cartesian(expand = FALSE) but would like to cover regular margins.
If you make the 'background' larger and use coord_cartesian() you can get the same 'filled-to-the-edges' effect; would this work for your use-case?
Alpha overlap at boundaries causes banding at lower resolution as seen here.
I wasn't able to fix the banding completely, but my approach below appears to reduce it.
Setting scale_fill "consumes" the option for the plot, not allowing it to be set independently for the points themselves.
If you use geom_segment() you can map density to colour, leaving fill available for e.g. the points. Again, not sure if this is a useable solution, just an idea that might help.
library(tidyverse)
data(iris)
dns <- density(iris$Sepal.Length)
dns_df <- tibble(
x = dns$x,
density = dns$y
) %>%
mutate(
start = x - mean(diff(x))/2,
end = x + mean(diff(x))/2
)
ggplot() +
geom_segment(
data = dns_df,
aes(x = start, xend = end,
y = min(iris$Sepal.Width) * 0.9,
yend = max(iris$Sepal.Width) * 1.1,
color = density), alpha = 0.5) +
coord_cartesian(ylim = c(min(iris$Sepal.Width),
max(iris$Sepal.Width)),
xlim = c(min(iris$Sepal.Length),
max(iris$Sepal.Length))) +
scale_color_viridis_c(option = "A", alpha = 0.5) +
scale_fill_viridis_d() +
geom_point(data = iris, aes(x = Sepal.Length,
y = Sepal.Width,
fill = Species),
shape = 21) +
geom_rug(data = iris, aes(x = Sepal.Length))
Created on 2022-07-04 by the reprex package (v2.0.1)
I created the plot below using:
ggplot(data_all, aes(x = data_all$Speed, fill = data_all$Season)) +
theme_bw() +
geom_histogram(position = "identity", alpha = 0.2, binwidth=0.1)
As you can see, the difference in the amount of data available is very large. Is there a way to look only at the distribution and not at the total data amount?
You can reference some of the other calculated values from stat functions using a notation that you may have seen before: ..value... I'm not sure the proper name for these or where you can find a list documented, but sometimes these are called "special variables" or "calculated aesthetics".
In this case, the default calculated aesthetic on the y axis for geom_histogram() is ..count... When comparing distributions of different total N size, it's useful to use ..density... You can access ..density.. by passing it to the y aesthetic directly in the geom_histogram() function.
First, here's an example of two histograms with vastly different sizes (similar to OP's question):
library(ggplot2)
set.seed(8675309)
df <- data.frame(
x = c(rnorm(1000, -1, 0.5), rnorm(100000, 3, 1)),
group = c(rep("A", 1000), rep("B", 100000))
)
ggplot(df, aes(x, fill=group)) + theme_classic() +
geom_histogram(
alpha=0.2, color='gray80',
position="identity", bins=80)
And here's the same plot using ..density..:
ggplot(df, aes(x, fill=group)) + theme_classic() +
geom_histogram(
aes(y=..density..), alpha=0.2, color='gray80',
position="identity", bins=80)
I`m a novice with the R programming language. What is the standard/general method for overlaying a density curve on a histogram using ggplot2?
It depends wether you want an empirical density estimate or to fit a theoretical density. In both cases, you'd need to match the width of histogram bins to the density.
For the empirical kernel density estimates:
library(ggplot2)
# dummy data
df <- data.frame(
x = rnorm(1000)
)
binwidth <- 0.1
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
geom_density(aes(y = after_stat(count * binwidth)),
color = "red")
Theoretical density estimates don't live in ggplot2 but in extention packages. Disclaimer: I'm the author of the following package, so I'm biased:
library(ggh4x)
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
stat_theodensity(aes(y = after_stat(count * binwidth)),
color = "red")
Alternatively, if you don't want to bother with setting binwidths you can also scale the histogram to density instead:
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
geom_density(color = "red")
Note: after_stat() requires ggplot2 v3.3.0, earlier versions use stat().
You need to make sure that to multiply value of ..count.. in in the density plot call by the value of whatever the binwidth is in the histogram call.
You can do it as follows:
set.seed(100)
a = data.frame(z = rnorm(10000))
binwidthVal=0.1
ggplot(a, aes(x=z)) +
geom_histogram(binwidth = binwidthVal) +
geom_density(colour='red', aes(y=binwidthVal * ..count..))
Credit to Brian Diggs for the idea.
EDIT: Seems like there is already a perfectly good answer here
This question already has answers here:
Overlaying boxplot with histogram in ggplot2
(3 answers)
Closed 3 years ago.
I want a boxplot to be overlayed on histogram. to avoid missing with the histogram, I am forced to draw the boxplot like that:
library(ggplot2)
ggplot(iris) + geom_boxplot(aes(x = Sepal.Length, y = factor(0)))
However the plot doesn't appear right unless I swap between the x and y.
I want to integrate a histogram with boxplot on the same coordinate, but it seems there is no way to plot a boxplot flipped without using coord_flip() that doesn't help here as it flip the whole plot.
ggplot(iris) +
geom_histogram(aes(x = Sepal.Length))+
geom_boxplot(aes(x = Sepal.Length, y = factor(0))) +
coord_flip()
Something like this?
library(ggplot2)
library(ggstance)
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram() +
geom_boxploth(aes(y = 3), width = 3, color = "blue", lwd = 2, alpha = .5) +
theme_minimal()
This works in the current development version of ggplot2, hopefully to be released soon.
library(ggplot2) # remotes::install_github("tidyverse/ggplot2")
packageVersion("ggplot2")
#> [1] '3.2.1.9000'
ggplot(iris) +
geom_histogram(aes(x = Sepal.Length))+
geom_boxplot(aes(x = Sepal.Length, y = factor(0)))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Created on 2019-11-12 by the reprex package (v0.3.0)
I am trying to recreate a figure from a GGplot2 seminar http://dl.dropbox.com/u/42707925/ggplot2/ggplot2slides.pdf.
In this case, I am trying to generate Example 5, with jittered data points subject to a dodge. When I run the code, the points are centered around the correct line, but have no jitter.
Here is the code directly from the presentation.
set.seed(12345)
hillest<-c(rep(1.1,100*4*3)+rnorm(100*4*3,sd=0.2),
rep(1.9,100*4*3)+rnorm(100*4*3,sd=0.2))
rep<-rep(1:100,4*3*2)
process<-rep(rep(c("Process 1","Process 2","Process 3","Process 4"),each=100),3*2)
memorypar<-rep(rep(c("0.1","0.2","0.3"),each=4*100),2)
tailindex<-rep(c("1.1","1.9"),each=3*4*100)
ex5<-data.frame(hillest=hillest,rep=rep,process=process,memorypar=memorypar, tailindex=tailindex)
stat_sum_df <- function(fun, geom="crossbar", ...) {stat_summary(fun.data=fun, geom=geom, ...) }
dodge <- position_dodge(width=0.9)
p<- ggplot(ex5,aes(x=tailindex ,y=hillest,color=memorypar))
p<- p + facet_wrap(~process,nrow=2) + geom_jitter(position=dodge) +geom_boxplot(position=dodge)
p
In ggplot2 version 1.0.0 there is new position named position_jitterdodge() that is made for such situation. This postion should be used inside the geom_point() and there should be fill= used inside the aes() to show by which variable to dodge your data. To control the width of dodging argument dodge.width= should be used.
ggplot(ex5, aes(x=tailindex, y=hillest, color=memorypar, fill=memorypar)) +
facet_wrap(~process, nrow=2) +
geom_point(position=position_jitterdodge(dodge.width=0.9)) +
geom_boxplot(fill="white", outlier.colour=NA, position=position_dodge(width=0.9))
EDIT: There is a better solution with ggplot2 version 1.0.0 using position_jitterdodge. See #Didzis Elferts' answer. Note that dodge.width controls the width of the dodging and jitter.width controls the width of the jittering.
I'm not sure how the code produced the graph in the pdf.
But does something like this get you close to what you're after?
I convert tailindex and memorypar to numeric; add them together; and the result is the x coordinate for the geom_jitter layer. There's probably a more effective way to do it. Also, I'd like to see how dodging geom_boxplot and geom_jitter, and with no jittering, will produce the graph in the pdf.
library(ggplot2)
dodge <- position_dodge(width = 0.9)
ex5$memorypar2 <- as.numeric(ex5$tailindex) +
3 * (as.numeric(as.character(ex5$memorypar)) - 0.2)
p <- ggplot(ex5,aes(x=tailindex , y=hillest)) +
scale_x_discrete() +
geom_jitter(aes(colour = memorypar, x = memorypar2),
position = position_jitter(width = .05), alpha = 0.5) +
geom_boxplot(aes(colour = memorypar), outlier.colour = NA, position = dodge) +
facet_wrap(~ process, nrow = 2)
p