Ggplot boxplot by group, change summary statistics shown - r

I want to change the summary statistics shown in the following boxplot:
I have created the boxplot as follows:
ggplot(as.data.frame(beta2), aes(y=var1,x=as.factor(Year))) +
geom_boxplot(outlier.shape = NA)+
ylab(expression(beta[1]))+
xlab("\nYear")+
theme_bw()
The default is for the box is the first and third quantile. I want the box to show the 2.5% and 97.5% quantiles. I know one can easily change what is shown when one boxplot is visualised by adding the following to geom_boxplot:
aes(
ymin= min(var1),
lower = quantile(var1,0.025),
middle = mean(var1),
upper = quantile(var1,0.975),
ymax=max(var1))
However, this does not work for when boxplots are generated by group. Any idea how one would do this? You can use the Iris data set:
ggplot(iris, aes(y=Sepal.Length,x=Species)) +
geom_boxplot(outlier.shape = NA)
EDIT:
The answer accepted does work. My data-frame is really big and as such the method provided takes a bit of time. I found another solution here: SOLUTION that works for large datasets and my specific need.

This could be achieved via stat_summary by setting geom="boxplot". and passing to fun.data a function which returns a data frame with the summary statistics you want to display as ymin, lower, ... in your boxplot:
library(ggplot2)
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
stat_summary(geom = "boxplot", fun.data = function(x) {
data.frame(
ymin = min(x),
lower = quantile(x, 0.025),
middle = mean(x),
upper = quantile(x, 0.975),
ymax = max(x)
)}, outlier.shape = NA)

Related

R ggplot2 - why does geom_boxplot ignore aesthetics "ymin", "lower", "middle", "upper", "ymax"?

For some testing purposes, I am trying to make boxplots where the upper and lower whiskers extend to the max and min data points (respectively), instead of treating them like outliers.
Not completely sure how this can be done best, but I figured I would just change my definition of upper and lower whisker to max() and min(), and pass them to geom_boxplot as ymin and ymax aesthetics (as indicated here: https://ggplot2.tidyverse.org/reference/geom_boxplot.html)
However, even when I get no warnings or errors, those aesthetics seem to be completely ignored.
See the MWE below with iris data:
data("iris")
iris_long <- as.data.frame(tidyr::pivot_longer(subset(iris, Species=="setosa"), -Species, names_to = "Var", values_to = "Value"))
iris_long_dt <- data.table::data.table(iris_long)
ddf <- as.data.frame(iris_long_dt[,list(y0=min(Value, na.rm=T),
y100=max(Value, na.rm=T)),
by=list(Var=Var)])
iris_long <- merge(iris_long, ddf, by="Var")
grDevices::png(filename="test.png", height=600, width=600)
P <- ggplot2::ggplot(iris_long, ggplot2::aes(x=Var, y=Value)) +
ggplot2::geom_boxplot(ggplot2::aes(fill=Var, ymin=y0, ymax=y100),
position=ggplot2::position_dodge(.9)) +
ggplot2::stat_summary(fun=mean, geom="point", shape=5, size=2) +
ggplot2::theme_light()
print(P)
grDevices::dev.off()
This is what I get (which is identical to not passing ymin and ymax):
What I would expect is that the whiskers would extend to the min and max outlier data points (and hence the outliers not plotted, since they would not be outliers anymore).
Why is this happening? Am I doing something wrong? Thanks!
Unfortunately it's not possible to provide only some of the boxplot stats. If you want to draw a boxplot manually you have to provide all stats and use stat="identity".
But to extend the whiskers over the whole data range you could use coef=Inf.
library(ggplot2)
ggplot(iris_long, aes(x = Var, y = Value)) +
geom_boxplot(aes(fill = Var),
position = position_dodge(.9), coef = Inf
) +
stat_summary(fun = mean, geom = "point", shape = 5, size = 2) +
theme_light()

ggplot2 barplots with errorbars when using stacked bars

I'm trying to produce a stacked barplot with an error bar which represents the total variability per bar. I don't want to use a dodged barplot as I have >10 categories per bar.
Below I have some sample data for a reproducible example:
scenario = c('A','A','A','A')
strategy = c('A','A','A','A')
decile = c(0,0,10,10)
asset = c('A','B','A','B')
lower = c(10,20,10, 15)
mean = c(30,50,60, 70)
upper = c(70,90,86,90)
data = data.frame(scenario, strategy, decile, asset, lower, mean, upper)
And once we have the data df we can use ggplot2 to create a stacked bar as so:
ggplot(wide, aes(x=decile, y=mean, fill=asset)) +
geom_bar(stat="identity") +
facet_grid(strategy~scenario) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.25)
However, the error bars produced are for each individual component of each stacked bar:
I appreciate this results from me providing the lower, mean and upper for each row of the df, but even when I summed these per decile I didn't get my desired errorbars at the top of each bar stack.
What is the correct ggplot2 code, or alternatively, what is the correct data structure to enable this?
I think you're correct in realising you need to manipulate your data rather than your plot. You can't really have position_stack on an errorbar, so you'll need to recalculate the mean, upper and lower values for the errorbars. Essentially this means getting the cumulative sum of the mean values, and shifting the upper and lower ranges accordingly. You can do this inside a dplyr pipe.
Note I think you will also need to have a position_dodge on the error bars, since their range overlaps even when shifted appropriately, which will make them harder to interpret visually:
library(ggplot2)
library(dplyr)
data %>%
mutate(lower = lower - mean, upper = upper - mean) %>%
group_by(decile) %>%
arrange(rev(asset), by.group = TRUE) %>%
mutate(mean2 = cumsum(mean), lower = lower + mean2, upper = upper + mean2) %>%
ggplot(aes(x = decile, y = mean, fill = asset)) +
geom_bar(stat = "identity") +
facet_grid(strategy ~ scenario) +
geom_errorbar(aes(y = mean2, ymin = lower, ymax = upper), width = 2,
position = position_dodge(width = 2)) +
geom_point(aes(y = mean2), position = position_dodge(width = 2))
If you want only one error bar per decile, you should aggregate the values so that there is not difference between assest like this:
library(ggplot2)
library(dplyr)
#Code
data %>% group_by(scenario,decile) %>%
mutate(nlower=mean(lower),nupper=mean(upper)) %>%
ggplot(aes(x=factor(decile), y=mean, fill=asset,group=scenario)) +
geom_bar(stat="identity") +
facet_grid(strategy~scenario) +
geom_errorbar(aes(ymin = nlower, ymax = nupper), width = 0.25)
Output:
It is other thing using asset as it will consider each class as you have different values for each of them:
#Code 2
data %>%
ggplot(aes(x=factor(decile), y=mean, fill=asset,group=scenario)) +
geom_bar(stat="identity") +
facet_grid(strategy~scenario) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.25)
Output:
In last version, each asset has its own error bar, but if you want to see erros globally, you should use an approach aggregating the limits as that was done with mean values or other measure you wish.

Question regarding plotting 95% Intervals using ggplot2

I have a data frame with several columns and I need to draw boxplots and a some kind of an interval plot (with 2.5% and 97.5%) for each column.
My data set looks like this:
set.seed(123)
x1=rnorm(100,0,1)
x2=rnorm(100,0,0.5)
x3=rnorm(100,0,0.6)
data_x=data.frame(x1,x2,x3)
I was able to draw the boxplots for this data using following lines of code:
datax_long=data_x %>% gather(x ,value ,x1:x3)
ggplot(data=datax_long, aes(y= x, x=value, fill=x))+ geom_boxplot()
Now I need to draw a interval plot for each column. It is kind of a horizontal line from 2.5%th percentile to 97.5%th percentile. The range of values for each variable should roughly the same as in the boxplot.
Is this something we can do using ggplot2 package in R ?
Thank you
Something like this should work:
ggplot(datax_long, aes(x = value, y = x)) +
stat_summary(geom = "errorbarh",
fun.min = function(z) quantile(z, .025),
fun = mean,
fun.max = function(z) quantile(z, 0.975), color = "red") +
stat_summary(geom = "point", fun = mean, color = "blue")

How to vary line and ribbon colours in a facet_grid

I'm hoping someone can help with this plotting problem I have. The data can be found here.
Basically I want to plot a line (mean) and it's associated confidence interval (lower, upper) for 4 models I have tested. I want to facet on the Cat_Auth variable for which there are 4 categories (so 4 plots). The first 'model' is actually just the mean of the sample data and I don't want a CI for this (NA values specified in the data - not sure if this is the correct thing to do).
I can get the plot some way there with:
newdata <- read.csv("data.csv", header=T)
ggplot(newdata, aes(x = Affil_Max, y = Mean)) +
geom_line(data = newdata, aes(), colour = "blue") +
geom_ribbon(data = newdata, alpha = .5, aes(ymin = Lower, ymax = Upper, group = Model, fill = Model)) +
facet_grid(.~ Cat_Auth)
But I'd like different coloured lines and shaded ribbons for each model (e.g. a red mean line and red shaded ribbon for model 2, green for model 3 etc). Also, I can't figure out why the blue line corresponding to the first set of mean values is disjointed as it is.
Would be really grateful for any assistance!
Try this:
library(dplyr)
library(ggplot2)
newdata %>%
mutate(Model = as.factor(Model)) %>%
ggplot(aes(Affil_Max, Mean)) +
geom_line(aes(color = Model, group = Model)) +
geom_ribbon(alpha = .5, aes(ymin = Lower, ymax = Upper,
group = Model, fill = Model)) +
facet_grid(. ~ Cat_Auth)

Plot min, max, median for each x value in geom_pointrange

So I know the better way to approach this is to use the stat_summary() function, but this is to address a question presented in Hadley's R for Data Science book mostly for my own curiosity. It asks how to convert code for an example plot made using stat_summary() to make the same plot with geom_pointrange(). The example is:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
And the plot should look like this:
(source: had.co.nz)
I've attempted with code such as:
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
geom_pointrange(mapping = aes(ymin = min(depth), ymax = max(depth)))
However, this plots the min and max for all depth values across each cut category (i.e., all ymin's and ymax's are the same). I also tried passing a vector of mins and maxs, but ymin only takes single values as far as I can tell. It's probably something simple, but I think people mostly use stat_summary() as I've found very few examples of geom_pointrange() usage via Google.
I think you need to do the summary outside the plot function to use geom_pointrange:
library(dplyr)
library(ggplot2)
summary_diamonds <- diamonds %>%
group_by(cut) %>%
summarise(lower = min(depth), upper = max(depth), p = median(depth))
ggplot(data = summary_diamonds, mapping = aes(x = cut, y = p)) +
geom_pointrange(mapping = aes(ymin = lower, ymax = upper))
geom_pointrange includes a stat argument, so you can do the statistical transformation inline https://stackoverflow.com/a/41865061

Resources