I'm learning R and ggplot2. According to the instructions, geom and stat are usually inter-changable as a geom has a default stat and a stat has a default geom.
My exercise is to create a plot in 3 ways: with stat, manually, and with geom_pointrange. I'm stuck at the third part:
library("tidyverse")
# With stat_summary
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
# Manually
diamonds_summary = diamonds %>%
group_by(cut) %>%
summarize(p = median(depth), lower = min(depth), upper = max(depth))
ggplot(diamonds_summary) +
geom_pointrange(
mapping = aes(x = cut, y = p, ymin = lower, ymax = upper)
)
# With geom_pointrange and stat
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = median(depth)),
stat = "summary"
)
# Warning: No summary function supplied, defaulting to `mean_se()`
How can I pass two summary functions (min and max) to the function identified by the stat param?
All three solutions should produce the following output:
When you sepcify stat="summary", it still needs to know how to summarize your values. The default is to use the mean with standard errors. But you want medians with min and max values. You can write your own summary function
median_min_max <- function(x) {
data.frame(y=median(x), ymin=min(x), ymax=max(x))
}
And then pass that to your pointrange layer via the fun.data= parameter.
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = depth),
stat = "summary", fun.data = median_min_max
)
This will give you a plot that matches the one you created from your summarized data.
ggplot does have a median_hilow summary function but that uses the Hmisc::smedian.hilow function which uses quantiles based on a confidence inteval rather than min/max values.
Related
I'm trying to replicate this image.
I was able to plot a scatter plot and the median (but it's not continuous).
I failed to plot the percentiles.
The median varies according to different spell length.
ggplot(df,aes(x=Spell.Length,y=Growth.Rate)) +
geom_point() +
stat_summary(fun = median, fun.min = median, fun.max = median,
geom = "crossbar", width = 0.5,colour="red")
What I'm trying to do
What I got so far
Use dplyr::summarize to create a data frame of the values of percentiles also group_by(Spell.Length), then plot those using geom_line(). Then the horizontal lines with geom_hline().
df %>% group_by(Spell.Length) %>%
summarize(median = quantile(Growth.Rate, p = .5), q1 = quantile(Growth.Rate, p = .25)) %>%
ggplot(aes(x = Spell.Length, y = median) +
geom_line() +
geom_line(aes(x = Spell.Length, y = q1)) +
geom_hline(yintercept = 3)
would be the basic idea.
geom_line() for each specific line style/group
Red lines geom_hline()
consider the following example dataset:
library(dplyr)
library(ggplot2)
d = mtcars %>%
as_tibble(rownames = "name") %>%
mutate(wt.cat = cut(wt, seq(1.5, 5.5, by = 1))) %>%
group_by(wt.cat) %>%
summarize(
Mean = mean(mpg),
Min = min(mpg),
Max = max(mpg)
)
Say I want to plot points for the "mean" value of each category in wt.cat and a ribbon showing the range. This works:
ggplot(d, aes(x = wt.cat)) +
geom_point(aes(y= Mean)) +
geom_ribbon(aes(x = as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue")
But the points are masked by the ribbon. However, if I change the order of the layers so that the points are plotted on top of the ribbon, I get an error:
ggplot(d, aes(x = wt.cat)) +
geom_ribbon(aes(x = as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue") +
geom_point(aes(y= Mean))
## Error: Discrete value supplied to continuous scale
So even though I'm specifying the discrete axis as the "default" aesthetic, it gets overridden by the specification of the first plotted layer. The only way I can find around this is to plot a dummy point layer first:
ggplot(d, aes(x = wt.cat)) +
geom_point(aes(y= Mean), shape = NA) +
geom_ribbon(aes(x = as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue") +
geom_point(aes(y= Mean))
## Warning message:
## Removed 4 rows containing missing values (geom_point).
Is there a more "proper" or correct way of combining discrete and continuous layers? Is there a solution that doesn't require creating a dummy layer?
would something like this be a solution?
d %>% {
ggplot(., aes(x = wt.cat)) +
scale_x_discrete(labels = levels(.$wt.cat)) +
geom_ribbon(aes(x =as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue") +
geom_point(aes(y=Mean))
}
I just learned you can wrap a pipe with { } and then reference the entire data frame with .
As camille said, the issue is that geom_ribbon requires a continuous scale because it plots area across values relating to the adjacent position. I believe the scale gets converted to continuous when geom_ribbon is added, but the labels are maintained.
Hope this helps
As per my reply -- the following works just as well if you want ggplot2 to handle all the labeling
d %>%
ggplot(aes(x = wt.cat)) +
scale_x_discrete() +
geom_ribbon(aes(x =as.numeric(wt.cat), ymin = Min, ymax = Max), fill = "blue") +
geom_point(aes(y=Mean))
About 18 months ago, this helpful exchange appeared, with code to show how to produce a plot of median along with interquartile ranges. Here's the code:
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median)
Producing this plot:
What I'd wonder is how to add labels for the median and IQ ranges, and how to format the bar (color, alpha, etc). I tried calling the plot as an object to see if there were objects within I could then use to call format functions, but nothing was obvious when I looked at it in the r Studio IDE.
Is this even doable? I know I can do a boxplot but that would have to include min/max. I'd like to produce boxplots with just mean/median and IQs.
You can change the formating like you would any ggplot layer, see the docs for Vertical intervals: lines, crossbars & errorbars in this case. An example of this is the following:
library(ggplot2)
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median,
size = 4, # <- adjusts size
colour = "red", # <- adjusts colour
alpha = .3) # <- adjusts transparency
If you want to control formatting for the points and lines individually you need to do as #camille suggests and pre-process your data as geom_pointrange() draws a single graphical object so the points and lines are one in the same.
I would suggest something like this:
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise(median = median(depth),
lq = quantile(depth, 0.25),
uq = quantile(depth, 0.75)) %>%
ggplot(aes(cut, median)) +
geom_linerange(aes(ymin=lq, ymax=uq), size = 4, colour = "blue", alpha = .4) +
geom_point(size = 10, colour = "red", alpha = .8)
So I know the better way to approach this is to use the stat_summary() function, but this is to address a question presented in Hadley's R for Data Science book mostly for my own curiosity. It asks how to convert code for an example plot made using stat_summary() to make the same plot with geom_pointrange(). The example is:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
And the plot should look like this:
(source: had.co.nz)
I've attempted with code such as:
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
geom_pointrange(mapping = aes(ymin = min(depth), ymax = max(depth)))
However, this plots the min and max for all depth values across each cut category (i.e., all ymin's and ymax's are the same). I also tried passing a vector of mins and maxs, but ymin only takes single values as far as I can tell. It's probably something simple, but I think people mostly use stat_summary() as I've found very few examples of geom_pointrange() usage via Google.
I think you need to do the summary outside the plot function to use geom_pointrange:
library(dplyr)
library(ggplot2)
summary_diamonds <- diamonds %>%
group_by(cut) %>%
summarise(lower = min(depth), upper = max(depth), p = median(depth))
ggplot(data = summary_diamonds, mapping = aes(x = cut, y = p)) +
geom_pointrange(mapping = aes(ymin = lower, ymax = upper))
geom_pointrange includes a stat argument, so you can do the statistical transformation inline https://stackoverflow.com/a/41865061
When you plot something using ggplot2, it warns you if it auto-removes missings.
I would love to be able to disable that specific warning or to set the default of na.rm to true system-wide, but that's not possible AFAIK.
I know I can disable it by specifying na.rm=T for each geom that I use. But this fails when ggplot generates further geoms that I don't explicitly specify. In the example below I would get three warnings per plot using my original data (10 when I facet it, so you can see this gets annoying in a knitr report).
I can suppress two warnings with na.rm=T, but the third one about geom_segment I can't. Incidentally it also occurs with mtcars, so I used that as an example.
Warning message:
Removed 23 rows containing missing values (geom_segment).
ggplot(data=mtcars, aes(x = disp, y = wt)) +
geom_linerange(stat = "summary", fun.data = "median_hilow", colour = "#aec05d", na.rm=T) +
geom_pointrange(stat = "summary", fun.data = "mean_cl_boot", colour = "#6c92b2", na.rm=T)
Until I figure this out I can use warning=FALSE for the offending chunks, but I don't really like that since it might suppress warnings that I do care about. I could also use na.omit on the dataset but that's a lot of work and syntax of figuring out which variables I'll use in the plot.
I guess the only way to avoid this is not to use stat_summary, but calculate the summary statistics yourself. For your example that's no problem, but I'll admit that this is not a very satisfactory solution in general.
# load dplyr package used to calculate summary
require(dplyr)
# calculate summary statistics
df <- mtcars %>% group_by(disp) %>% do(mean_cl_boot(.$wt))
# use geom_point and geom_segment with na.rm=TRUE
ggplot(data=mtcars, aes(x = disp, y = wt)) +
geom_linerange(stat = "summary", fun.data = "median_hilow", colour = "#aec05d") +
geom_point(data = df, aes(x = disp, y = y), colour = "#6c92b2") +
geom_segment(data = df, aes(x = disp, xend = disp, y = ymin, yend = ymax), colour = "#6c92b2", na.rm=TRUE)
Alternatively, you can write your own version of mean_cl_boot. If ymin or ymax are NA just set them to the value of y.
# your summary function
my_mean_cl_boot <- function(x, ...){
res <- mean_cl_boot(x, ...)
res[is.na(res$ymin), "ymin"] <- res[is.na(res$ymin), "y"]
res[is.na(res$ymax), "ymax"] <- res[is.na(res$ymax), "y"]
na.omit(res)
}
# plotting command
ggplot(data=mtcars, aes(x = disp, y = wt)) +
geom_linerange(stat = "summary", fun.data = "median_hilow", colour = "#aec05d", na.rm=T) +
geom_pointrange(stat = "summary", fun.data = "my_mean_cl_boot", colour = "#6c92b2", na.rm=T)