Calculation error creating ggplot2 stat extension - r

I'm trying to create a ggplot2 extension. I'd like to create a stat (or geom) that allows me to plot well logs in a fashion somewhat similar to this well log image. The calculation encapsulated in the ggproto object eventually fails though.
Let's look at the pet example! Suppose we have some well log information giving us the lower boundary of a geological unit, its name and a position indicator:
library(tidyverse)
testwell <- tibble::tribble(
~depth, ~type, ~pos,
3, "loam", 1,
7, "sand", 1,
7.5, "clay", 1,
11, "murl", 1,
12.2, "gravel", 1
)
In order to transform this into plotable polygons I created a ggproto object and a stat.
StatGeostack <- ggproto("StatGeostack", Stat,
required_aes = c("x", "y", "group"),
compute_group = function(data, scales, params, glvl = 0){
bound_low <- data$y
strata <- data$group
position <- data$x
# x-position doesn't really matter for now so let's just jitter it a bit
# and pretend we have a drillcore diameter of "1" so we can draw polygons
xmin <- position[1]-0.5
xmax <- position[1]+0.5
#The lower boundary of the first strata is the upper boundary of the second and so on
bound_up <- c(glvl, bound_low)
length(bound_up) <- length(bound_low)
# This tibble contains all the information alas not in the right format
stackframe <- tibble::tibble(
strata = strata,
bound_up = bound_up,
bound_low = bound_low
)
# adapt input format for ggplot2 polygons
purrr::pmap_dfr(stackframe,
function(strata, bound_up, bound_low, xmin, xmax){
tibble::tibble(y = c(rep(bound_up, times = 2),
rep(bound_low, times = 2)),
x = c(xmin, xmax, xmax, xmin),
group = rep(strata, times = 4))
},
xmin = xmin, xmax = xmax)
}
)
#creating the stat corresponding to the ggproto object
stat_geostack <- function(mapping = NULL, data = NULL, geom = "polygon",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, glvl = 0, ...){
layer(
stat = StatGeostack, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, glvl = 0, ...)
)
}
So this is what I came up with. To see why I'm not happy with the result, let's use the new stat in a plot
ggplot(data = testwell,
aes(x = pos, y = depth*-1, group = type)) +
stat_geostack(aes(fill = type)) +
theme_bw()
The result looks like this. On first glance this doesn't look too bad, but then again half of our stratigraphy is missing. It's in the legend but not drawn in the panel.
I tried to figure out what is going on by changing from "fill" to "color"
ggplot(data = testwell,
aes(x = pos, y = depth*-1, group = type)) +
stat_geostack(fill = NA, aes(color = type), size = 2) +
theme_bw()
So what apparently happens is that the missing stratigraphic units aren't gone but hidden behind each other. The reason for this seems to be, that each polygon has its upper boundary set to zero, while the lower boundary is calculated as intended. The calculation-mechanism I used inside the ggproto's "calculate_group" function works just fine in the global environment. I can calculate the polygon vertices outside the ggplot2 extension mechanism and than make a regular ggplot2+geom_polygon plot with no problem, but that's not what I want to do.
Thank you for staying with me in this somewhat exhausting post.

Related

How to align scale transformation across geoms?

I have a geom_foo() which will do some transformation of the input data and I have a scale transformation. My problem is that these work not as I would expect together with other geom_*s in terms of scaling.
To illustrate the behavior, consider foo() which will be used in the setup_data method of GeomFoo, defined at the end of the question.
foo <- function(x, y) {
data.frame(
x = x + 2,
y = y + 2
)
}
foo(1, 1)
The transformer is:
foo_trans <- scales::trans_new(
name = "foo",
transform = function(x) x / 5,
inverse = function(x) x * 5
)
Given this input data:
df1 <- data.frame(x = c(1, 2), y = c(1, 2))
Here is a basic plot:
library(ggplot2)
ggplot(df1, aes(x = x, y = y)) +
geom_foo()
When I apply the transformation to the vertical scale, I get this
ggplot(df1, aes(x = x, y = y)) +
geom_foo() +
scale_y_continuous(trans = foo_trans)
What I can say is that the y-axis limits are calculate as 11 = 1 + (2*5) and 12 = 2 + (2*5), where 1 and 2 are df1$y, and (2 * 5) are taken from the setup_data method and from trans_foo.
My real problem is, that I would like add a text layer with labels. These labels and their coordinates come from another dataframe, as below.
df_label <- foo(df1$x, df1$y)
df_label$label <- c("A", "B")
Label and point layers are on same x-y positions without the scale transformation
p <- ggplot(df1, aes(x = x, y = y)) +
geom_foo(color = "red", size = 6) +
geom_text(data = df_label, aes(x, y, label = label))
p
But when I apply the transformation, the coordinates do not match anymore
p +
scale_y_continuous(trans = foo_trans)
How do I get the to layer to match in x-y coordinates after the transformation? Thanks
ggproto object:
GeomFoo <- ggproto("GeomFoo", GeomPoint,
setup_data = function(data, params) {
cols_to_keep <- setdiff(names(data), c("x", "y"))
cbind(
foo(data$x, data$y),
data[, cols_to_keep]
)
}
)
geom constructor:
geom_foo <- function(mapping = NULL, data = NULL, ...,
na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = "identity",
geom = GeomFoo,
position = "identity",
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
na.rm = na.rm,
...
)
)
}
Doing data transformations isn't really the task of a geom, but a task of a stat instead. That said, the larger issue is that scale transformations are applied before the GeomFoo$setup_data() method is called. There are two ways one could accomplish this task that I could see.
Apply foo() before scale transformation. I don't think geoms or stats ever have access to the data before scale transformation. A possible place for this is in the ggplot2:::Layer$setup_layer() method. However, this isn't exported, which probably means the devs would like to discourage this even before we make an attempt.
Inverse the scale transformation, apply foo(), and transform again. For this, you need a method with access to the scales. AFAIK, no geom method has this access. However Stat$compute_panel() does have access, so we can use this.
To give an example of (2), I think you could get away with the following:
StatFoo <- ggproto(
"StatFoo", Stat,
compute_panel = function(self, data, scales) {
cols_to_keep <- setdiff(names(data), c("x", "y"))
food <- foo(scales$x$trans$inverse(data$x),
scales$y$trans$inverse(data$y))
cbind(
data.frame(x = scales$x$trans$transform(food$x),
y = scales$y$trans$transform(food$y)),
data[, cols_to_keep]
)
}
)
geom_foo <- function(mapping = NULL, data = NULL, ...,
na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = StatFoo,
geom = GeomPoint,
position = "identity",
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
na.rm = na.rm,
...
)
)
}
If someone else has brighter ideas to do this, I'd also like to know!

ggplot extension function to plot a superimposed mean in a scatterplot

I am trying to create a custom function that extends ggplot2. The goal of the function is to superimpose a mean with horizontal and vertical standard errors. The code below does the entire thing.
library(plyr)
library(tidyverse)
summ <- ddply(mtcars,.(),summarise,
dratSE = sqrt(var(drat))/length(drat),
mpgSE = sqrt(var(mpg))/length(mpg),
drat = mean(drat),
mpg = mean(mpg))
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = 'black', color = 'white', size = 3) +
geom_errorbarh(data = summ, aes(xmin = drat - dratSE, xmax = drat + dratSE)) +
geom_errorbar(data = summ, aes(ymin = mpg - mpgSE, ymax = mpg+mpgSE), width = .1) +
geom_point(data = summ, color='red',size=4)
Ideally, it would only take a function such as geom_scattermeans() to do this whole thing. But I am not sure how the aesthetics get transferred into subsequent geom functions from ggplot().
Also I've had difficulties in making a function that receives column names as argument and making it work with ddply().
I think plyr is pretty defunct at this point. I would recommend the dplyr package. When programming with dplyr you can use {{ (curly-curly, or embracing as the documentation says) to properly quote expressions.
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
geom_point_error <- function(data, x, y, color = 'red', size = 4) {
data <- dplyr::summarise(
data,
x_se = sqrt(var({{x}}))/length({{x}}),
y_se = sqrt(var({{y}}))/length({{y}}),
x = mean({{x}}),
y = mean({{y}})
)
list(
geom_errorbarh(data = data,
mapping = aes(y = y,
xmin = x - x_se, xmax = x + x_se), inherit.aes = F),
geom_errorbar(data = data,
mapping = aes(x = x,
ymin = y - y_se, ymax = y + y_se), width = .1,inherit.aes = F),
geom_point(data = data,
mapping = aes(x = x, y = y),
color = color, size = size)
)
}
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = 'black', color = 'white', size = 3) +
geom_point_error(mtcars, x = drat, y = mpg)
Created on 2021-05-17 by the reprex package (v1.0.0)
The second option would be to build your own ggproto Geom to handle these calculations inside ggplot2, but that is a bit much for right now.
Since my first answer is still the easier solution, I decieded to keep it. This answer should get OP closer to their goal.
Building a ggproto object can be cumbersome depending on what you are trying to do. In your case you are combining 3 ggproto Geoms classes together with the possibility of a new Stat.
The three Geoms are:
GeomErrorbar
GeomErrorbarh
GeomPoint
To get started, sometimes you just need to inherit from one of the classes and overwrite the method, but to pool the three together you will need to do more work.
Lets first consider how each of these Geoms draw their grid objects. Depending on the Geom it is in one of these functions draw_layer(), draw_panel(), and draw_group(). Fortunately, each of the geoms we want to use only use draw_panel() which mean a bit less work for us - we will just call these methods directly and build a new grobTree object. We will just need to be careful that all the correct parameters are making it to our new Geom's draw_panel() method.
Before we get to writing our own draw_panel, we have to first consider setup_params() and setup_data() functions. Occasionally, these will modify the data right out the gate. These steps are usually helpful to have automatic processing here and are often used to standardize/transform the data. A good example is GeomTile and GeomRect, they are essentially the same Geoms but their setup_data() functions differ because they are parameterized differently.
Lets assume you only want to assign an x and a y aesthetic, and leave the calculations of xmin, ymin, xmax, and ymax to the geoms/stats.
Fortunately, GeomPoint just returns the data with no modifications, so we will need to incorporate GeomErrorbar and GeomErrorbarh's setup_data() first. To skip some steps, I am just going to make a new Stat that will take care of transforming those values for us within a compute_group() method.
A note here, GeomErrorbar and GeomErrorbarh allow for another parameter to be included - width and height respectively, which controls how wide the flat portions of the error bars are.
also, within these functions, each will make their own xmin, xmax, ymin, ymax - so we will need to distinguish these parameters.
First load required information into the namespace
library(ggplot2)
library(grid)
"%||%" <- ggplot2:::`%||%`
Start with new Stat, I've decided to call it a PointError
StatPointError <- ggproto(
"StatPointError",
Stat,
#having `width` and `height` as named parameters here insure
#that they will be available to the `Stat` ggproto object.
compute_group = function(data, scales, width = NULL, height = NULL){
data$width <- data$width %||% width %||% (resolution(data$x, FALSE)*0.9)
data$height <- data$height %||% height %||% (resolution(data$y, FALSE)*0.9)
data <- transform(
data,
x = mean(x),
y = mean(y),
# positions for flat parts of vertical error bars
xmin = mean(x) - width /2,
xmax = mean(x) + width / 2,
width = NULL,
# y positions of vertical error bars
ymin = mean(y) - sqrt(var(y))/length(y),
ymax = mean(y) + sqrt(var(y))/length(y),
#positions for flat parts of horizontal error bars
ymin_h = mean(y) - height /2,
ymax_h = mean(y) + height /2,
height = NULL,
# x positions of horizontal error bars
xmin_h = mean(x) - sqrt(var(x))/length(x),
xmax_h = mean(x) + sqrt(var(x))/length(x)
)
unique(data)
}
)
Now for the fun part, the Geom, again I'm going for PointError as a consistent name.
GeomPointError <- ggproto(
"GeomPointError",
GeomPoint,
#include some additional defaults
default_aes = aes(
shape = 19,
colour = "black",
size = 1.5, # error bars have defaults of 0.5 - you may want to add another parameter?
fill = NA,
alpha = NA,
linetype = 1,
stroke = 0.5, # for GeomPoint
width = 0.5, # for GeomErrorbar
height = 0.5, # for GeomErrorbarh
),
draw_panel = function(data, panel_params, coord, width = NULL, height = NULL, na.rm = FALSE) {
#make errorbar grobs
data_errbar <- data
data_errbar[["size"]] <- 0.5
errorbar_grob <- GeomErrorbar$draw_panel(data = data_errbar,
panel_params = panel_params, coord = coord,
width = width, flipped_aes = FALSE)
#re-parameterize errbarh data
data_errbarh <- transform(data,
xmin = xmin_h, xmax = xmax_h, ymin = ymin_h, ymax = ymax_h,
xmin_h = NULL, xmax_h = NULL, ymin_h = NULL, ymax_h = NULL,
size = 0.5)
#make errorbarh grobs
errorbarh_grob <- GeomErrorbarh$draw_panel(data = data_errbarh,
panel_params = panel_params, coord = coord,
height = height)
point_grob <- GeomPoint$draw_panel(data = data, panel_params = panel_params,
coord = coord, na.rm = na.rm)
gt <- grobTree(
errorbar_grob,
errorbarh_grob,
point_grob, name = 'geom_point_error')
gt
}
)
Last, we need a function for the user to call that will make a Layer object.
geom_point_error <- function(mapping = NULL, data = NULL,
position = "identity",
...,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = StatPointError,
geom = GeomPointError,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
na.rm = na.rm,
...
)
)
}
Now we can test if this is working properly
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = 'black', color = 'white', size = 3) +
geom_point_error(color = "red", width = .1, height = .3)
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = 'black', color = 'white', size = 3) +
geom_point_error(aes(color = hp>100))
Created on 2021-05-18 by the reprex package (v1.0.0)
There is obviously so much more you could do with this, from including additional default aesthetics such that you could control the color and size of the lines/points separately (may want to override GeomPointError$setup_data() to insure everything maps correctly).
Finially, this geom is pretty naive in that it assumes the x and y data mappings are continuous. It still works with mixing continuous and discrete, but looks a bit funky
ggplot(mpg, aes(cty, model)) +
geom_point() +
geom_point_error(color = 'red')

Extending ggplot2: How to build a geom and stat?

I am in the early stages of learning how to extend ggplot2. I would like to create a custom geom and associated stat. My starting point was the vignette. In addition, I have benefited from this and this. I'm trying to put together a template to teach myself and hopefully others.
Main question:
Inside my function calculate_shadows() the needed parameter params$anchor is NULL. How can I access it?
The goal described below is intended solely for learning how to create custom stat and geom functions, it's not a real goal: as you can see from the screenshots, I do know how to leverage the power of ggplot2 to make the graphs.
The geom will read the data and for the supplied variables ("x", "y") will plot (for want of a better word) shadows: a horizontal line min(x)--max(x) at the default y=0 and a vertical line min(y)--max(y) at the default x=0. If an option is supplied, these "anchors" could be changed, e.g. if the user supplies x = 35, y = 1, the horizontal line would be drawn at the intercept y = 1 while the vertical line would be drawn at the intercept x = 35. Usage:
library(ggplot2)
ggplot(data = mtcars, aes(x = mpg, y = wt)) +
geom_point() +
geom_shadows(x = 35, y = 1)
The stat will read the data and for the supplied variables ("x", "y") will compute shadows according to the value of stat. For instance, by passing stat = "identity", the shadows would be computed for the min and max of the data (as done by geom_shadows). But by passing stat = "quartile", the shadows would be computed for first and third quartile. More generally, one could pass a function like stats::quantile with arguments args = list(probs = c(0.10, 0.90), type = 6), to compute shadows using the 10th and 90th percentiles and the quantile method of type 6. Usage:
ggplot(data = mtcars, aes(x = mpg, y = wt)) +
geom_point() +
stat_shadows(stat = "quartile")
Unfortunately, my lack of familiarity with extending ggplot2 stopped me well short of my objective. These plots were "faked" with geom_segment. Based on the tutorial and discussions cited above and inspecting existing code like stat-qq or stat-smooth, I have put together a basic architecture for this goal. It must contain many mistakes, I would be grateful for guidance. Also, note that either of these approaches would be fine: geom_shadows(anchor = c(35, 1)) or geom_shadows(x = 35, y = 1).
Now here are my efforts. First, geom-shadows.r to define geom_shadows(). Second, stat-shadows.r to define stat_shadows(). The code doesn't work as is. But if I execute its content, it does produce the desired statistics. For clarity, I have removed most of the calculations in stat_shadows(), such as quartiles, to focus on essentials. Any obvious mistake in the layout?
geom-shadows.r
#' documentation ought to be here
geom_shadows <- function(
mapping = NULL,
data = NULL,
stat = "shadows",
position = "identity",
...,
anchor = list(x = 0, y = 0),
shadows = list("x", "y"),
type = NULL,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = stat,
geom = GeomShadows,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
anchor = anchor,
shadows = shadows,
type = type,
na.rm = na.rm,
...
)
)
}
GeomShadows <- ggproto("GeomShadows", Geom,
# set up the data, e.g. remove missing data
setup_data = function(data, params) {
data
},
# set up the parameters, e.g. supply warnings for incorrect input
setup_params = function(data, params) {
params
},
draw_group = function(data, panel_params, coord, anchor, shadows, type) {
# draw_group uses stats returned by compute_group
# set common aesthetics
geom_aes <- list(
alpha = data$alpha,
colour = data$color,
size = data$size,
linetype = data$linetype,
fill = alpha(data$fill, data$alpha),
group = data$group
)
# merge aesthetics with data calculated in setup_data
geom_stats <- new_data_frame(c(list(
x = c(data$x.xmin, data$y.xmin),
xend = c(data$x.xmax, data$y.xmax),
y = c(data$x.ymin, data$y.ymin),
yend = c(data$x.ymax, data$y.ymax),
alpha = c(data$alpha, data$alpha)
), geom_aes
), n = 2)
# turn the stats data into a GeomPath
geom_grob <- GeomSegment$draw_panel(unique(geom_stats),
panel_params, coord)
# pass the GeomPath to grobTree
ggname("geom_shadows", grobTree(geom_grob))
},
# set legend box styles
draw_key = draw_key_path,
# set default aesthetics
default_aes = aes(
colour = "blue",
fill = "red",
size = 1,
linetype = 1,
alpha = 1
)
)
stat-shadows.r
#' documentation ought to be here
stat_shadows <-
function(mapping = NULL,
data = NULL,
geom = "shadows",
position = "identity",
...,
# do I need to add the geom_shadows arguments here?
anchor = list(x = 0, y = 0),
shadows = list("x", "y"),
type = NULL,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE) {
layer(
stat = StatShadows,
data = data,
mapping = mapping,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
# geom_shadows argument repeated here?
anchor = anchor,
shadows = shadows,
type = type,
na.rm = na.rm,
...
)
)
}
StatShadows <-
ggproto("StatShadows", Stat,
# do I need to repeat required_aes?
required_aes = c("x", "y"),
# set up the data, e.g. remove missing data
setup_data = function(data, params) {
data
},
# set up parameters, e.g. unpack from list
setup_params = function(data, params) {
params
},
# calculate shadows: returns data_frame with colnames: xmin, xmax, ymin, ymax
compute_group = function(data, scales, anchor = list(x = 0, y = 0), shadows = list("x", "y"), type = NULL, na.rm = TRUE) {
.compute_shadows(data = data, anchor = anchor, shadows = shadows, type = type)
}
)
# Calculate the shadows for each type / shadows / anchor
.compute_shadows <- function(data, anchor, shadows, type) {
# Deleted all type-checking, etc. for MWE
# Only 'type = c(double, double)' accepted, e.g. type = c(0, 1)
qs <- type
# compute shadows along the x-axis
if (any(shadows == "x")) {
shadows.x <- c(
xmin = as.numeric(stats::quantile(data[, "x"], qs[[1]])),
xmax = as.numeric(stats::quantile(data[, "x"], qs[[2]])),
ymin = anchor[["y"]],
ymax = anchor[["y"]])
}
# compute shadows along the y-axis
if (any(shadows == "y")) {
shadows.y <- c(
xmin = anchor[["x"]],
xmax = anchor[["x"]],
ymin = as.numeric(stats::quantile(data[, "y"], qs[[1]])),
ymax = as.numeric(stats::quantile(data[, "y"], qs[[2]])))
}
# store shadows in one data_frame
stats <- new_data_frame(c(x = shadows.x, y = shadows.y))
# return the statistics
stats
}
.
Until a more thorough answer comes along: You are missing
extra_params = c("na.rm", "shadows", "anchor", "type"),
inside GeomShadows <- ggproto("GeomShadows", Geom,
and possibly also inside StatShadows <- ggproto("StatShadows", Stat,.
Inside geom-.r and stat-.r there are many very useful comments that clarify how geoms and stats work. In particular (hat tips Claus Wilke over at github issues):
# Most parameters for the geom are taken automatically from draw_panel() or
# draw_groups(). However, some additional parameters may be needed
# for setup_data() or handle_na(). These can not be imputed automatically,
# so the slightly hacky "extra_params" field is used instead. By
# default it contains `na.rm`
extra_params = c("na.rm"),

Creating geom / stat from scratch

I just started working with R not long ago, and I am currently trying to strengthen my visualization skills. What I want to do is to create boxplots with mean diamonds as a layer on top (see picture in the link below). I did not find any functions that does this already, so I guess I have to create it myself.
What I was hoping to do was to create a geom or a stat that would allow something like this to work:
ggplot(data, aes(...))) +
geom_boxplot(...) +
geom_meanDiamonds(...)
I have no idea where to start in order to build this new function. I know which values are needed for the mean diamonds (mean and confidence interval), but I do not know how to build the geom / stat that takes the data from ggplot(), calculates the mean and CI for each group, and plots a mean diamond on top of each boxplot.
I have searched for detailed descriptions on how to build these type of functions from scratch, however, I have not found anything that really starts from the bottom. I would really appreciate it, if anyone could point me towards some useful guides.
Thank you!
I'm currently learning to write geoms myself, so this is going to be a rather long & rambling post as I go through my thought processes, untangling the Geom aspects (creating polygons & line segments) from the Stats aspects (calculating where these polygons & segments should be) of a geom.
Disclaimer: I'm not familiar with this kind of plot, and Google didn't throw up many authoritative guides. My understanding of how the confidence interval is calculated / used here may be off.
Step 0. Understand the relationship between a geom / stat and a layer function.
geom_boxplot and stat_boxplot are examples of layer functions. If you enter them into the R console, you'll see that they are (relatively) short, and does not contain actual code for calculating the box / whiskers of the boxplot. Instead, geom_boxplot contains a line that says geom = GeomBoxplot, while stat_boxplot contains a line that says stat = StatBoxplot (reproduced below).
> stat_boxplot
function (mapping = NULL, data = NULL, geom = "boxplot", position = "dodge2",
..., coef = 1.5, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
{
layer(data = data, mapping = mapping, stat = StatBoxplot,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = list(na.rm = na.rm,
coef = coef, ...))
}
GeomBoxplot and StatBoxplot are ggproto objects. They are where the magic happens.
Step 1. Recognise that ggproto()'s _inherit parameter is your friend.
Don't reinvent the wheel. Since we want to create something that overlaps nicely with a boxplot, we can take reference from the Geom / Stat used for that, and only change what's necessary.
StatMeanDiamonds <- ggproto(
`_class` = "StatMeanDiamonds",
`_inherit` = StatBoxplot,
... # add functions here to override those defined in StatBoxplot
)
GeomMeanDiamonds <- ggproto(
`_class` = "GeomMeanDiamonds",
`_inherit` = GeomBoxplot,
... # as above
)
Step 2. Modify the Stat.
There are 3 functions defined within StatBoxplot: setup_data, setup_params, and compute_group. You can refer to the code on Github (link above) for the details, or view them by entering for example StatBoxplot$compute_group.
The compute_group function calculates the ymin / lower / middle / upper / ymax values for all the y values associated with each group (i.e. each unique x value), which are used to plot the box plot. We can override it with one that calculates the confidence interval & mean values instead:
# ci is added as a parameter, to allow the user to specify different confidence intervals
compute_group_new <- function(data, scales, width = NULL,
ci = 0.95, na.rm = FALSE){
a <- mean(data$y)
s <- sd(data$y)
n <- sum(!is.na(data$y))
error <- qt(ci + (1-ci)/2, df = n-1) * s / sqrt(n)
stats <- c("lower" = a - error, "mean" = a, "upper" = a + error)
if(length(unique(data$x)) > 1) width <- diff(range(data$x)) * 0.9
df <- as.data.frame(as.list(stats))
df$x <- if(is.factor(data$x)) data$x[1] else mean(range(data$x))
df$width <- width
df
}
(Optional) StatBoxplot has provision for the user to include weight as an aesthetic mapping. We can allow for that as well, by replacing:
a <- mean(data$y)
s <- sd(data$y)
n <- sum(!is.na(data$y))
with:
if(!is.null(data$weight)) {
a <- Hmisc::wtd.mean(data$y, weights = data$weight)
s <- sqrt(Hmisc::wtd.var(data$y, weights = data$weight))
n <- sum(data$weight[!is.na(data$y) & !is.na(data$weight)])
} else {
a <- mean(data$y)
s <- sd(data$y)
n <- sum(!is.na(data$y))
}
There's no need to change the other functions in StatBoxplot. So we can define StatMeanDiamonds as follows:
StatMeanDiamonds <- ggproto(
`_class` = "StatMeanDiamonds",
`_inherit` = StatBoxplot,
compute_group = compute_group_new
)
Step 3. Modify the Geom.
GeomBoxplot has 3 functions: setup_data, draw_group, and draw_key. It also includes definitions for default_aes() and required_aes().
Since we've changed the upstream data source (the data produced by StatMeanDiamonds contain the calculated columns "lower" / "mean" / "upper", while the data produced by StatBoxplot would have contained the calculated columns "ymin" / "lower" / "middle" / "upper" / "ymax"), do check whether the downstream setup_data function is affected as well. (In this case, GeomBoxplot$setup_data makes no reference to the affected columns, so no changes required here.)
The draw_group function takes the data produced by StatMeanDiamonds and set up by setup_data, and produces multiple data frames. "common" contains the aesthetic mappings common to all geoms. "diamond.df" for the mappings that contribute towards the diamond polygon, and "segment.df" for the mappings that contribute towards the horizontal line segment at the mean. The data frames are then passed to the draw_panel functions of GeomPolygon and GeomSegment respectively, to produce the actual polygons / line segments.
draw_group_new = function(data, panel_params, coord,
varwidth = FALSE){
common <- data.frame(colour = data$colour,
size = data$size,
linetype = data$linetype,
fill = alpha(data$fill, data$alpha),
group = data$group,
stringsAsFactors = FALSE)
diamond.df <- data.frame(x = c(data$x, data$xmax, data$x, data$xmin),
y = c(data$upper, data$mean, data$lower, data$mean),
alpha = data$alpha,
common,
stringsAsFactors = FALSE)
segment.df <- data.frame(x = data$xmin, xend = data$xmax,
y = data$mean, yend = data$mean,
alpha = NA,
common,
stringsAsFactors = FALSE)
ggplot2:::ggname("geom_meanDiamonds",
grid::grobTree(
GeomPolygon$draw_panel(diamond.df, panel_params, coord),
GeomSegment$draw_panel(segment.df, panel_params, coord)
))
}
The draw_key function is used to create the legend for this layer, should the need arise. Since GeomMeanDiamonds inherits from GeomBoxplot, the default is draw_key = draw_key_boxplot, and we don't have to change it. Leaving it unchanged will not break the code. However, I think a simpler legend such as draw_key_polygon offers a less cluttered look.
GeomBoxplot's default_aes specifications look fine. But we need to change the required_aes since the data we expect to get from StatMeanDiamonds is different ("lower" / "mean" / "upper" instead of "ymin" / "lower" / "middle" / "upper" / "ymax").
We are now ready to define GeomMeanDiamonds:
GeomMeanDiamonds <- ggproto(
"GeomMeanDiamonds",
GeomBoxplot,
draw_group = draw_group_new,
draw_key = draw_key_polygon,
required_aes = c("x", "lower", "upper", "mean")
)
Step 4. Define the layer functions.
This is the boring part. I copied from geom_boxplot / stat_boxplot directly, removing all references to outliers in geom_meanDiamonds, changing to geom = GeomMeanDiamonds / stat = StatMeanDiamonds, and adding ci = 0.95 to stat_meanDiamonds.
geom_meanDiamonds <- function(mapping = NULL, data = NULL,
stat = "meanDiamonds", position = "dodge2",
..., varwidth = FALSE, na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE){
if (is.character(position)) {
if (varwidth == TRUE) position <- position_dodge2(preserve = "single")
} else {
if (identical(position$preserve, "total") & varwidth == TRUE) {
warning("Can't preserve total widths when varwidth = TRUE.", call. = FALSE)
position$preserve <- "single"
}
}
layer(data = data, mapping = mapping, stat = stat,
geom = GeomMeanDiamonds, position = position,
show.legend = show.legend, inherit.aes = inherit.aes,
params = list(varwidth = varwidth, na.rm = na.rm, ...))
}
stat_meanDiamonds <- function(mapping = NULL, data = NULL,
geom = "meanDiamonds", position = "dodge2",
..., ci = 0.95,
na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) {
layer(data = data, mapping = mapping, stat = StatMeanDiamonds,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ci = ci, ...))
}
Step 5. Check output.
# basic
ggplot(iris,
aes(Species, Sepal.Length)) +
geom_boxplot() +
geom_meanDiamonds()
# with additional parameters, to see if they break anything
ggplot(iris,
aes(Species, Sepal.Length)) +
geom_boxplot(width = 0.8) +
geom_meanDiamonds(aes(fill = Species),
color = "red", alpha = 0.5, size = 1,
ci = 0.99, width = 0.3)

Constrict ggplot ellips to realistic/possible values

When plotting an ellips with ggplot is it possible to constrain the ellips to values that are actually possible?
For example, the following reproducible code and data plots Ele vs. Var for two species. Var is a positive variable and cannot be negative. Nonetheless, negative values are included in the resulting ellips. Is it possible to bound the ellips by 0 on the x-axis (using ggplot)?
More specifically, I am picturing a flat edge with the ellipsoids truncated at 0 on the x-axis.
library(ggplot2)
set.seed(123)
df <- data.frame(Species = rep(c("BHS", "MTG"), each = 100),
Ele = c(sample(1500:3000, 100), sample(2500:3500, 100)),
Var = abs(rnorm(200)))
ggplot(df, aes(Var, Ele, color = Species)) +
geom_point() +
stat_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2)
You could edit the default stat to clip points to a particular value. Here we change the basic stat to trim x values less than 0 to 0
StatClipEllipse <- ggproto("StatClipEllipse", Stat,
required_aes = c("x", "y"),
compute_group = function(data, scales, type = "t", level = 0.95,
segments = 51, na.rm = FALSE) {
xx <- ggplot2:::calculate_ellipse(data = data, vars = c("x", "y"), type = type,
level = level, segments = segments)
xx %>% mutate(x=pmax(x, 0))
}
)
Then we have to wrap it in a ggplot stat that is identical to stat_ellipe except that it uses our custom Stat object
stat_clip_ellipse <- function(mapping = NULL, data = NULL,
geom = "path", position = "identity",
...,
type = "t",
level = 0.95,
segments = 51,
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE) {
layer(
data = data,
mapping = mapping,
stat = StatClipEllipse,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
type = type,
level = level,
segments = segments,
na.rm = na.rm,
...
)
)
}
then you can use it to make your plot
ggplot(df, aes(Var, Ele, color = Species)) +
geom_point() +
stat_clip_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2)
This was inspired by the source code for stat_ellipse.
Based on my comment above, I created a less-misleading option for visualization. This is ignoring the problem with y being uniformly distributed, since that's a somewhat less egregious problem than the heavily skewed x variable.
Both these options use the ggforce package, which is an extension of ggplot2, but just in case, I've also included the source for the particular function I used.
library(ggforce)
library(scales)
# power_trans <- function (n)
# {
# scales::trans_new(name = paste0("power of ", fractions(n)), transform = function(x) {
# x^n
# }, inverse = function(x) {
# x^(1/n)
# }, breaks = scales::extended_breaks(), format = scales::format_format(),
# domain = c(0, Inf))
# }
Option 1:
ggplot(df, aes(Var, Ele, color = Species)) +
geom_point() +
stat_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2) +
scale_x_sqrt(limits = c(-0.1,3.5),
breaks = c(0.0001,1:4),
labels = 0:4,
expand = c(0.00,0))
This option stretches the x-axis along a square-root transform, spreading out the points clustered near zero. Then it computes an ellipse over this new space.
Advantage: looks like an ellipse still.
Disadvantage: in order to get it to play nice and label the Var=0 point on the x axis, you have to use expand = c(0,0), which clips the limits exactly, and so requires a bit more fiddling with manual limits/breaks/labels, including choosing a very small value (0.0001) to be represented as 0.
Disadvantage: the x values aren't linearly distributed along the axis, which requires a bit more cognitive load when reading the figure.
Option 2:
ggplot(df, aes(sqrt(Var), Ele, color = Species)) +
geom_point() +
stat_ellipse() +
coord_trans(x = ggforce::power_trans(2)) +
scale_x_continuous(breaks = sqrt(0:4), labels = 0:4,
name = "Var")
This option plots the pre-transformed sqrt(Var) (notice the aes(...)). It then calculates the ellipses based on this new approximately normal value. Then it stretches out the x-axis so that the values of Var are once again linearly spaced, which distorts the ellipse in the same transformation.
Advantage: looks cool.
Advantage: values of Var are easy to interpret on the x-axis.
Advantage: you can see the density near Var=0 with the points and the wide flat end of the "egg" easily.
Advantage: the pointy end shows you how low the density is at those values.
Disadvantage: looks unfamiliar and requires explanation and additional cognitive load to interpret.

Resources