What are the limits to inheritance in ggplot2? - r

I have been trying to work out a few things about ggplot2, and how supplemntary arguments inherit from the first part ggplot(). Specifically, if inheritance is passed on beyond the geom_*** part.
I have a histogram of data:
ggplot(data = faithful, aes(eruptions)) + geom_histogram()
Which produces a fine chart, though the breaks are default. It appears to me (an admitted novice), that geom_histogram() is inheriting the data specification from ggplot(). If I want to have a smarter way of setting the breaks I could use a process like so:
ggplot(data = faithful, aes(eruptions)) +
geom_histogram(breaks = seq(from = min(faithful$eruptions),
to = max(faithful$eruptions), length.out = 10))
However, here I am re-specifying within the geom_histogram() function that I want faithful$eruptions. I have been unable to find a way to phrase this without re-specifying. Further, if I use the data = argument in geom_histogram(), and specify just eruptions in min and max, seq() still doesn't understand that I mean the faithful data set.
I know that seq is not part of ggplot2, but I wondered if it might be able to inherit regardless, as it is bound within geom_histogram(), which itself inherits from ggplot(). Am I just using the wrong syntax, or is this possible?

Note that the term you are looking for is not "inheritance", but non standard evaluation (NSE). ggplot offers a couple of places where you can refer to your data items by their column names instead of a full reference (NSE), but those are the mapping arguments to the geom_* layers only, and even then when you are using aes. These work:
ggplot(faithful) + geom_point(aes(eruptions, eruptions))
ggplot(faithful) + geom_point(aes(eruptions, eruptions, size=waiting))
The following doesn't work because we are referring to waiting outside of aes and mapping (note first arg to geom_* is the mapping arg):
ggplot(faithful) + geom_point(aes(eruptions, eruptions), size=waiting)
But this works:
ggplot(faithful) + geom_point(aes(eruptions, eruptions), size=faithful$waiting)
though differently since now size is being interpreted litterally instead of being normalized as when part of mapping.
In your case, since breaks is not part of the aes/mapping spec, you can't use NSE and you are left using the full reference. Some possible work-arounds:
ggplot(data = faithful, aes(eruptions)) + geom_histogram(bins=10) # not identical
ggplot(data=faithful, aes(eruptions)) +
geom_histogram(
breaks=with(faithful, # use `with`
seq(from=max(eruptions), to=min(eruptions), length.out=10)
) )
And no-NSE, but a little less typing:
ggplot(data=faithful, aes(eruptions)) +
geom_histogram(
breaks=do.call(seq, c(as.list(range(faithful$eruptions)), len=10))
)

Based on the ggplot2 documentation it seems that + operator which is really the +.gg function allows adding the following objects to a ggplot object:
data.frame, uneval, layer, theme, scale, coord, facet
The geom function are functions that create layers which inherit the data and aes from the ggplot object "above" unless stated otherwise.
However the ggplot object and functions "live" in the Global environment, and thus calling a function such as seq which doesn't create a ggplot object from the ones listed above and doesn't inherit the ggplot object's themes (with the + operator which apply's to the listed above objects) lives in the global environment which doesn't include an object eruptions

Related

Are there any reasons not to use ggplot() + aes() + geom_() syntax?

I am a fairly experienced ggplot2 user and teach it to university students. However, I only just came across an example that uses the following syntax:
ggplot(mtcars) + aes(cyl) + geom_histogram()
This fits a lot better into the logic of adding up layers than specifying aes inside ggplot() or the geom_ ... but it does not seem to be documented anywhere in the ggplot2 help. Therefore, I am wondering whether there are any reasons why this syntax is limited / should not be used? (Obviously, I see that it needs to be specified in the geom if it is meant to differ between geoms ...)
This is verging on an opinion-based question, but I think it is on-topic, since it helps to clarify the syntax and structure of ggplot calls.
In a sense you have already answered the question yourself:
it does not seem to be documented anywhere in the ggplot2 help
This, and the near absence of examples in online tutorials, blogs and SO answers is a good enough reason not to use aes this way (or at least not to teach people to use it this way). It could lead to confusion and frustration on the part of new users.
This fits a lot better into the logic of adding up layers
This is sort of true, but could be a bit misleading. What it actually does is to specify the default aesthetic mapping, that subsequent layers will inherit from the ggplot object itself. It should be considered a core part of the base plot, along with the default data object, and therefore "belongs" in the initial ggplot call, rather than something that is being added or layered on to the plot. If you create a default ggplot object without data and mapping, the slots are still there, but contain waivers rather than being NULL :
p <- ggplot()
p$mapping
#> Aesthetic mapping:
#> <empty>
p$data
#> list()
#> attr(,"class")
#> [1] "waiver"
Note that unlike the scales and co-ordinate objects, for which you might argue that the same is also true, there can be no defaults for data and aesthetic mappings.
Does this mean you should never use this syntax? No, but it should be considered an advanced trick for folks who are well versed in ggplot. The most frequent use case I find for it is in changing the mapping of ggplots that are created in extension packages, such as ggsurvplot or ggraph, where the plotting functions use wrappers around ggplot. It can also be used to quickly create multiple plots with the same themes and colour scales:
p <- ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
geom_point(aes(color = Species)) +
theme_light()
library(patchwork)
p + (p + aes(Petal.Width, Petal.Length))
So the bottom line is that you can use this if you want, but best avoid teaching it to beginners
TL;DR
I cannot see any strong reasons why not to use this pattern, but other patterns are recommended in the documentation, without elaboration.
What does + aes() do?
A ggplot has two types of aesthetics:
the default one (typically supplied inside ggplot()), and
geom_*() specific aesthetics
If inherit.aes = TRUE is set inside the geoms, then these two types of aesthetics are combined in the final plot. If the default aesthetic is not set, then the geom_* specific aesthetics must be set.
Using ggplot(df) + aes(x, y) changes the default aesthetic.
This is documented in ?"+.gg":
An aes() object replaces the default aesthetics.
Are there any reasons not to use it?
I cannot see any strong reasons not to. However, in the documentation of ?ggplot it is stated that:
There are three common ways to invoke ggplot():
ggplot(df, aes(x, y, other aesthetics))
ggplot(df)
ggplot()
The first method is recommended if all layers use the same data and the same set of aesthetics.
As far as I can see, the typical use case for + aes() is when all layers use the same aesthetics. So the documentation recommend the usual pattern ggplot(df, aes(x, y, other aesthetics)), but I cannot find an elaboration of why.
Further: even though the plots look identical, the objects returned by ggplot(df, aes() and ggplot(df) + aes() are not identical, so there might be some edge cases where one pattern would lead to errors or a different plot.
You can see the many small differences with this code:
library(ggplot2)
a <- ggplot(mtcars, aes(hp, mpg)) + geom_point()
b <- ggplot(mtcars) + aes(hp, mpg) + geom_point()
waldo::compare(a, b, x_arg = "a", y_arg = "b")

Two dots before and after an R token [duplicate]

Consider the following lines.
p <- ggplot(mpg, aes(x=factor(cyl), y=..count..))
p + geom_histogram()
p + stat_summary(fun.y=identity, geom='bar')
In theory, the last two should produce the same plot. In practice, stat_summary fails and complains that the required y aesthetic is missing.
Why can't I use ..count.. in stat_summary? I can't find anywhere in the docs information about how to use these variables.
Expanding #joran's comment, the special variables in ggplot with double periods around them (..count.., ..density.., etc.) are returned by a stat transformation of the original data set. Those particular ones are returned by stat_bin which is implicitly called by geom_histogram (note in the documentation that the default value of the stat argument is "bin"). Your second example calls a different stat function which does not create a variable named ..count... You can get the same graph with
p + geom_bar(stat="bin")
In newer versions of ggplot2, one can also use the stat function instead of the enclosing .., so aes(y = ..count..) becomes aes(y = stat(count)).

How do ggplot stat_* functions work conceptually?

I'm currently trying to get my head around the differences between stat_* and geom_* in ggplot2. (Please note this is more of an interest/understanding based question than a specific problem I am trying solve).
Introduction
My current understanding is that is that the stat_* functions apply a transformation to your data and that the result is then passed onto the geom_* to be displayed.
Most simple example being the identity transformation which simply passes your data untransformed onto the geom.
ggplot(data = iris) +
stat_identity(aes(x = Sepal.Length, y = Sepal.Width) , geom= "point")
More practical use-cases appear to be when you want to use some transformation and supply the results to a non-default geom, for example if you wanted to plot an error bar of the 1st and 3rd quartile you could do something like:
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length, ymax = ..upper.., ymin = ..lower..), geom = "errorbar")
Question 1
So how / when are these transformations applied to the dataset and how does data pass through them exactly?
As an example, say I wanted to take the stat_boxplot transformation and plot the point of the 3rd quartile how would I do this ?
My intuition would be something like :
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = ..upper..) , geom = "point")
or
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length) , geom = "point")
however both error with
Error: geom_point requires the following missing aesthetics: y
My guess is as part of the stat_boxplot transformation it consumes the y aesthetic and produces a dataset not containing any y variable however this leads onto ....
Question 2
Where can I find out which variables are consumed as part of the stat_* transformation and what variables they output? Maybe i'm looking in the wrong places but the documentation does not seem clear to me at all...
Interesting questions...
As background info, you can read this chapter of R for Data Science, focusing on the grammar of graphics. I'm sure Hadley Wickham's book on ggplot2 is even a better source, but I don't have that one.
The main steps for building a graph with one layer and no facet are:
Apply aesthetics mapping on input data (in simple cases, this is a selection and renaming on columns)
Apply scale transformation (if any) on each data column
Compute stat on each data group (i.e. per Species in this case)
Apply aesthetics mapping on stat data, detected with ..<name>.. or stat(name)
Apply position adjustment
Build graphical objects
Apply coordinate transformations
As you guessed, the behaviour at step 3 is similar to dplyr::transmute(): it consumes all aesthetics columns and outputs a data frame having as columns all freshly computed stats and all columns that are constant within the group. The stat output may also have a different number of rows from its input. Thus indeed in your example the y column isn't passed to the geom.
To do this, we'd like to specify different mappings at step 1 (before stat) and at step 4 (before geom). I thought something like this would work:
# This does not work:
ggplot(data = iris) +
geom_point(
aes(x=Species, y=stat(upper)),
stat=stat_boxplot(aes(x=Species, y=Sepal.Length)) )
... but it doesn't (stat must be a string or a Stat object, but stat_boxplot actually returns a Layer object, like geom_point does).
NB: stat(upper) is an equivalent, more recent, notation to your ..upper..
I might be wrong but I don't think there is a way of doing this directly within ggplot. What you can do is extract the stat part of the process above and manage it yourself before entering ggplot():
library(tidyverse)
iris %>%
group_by(Species) %>%
select(y=Sepal.Length) %>%
do(StatBoxplot$compute_group(.)) %>%
ggplot(aes(Species, upper)) + geom_point()
A bit less elegant, I admit...
For your question 2, it's in the doc: see sections Aesthetics and Computed variables of ?stat_boxplot

addressing `data` in `geom_line` of ggplot

I am building a barplot with a line connecting two bars in order to show that asterisk refers to the difference between them:
Most of the plot is built correctly with the following code:
mytbl <- data.frame(
"var" =c("test", "control"),
"mean1" =c(0.019, 0.022),
"sderr"= c(0.001, 0.002)
);
mytbl$var <- relevel(mytbl$var, "test"); # without this will be sorted alphabetically (i.e. 'control', then 'test')
p <-
ggplot(mytbl, aes(x=var, y=mean1)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=mean1-sderr, ymax=mean1+sderr), width=.2)+
scale_y_continuous(labels=percent, expand=c(0,0), limits=c(NA, 1.3*max(mytbl$mean1+mytbl$sderr))) +
geom_text(mapping=aes(x=1.5, y= max(mean1+sderr)+0.005), label='*', size=10)
p
The only thing missing is the line itself. In my very old code, it was supposedly working with the following:
p +
geom_line(
mapping=aes(x=c(1,1,2,2),
y=c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)
)
)
But when I run this code now, I get an error: Error: Aesthetics must be either length 1 or the same as the data (2): x, y. By trying different things, I came to an awkward workaround: I add data=rbind(mytbl,mytbl), before mapping but I don't understand what really happens here.
P.S. additional little question (I know, I should ask in a separate SO post, sorry for that) - why in scale_y_continuous(..., limits()) I can't address data by columns and have to call mytbl$ explicitly?
Just put all that in a separate data frame:
line_data <- data.frame(x=c(1,1,2,2),
y=with(mytbl,c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)))
p + geom_line(data = line_data,aes(x = x,y = y))
In general, you should avoid using things like [ and $ when you map aesthetics inside of aes(). The intended way to use ggplot2 is usually to adjust your data into a format such that each column is exactly what you want plotted already.
You can't reference variables in mytbl in the scale_* functions because that data environment isn't passed along like it is with layers. The scales are treated separately than the data layers, and so the information about them is generally assumed to live somewhere separate from the data you are plotting.

What does ..count.. mean in R? [duplicate]

Consider the following lines.
p <- ggplot(mpg, aes(x=factor(cyl), y=..count..))
p + geom_histogram()
p + stat_summary(fun.y=identity, geom='bar')
In theory, the last two should produce the same plot. In practice, stat_summary fails and complains that the required y aesthetic is missing.
Why can't I use ..count.. in stat_summary? I can't find anywhere in the docs information about how to use these variables.
Expanding #joran's comment, the special variables in ggplot with double periods around them (..count.., ..density.., etc.) are returned by a stat transformation of the original data set. Those particular ones are returned by stat_bin which is implicitly called by geom_histogram (note in the documentation that the default value of the stat argument is "bin"). Your second example calls a different stat function which does not create a variable named ..count... You can get the same graph with
p + geom_bar(stat="bin")
In newer versions of ggplot2, one can also use the stat function instead of the enclosing .., so aes(y = ..count..) becomes aes(y = stat(count)).

Resources