Are there any reasons not to use ggplot() + aes() + geom_() syntax? - r

I am a fairly experienced ggplot2 user and teach it to university students. However, I only just came across an example that uses the following syntax:
ggplot(mtcars) + aes(cyl) + geom_histogram()
This fits a lot better into the logic of adding up layers than specifying aes inside ggplot() or the geom_ ... but it does not seem to be documented anywhere in the ggplot2 help. Therefore, I am wondering whether there are any reasons why this syntax is limited / should not be used? (Obviously, I see that it needs to be specified in the geom if it is meant to differ between geoms ...)

This is verging on an opinion-based question, but I think it is on-topic, since it helps to clarify the syntax and structure of ggplot calls.
In a sense you have already answered the question yourself:
it does not seem to be documented anywhere in the ggplot2 help
This, and the near absence of examples in online tutorials, blogs and SO answers is a good enough reason not to use aes this way (or at least not to teach people to use it this way). It could lead to confusion and frustration on the part of new users.
This fits a lot better into the logic of adding up layers
This is sort of true, but could be a bit misleading. What it actually does is to specify the default aesthetic mapping, that subsequent layers will inherit from the ggplot object itself. It should be considered a core part of the base plot, along with the default data object, and therefore "belongs" in the initial ggplot call, rather than something that is being added or layered on to the plot. If you create a default ggplot object without data and mapping, the slots are still there, but contain waivers rather than being NULL :
p <- ggplot()
p$mapping
#> Aesthetic mapping:
#> <empty>
p$data
#> list()
#> attr(,"class")
#> [1] "waiver"
Note that unlike the scales and co-ordinate objects, for which you might argue that the same is also true, there can be no defaults for data and aesthetic mappings.
Does this mean you should never use this syntax? No, but it should be considered an advanced trick for folks who are well versed in ggplot. The most frequent use case I find for it is in changing the mapping of ggplots that are created in extension packages, such as ggsurvplot or ggraph, where the plotting functions use wrappers around ggplot. It can also be used to quickly create multiple plots with the same themes and colour scales:
p <- ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
geom_point(aes(color = Species)) +
theme_light()
library(patchwork)
p + (p + aes(Petal.Width, Petal.Length))
So the bottom line is that you can use this if you want, but best avoid teaching it to beginners

TL;DR
I cannot see any strong reasons why not to use this pattern, but other patterns are recommended in the documentation, without elaboration.
What does + aes() do?
A ggplot has two types of aesthetics:
the default one (typically supplied inside ggplot()), and
geom_*() specific aesthetics
If inherit.aes = TRUE is set inside the geoms, then these two types of aesthetics are combined in the final plot. If the default aesthetic is not set, then the geom_* specific aesthetics must be set.
Using ggplot(df) + aes(x, y) changes the default aesthetic.
This is documented in ?"+.gg":
An aes() object replaces the default aesthetics.
Are there any reasons not to use it?
I cannot see any strong reasons not to. However, in the documentation of ?ggplot it is stated that:
There are three common ways to invoke ggplot():
ggplot(df, aes(x, y, other aesthetics))
ggplot(df)
ggplot()
The first method is recommended if all layers use the same data and the same set of aesthetics.
As far as I can see, the typical use case for + aes() is when all layers use the same aesthetics. So the documentation recommend the usual pattern ggplot(df, aes(x, y, other aesthetics)), but I cannot find an elaboration of why.
Further: even though the plots look identical, the objects returned by ggplot(df, aes() and ggplot(df) + aes() are not identical, so there might be some edge cases where one pattern would lead to errors or a different plot.
You can see the many small differences with this code:
library(ggplot2)
a <- ggplot(mtcars, aes(hp, mpg)) + geom_point()
b <- ggplot(mtcars) + aes(hp, mpg) + geom_point()
waldo::compare(a, b, x_arg = "a", y_arg = "b")

Related

Two dots before and after an R token [duplicate]

Consider the following lines.
p <- ggplot(mpg, aes(x=factor(cyl), y=..count..))
p + geom_histogram()
p + stat_summary(fun.y=identity, geom='bar')
In theory, the last two should produce the same plot. In practice, stat_summary fails and complains that the required y aesthetic is missing.
Why can't I use ..count.. in stat_summary? I can't find anywhere in the docs information about how to use these variables.
Expanding #joran's comment, the special variables in ggplot with double periods around them (..count.., ..density.., etc.) are returned by a stat transformation of the original data set. Those particular ones are returned by stat_bin which is implicitly called by geom_histogram (note in the documentation that the default value of the stat argument is "bin"). Your second example calls a different stat function which does not create a variable named ..count... You can get the same graph with
p + geom_bar(stat="bin")
In newer versions of ggplot2, one can also use the stat function instead of the enclosing .., so aes(y = ..count..) becomes aes(y = stat(count)).

Facetting with factorised variables and geom_hline / geom_vline

Consider this code:
require(ggplot2)
ggplot(data = mtcars) +
geom_point(aes(x = drat, y = wt)) +
geom_hline(yintercept = 3) +
facet_grid(~ cyl) ## works
ggplot(data = mtcars) +
geom_point(aes(x = drat, y = wt)) +
geom_hline(yintercept = 3) +
facet_grid(~ factor(cyl)) ## does not work
# Error in factor(cyl) : object 'cyl' not found
# removing geom_hline: works again.
Google helped me to find a debug, namely wrapping intercept into aes
ggplot(data = mtcars) +
geom_point(aes(x = drat, y = wt)) +
geom_hline(aes(yintercept = 3)) +
facet_grid(~ factor(cyl)) # works
# R version 3.4.3 (2017-11-30)
# ggplot2_2.2.1
Hadley writes here that functions as variables need to be in every layer. (which sounds mysterious to me)
Why does this happen when factorising the facet variable?
So here's my best guess and explanation.
When Hadley says:
This is a known limitation of facetting with a function - the variables you use have to be present on every layer.
He means in ggplot, when you're going to use a function in the facetting function, you need to have the variable in every geom. The issue occurs because there cyl variable is not present in the hline geom.
It's important to remember, this is a limitation, not ideal behaviour. Moreso, a consequence of how their efficient code works, is that when using functions to facet, the variables must be present in every geom.
Without looking into the specifics of the ggplot2 functions, I'm guessing what wrapping aes around the yintercept argument does, is give an aesthetic mapping to the geom_hline function. The aes function maps variables to components of the plot, rather than static values. It's an important distinction. Even though we still set yintercept = 3, the fact that we have placed it in the aesthetic mapping, must somehow reference that cyl also exists in this space. That is, it connects geom_hline indirectly with cyl, meaning it's now in the layer, and no longer a limitation.
This may not be an entirely satisfying answer, but without reading over the ggplot2 code to try and work out specifically why this limitation occurs, this might be as good as you'll get for now. Hopefully one of these workarounds is sufficient for you :)

What are the limits to inheritance in ggplot2?

I have been trying to work out a few things about ggplot2, and how supplemntary arguments inherit from the first part ggplot(). Specifically, if inheritance is passed on beyond the geom_*** part.
I have a histogram of data:
ggplot(data = faithful, aes(eruptions)) + geom_histogram()
Which produces a fine chart, though the breaks are default. It appears to me (an admitted novice), that geom_histogram() is inheriting the data specification from ggplot(). If I want to have a smarter way of setting the breaks I could use a process like so:
ggplot(data = faithful, aes(eruptions)) +
geom_histogram(breaks = seq(from = min(faithful$eruptions),
to = max(faithful$eruptions), length.out = 10))
However, here I am re-specifying within the geom_histogram() function that I want faithful$eruptions. I have been unable to find a way to phrase this without re-specifying. Further, if I use the data = argument in geom_histogram(), and specify just eruptions in min and max, seq() still doesn't understand that I mean the faithful data set.
I know that seq is not part of ggplot2, but I wondered if it might be able to inherit regardless, as it is bound within geom_histogram(), which itself inherits from ggplot(). Am I just using the wrong syntax, or is this possible?
Note that the term you are looking for is not "inheritance", but non standard evaluation (NSE). ggplot offers a couple of places where you can refer to your data items by their column names instead of a full reference (NSE), but those are the mapping arguments to the geom_* layers only, and even then when you are using aes. These work:
ggplot(faithful) + geom_point(aes(eruptions, eruptions))
ggplot(faithful) + geom_point(aes(eruptions, eruptions, size=waiting))
The following doesn't work because we are referring to waiting outside of aes and mapping (note first arg to geom_* is the mapping arg):
ggplot(faithful) + geom_point(aes(eruptions, eruptions), size=waiting)
But this works:
ggplot(faithful) + geom_point(aes(eruptions, eruptions), size=faithful$waiting)
though differently since now size is being interpreted litterally instead of being normalized as when part of mapping.
In your case, since breaks is not part of the aes/mapping spec, you can't use NSE and you are left using the full reference. Some possible work-arounds:
ggplot(data = faithful, aes(eruptions)) + geom_histogram(bins=10) # not identical
ggplot(data=faithful, aes(eruptions)) +
geom_histogram(
breaks=with(faithful, # use `with`
seq(from=max(eruptions), to=min(eruptions), length.out=10)
) )
And no-NSE, but a little less typing:
ggplot(data=faithful, aes(eruptions)) +
geom_histogram(
breaks=do.call(seq, c(as.list(range(faithful$eruptions)), len=10))
)
Based on the ggplot2 documentation it seems that + operator which is really the +.gg function allows adding the following objects to a ggplot object:
data.frame, uneval, layer, theme, scale, coord, facet
The geom function are functions that create layers which inherit the data and aes from the ggplot object "above" unless stated otherwise.
However the ggplot object and functions "live" in the Global environment, and thus calling a function such as seq which doesn't create a ggplot object from the ones listed above and doesn't inherit the ggplot object's themes (with the + operator which apply's to the listed above objects) lives in the global environment which doesn't include an object eruptions

Special variables in ggplot (..count.., ..density.., etc.)

Consider the following lines.
p <- ggplot(mpg, aes(x=factor(cyl), y=..count..))
p + geom_histogram()
p + stat_summary(fun.y=identity, geom='bar')
In theory, the last two should produce the same plot. In practice, stat_summary fails and complains that the required y aesthetic is missing.
Why can't I use ..count.. in stat_summary? I can't find anywhere in the docs information about how to use these variables.
Expanding #joran's comment, the special variables in ggplot with double periods around them (..count.., ..density.., etc.) are returned by a stat transformation of the original data set. Those particular ones are returned by stat_bin which is implicitly called by geom_histogram (note in the documentation that the default value of the stat argument is "bin"). Your second example calls a different stat function which does not create a variable named ..count... You can get the same graph with
p + geom_bar(stat="bin")
In newer versions of ggplot2, one can also use the stat function instead of the enclosing .., so aes(y = ..count..) becomes aes(y = stat(count)).

Adding stat_smooth in to only 1 facet in ggplot2

I have some data for which, at one level of a factor, there is a significant correlation. At the other level, there is none. Plotting these side-by-side is simple. Adding a line to both of them with stat_smooth, also straightforward. However, I do not want the line or its fill displayed in one of the two facets. Is there a simple way to do this? Perhaps specifying a blank color for the fill and colour of one of the lines somehow?
Don't think about picking a facet, think supplying a subset of your data to stat_smooth:
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(data = subset(df, z =="a")) +
facet_wrap(~ z)
Of course, I later answered my own question. Although, is there a less hack-y way to do this? I wonder if one could even fit different functions to different panels.
One technique is to use + scale_fill_manual and scale_colour_manual. They allow one to specify what colors will be used. So, in this case, let's say you have
a<-qplot(x, y, facets=~z)+stat_smooth(method="lm", aes(colour=z, fill=z))
You can specify colors for the fill and colour using the following. Note, the second color is clear, as it is using a hex value with the final two numbers representing transparency. So, 00=clear.
a+stat_fill_manual(values=c("grey", "#11111100"))+scale_colour_manual(values=c("blue", "#11111100"))

Resources