How do ggplot stat_* functions work conceptually? - r

I'm currently trying to get my head around the differences between stat_* and geom_* in ggplot2. (Please note this is more of an interest/understanding based question than a specific problem I am trying solve).
Introduction
My current understanding is that is that the stat_* functions apply a transformation to your data and that the result is then passed onto the geom_* to be displayed.
Most simple example being the identity transformation which simply passes your data untransformed onto the geom.
ggplot(data = iris) +
stat_identity(aes(x = Sepal.Length, y = Sepal.Width) , geom= "point")
More practical use-cases appear to be when you want to use some transformation and supply the results to a non-default geom, for example if you wanted to plot an error bar of the 1st and 3rd quartile you could do something like:
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length, ymax = ..upper.., ymin = ..lower..), geom = "errorbar")
Question 1
So how / when are these transformations applied to the dataset and how does data pass through them exactly?
As an example, say I wanted to take the stat_boxplot transformation and plot the point of the 3rd quartile how would I do this ?
My intuition would be something like :
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = ..upper..) , geom = "point")
or
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length) , geom = "point")
however both error with
Error: geom_point requires the following missing aesthetics: y
My guess is as part of the stat_boxplot transformation it consumes the y aesthetic and produces a dataset not containing any y variable however this leads onto ....
Question 2
Where can I find out which variables are consumed as part of the stat_* transformation and what variables they output? Maybe i'm looking in the wrong places but the documentation does not seem clear to me at all...

Interesting questions...
As background info, you can read this chapter of R for Data Science, focusing on the grammar of graphics. I'm sure Hadley Wickham's book on ggplot2 is even a better source, but I don't have that one.
The main steps for building a graph with one layer and no facet are:
Apply aesthetics mapping on input data (in simple cases, this is a selection and renaming on columns)
Apply scale transformation (if any) on each data column
Compute stat on each data group (i.e. per Species in this case)
Apply aesthetics mapping on stat data, detected with ..<name>.. or stat(name)
Apply position adjustment
Build graphical objects
Apply coordinate transformations
As you guessed, the behaviour at step 3 is similar to dplyr::transmute(): it consumes all aesthetics columns and outputs a data frame having as columns all freshly computed stats and all columns that are constant within the group. The stat output may also have a different number of rows from its input. Thus indeed in your example the y column isn't passed to the geom.
To do this, we'd like to specify different mappings at step 1 (before stat) and at step 4 (before geom). I thought something like this would work:
# This does not work:
ggplot(data = iris) +
geom_point(
aes(x=Species, y=stat(upper)),
stat=stat_boxplot(aes(x=Species, y=Sepal.Length)) )
... but it doesn't (stat must be a string or a Stat object, but stat_boxplot actually returns a Layer object, like geom_point does).
NB: stat(upper) is an equivalent, more recent, notation to your ..upper..
I might be wrong but I don't think there is a way of doing this directly within ggplot. What you can do is extract the stat part of the process above and manage it yourself before entering ggplot():
library(tidyverse)
iris %>%
group_by(Species) %>%
select(y=Sepal.Length) %>%
do(StatBoxplot$compute_group(.)) %>%
ggplot(aes(Species, upper)) + geom_point()
A bit less elegant, I admit...
For your question 2, it's in the doc: see sections Aesthetics and Computed variables of ?stat_boxplot

Related

Why can facet_wrap() in ggplot2 be expressed with either a tilde (~) or vars()?

A tilde (~) in R generally denotes an anonymous function or formula, if I understand correctly. In ggplot2, you can use facet_wrap() to split your plot into facets based on a factor variable with multiple levels. There are two different ways to express this, and they both produce similar results:
# load starwars and tidyverse
library(tidyverse)
data(starwars)
With a ~:
ggplot(data = starwars, mapping = aes(x = mass)) +
geom_histogram(fill = "blue", alpha = .2) +
theme_minimal() +
facet_wrap( ~ gender, nrow = 1)
With vars():
ggplot(data = starwars, mapping = aes(x = mass)) +
geom_histogram(fill = "blue", alpha = .2) +
theme_minimal() +
facet_wrap( vars(gender), nrow = 1)
How are vars() and ~ equivalent in ggplot2? How is ~ being used in a manner that is analogous, or equivalent to, its typical usage as an anonymous function or formula in R? It doesn't seem like it's a function here? Can someone help clarify how vars() and ~ for facet_wrap() denote the same thing?
The two plots should be identical.
In ggplot2, vars() is just a quoting function that takes inputs to be evaluated, which in this case is the variable name used to form the faceting groups. In other words, the column you supplied, usually a variable with more than one level, will be automatically quoted, then evaluated in the context of the data to form small panels of plots. I recommend using vars() inputs when you want to create a function to wrap around facet_wrap(); it’s a lot easier.
The ~, on the other hand, is syntax specific to the facet_wrap() function. For example, facet_wrap(~ variable_name) does not imply the estimation of some formulaic expression. Rather, as a one-sided formula with a variable on the right-hand side, it’s like telling R to feed the function the variable in its current form, which is just the name of the column itself. It’s confusing because we usually use the ~ to denote a relationship between x and y. It’s kind of the same thing in this context. The missing dependent y variable to the left of the ~ represents the row values, whereas the independent x variable to the right of the ~ represents the column(s). Note, the function may already know the y variable, which is usually specified inside of the aes() call. Layering on facet_wrap(~ ...) is just a quick way to partition those y values (rows) across each dimension (level) of your x variable.

Issue adding second variable to scatter plot in R

Been set this question for an assignment - but i've never used R before - any help is appreciated.
Many thanks.
Question:
Produce a scatter plot to compare CO2 emissions from Brazil and Argentina between 1950 and 2019....
I can get it for Brazil but cannot figure out how to add Argentina.
I think i have to do something with geom_point and filter?
df%>%
filter(Country=="Brazil", Year<=2019 & Year>=1950) %>%
ggplot(aes(x = Year, y = CO2_annual_tonnes)) +
geom_point(na.rm =TRUE, shape=20, size=2, colour="green") +
labs(x = "Year", y = "CO2Emmissions (tonnes)")
The answer depends on what you're looking to do, but generally adding another dimension to a scatter plot where you already have clear x and y dimensions is done by applying an aesthetic (color, shape, etc) or via faceting.
In both approaches, you actually don't want to filter the data. You use either aesthetics or faceting to "filter" in a way and map the data appropriately based on the country column in the dataset. If your dataset contains more countries than Argentina and Brazil, you will want to filter to only include those, so:
your_filtered_df <- your_df %>%
dplyr::filter(Country %in% c("Argentina", "Brazil"))
Faceting
Faceting is another way of saying you want to split up your one plot into two separate plots (one for Argentina, one for Brazil). Each plot will have the same aesthetics (look the same), but will have the appropriate "filtered" dataset.
In your case, you can try:
your_filtered_df %>%
ggplot(aes(x = Year, y = CO2_annual_tonnes)) +
geom_point(na.rm =TRUE, shape=20, size=2, colour="green") +
facet_wrap(~Country)
Aesthetics
Here, you have a lot of options. The idea is that you tell ggplot2 to map the appearance of individual points in the point geom to the value specified in your_filtered_df$Country. You do this by placing one of the aesthetic arguments for geom_point() inside of aes(). If you use shape=, for example it might look like this:
your_filtered_df %>%
ggplot(aes(x = Year, y = CO2_annual_tonnes)) +
geom_point(aes(shape=Country), na.rm =TRUE, size=2, colour="green")
This should show a plot that has a legend created to and two different shapes for the points that correspond to the country name. It's very important to remember that when you put an aesthetic like shape or color or size inside of aes(), you must not also have it outside. So, this will behave correctly:
geom_point(aes(colour=Country), ...)
But this will not:
geom_point(aes(colour=Country), colour="green", ...)
When one aesthetic is outside, it overrides the one in aes(). The second one will still show all points as green.
Don't Do this... but it works
OP posted a comment that indicated some additional hints from the professor, which was:
We were given the hint in the question "you can embed piped filter
functions within geom_point objects"
I believe they are referring to a final... very bad way of generating the points. This method would require you to have two geom_point() objects, and send each one a different filtered dataset. You can do this by accessing the data= argument within each geom_point() object. There are many problems with this approach, including the lack of a legend being generated, but if you simply must do it this way... here it is:
# painful to write this. it goes against all good practices with ggplot
your_filtered_df %>%
ggplot(aes(x = Year, y = CO2_annual_tonnes)) +
geom_point(data=your_filtered_df %>% dplyr::filter(Country=="Argentina"),
color="green", shape=20) +
geom_point(data=your_filtered_df %>% dplyr::filter(Country=="Brazil"),
color="red", shape=20)
You should probably see why this is not a good convention. Think about what you would do for representing 50 different countries... the above codes or methods would work, but with this method, you would have 50 individual geom_point() objects in your plot... ugh. Don't make a typo!

r - scatterplot summary stat (e.g. sum or mean) for each point instead of individual data points

I am looking for a way to summarize data within a ggplot call, not before. I could pre-aggregate the data and then plot it, but I know there is a way to do it within a ggplot call. I'm just unsure how.
In this example, I want to get a mean for each (x,y) combo, and map it onto the colour aes
library(tidyverse)
df <- tibble(x = rep(c(1,2,4,1,5),10),
y = rep(c(1,2,3,1,5),10),
col = sample(c(1:100), 50))
df_summar <- df %>%
group_by(x,y) %>%
summarise(col_mean = mean(col))
ggplot(df_summar, aes(x=x, y=y, col=col_mean)) +
geom_point(size = 5)
I think there must be a better way to avoid the pre-ggplot step (yes, I could also have piped dplyr transformations into the ggplot, but the mechanics would be the same).
For instance, geom_count() counts the instances and plots them onto size aes:
ggplot(df, aes(x=x, y=y)) + geom_count()
I want the same, but mean instead of count, and col instead of size
I'm guessing I need stat_summary() or a stat() call (a replacement for ..xxx.. notation), but I can't get it to give me what I need.
You'll need stat_summary_2d:
ggplot(df, aes(x, y, z = col)) +
stat_summary_2d(aes(col = ..value..), fun = 'mean', geom = 'point', size = 5)
(Or calc(value), if you use the ggplot dev version, or read this in the future.)
You can pass any arbitrary function to fun.
While stat_summary seems like it would be useful, it is not in this case. It is specialized in the common transformation for plotting, summarizing a range of y values, grouped by x, into a set of summary statistics that are plotted as y(, ymin and ymax). You want to group by both x and y, so 2d it is.
Note that this uses binning however, so to get the points to accurately line up, you need to increase bin size (e.g. to 1e3). Unfortunately, there is no non-binning 2d summary stat.

addressing `data` in `geom_line` of ggplot

I am building a barplot with a line connecting two bars in order to show that asterisk refers to the difference between them:
Most of the plot is built correctly with the following code:
mytbl <- data.frame(
"var" =c("test", "control"),
"mean1" =c(0.019, 0.022),
"sderr"= c(0.001, 0.002)
);
mytbl$var <- relevel(mytbl$var, "test"); # without this will be sorted alphabetically (i.e. 'control', then 'test')
p <-
ggplot(mytbl, aes(x=var, y=mean1)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=mean1-sderr, ymax=mean1+sderr), width=.2)+
scale_y_continuous(labels=percent, expand=c(0,0), limits=c(NA, 1.3*max(mytbl$mean1+mytbl$sderr))) +
geom_text(mapping=aes(x=1.5, y= max(mean1+sderr)+0.005), label='*', size=10)
p
The only thing missing is the line itself. In my very old code, it was supposedly working with the following:
p +
geom_line(
mapping=aes(x=c(1,1,2,2),
y=c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)
)
)
But when I run this code now, I get an error: Error: Aesthetics must be either length 1 or the same as the data (2): x, y. By trying different things, I came to an awkward workaround: I add data=rbind(mytbl,mytbl), before mapping but I don't understand what really happens here.
P.S. additional little question (I know, I should ask in a separate SO post, sorry for that) - why in scale_y_continuous(..., limits()) I can't address data by columns and have to call mytbl$ explicitly?
Just put all that in a separate data frame:
line_data <- data.frame(x=c(1,1,2,2),
y=with(mytbl,c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)))
p + geom_line(data = line_data,aes(x = x,y = y))
In general, you should avoid using things like [ and $ when you map aesthetics inside of aes(). The intended way to use ggplot2 is usually to adjust your data into a format such that each column is exactly what you want plotted already.
You can't reference variables in mytbl in the scale_* functions because that data environment isn't passed along like it is with layers. The scales are treated separately than the data layers, and so the information about them is generally assumed to live somewhere separate from the data you are plotting.

What are the limits to inheritance in ggplot2?

I have been trying to work out a few things about ggplot2, and how supplemntary arguments inherit from the first part ggplot(). Specifically, if inheritance is passed on beyond the geom_*** part.
I have a histogram of data:
ggplot(data = faithful, aes(eruptions)) + geom_histogram()
Which produces a fine chart, though the breaks are default. It appears to me (an admitted novice), that geom_histogram() is inheriting the data specification from ggplot(). If I want to have a smarter way of setting the breaks I could use a process like so:
ggplot(data = faithful, aes(eruptions)) +
geom_histogram(breaks = seq(from = min(faithful$eruptions),
to = max(faithful$eruptions), length.out = 10))
However, here I am re-specifying within the geom_histogram() function that I want faithful$eruptions. I have been unable to find a way to phrase this without re-specifying. Further, if I use the data = argument in geom_histogram(), and specify just eruptions in min and max, seq() still doesn't understand that I mean the faithful data set.
I know that seq is not part of ggplot2, but I wondered if it might be able to inherit regardless, as it is bound within geom_histogram(), which itself inherits from ggplot(). Am I just using the wrong syntax, or is this possible?
Note that the term you are looking for is not "inheritance", but non standard evaluation (NSE). ggplot offers a couple of places where you can refer to your data items by their column names instead of a full reference (NSE), but those are the mapping arguments to the geom_* layers only, and even then when you are using aes. These work:
ggplot(faithful) + geom_point(aes(eruptions, eruptions))
ggplot(faithful) + geom_point(aes(eruptions, eruptions, size=waiting))
The following doesn't work because we are referring to waiting outside of aes and mapping (note first arg to geom_* is the mapping arg):
ggplot(faithful) + geom_point(aes(eruptions, eruptions), size=waiting)
But this works:
ggplot(faithful) + geom_point(aes(eruptions, eruptions), size=faithful$waiting)
though differently since now size is being interpreted litterally instead of being normalized as when part of mapping.
In your case, since breaks is not part of the aes/mapping spec, you can't use NSE and you are left using the full reference. Some possible work-arounds:
ggplot(data = faithful, aes(eruptions)) + geom_histogram(bins=10) # not identical
ggplot(data=faithful, aes(eruptions)) +
geom_histogram(
breaks=with(faithful, # use `with`
seq(from=max(eruptions), to=min(eruptions), length.out=10)
) )
And no-NSE, but a little less typing:
ggplot(data=faithful, aes(eruptions)) +
geom_histogram(
breaks=do.call(seq, c(as.list(range(faithful$eruptions)), len=10))
)
Based on the ggplot2 documentation it seems that + operator which is really the +.gg function allows adding the following objects to a ggplot object:
data.frame, uneval, layer, theme, scale, coord, facet
The geom function are functions that create layers which inherit the data and aes from the ggplot object "above" unless stated otherwise.
However the ggplot object and functions "live" in the Global environment, and thus calling a function such as seq which doesn't create a ggplot object from the ones listed above and doesn't inherit the ggplot object's themes (with the + operator which apply's to the listed above objects) lives in the global environment which doesn't include an object eruptions

Resources