I am building a barplot with a line connecting two bars in order to show that asterisk refers to the difference between them:
Most of the plot is built correctly with the following code:
mytbl <- data.frame(
"var" =c("test", "control"),
"mean1" =c(0.019, 0.022),
"sderr"= c(0.001, 0.002)
);
mytbl$var <- relevel(mytbl$var, "test"); # without this will be sorted alphabetically (i.e. 'control', then 'test')
p <-
ggplot(mytbl, aes(x=var, y=mean1)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=mean1-sderr, ymax=mean1+sderr), width=.2)+
scale_y_continuous(labels=percent, expand=c(0,0), limits=c(NA, 1.3*max(mytbl$mean1+mytbl$sderr))) +
geom_text(mapping=aes(x=1.5, y= max(mean1+sderr)+0.005), label='*', size=10)
p
The only thing missing is the line itself. In my very old code, it was supposedly working with the following:
p +
geom_line(
mapping=aes(x=c(1,1,2,2),
y=c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)
)
)
But when I run this code now, I get an error: Error: Aesthetics must be either length 1 or the same as the data (2): x, y. By trying different things, I came to an awkward workaround: I add data=rbind(mytbl,mytbl), before mapping but I don't understand what really happens here.
P.S. additional little question (I know, I should ask in a separate SO post, sorry for that) - why in scale_y_continuous(..., limits()) I can't address data by columns and have to call mytbl$ explicitly?
Just put all that in a separate data frame:
line_data <- data.frame(x=c(1,1,2,2),
y=with(mytbl,c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)))
p + geom_line(data = line_data,aes(x = x,y = y))
In general, you should avoid using things like [ and $ when you map aesthetics inside of aes(). The intended way to use ggplot2 is usually to adjust your data into a format such that each column is exactly what you want plotted already.
You can't reference variables in mytbl in the scale_* functions because that data environment isn't passed along like it is with layers. The scales are treated separately than the data layers, and so the information about them is generally assumed to live somewhere separate from the data you are plotting.
Related
I'm currently trying to get my head around the differences between stat_* and geom_* in ggplot2. (Please note this is more of an interest/understanding based question than a specific problem I am trying solve).
Introduction
My current understanding is that is that the stat_* functions apply a transformation to your data and that the result is then passed onto the geom_* to be displayed.
Most simple example being the identity transformation which simply passes your data untransformed onto the geom.
ggplot(data = iris) +
stat_identity(aes(x = Sepal.Length, y = Sepal.Width) , geom= "point")
More practical use-cases appear to be when you want to use some transformation and supply the results to a non-default geom, for example if you wanted to plot an error bar of the 1st and 3rd quartile you could do something like:
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length, ymax = ..upper.., ymin = ..lower..), geom = "errorbar")
Question 1
So how / when are these transformations applied to the dataset and how does data pass through them exactly?
As an example, say I wanted to take the stat_boxplot transformation and plot the point of the 3rd quartile how would I do this ?
My intuition would be something like :
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = ..upper..) , geom = "point")
or
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length) , geom = "point")
however both error with
Error: geom_point requires the following missing aesthetics: y
My guess is as part of the stat_boxplot transformation it consumes the y aesthetic and produces a dataset not containing any y variable however this leads onto ....
Question 2
Where can I find out which variables are consumed as part of the stat_* transformation and what variables they output? Maybe i'm looking in the wrong places but the documentation does not seem clear to me at all...
Interesting questions...
As background info, you can read this chapter of R for Data Science, focusing on the grammar of graphics. I'm sure Hadley Wickham's book on ggplot2 is even a better source, but I don't have that one.
The main steps for building a graph with one layer and no facet are:
Apply aesthetics mapping on input data (in simple cases, this is a selection and renaming on columns)
Apply scale transformation (if any) on each data column
Compute stat on each data group (i.e. per Species in this case)
Apply aesthetics mapping on stat data, detected with ..<name>.. or stat(name)
Apply position adjustment
Build graphical objects
Apply coordinate transformations
As you guessed, the behaviour at step 3 is similar to dplyr::transmute(): it consumes all aesthetics columns and outputs a data frame having as columns all freshly computed stats and all columns that are constant within the group. The stat output may also have a different number of rows from its input. Thus indeed in your example the y column isn't passed to the geom.
To do this, we'd like to specify different mappings at step 1 (before stat) and at step 4 (before geom). I thought something like this would work:
# This does not work:
ggplot(data = iris) +
geom_point(
aes(x=Species, y=stat(upper)),
stat=stat_boxplot(aes(x=Species, y=Sepal.Length)) )
... but it doesn't (stat must be a string or a Stat object, but stat_boxplot actually returns a Layer object, like geom_point does).
NB: stat(upper) is an equivalent, more recent, notation to your ..upper..
I might be wrong but I don't think there is a way of doing this directly within ggplot. What you can do is extract the stat part of the process above and manage it yourself before entering ggplot():
library(tidyverse)
iris %>%
group_by(Species) %>%
select(y=Sepal.Length) %>%
do(StatBoxplot$compute_group(.)) %>%
ggplot(aes(Species, upper)) + geom_point()
A bit less elegant, I admit...
For your question 2, it's in the doc: see sections Aesthetics and Computed variables of ?stat_boxplot
This question already has an answer here:
Issue when passing variable with dollar sign notation ($) to aes() in combination with facet_grid() or facet_wrap()
(1 answer)
Closed 4 years ago.
I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.
I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.
library(ggplot2)
species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)
data <- data.frame(species=species, numb=numb, groups=groups)
Let the following code represent the categorisation of a continuous variable.
data$factnumb <- as.factor(data$numb)
If I would like to plot this dataset the following two codes are completely interchangable:
Note the difference after the fill= statement.
p <- ggplot(data, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(p):
q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(q):
However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:
data_miss <- data[which(data$numb!= 3),]
This results in a disparity between the levels of the categorial variable and the observations in the dataset:
> unique(data_miss$factnumb)
[1] 1 2 7 8 10
Levels: 1 2 3 7 8 10
And plotted the data_miss dataset, still including all of the levels of the factnumb variable.
pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_fill_discrete(drop=FALSE) +
scale_x_discrete(drop=FALSE)+
scale_y_continuous(labels = scales::percent)
plot(pm):
qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_x_discrete(drop=FALSE)+
scale_fill_discrete(drop=FALSE) +
scale_y_continuous(labels = scales::percent)
plot(qm):
In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).
I would be really happy if someone could clear this one up for me.
Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?
Thanks in advance!
Kind regards,
Bernadette
Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).
More explanation:
ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:
mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found
This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.
The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.
ggplot(mydata, aes(x = col1)) + geom_bar()
# no error
You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):
# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()
So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.
So, just use unquoted column names in aes() without data$ and everything will work as expected.
I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")
The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")
I'm plotting a dense scatter plot in ggplot2 where each point might be labeled by a different color:
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size))
When I do this, the scatter point labeled "point" (green) is plotted on top of the red points which have the label "a". What controls this z ordering in ggplot, i.e. what controls which point is on top of which?
For example, what if I wanted all the "a" points to be on top of all the points labeled "point" (meaning they would sometimes partially or fully hide that point)? Does this depend on alphanumerical ordering of labels?
I'd like to find a solution that can be translated easily to rpy2.
2016 Update:
The order aesthetic has been deprecated, so at this point the easiest approach is to sort the data.frame so that the green point is at the bottom, and is plotted last. If you don't want to alter the original data.frame, you can sort it during the ggplot call - here's an example that uses %>% and arrange from the dplyr package to do the on-the-fly sorting:
library(dplyr)
ggplot(df %>%
arrange(label),
aes(x = x, y = y, color = label, size = size)) +
geom_point()
Original 2015 answer for ggplot2 versions < 2.0.0
In ggplot2, you can use the order aesthetic to specify the order in which points are plotted. The last ones plotted will appear on top. To apply this, you can create a variable holding the order in which you'd like points to be drawn.
To put the green dot on top by plotting it after the others:
df$order <- ifelse(df$label=="a", 1, 2)
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=order))
Or to plot the green dot first and bury it, plot the points in the opposite order:
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=-order))
For this simple example, you can skip creating a new sorting variable and just coerce the label variable to a factor and then a numeric:
ggplot(df) +
geom_point(aes(x=x, y=y, color=label, size=size, order=as.numeric(factor(df$label))))
ggplot2 will create plots layer-by-layer and within each layer, the plotting order is defined by the geom type. The default is to plot in the order that they appear in the data.
Where this is different, it is noted. For example
geom_line
Connect observations, ordered by x value.
and
geom_path
Connect observations in data order
There are also known issues regarding the ordering of factors, and it is interesting to note the response of the package author Hadley
The display of a plot should be invariant to the order of the data frame - anything else is a bug.
This quote in mind, a layer is drawn in the specified order, so overplotting can be an issue, especially when creating dense scatter plots. So if you want a consistent plot (and not one that relies on the order in the data frame) you need to think a bit more.
Create a second layer
If you want certain values to appear above other values, you can use the subset argument to create a second layer to definitely be drawn afterwards. You will need to explicitly load the plyr package so .() will work.
set.seed(1234)
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
library(plyr)
ggplot(df) + geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(aes(x = x, y = y, color = label, size = size),
subset = .(label == 'point'))
Update
In ggplot2_2.0.0, the subset argument is deprecated. Use e.g. base::subset to select relevant data specified in the data argument. And no need to load plyr:
ggplot(df) +
geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(data = subset(df, label == 'point'),
aes(x = x, y = y, color = label, size = size))
Or use alpha
Another approach to avoid the problem of overplotting would be to set the alpha (transparancy) of the points. This will not be as effective as the explicit second layer approach above, however, with judicious use of scale_alpha_manual you should be able to get something to work.
eg
# set alpha = 1 (no transparency) for your point(s) of interest
# and a low value otherwise
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size,alpha = label)) +
scale_alpha_manual(guide='none', values = list(a = 0.2, point = 1))
The fundamental question here can be rephrased like this:
How do I control the layers of my plot?
In the 'ggplot2' package, you can do this quickly by splitting each different layer into a different command. Thinking in terms of layers takes a little bit of practice, but it essentially comes down to what you want plotted on top of other things. You build from the background upwards.
Prep: Prepare the sample data. This step is only necessary for this example, because we don't have real data to work with.
# Establish random seed to make data reproducible.
set.seed(1)
# Generate sample data.
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
# Initialize 'label' and 'size' default values.
df$label <- "a"
df$size <- 2
# Label and size our "special" point.
df$label[50] <- "point"
df$size[50] <- 4
You may notice that I've added a different size to the example just to make the layer difference clearer.
Step 1: Separate your data into layers. Always do this BEFORE you use the 'ggplot' function. Too many people get stuck by trying to do data manipulation from with the 'ggplot' functions. Here, we want to create two layers: one with the "a" labels and one with the "point" labels.
df_layer_1 <- df[df$label=="a",]
df_layer_2 <- df[df$label=="point",]
You could do this with other functions, but I'm just quickly using the data frame matching logic to pull the data.
Step 2: Plot the data as layers. We want to plot all of the "a" data first and then plot all the "point" data.
ggplot() +
geom_point(
data=df_layer_1,
aes(x=x, y=y),
colour="orange",
size=df_layer_1$size) +
geom_point(
data=df_layer_2,
aes(x=x, y=y),
colour="blue",
size=df_layer_2$size)
Notice that the base plot layer ggplot() has no data assigned. This is important, because we are going to override the data for each layer. Then, we have two separate point geometry layers geom_point(...) that use their own specifications. The x and y axis will be shared, but we will use different data, colors, and sizes.
It is important to move the colour and size specifications outside of the aes(...) function, so we can specify these values literally. Otherwise, the 'ggplot' function will usually assign colors and sizes according to the levels found in the data. For instance, if you have size values of 2 and 5 in the data, it will assign a default size to any occurrences of the value 2 and will assign some larger size to any occurrences of the value 5. An 'aes' function specification will not use the values 2 and 5 for the sizes. The same goes for colors. I have exact sizes and colors that I want to use, so I move those arguments into the 'geom_plot' function itself. Also, any specifications in the 'aes' function will be put into the legend, which can be really useless.
Final note: In this example, you could achieve the wanted result in many ways, but it is important to understand how 'ggplot2' layers work in order to get the most out of your 'ggplot' charts. As long as you separate your data into different layers before you call the 'ggplot' functions, you have a lot of control over how things will be graphed on the screen.
It's plotted in order of the rows in the data.frame. Try this:
df2 <- rbind(df[-50,],df[50,])
ggplot(df2) + geom_point(aes(x=x, y=y, color=label, size=size))
As you see the green point is drawn last, since it represents the last row of the data.frame.
Here is a way to order the data.frame to have the green point drawn first:
df2 <- df[order(-as.numeric(factor(df$label))),]
Suppose that I have two data frames
df1 = data.frame(x=1:10)
df2 = data.frame(x=11:20)
and I want a scatter plot with these two series defining the coordinates. It would be simple to do
plot(df1$x,df2$x)
From what I can tell so far about ggplot2, I could also do
df = data.frame(x1 = df1$x, x2 = df2$x)
ggplot(data = df, aes(x=x1, y=x2)) + geom_point()
rm(df)
but that would be slower (for me) than not creating a new data frame, is hard to read, and could lead to increased mistakes (deleting the wrong data frame, writing over a needed data frame, forgetting to remove the excess clutter, etc.). Do I really need to create a separate data frame just to house the data that are already there? Why does the first line of the following work even though it only lists one of the data frames under "data" while the second line does not?
ggplot(data = df1, aes(x=df1$x, y=df2$x)) + geom_point()
ggplot( aes(x=df1$x, y=df2$x)) + geom_point()
Here is an example image of basically what I want:
Any line of the following (all taken from comments) should work:
ggplot(data=data.frame(x=df1$x, y=df2$x), aes(x,y)) + geom_point()
ggplot() + geom_point(aes(x=df1$x, y=df2$x))
ggplot(data=NULL, aes(x=df1$x, y=df2$x)) + geom_point()
ggplot(data=df1, aes(x=x)) + geom_point(aes(y=df2$x))
I prefer the last line (taken from a comment that was deleted). As mentioned in comments on the question, ggplot() will create a data.frame anyway. What these solutions do is permit the user to ignore this aspect of data management somewhat (admittedly in ways that some users would find abhorrent).
This was going to be a comment, but I'm not reputable enough yet.
You could also try qplot(x = df1$x, y = df2$x). Note that from ?qplot that qplot will create a data frame for you from the inputs provided, if the data argument is left unset.