How to add legend to plot with data from multiple data frames - r

I have scripted a ggplot compiled from two separate data frames, but as it stands there is no legend as the colours aren't included in aes. I'd prefer to keep the two datasets separate if possible, but can't figure out how to add the legend. Any thoughts?
I've tried adding the colours directly to the aes function, but then colours are just added as variables and listed in the legend instead of colouring the actual data.
Plotting this with base r, after creating the plot I would've used:
legend("top",c("Delta 18O","Delta 13C"),fill=c("red","blue")
and gotten what I needed, but I'm not sure how to replicate this in ggplot.
The following code currently plots exactly what I want, it's just missing the legend... which ideally should match what the above line would produce, except the "18" and "13" need superscripted.
Examples of an old plot using base r (with a correct legend, except lacking superscripted 13 and 18) and the current plot missing the legend can be found here:
Old: https://imgur.com/xgd9e9C
New, missing legend: https://imgur.com/eGRhUzf
Background data
head(avar.data.x)
time av error
1 1.015223 0.030233604 0.003726832
2 2.030445 0.014819145 0.005270609
3 3.045668 0.010054801 0.006455241
4 4.060891 0.007477541 0.007453974
5 5.076113 0.006178282 0.008333912
6 6.091336 0.004949045 0.009129470
head(avar.data.y)
time av error
1 1.015223 0.06810001 0.003726832
2 2.030445 0.03408136 0.005270609
3 3.045668 0.02313839 0.006455241
4 4.060891 0.01737148 0.007453974
5 5.076113 0.01405144 0.008333912
6 6.091336 0.01172788 0.009129470
The following avarn function produces a data frame with three columns and several thousand rows (see header above). These are then graphed over time on a log/log plot.
avar.data.x <- avarn(data3$"d Intl. Std:d 13C VPDB - Value",frequency)
avar.data.y <- avarn(data3$"d Intl. Std:d 18O VPDB-CO2 - Value",frequency)
Create allan deviation plot
ggplot()+
geom_line(data=avar.data.y,aes(x=time,y=sqrt(av)),color="red")+
geom_line(data=avar.data.x,aes(x=time,y=sqrt(av)),color="blue")+
scale_x_log10()+
scale_y_log10()+
labs(x=expression(paste("Averaging Time ",tau," (seconds)")),y="Allan Deviation (per mil)")
The above plot is only missing a legend to show the name of the two plotted datasets and their respective colours. I would like the legend in the top centre of the graph.
How to superscript legend titles?:
ggplot()+
geom_line(data=avar.data.y,aes(x=time,y=sqrt(av),
color =expression(paste("Delta ",18^,"O"))))+
geom_line(data=avar.data.xmod,aes(x=time,y=sqrt(av),
color=expression(paste("Delta ",13^,"C"))))+
scale_color_manual(values = c("blue", "red"),name=NULL) +
scale_x_log10()+
scale_y_log10()+
labs(
x=expression(paste("Averaging Time ",tau," (seconds)")),
y="Allan Deviation (per mil)") +
theme(legend.position = c(0.5, 0.9))

Set color inside the aes and add a scale_color_ function to your plot should do the trick.
ggplot()+
geom_line(data=avar.data.y,aes(x=time,y=sqrt(av), color = "a"))+
geom_line(data=avar.data.x,aes(x=time,y=sqrt(av), color="b"))+
scale_color_manual(
values = c("red", "blue"),
labels = expression(avar.data.x^2, "b")
) +
scale_x_log10()+
scale_y_log10()+
labs(
x=expression(paste("Averaging^2 Time ",tau," (seconds)")),
y="Allan Deviation (per mil)") +
theme(legend.position = c(0.5, 0.9))

You can make better use of ggplot's aesthetics by combining both data sets into one. This is particularly easy when your data frames have the same structure. Here, you could then for example use color.
This way you only need one call to geom_line and it is easier to control the legend(s). You could even make some fancy function to automate your labels. etc.
Also note that white spaces in column names are not great (you're making your own life very difficult) and that you may want to think about automating your avarn calls, e.g. with lapply, which would result in a list of data frames and makes the binding of the data frames even easier.
avar.data.x <- readr::read_table("0 time av error
1 1.015223 0.030233604 0.003726832
2 2.030445 0.014819145 0.005270609
3 3.045668 0.010054801 0.006455241
4 4.060891 0.007477541 0.007453974
5 5.076113 0.006178282 0.008333912
6 6.091336 0.004949045 0.009129470")
avar.data.y <- readr::read_table("0 time av error
1 1.015223 0.06810001 0.003726832
2 2.030445 0.03408136 0.005270609
3 3.045668 0.02313839 0.006455241
4 4.060891 0.01737148 0.007453974
5 5.076113 0.01405144 0.008333912
6 6.091336 0.01172788 0.009129470")
library(tidyverse)
combine_df <- bind_rows(list(a = avar.data.x, b = avar.data.y), .id = 'ID')
ggplot(combine_df)+
geom_line(aes(x = time, y = sqrt(av), color = ID))+
scale_color_manual(values = c("red", "blue"),
labels = c(expression("Delta 18"^"O"), expression("Delta 13"^"C")))
Created on 2019-11-11 by the reprex package (v0.2.1)

Related

Can someone explain why my first ggplot2 box plot was just one big box and how the solution worked?

So my first ggplot2 box plot was just one big stretched out box plot, the second one was correct but I don't understand what changed and why the second one worked. I'm new to R and ggplot2, let me know if you can, thanks.
#----------------------------------------------------------
# This is the original ggplot that didn't work:
#----------------------------------------------------------
zSepalFrame <- data.frame(zSepalLength, zSepalWdth)
zPetalFrame <- data.frame(zPetalLength, zPetalWdth)
p1 <- ggplot(data = zSepalFrame, mapping = aes(x=zSepalWdth, y=zSepalLength, group = 4)) + #fill = zSepalLength
geom_boxplot(notch=TRUE) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
theme_classic() +
labs(title = "Iris Data Box Plot") +
labs(subtitle ="Z Values of Sepals From Iris.R")
p1
#----------------------------------------------------------
# This is the new ggplot box plot line that worked:
#----------------------------------------------------------
bp = ggplot(zSepalFrame, aes(x=factor(zSepalWdth), y=zSepalLength, color = zSepalWdth)) + geom_boxplot() + theme(legend.position = "none")
bp
This is what the ggplot box plot looked like
I don't have your precise dataset, OP, but it seems to stem from assigning a continuous variable to your x axis, when boxplots require a discrete variable.
A continuous variable is something like a numeric column in a dataframe. So something like this:
x <- c(4,4,4,8,8,8,8)
Even though the variable x only contains 4's and 8's, R assigns this as a numeric type of variable, which is continuous. It means that if you plot this on the x axis, ggplot will have no issue with something falling anywhere in-between 4 or 8, and will be positioned accordingly.
The other type of variable is called discrete, which would be something like this:
y <- c("Green", "Green", "Flags", "Flags", "Cars")
The variable y contains only characters. It must be discrete, since there is no such thing as something between "Green" and "Cars". If plotted on an x axis, ggplot will group things as either being "Green", "Flags", or "Cars".
The cool thing is that you can change a continuous variable into a discrete one. One way to do that is to factorize or force R to consider a variable as a factor. If you typed factor(x), you get this:
[1] 4 4 4 8 8 8 8
Levels: 4 8
The values in x are the same, but now there is no such thing as a number between 4 and 8 when x is a factor - it would just add another level.
That is in short why your box plot changes. Let's demonstrate with the iris dataset. First, an example like yours. Notice that I'm assigning x=Sepal.Length. In the iris dataset, Sepal.Length is numeric, so continuous.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_boxplot()
This is similar to yours. The reason is that the boxplot is drawn by grouping according to x and then calculating statistics on those groups. If a variable is continuous, there are no "groups", even if data is replicated (like as in x above). One way to make groups is to force the data to be discrete, as in factor(Sepal.Length). Here's what it looks like when you do that:
ggplot(iris, aes(x=factor(Sepal.Length), y=Sepal.Width)) +
geom_boxplot()
The other way to have this same effect would be to use the group= aesthetic, which does what you might think: it groups according to that column in the dataset.
ggplot(iris, aes(x=Sepal.Length), y=Sepal.Width, group=Sepal.Length)) +
geom_boxplot()

compare boxplots with a single value

I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")
The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")

ggplot scale_fill_discrete(breaks = user_countries) creates a second, undesired legend

I am trying to change the factor level ordering of a data frame column to control the legend ordering and ggplot coloring of factor levels specified by country name. Here is my dataframe country_hours:
countries hours
1 Brazil 17
2 Mexico 13
3 Poland 20
4 Indonesia 2
5 Norway 20
6 Poland 20
Here is how I try to plot subsets of the data frame depending on a list of selected countries, user_countries:
make_country_plot<-function(user_countries, country_hours_pre)
{
country_hours = country_hours_pre[which(country_hours_pre$countries %in% user_countries) ,]
country_hours$countries = factor(country_hours$countries, levels = c(user_countries))
p = ggplot(data=country_hours, aes(x=hours, color=countries))
for(name in user_countries){
p = p + geom_bar( data=subset(country_hours, countries==name), aes(y = (..count..)/sum(..count..), fill=countries), binwidth = 1, alpha = .3)
}
p = p + scale_y_continuous(labels = percent) + geom_density(size = 1, aes(color=countries), adjust=1) +
ggtitle("Baltic countries") + theme(plot.title = element_text(lineheight=.8, face="bold")) + scale_fill_discrete(breaks = user_countries)
}
This works great in that the coloring goes according to my desired order as does the top legend, but a second legend appears and shows a different order. Without scale_fill_discrete(breaks = user_countries) I do not get my desired order, but I also do not get two legends. In the plot shown below, the desired order, given by user_countries was
user_countries = c("Lithuania", "Latvia", "Estonia")
I'd like to get rid of this second legend. How can I do it?
I also have another problem, which is that the plotting/coloring is inconsistent between different plots. I'd like the "first" country to always be blue, but it's not always blue. Also the 'real' legend (darker/solid colors) is not always in the same position - sometimes it's below the incorrect/black legend. Why does this happen and how can I make this consistent across plots?
Also, different plots have different numbers of factor groups, sometimes more than 9, so I'd rather stick with standard ggplot coloring as most of the solutions for defining your own colors seem limited in the number of colors you can do (How to assign colors to categorical variables in ggplot2 that have stable mapping?)
You are mapping to two different aesthetics (color and fill) but you changed the scale specifications for only one of them. Doing this will always split a previously combined legend. There is a nice example of this on this page
To keep your legends combined, you'll want to add scale_color_discrete(breaks = user_countries) in addition to scale_fill_discrete(breaks = user_countries).
I don't have enough reputation to comment, but this previous question has a comprehensive answer.
Short answer is to change geom_density so that it doesn't map countries to color. That means just taking everything inside the aes() and putting it outside.
geom_density(size = 1, color=countries, adjust=1)
(This should work. Don't have an example to confirm).

How do you plot two vectors on x-axis and another on y-axis in ggplot2

I am trying to plot two vectors with different values, but equal length on the same graph as follows:
a<-23.33:52.33
b<-33.33:62.33
days<-1:30
df<-data.frame(x,y,days)
a b days
1 23.33 33.33 1
2 24.33 34.33 2
3 25.33 35.33 3
4 26.33 36.33 4
5 27.33 37.33 5
etc..
I am trying to use ggplot2 to plot x and y on the x-axis and the days on the y-axis. However, I can't figure out how to do it. I am able to plot them individually and combine the graphs, but I want just one graph with both a and b vectors (different colors) on x-axis and number of days on y-axis.
What I have so far:
X<-ggplot(df, aes(x=a,y=days)) + geom_line(color="red")
Y<-ggplot(df, aes(x=b,y=days)) + geom_line(color="blue")
Is there any way to define the x-axis for both a and b vectors? I have also tried using the melt long function, but got stuck afterwards.
Any help is much appreciated. Thank you
I think the best way to do it is via a the approach of melting the data (as you have mentioned). Especially if you are going to add more vectors. This is the code
library(reshape2)
library(ggplot2)
a<-23:52
b<-33:62
days<-1:30
df<-data.frame(x=a,y=b,days)
df_molten=melt(df,id.vars="days")
ggplot(df_molten) + geom_line(aes(x=value,y=days,color=variable))
You can also change the colors manually via scale_color_manual.
A simpler solution is to use only ggplot. The following code will work in your case
a<-23.33:52.33
b<-33.33:62.33
days<-1:30
df<-data.frame(a,b,days)
ggplot(data = df)+
geom_line(aes(x = df$days,y = df$a), color = "blue")+
geom_line(aes(x = df$days,y = df$b), color = "red")
I added the colors, you might want to use them to differentiate between your variables.

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

Resources