Add arbitrary series with legend in ggplot2? - r

I have a bunch of data - three timeseries (model group means), coloured by group, with standard deviation represented by geom_ribbon. By default they have a nice legend on the side. I also have a single timeseries of observations, that I want to overlay over the plot (without the geom_ribbon), like this:
df <- data.frame(year=1991:2010, group=c(rep('group1',20), rep('group2',20), rep('group3',20)), mean=c(cumsum(abs(rnorm(20))),cumsum(abs(rnorm(20))),cumsum(abs(rnorm(20)))),sd=3+rnorm(60))
obs_df <- data.frame(year=1991:2010, value=cumsum(abs(rnorm(20))))
ggplot(df, aes(x=year, y=mean)) + geom_line(aes(colour=group)) + geom_ribbon(aes(ymax=mean+sd, ymin=mean-sd, fill=group), alpha = 0.2) +geom_line(data=obs_df, aes(x=year, y=value))
But the observations does appear on the legend, because it's not coloured (I want it black). How can I add the obs to the legend?

First, create a combined data frame of df and obs_df:
dat <- rbind(df, data.frame(year = obs_df$year,
group = "obs", mean = obs_df$value, sd = 0))
Plot:
ggplot(dat, aes(x=year, y=mean)) +
geom_line(aes(colour=group)) +
geom_ribbon(aes(ymax=mean+sd, ymin=mean-sd, fill=group), alpha = 0.2) +
scale_colour_manual(values = c("red", "green", "blue", "black")) +
scale_fill_manual(values = c("red", "green", "blue", NA))

I'm guessing you made an error with your construction of 'obs_df'. If you create it with year = 1991:2010 it makes more sense in the context of the rest of the data and it gives you the plot you are hoping for with the ggplot call unchanged.

Related

Bar plot using continuous variable and factor to color bars

I have a a dataset that looks like this but much bigger:
County <- rep(c("Alameda", "Clallam", "Clatsop", "Contra Costa", "Coos", "Curry"), each=2)
Habitat <-rep(c("Seagrass","Saltmarsh"), time=6)
Acres <- c(892.03, 6841.43, 5092.35,214.74, 0, 463.06,3165.39,2159.99,263.21, 12.53, 0,83.1)
SVI<-rep(c(0.4701, 0.6146,0.5185,0.4057,0.529,0.8774), each=2)
df <- data.frame(County, Habitat, Acres, SVI)
I would like to make a barplot that shows the number of acres for seagrass and saltmarsh by county but I would like the color of the barplot to reflect the SVI value.
So ideally I would have: bars in a range of shades of pink that reflect seagrass + SVI value and bars in a range of shades of blue that reflect saltmarsh+SVI value. I figured out how to do this for two discrete value but not one categorical and one continuous.
So far I have:
library(ggplot2)
p<-ggplot(data=df, aes(x=County, y=Acres, fill=Habitat, color=SVI)) +
geom_bar(stat="identity", position=position_dodge())
But that's not exactly what I want.
Any suggestions?
Thank you,
Annick
One option is you can use alpha to set the transparency, which will give you the result of a range of colors within a given variable.
ggplot(data=df, aes(x=County, y=Acres, fill=Habitat, alpha=SVI, color = Habitat)) +
geom_bar(stat="identity", position=position_dodge())
Another option would be to use facets, which then allows you to vary colors more widely across County while maintaining the distinction between Habitats. Lots of different options for color scales. Some info here: https://ggplot2-book.org/scale-colour.html
ggplot(data=df, aes(x=County, y=Acres, fill=SVI)) +
geom_bar(stat="identity", position=position_dodge()) +
facet_wrap(~Habitat) +
scale_fill_viridis_c()
Data, with a tweak to make Acres and SVI numeric
County <- rep(c("Alameda", "Clallam", "Clatsop", "Contra Costa", "Coos", "Curry"), each=2)
Habitat <-rep(c("Seagrass","Saltmarsh"), time=6)
Acres <- c("892.03", "6841.43", "5092.35","214.74", "0", "463.06","3165.39","2159.99","263.21", "12.53", "0","83.1")
SVI<-rep(c("0.4701", "0.6146","0.5185", "0.04057", "0.529","0.8774"), each=2)
df <- data.frame(County, Habitat, Acres = as.numeric(Acres), SVI = as.numeric(SVI))
You will first need to convert your character variables to numbers:
df$Acres <- as.numeric(df$Acres)
df$SVI <- as.numeric(df$SVI)
Using two fill scales for dodged bars is tricky. It requires two separate layers, which will not dodge automatically, so require a numeric axis that has to be faked to look discrete. It also requires two gradient fill scales, which can't be done natively in ggplot, but can be done using the ggnewscale extension package:
library(ggnewscale)
library(ggplot2)
ggplot(data = subset(df, Habitat == "Seagrass"),
aes(x = as.numeric(factor(County)) - 0.2, y = Acres, fill = SVI)) +
geom_col(width = 0.35, color = "gray50") +
scale_fill_gradient(low = "hotpink", high = "white", name = "Seagrass") +
ggnewscale::new_scale_fill() +
geom_col(data = subset(df, Habitat == "Saltmarsh"), width = 0.35,
aes(x = as.numeric(factor(County)) + 0.2, fill = SVI),
color = "gray50") +
scale_fill_gradient(low = "navy", high = "lightblue", name = "Saltmarsh") +
scale_x_continuous(breaks = seq_along(levels(factor(df$County))),
labels = levels(factor(df$County)), name = "County") +
theme_minimal(base_size = 16) +
theme(panel.grid.major.x = element_blank())

Creating a legend with shapes using ggplot2

I have created the following code for a graph in which four fitted lines and corresponding points are plotted. I have problems with the legend. For some reason I cannot find a way to assign the different shapes of the points to a variable name. Also, the colours do not line up with the actual colours in the graph.
y1 <- c(1400,1200,1100,1000,900,800)
y2 <- c(1300,1130,1020,970,830,820)
y3 <- c(1340,1230,1120,1070,940,850)
y4 <- c(1290,1150,1040,920,810,800)
df <- data.frame(x,y1,y2,y3,y4)
g <- ggplot(df, aes(x=x), shape="shape") +
geom_smooth(aes(y=y1), colour="red", method="auto", se=FALSE) + geom_point(aes(y=y1),shape=14) +
geom_smooth(aes(y=y2), colour="blue", method="auto", se=FALSE) + geom_point(aes(y=y2),shape=8) +
geom_smooth(aes(y=y3), colour="green", method="auto", se=FALSE) + geom_point(aes(y=y3),shape=6) +
geom_smooth(aes(y=y4), colour="yellow", method="auto", se=FALSE) + geom_point(aes(y=y4),shape=2) +
ylab("x") + xlab("y") + labs(title="overview")
geom_line(aes(y=1000), linetype = "dashed")
theme_light() +
theme(plot.title = element_text(color="black", size=12, face="italic", hjust = 0.5)) +
scale_shape_binned(name="Value g", values=c(y1="14",y2="8",y3="6",y4="2"))
print(g)
I am wondering why the colours don't match up and how I can construct such a legend that it is clear which shape corresponds to which variable name.
While you can add the legend manually via scale_shape_manual, perhaps the adequate solution would be to reshape your data (try using tidyr::pivot_longer() on y1:y4 variables), and then assigning the resulting variable to the shape aesthetic (you can then manually set the colors to your liking). You would then need to use a single geom_point() and geom_smooth() instead of four of each.
Also, you're missing a reproducible example (what are the values of x?) and your code emits some warnings while trying to perform loess smoothing (because there's fewer data points than need to perform it).
Update (2021-12-12)
Here's a reproducible example in which we reshape the original data and feed it to ggplot using its aes() function to automatically plot different geom_point and geom_smooth for each "y group". I made up the values for the x variable.
library(ggplot2)
library(tidyr)
x <- 1:6
y1 <- c(1400,1200,1100,1000,900,800)
y2 <- c(1300,1130,1020,970,830,820)
y3 <- c(1340,1230,1120,1070,940,850)
y4 <- c(1290,1150,1040,920,810,800)
df <- data.frame(x,y1,y2,y3,y4)
data2 <- df %>%
pivot_longer(y1:y4, names_to = "group", values_to = "y")
ggplot(data2, aes(x, y, color = group, shape = group)) +
geom_point(size = 3) + # increased size for increased visibility
geom_smooth(method = "auto", se = FALSE)
Run the code line by line in RStudio and use it to inspect data2. I think it'll make more sense here's the resulting output:
Another update
Freek19, in your second example you'll need to specify both the shape and color scales manually, so that ggplot2 considers them to be the same, like so:
library(ggplot2)
data <- ... # from your previous example
ggplot(data, aes(x, y, shape = group, color = group)) +
geom_smooth() +
geom_point(size = 3) +
scale_shape_manual("Program type", values=c(1, 2, 3,4,5)) +
scale_color_manual("Program type", values=c(1, 2, 3,4,5))
Hope this helps.
I managed to get close to what I want, using:
library(ggplot2)
data <- data.frame(x = c(0,0.02,0.04,0.06,0.08,0.1),
y = c(1400,1200,1100,1000,910,850, #y1
1300,1130,1010,970,890,840, #y2
1200,1080,980,950,880,820, #y3
1100,1050,960,930,830,810, #y4
1050,1000,950,920,810,800), #y5
group = rep(c("5%","6%","7%","8%","9%"), each = 6))
data
Values <- ggplot(data, aes(x, y, shape = group, color = group)) + # Create line plot with default colors
geom_smooth(aes(color=group)) + geom_point(aes(shape=group),size=3) +
scale_shape_manual(values=c(1, 2, 3,4,5))+
geom_line(aes(y=1000), linetype = "dashed") +
ylab("V(c)") + xlab("c") + labs(title="Valuation")+
theme_light() +
theme(plot.title = element_text(color="black", size=12, face="italic", hjust = 0.5))+
labs(group="Program Type")
Values
I am only stuck with 2 legends. I want to change both name, because otherwise they overlap. However I am not sure how to do this.

How to create legend with differing alphas for multiple geom_line plots in ggplot2 (R)

I have the following data on school enrollment for two years. I want to highlight data from school H in my plot and in the legend by giving it a different alpha.
library(tidyverse)
schools <- c("A","B","C","D","E",
"F","G","H","I","J")
yr2010 <- c(601,809,604,601,485,485,798,662,408,451)
yr2019 <- c(971,1056,1144,933,732,833,975,617,598,822)
data <- data.frame(schools,yr2010,yr2019)
I did some data management to get the data ready for plotting.
data2 <- data %>%
gather(key = "year", value = "students", 2:3)
data2a <- data2 %>%
filter(schools != "H")
data2b <- data2 %>%
filter(schools == "H")
Then I tried to graph the data using two separate geom_line plots, one for school H with default alpha and size=1.5, and one for the remaining schools with alpha=.3 and size=1.
ggplot(data2, aes(x=year,y=students,color=schools,group=schools)) +
theme_classic() +
geom_line(data = data2a, alpha=.3, size=1) +
scale_color_manual(values=c("red","orange","green","skyblue","aquamarine","purple",
"pink","brown","black")) +
geom_line(data = data2b, color="blue", size=1.5)
However, the school I want to highlight is not included in the legend. So I tried to include the color of school H in scale_color_manual instead of in the geom_line call.
ggplot(data2, aes(x=year,y=students,color=schools,group=schools)) +
theme_classic() +
geom_line(data = data2a, alpha=.3, size=1) +
scale_color_manual(values=c("red","orange","green","skyblue","aquamarine","purple",
"pink","blue","brown","black")) +
geom_line(data = data2b, size=1.5)
However, now the alphas in the legend are all the same, which doesn't highlight school H as much as I'd like.
How can I call the plot so that the legend matches the alpha of the line itself for all schools?
You need to put alpha and size categories in aes like you put color. Then, you can use scale_alpha_manual and scale_size_manual with respect to your need. Also, by that there is no need for creating data2a and data2b.
See below code:
ggplot(data2, aes(x=year,y=students,color=schools,group=schools,
alpha=schools, size = schools)) +
theme_classic() +
geom_line() +
scale_color_manual(values=c("red","orange","green","skyblue","aquamarine","purple",
"pink","blue","brown","black")) +
scale_alpha_manual(values = c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,NA, 0.3, 0.3)) +
#for the default alpha, you can write 1 or NA
scale_size_manual(values= c(1,1,1,1,1,1,1,1.5,1,1))
The code brings this plot. Please click.
I hope it will be useful.

ggplot2 manual legend inside a plot

When I run the below code, a density plot and histogram will be created. I've added two vertical line to show mean and median. I want to display a legend ("Mean" with dotted red and "Median" with green color) at the top-right corner of the plot. You can run this code as the df is already available in R-studio.
ggplot(USArrests,aes(x=Murder)) +
geom_histogram(aes(y=..density..),binwidth=.5,col="black",fill="white") +
geom_density(alpha=.2,fill="coral") +
geom_vline(aes(xintercept=mean(Murder,na.rm=T)),color="red",linetype="dashed",size=1) +
geom_vline(aes(xintercept=median(Murder,na.rm=T)),color="green",size=1)
My question is shall I use theme() or something else to display legend in my plot?
No need for extra data.frames.
library(ggplot2)
ggplot(USArrests,aes(x=Murder)) +
geom_histogram(aes(y=..density..),binwidth=.5,col="black",fill="white") +
geom_density(alpha=.2,fill="coral") +
geom_vline(aes(xintercept=mean(Murder,na.rm=TRUE), color="mean", linetype="mean"), size=1) +
geom_vline(aes(xintercept=median(Murder,na.rm=TRUE), color="median", linetype="median"), size=1) +
scale_color_manual(name=NULL, values=c(mean="red", median="green"), drop=FALSE) +
scale_linetype_manual(name=NULL, values=c(mean="dashed", median="solid")) +
theme(legend.position=c(0.9, 0.9))
You're probably better off creating an additional data.frame of the summary statistics
and then adding this to the plot instead of trying to fiddle around with manually creating
each legend element. Legend position can be adjusted with theme(legend.position = c())
library("ggplot2")
library("reshape2")
library("dplyr")
# Summary data.frame
summary_df <- USArrests %>%
summarise(Mean = mean(Murder), Median = median(Murder)) %>%
melt(variable.name="statistic")
# Specifying colors and linetypes for the legend since you wanted to map both color and linetype
# to the same variable.
summary_cols <- c("Mean" = "red", "Median" = "green")
summary_linetypes <- c("Mean" = 2, "Median" = 1)
ggplot(USArrests,aes(x=Murder)) +
geom_histogram(aes(y=..density..),binwidth=.5,col="black",fill="white") +
geom_density(alpha=.2,fill="coral") +
geom_vline(data = summary_df, aes(xintercept = value, color = statistic,
lty = statistic)) +
scale_color_manual(values = summary_cols) +
scale_linetype_manual(values = summary_linetypes) +
theme(legend.position = c(0.85,0.85))
giving

Overlaying histograms with ggplot2 in R

I am new to R and am trying to plot 3 histograms onto the same graph.
Everything worked fine, but my problem is that you don't see where 2 histograms overlap - they look rather cut off.
When I make density plots, it looks perfect: each curve is surrounded by a black frame line, and colours look different where curves overlap.
Can someone tell me if something similar can be achieved with the histograms in the 1st picture? This is the code I'm using:
lowf0 <-read.csv (....)
mediumf0 <-read.csv (....)
highf0 <-read.csv(....)
lowf0$utt<-'low f0'
mediumf0$utt<-'medium f0'
highf0$utt<-'high f0'
histogram<-rbind(lowf0,mediumf0,highf0)
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
Using #joran's sample data,
ggplot(dat, aes(x=xx, fill=yy)) + geom_histogram(alpha=0.2, position="identity")
note that the default position of geom_histogram is "stack."
see "position adjustment" of this page:
geom_histogram documentation
Your current code:
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
is telling ggplot to construct one histogram using all the values in f0 and then color the bars of this single histogram according to the variable utt.
What you want instead is to create three separate histograms, with alpha blending so that they are visible through each other. So you probably want to use three separate calls to geom_histogram, where each one gets it's own data frame and fill:
ggplot(histogram, aes(f0)) +
geom_histogram(data = lowf0, fill = "red", alpha = 0.2) +
geom_histogram(data = mediumf0, fill = "blue", alpha = 0.2) +
geom_histogram(data = highf0, fill = "green", alpha = 0.2) +
Here's a concrete example with some output:
dat <- data.frame(xx = c(runif(100,20,50),runif(100,40,80),runif(100,0,30)),yy = rep(letters[1:3],each = 100))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
which produces something like this:
Edited to fix typos; you wanted fill, not colour.
While only a few lines are required to plot multiple/overlapping histograms in ggplot2, the results are't always satisfactory. There needs to be proper use of borders and coloring to ensure the eye can differentiate between histograms.
The following functions balance border colors, opacities, and superimposed density plots to enable the viewer to differentiate among distributions.
Single histogram:
plot_histogram <- function(df, feature) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)))) +
geom_histogram(aes(y = ..density..), alpha=0.7, fill="#33AADE", color="black") +
geom_density(alpha=0.3, fill="red") +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
print(plt)
}
Multiple histogram:
plot_multi_histogram <- function(df, feature, label_column) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
Usage:
Simply pass your data frame into the above functions along with desired arguments:
plot_histogram(iris, 'Sepal.Width')
plot_multi_histogram(iris, 'Sepal.Width', 'Species')
The extra parameter in plot_multi_histogram is the name of the column containing the category labels.
We can see this more dramatically by creating a dataframe with many different distribution means:
a <-data.frame(n=rnorm(1000, mean = 1), category=rep('A', 1000))
b <-data.frame(n=rnorm(1000, mean = 2), category=rep('B', 1000))
c <-data.frame(n=rnorm(1000, mean = 3), category=rep('C', 1000))
d <-data.frame(n=rnorm(1000, mean = 4), category=rep('D', 1000))
e <-data.frame(n=rnorm(1000, mean = 5), category=rep('E', 1000))
f <-data.frame(n=rnorm(1000, mean = 6), category=rep('F', 1000))
many_distros <- do.call('rbind', list(a,b,c,d,e,f))
Passing data frame in as before (and widening chart using options):
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, 'n', 'category')
To add a separate vertical line for each distribution:
plot_multi_histogram <- function(df, feature, label_column, means) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(xintercept=means, color="black", linetype="dashed", size=1)
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
The only change over the previous plot_multi_histogram function is the addition of means to the parameters, and changing the geom_vline line to accept multiple values.
Usage:
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, "n", 'category', c(1, 2, 3, 4, 5, 6))
Result:
Since I set the means explicitly in many_distros I can simply pass them in. Alternatively you can simply calculate these inside the function and use that way.

Resources