ggplot2 manual legend inside a plot - r

When I run the below code, a density plot and histogram will be created. I've added two vertical line to show mean and median. I want to display a legend ("Mean" with dotted red and "Median" with green color) at the top-right corner of the plot. You can run this code as the df is already available in R-studio.
ggplot(USArrests,aes(x=Murder)) +
geom_histogram(aes(y=..density..),binwidth=.5,col="black",fill="white") +
geom_density(alpha=.2,fill="coral") +
geom_vline(aes(xintercept=mean(Murder,na.rm=T)),color="red",linetype="dashed",size=1) +
geom_vline(aes(xintercept=median(Murder,na.rm=T)),color="green",size=1)
My question is shall I use theme() or something else to display legend in my plot?

No need for extra data.frames.
library(ggplot2)
ggplot(USArrests,aes(x=Murder)) +
geom_histogram(aes(y=..density..),binwidth=.5,col="black",fill="white") +
geom_density(alpha=.2,fill="coral") +
geom_vline(aes(xintercept=mean(Murder,na.rm=TRUE), color="mean", linetype="mean"), size=1) +
geom_vline(aes(xintercept=median(Murder,na.rm=TRUE), color="median", linetype="median"), size=1) +
scale_color_manual(name=NULL, values=c(mean="red", median="green"), drop=FALSE) +
scale_linetype_manual(name=NULL, values=c(mean="dashed", median="solid")) +
theme(legend.position=c(0.9, 0.9))

You're probably better off creating an additional data.frame of the summary statistics
and then adding this to the plot instead of trying to fiddle around with manually creating
each legend element. Legend position can be adjusted with theme(legend.position = c())
library("ggplot2")
library("reshape2")
library("dplyr")
# Summary data.frame
summary_df <- USArrests %>%
summarise(Mean = mean(Murder), Median = median(Murder)) %>%
melt(variable.name="statistic")
# Specifying colors and linetypes for the legend since you wanted to map both color and linetype
# to the same variable.
summary_cols <- c("Mean" = "red", "Median" = "green")
summary_linetypes <- c("Mean" = 2, "Median" = 1)
ggplot(USArrests,aes(x=Murder)) +
geom_histogram(aes(y=..density..),binwidth=.5,col="black",fill="white") +
geom_density(alpha=.2,fill="coral") +
geom_vline(data = summary_df, aes(xintercept = value, color = statistic,
lty = statistic)) +
scale_color_manual(values = summary_cols) +
scale_linetype_manual(values = summary_linetypes) +
theme(legend.position = c(0.85,0.85))
giving

Related

How to add percentages on top of an histogram when data is grouped

This is not my data (for confidentiality reasons), but I have tried to create a reproducible example using a dataset included in the ggplot2 library. I have an histogram summarizing the value of some variable by group (factor of 2 levels). First, I did not want the counts but proportions of the total, so I used that code:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>% as.data.frame() %>% filter(cut=="Premium" | cut=="Ideal")
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="Count") +
theme_bw() + theme(legend.position="none")
It gave me this as a result.
enter image description here
The issue is that I would like to print the numeric percentages on top of the bins and haven't find a way to do so.
As I saw it done for printing counts elsewhere, I attempted to print them using stat_bin(), including the same y and label values as the y in geom_histogram, thinking it would print the right numbers:
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
stat_bin(aes(y=after_stat(width*density),label=after_stat(width*density*100)),geom="text",vjust=-.5) +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="%") +
theme_bw() + theme(legend.position="none")
However, it does print way more values than there are bins, these values do not appear consistent with what is portrayed by the bar heights and they do not print in respect to vjust=-.5 which would make them appear slightly above the bars.
enter image description here
What am I missing here? I know that if there was no grouping variable/facet_wrap, I could use after_stat(count/sum(count)) instead of after_stat(width*density) and it seems that it would have fixed my issue. But I need the histograms for both groups to appear next to each other. Thanks in advance!
You have to use the same arguments in stat_bin as for the histogram when adding your labels to get same binning for both layers and to align the labels with the bars:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>%
as.data.frame() %>%
filter(cut == "Premium" | cut == "Ideal")
ggplot(df_example, aes(x = z, fill = cut)) +
geom_histogram(aes(y = after_stat(width * density)),
binwidth = 1, center = 0.5, col = "black"
) +
stat_bin(
aes(
y = after_stat(width * density),
label = scales::number(after_stat(width * density), scale = 100, accuracy = 1)
),
geom = "text", binwidth = 1, center = 0.5, vjust = -.25
) +
facet_wrap(~cut) +
scale_x_continuous(breaks = seq(0, 9, by = 1)) +
scale_y_continuous(labels = scales::number_format(scale = 100)) +
scale_fill_manual(values = c("#CC79A7", "#009E73")) +
labs(x = "Depth (mm)", y = "%") +
theme_bw() +
theme(legend.position = "none")

Creating a legend with shapes using ggplot2

I have created the following code for a graph in which four fitted lines and corresponding points are plotted. I have problems with the legend. For some reason I cannot find a way to assign the different shapes of the points to a variable name. Also, the colours do not line up with the actual colours in the graph.
y1 <- c(1400,1200,1100,1000,900,800)
y2 <- c(1300,1130,1020,970,830,820)
y3 <- c(1340,1230,1120,1070,940,850)
y4 <- c(1290,1150,1040,920,810,800)
df <- data.frame(x,y1,y2,y3,y4)
g <- ggplot(df, aes(x=x), shape="shape") +
geom_smooth(aes(y=y1), colour="red", method="auto", se=FALSE) + geom_point(aes(y=y1),shape=14) +
geom_smooth(aes(y=y2), colour="blue", method="auto", se=FALSE) + geom_point(aes(y=y2),shape=8) +
geom_smooth(aes(y=y3), colour="green", method="auto", se=FALSE) + geom_point(aes(y=y3),shape=6) +
geom_smooth(aes(y=y4), colour="yellow", method="auto", se=FALSE) + geom_point(aes(y=y4),shape=2) +
ylab("x") + xlab("y") + labs(title="overview")
geom_line(aes(y=1000), linetype = "dashed")
theme_light() +
theme(plot.title = element_text(color="black", size=12, face="italic", hjust = 0.5)) +
scale_shape_binned(name="Value g", values=c(y1="14",y2="8",y3="6",y4="2"))
print(g)
I am wondering why the colours don't match up and how I can construct such a legend that it is clear which shape corresponds to which variable name.
While you can add the legend manually via scale_shape_manual, perhaps the adequate solution would be to reshape your data (try using tidyr::pivot_longer() on y1:y4 variables), and then assigning the resulting variable to the shape aesthetic (you can then manually set the colors to your liking). You would then need to use a single geom_point() and geom_smooth() instead of four of each.
Also, you're missing a reproducible example (what are the values of x?) and your code emits some warnings while trying to perform loess smoothing (because there's fewer data points than need to perform it).
Update (2021-12-12)
Here's a reproducible example in which we reshape the original data and feed it to ggplot using its aes() function to automatically plot different geom_point and geom_smooth for each "y group". I made up the values for the x variable.
library(ggplot2)
library(tidyr)
x <- 1:6
y1 <- c(1400,1200,1100,1000,900,800)
y2 <- c(1300,1130,1020,970,830,820)
y3 <- c(1340,1230,1120,1070,940,850)
y4 <- c(1290,1150,1040,920,810,800)
df <- data.frame(x,y1,y2,y3,y4)
data2 <- df %>%
pivot_longer(y1:y4, names_to = "group", values_to = "y")
ggplot(data2, aes(x, y, color = group, shape = group)) +
geom_point(size = 3) + # increased size for increased visibility
geom_smooth(method = "auto", se = FALSE)
Run the code line by line in RStudio and use it to inspect data2. I think it'll make more sense here's the resulting output:
Another update
Freek19, in your second example you'll need to specify both the shape and color scales manually, so that ggplot2 considers them to be the same, like so:
library(ggplot2)
data <- ... # from your previous example
ggplot(data, aes(x, y, shape = group, color = group)) +
geom_smooth() +
geom_point(size = 3) +
scale_shape_manual("Program type", values=c(1, 2, 3,4,5)) +
scale_color_manual("Program type", values=c(1, 2, 3,4,5))
Hope this helps.
I managed to get close to what I want, using:
library(ggplot2)
data <- data.frame(x = c(0,0.02,0.04,0.06,0.08,0.1),
y = c(1400,1200,1100,1000,910,850, #y1
1300,1130,1010,970,890,840, #y2
1200,1080,980,950,880,820, #y3
1100,1050,960,930,830,810, #y4
1050,1000,950,920,810,800), #y5
group = rep(c("5%","6%","7%","8%","9%"), each = 6))
data
Values <- ggplot(data, aes(x, y, shape = group, color = group)) + # Create line plot with default colors
geom_smooth(aes(color=group)) + geom_point(aes(shape=group),size=3) +
scale_shape_manual(values=c(1, 2, 3,4,5))+
geom_line(aes(y=1000), linetype = "dashed") +
ylab("V(c)") + xlab("c") + labs(title="Valuation")+
theme_light() +
theme(plot.title = element_text(color="black", size=12, face="italic", hjust = 0.5))+
labs(group="Program Type")
Values
I am only stuck with 2 legends. I want to change both name, because otherwise they overlap. However I am not sure how to do this.

ggplot specific thick line

How would one be able to plot one line thicker than the other. I tried using the geom_line(size=X) but then this increases the thickness of both lines. Let say I would like to increase the thickness of the first column, how would one be able to approach this?
a <- (cbind(rnorm(100),rnorm(100))) #nav[,1:10]
sa <- stack(as.data.frame(a))
sa$x <- rep(seq_len(nrow(a)), ncol(a))
require("ggplot2")
p<-qplot(x, values, data = sa, group = ind, colour = ind, geom = "line")
p + theme(legend.position = "none")+ylab("Millions")+xlab("Age")+
geom_line( size = 1.5)
You need to map line thickness to the variable:
p + geom_line(aes(size = ind))
To control the thickness use scale_size_manual():
p + geom_line(aes(size = ind)) +
scale_size_manual(values = c(0.1, 1))

Add arbitrary series with legend in ggplot2?

I have a bunch of data - three timeseries (model group means), coloured by group, with standard deviation represented by geom_ribbon. By default they have a nice legend on the side. I also have a single timeseries of observations, that I want to overlay over the plot (without the geom_ribbon), like this:
df <- data.frame(year=1991:2010, group=c(rep('group1',20), rep('group2',20), rep('group3',20)), mean=c(cumsum(abs(rnorm(20))),cumsum(abs(rnorm(20))),cumsum(abs(rnorm(20)))),sd=3+rnorm(60))
obs_df <- data.frame(year=1991:2010, value=cumsum(abs(rnorm(20))))
ggplot(df, aes(x=year, y=mean)) + geom_line(aes(colour=group)) + geom_ribbon(aes(ymax=mean+sd, ymin=mean-sd, fill=group), alpha = 0.2) +geom_line(data=obs_df, aes(x=year, y=value))
But the observations does appear on the legend, because it's not coloured (I want it black). How can I add the obs to the legend?
First, create a combined data frame of df and obs_df:
dat <- rbind(df, data.frame(year = obs_df$year,
group = "obs", mean = obs_df$value, sd = 0))
Plot:
ggplot(dat, aes(x=year, y=mean)) +
geom_line(aes(colour=group)) +
geom_ribbon(aes(ymax=mean+sd, ymin=mean-sd, fill=group), alpha = 0.2) +
scale_colour_manual(values = c("red", "green", "blue", "black")) +
scale_fill_manual(values = c("red", "green", "blue", NA))
I'm guessing you made an error with your construction of 'obs_df'. If you create it with year = 1991:2010 it makes more sense in the context of the rest of the data and it gives you the plot you are hoping for with the ggplot call unchanged.

Overlaying histograms with ggplot2 in R

I am new to R and am trying to plot 3 histograms onto the same graph.
Everything worked fine, but my problem is that you don't see where 2 histograms overlap - they look rather cut off.
When I make density plots, it looks perfect: each curve is surrounded by a black frame line, and colours look different where curves overlap.
Can someone tell me if something similar can be achieved with the histograms in the 1st picture? This is the code I'm using:
lowf0 <-read.csv (....)
mediumf0 <-read.csv (....)
highf0 <-read.csv(....)
lowf0$utt<-'low f0'
mediumf0$utt<-'medium f0'
highf0$utt<-'high f0'
histogram<-rbind(lowf0,mediumf0,highf0)
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
Using #joran's sample data,
ggplot(dat, aes(x=xx, fill=yy)) + geom_histogram(alpha=0.2, position="identity")
note that the default position of geom_histogram is "stack."
see "position adjustment" of this page:
geom_histogram documentation
Your current code:
ggplot(histogram, aes(f0, fill = utt)) + geom_histogram(alpha = 0.2)
is telling ggplot to construct one histogram using all the values in f0 and then color the bars of this single histogram according to the variable utt.
What you want instead is to create three separate histograms, with alpha blending so that they are visible through each other. So you probably want to use three separate calls to geom_histogram, where each one gets it's own data frame and fill:
ggplot(histogram, aes(f0)) +
geom_histogram(data = lowf0, fill = "red", alpha = 0.2) +
geom_histogram(data = mediumf0, fill = "blue", alpha = 0.2) +
geom_histogram(data = highf0, fill = "green", alpha = 0.2) +
Here's a concrete example with some output:
dat <- data.frame(xx = c(runif(100,20,50),runif(100,40,80),runif(100,0,30)),yy = rep(letters[1:3],each = 100))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
which produces something like this:
Edited to fix typos; you wanted fill, not colour.
While only a few lines are required to plot multiple/overlapping histograms in ggplot2, the results are't always satisfactory. There needs to be proper use of borders and coloring to ensure the eye can differentiate between histograms.
The following functions balance border colors, opacities, and superimposed density plots to enable the viewer to differentiate among distributions.
Single histogram:
plot_histogram <- function(df, feature) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)))) +
geom_histogram(aes(y = ..density..), alpha=0.7, fill="#33AADE", color="black") +
geom_density(alpha=0.3, fill="red") +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
print(plt)
}
Multiple histogram:
plot_multi_histogram <- function(df, feature, label_column) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), color="black", linetype="dashed", size=1) +
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
Usage:
Simply pass your data frame into the above functions along with desired arguments:
plot_histogram(iris, 'Sepal.Width')
plot_multi_histogram(iris, 'Sepal.Width', 'Species')
The extra parameter in plot_multi_histogram is the name of the column containing the category labels.
We can see this more dramatically by creating a dataframe with many different distribution means:
a <-data.frame(n=rnorm(1000, mean = 1), category=rep('A', 1000))
b <-data.frame(n=rnorm(1000, mean = 2), category=rep('B', 1000))
c <-data.frame(n=rnorm(1000, mean = 3), category=rep('C', 1000))
d <-data.frame(n=rnorm(1000, mean = 4), category=rep('D', 1000))
e <-data.frame(n=rnorm(1000, mean = 5), category=rep('E', 1000))
f <-data.frame(n=rnorm(1000, mean = 6), category=rep('F', 1000))
many_distros <- do.call('rbind', list(a,b,c,d,e,f))
Passing data frame in as before (and widening chart using options):
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, 'n', 'category')
To add a separate vertical line for each distribution:
plot_multi_histogram <- function(df, feature, label_column, means) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)), fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.7, position="identity", aes(y = ..density..), color="black") +
geom_density(alpha=0.7) +
geom_vline(xintercept=means, color="black", linetype="dashed", size=1)
labs(x=feature, y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
The only change over the previous plot_multi_histogram function is the addition of means to the parameters, and changing the geom_vline line to accept multiple values.
Usage:
options(repr.plot.width = 20, repr.plot.height = 8)
plot_multi_histogram(many_distros, "n", 'category', c(1, 2, 3, 4, 5, 6))
Result:
Since I set the means explicitly in many_distros I can simply pass them in. Alternatively you can simply calculate these inside the function and use that way.

Resources