Nested legend based on colour and shape - r

I want to make an xy plot of nested groups (Group and Subgroup) where points are colored by Group and have shape by Subgroup. A minimal example is below:
DATA<-data.frame(
Group=c(rep("group1",10),rep("group2",10),rep("group3",10) ),
Subgroup = c(rep(c("1.1","1.2"),5), rep(c("2.1","2.2"),5), rep(c("3.1","3.2"),5)),
x=c(rnorm(10, mean=5),rnorm(10, mean=10),rnorm(10, mean=15)),
y=c(rnorm(10, mean=3),rnorm(10, mean=4),rnorm(10, mean=5))
)
ggplot(DATA, aes(x=x, y=y,colour=Group, shape=Subgroup) ) +
geom_point(size=3)
However, because in reality I have many more subgroups than can be easily be identified based on the available shapes I want to repeat the same shapes within each Group. Below is the same code but with an additional column (Shape) specifying the shape:
DATA<-data.frame(
Group=c(rep("group1",10),rep("group2",10),rep("group3",10) ),
Subgroup = c(rep(c("1.1","1.2"),5), rep(c("2.1","2.2"),5), rep(c("3.1","3.2"),5)),
Shape = as.character(c(rep(c(1,2),15) ) ),
x=c(rnorm(10, mean=5),rnorm(10, mean=10),rnorm(10, mean=15)),
y=c(rnorm(10, mean=3),rnorm(10, mean=4),rnorm(10, mean=5))
)
ggplot(DATA, aes(x=x, y=y,colour=Group, shape=Shape) ) +
geom_point(size=3)
Now the shapes and colours are as I want them. However, the legend no longer lists the subgroups. What I want is a legend that lists all subgroups under each respective Group. Something like:
Group1
1.1
1.2
Group2
2.1
2.2
Group3
3.1
3.2
(Ideally, this would be a single nested legend. If nested legends are not possible, perhaps they can be three separate legends with the Groups as titles)
Is this something that can be achieved, and how?
Thanks

One option to achieve your desired result would be via the ggnewscale package which allows for multiple scales and legends for the same aesthetic.
To this end we have to
split the data by GROUP and plot each GROUP via a separate geom_point layer.
Additionally each GROUP gets a separate shape scale and legend which via achieve via ggnewscale::new_scale.
Instead of making use of the color aesthetic we set the color for each group as an argument for which I make use of a named vector of colors
Instead of copying and pasting the code for each group I make use of purrr::imap to loop over the splitted dataset and add the layers dynamically.
One more note: In general the order of legends is by default set via a "magic algorithm". To get the groups in the right order we have to explicitly set the order via guide_legend.
library(ggplot2)
library(ggnewscale)
library(dplyr)
library(purrr)
library(tibble)
DATA_split <- split(DATA, DATA$Group)
# Vector of colors and shapes
colors <- setNames(scales::hue_pal()(length(DATA_split)), names(DATA_split))
shapes <- setNames(scales::shape_pal()(length(unique(DATA$Shape))), unique(DATA$Shape))
ggplot(mapping = aes(x = x, y = y)) +
purrr::imap(DATA_split, function(x, y) {
# Get Labels
labels <- x[c("Shape", "Subgroup")] %>%
distinct(Shape, Subgroup) %>%
deframe()
# Get order
order <- as.numeric(gsub("^.*?(\\d+)$", "\\1", y))
list(
geom_point(data = x, aes(shape = Shape), color = colors[[y]], size = 3),
scale_shape_manual(values = shapes, labels = labels, name = y, guide = guide_legend(order = order)),
new_scale("shape")
)
})
DATA
set.seed(123)
DATA <- data.frame(
Group = c(rep("group1", 10), rep("group2", 10), rep("group3", 10)),
Subgroup = c(rep(c("1.1", "1.2"), 5), rep(c("2.1", "2.2"), 5), rep(c("3.1", "3.2"), 5)),
Shape = as.character(c(rep(c(1, 2), 15))),
x = c(rnorm(10, mean = 5), rnorm(10, mean = 10), rnorm(10, mean = 15)),
y = c(rnorm(10, mean = 3), rnorm(10, mean = 4), rnorm(10, mean = 5))
)

Related

Indicating the maximum values and adding corresponding labels on a ggplot

ggplot(data = dat) + geom_line(aes(x=foo,y=bar)) +geom_line(aes(x=foo_land,y=bar_land))
which creates a plot like the following:
I want to try and indicate the maximum values on this plot as well as add corresponding labels to the axis like:
The data for the maximum x and y values is stored in the dat file.
I was attempting to use
geom_hline() + geom_vline()
but I couldn't get this to work. I also believe that these lines will continue through the rest of the plot, which is not what I am trying to achieve. I should note that I would like to indicate the maximum y-value and its corresponding x value. The x-value is not indicated here since it is already labelled on the axis.
Reproducible example:
library(ggplot2)
col1 <- c(1,2,3)
col2 <- c(2,9,6)
df <- data.frame(col1,col2)
ggplot(data = df) +
geom_line(aes(x=col1,y=col2))
I would like to include a line which travels up from 2 on the x-axis and horizontally to the y-axis indicating the point 9, the maximum value of this graph.
Here's a start, although it does not make the axis text red where that maximal point is:
MaxLines <- data.frame(col1 = c(rep(df$col1[which.max(df$col2)], 2),
-Inf),
col2 = c(-Inf, rep(max(df$col2), 2)))
MaxLines creates an object that says where each of three points should be for two segments.
ggplot(data = df) +
geom_line(aes(x=col1,y=col2)) +
geom_path(data = MaxLines, aes(x = col1, y = col2),
inherit.aes = F, color = "red") +
scale_x_continuous(breaks = c(seq(1, 3, by = 0.5), df$col1[which.max(df$col2)])) +
scale_y_continuous(breaks = c(seq(2, 9, by = 2), max(df$col2)))

R colour code plot by rownames for principal component analysis

I am attempting to complete a principal component analysis on a set of data containing columns of numeric data.
Assuming a dataset like this (in reality I have a pre configured data frame, this one if for reproducibility):
v1 <- c(1,2,3,4,5,6,7)
v2 <- c(3,6,2,5,2,4,9)
v3 <- c(6,1,4,2,3,7,5)
dataset <-data.frame(v1,v2,v3)
row.names(dataset) <-c('New York', 'Seattle', 'Washington DC', 'Dallas', 'Chicago','Los Angeles','Minneapolis')
I have ran my principal component analysis, and successfully plotted it:
pca=prcomp(dataset,scale=TRUE)
plot(pca$x[,1], pca$x[,2],
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],cex=0.7,pos=3,col="darkgrey")
What I want to do however is colour code my data points based on the city, which is the row names of my dataset. I also want to use these cities (i.e. rownames) as labels.
I've tried the following, but neither have worked:
## attempt 1 - I get row labels, but no chart
plot(pca$x[,1], pca$x[,2],col=rownames(dataset),pch=rownames(dataset),
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],labels=rownames(dataset),cex=0.7,pos=3,col="darkgrey")
## attempt 2
datasetwithcity = rownames_to_column(dataset, var = "city")
head(datasetwithcity)
OnlyCities=datasetwithcity[,1]
OnlyCities
# this didn't work:
City_Labels=as.numeric(OnlyCities)
head(City_Labels)
# gets city labels, but loses points and no colour
plot(pca$x[,1], pca$x[,2],col=City_Labels,pch=City_Labels,
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],labels=rownames(dataset),
cex=0.7,pos=3,col="darkgrey")
There are many different ways to do this.
In base R, you could do:
plot(pca$x[,1], pca$x[,2],
xlab="First PC",ylab="Second PC", col = seq(nrow(pca$x)),
xlim = c(-2.5, 2.5), ylim = c(-2, 2))
text(pca$x[,1], pca$x[,2],cex=0.7,pos=3,col="darkgrey")
text(x = pca$x[,1], y = pca$x[,2], labels = rownames(pca$x), pos = 1)
Personally, I think the resulting aesthetics are nicer (and more easy to change to suit your needs) with ggplot. The code is also a bit easier to read once you get used to the syntax.
library(ggplot2)
df <- as.data.frame(pca$x)
df$city <- rownames(df)
ggplot(df, aes(PC1, PC2, color = city)) +
geom_point(size = 3) +
geom_text(aes(label = city) , vjust = 2) +
lims(x = c(-2.5, 2.5), y = c(-2, 2)) +
theme_bw() +
theme(legend.position = "none")
Created on 2021-10-28 by the reprex package (v2.0.0)

How to create separate facets for different measurements with tidyverse?

I am a novice programmer looking to plot highly grouped variables. Specifically, I am trying to plot a variable that is grouped by 5 other variables. Below is an example data that I am working with.
library(ggplot2)
library(tibble)
set.seed(42)
mydf <- tibble(
grp = rep(c('A', 'B'), length.out = 32, each = 16),
sex = rep(c('M', 'F'), length.out = 32, each = 8),
cond = rep(c('Wet', 'Dry'), length.out = 32, each = 4),
measure = rep(c('Tempature', 'Volume'), length.out = 32, each = 2),
kind = rep(c('Experimental', 'Control'), length.out = 32, each = 1),
value = rnorm(32) * 100,
)
ggplot(mydf, aes(x = grp, y = value, col = cond)) +
geom_point() +
facet_wrap(sex~measure + kind)
However, the output is quite messy. Would it be possible to create separate faceted plots for each measurement? What would be a proper way to graph this type of data?
Thank you
For ease of comparison, I would facet on no more than two variables. I would also use facet_grid() rather than facet_wrap() in such cases, as I think it's just easier to keep track of the different facet dimensions if they are on separate axes.
In your case, you want to distinguish measurements for 5 binary variables.
grp
sex
cond
measure
kind
With "grp" on the x-axis, "sex" distinguished by colour, and 2 of the remaining 3 on facets, we'll need to introduce another aesthetic parameter to distinguish the last variable.
In this case, since there aren't too many points to plot, I suggest shape.
ggplot(mydf, aes(x = grp, y = value,
color = cond,
shape = kind)) +
geom_point(size = 5, stroke = 2) +
facet_grid(sex~measure) +
scale_shape_manual(values = c("Control" = 4, "Experimental" = 16),
breaks = c("Experimental", "Control"))
The use of a filled shape vs an un-filled shape makes Experimental points visually distinct from Control points. You can check out other shape options here.
Note that if there are many different values in your grouping variables (e.g. 5 categories along the x-axis, 6 different colours, 20 facet combinations, etc.), or many points within each facet, the plot will look very busy, and you may want to split into separate plots rather than keep everything together.

How to add a trendline to a boxplot of counts(y axis) and ids(x axis) when x axis is ordered

df1 <- data.frame(a=c(1,4,7),
b=c(3, 5, 6),
c=c(1, 1, 4),
d=c(2 ,6 ,3))
df2<-data.frame(id=c("a","f","f","b","b","c","c","c","d","d"),
var=c(12,20,15,18,10,30,5,8,5,5))
mediorder <- with(df2, reorder(id, -var, median))
boxplot(var~mediorder, data = df2)
fc = levels(as.factor(mediorder))
ndf1= df1[,intersect(fc, colnames(df1))]
ln<-lm( #confused here
boxplot(ndf1)
abline(ln)
I have the above boxplot (ndf1) with an x-axis ordered according to medians from another data frame, and I would like to add a trendline to it.
I am confused since it doesn't have an x and y variable to refer to, just columns with counts. Also the ordering is causing me problems.
EDITED for clarification...
I am building on the question here: How to match an ordered list (e.g., levels(as.factor(x)) ) to another dataframe in which only some columns match?
All I would like to do is fit a trend line to ndf1
Something like this should do. It's fairly easy using ggplot2. However, your data/question are a bit confusing e.g. Some factors (a,d) have one data point only. Is this what you want?
df2$id <- factor(df2$id , levels = levels(mediorder))
library(ggplot2)
ggplot(data = df2, aes(x = id, y = var)) + geom_boxplot() +
geom_smooth(method = "lm", aes(group = 1), se = F)

adding line segments to existing facet grid ggplot r

I'm trying to plot distribution of species between 2 different habitat types (hab 1 and hab 2). Some of my species secondarily use some habitats, so I have a separate column for secondary hab1 (hab1.sec). To visualise their distribution across the two habitats and different depths, I am using a facet_grid between hab1 and hab2. Example code as below:
# example code
set.seed(101)
ID <- seq(1,20, by=1) ## ID for plotting
species <- sample(letters, size=20) ## arbitrary species
## different habitat types in hab.1
hab1 <- c("coastal","shelf","slope","open.ocean","seamount")
hab1.pri <- sample(hab1, size = 20, replace = T)
## secondarily used habitats, may not be present for some species
hab.sec <- c("coastal","shelf","slope","open.ocean","seamount", NA)
hab1.sec <- sample(hab.sec, size = 20, replace = T)
## habitat types for hab.2
hab2 <- c("epipelagic","benthopelagic","epibenthic","benthic")
hab.2 <- sample(hab2, size = 20, replace = T)
## arbitrary depth values
dep.min <- sample(seq(0,1000), size = 20, replace = T)
dep.max <- sample(seq(40, 1500), size = 20, replace = T)
# make data frame
dat <- data.frame(ID, species, hab1.pri, hab1.sec, hab.2,dep.min, dep.max)
# ggplot with facet grid
p <- ggplot(data=dat)+ geom_segment(aes(x=as.factor(ID),xend=as.factor(ID),y=dep.min, yend=dep.max),size=2,data = dat)+ scale_y_reverse(breaks = c(0, 200, 1000,1500))+facet_grid(hab.2~hab1.pri, scales = "free" ,space = "free")+theme_bw()
I would like to add segments for hab1.sec within the existing facet grid. I have tried this code:
p+ geom_segment(aes(x=as.factor(ID),xend=as.factor(ID),y=dep.min, yend=dep.max),linetype=2,data = dat)+facet_wrap(~hab1.sec)
But doing this produces a new graph.
Is there a better way to add those extra lines to the existing grid (preferably as dashed lines)?
I'd be really grateful for any help with this!
Thanks a lot, in advance!
What about combining the primary and secondary habitats into one variable and mapping that variable to an aesthetic?
Note I'm using tidyr and dplyr tools here because they help a lot in cases like this.
library(dplyr)
library(tidyr)
dat %>%
gather(hab1, value, -ID, -species, -(hab.2:dep.max)) %>%
ggplot()+
geom_segment(aes(x=as.factor(ID),xend=as.factor(ID),y=dep.min, yend=dep.max, linetype=hab1),size=2) +
scale_y_reverse(breaks = c(0, 200, 1000,1500))+
facet_grid(hab.2~value, scales = "free" ,space = "free")+
theme_bw()

Resources