How to highlight specific lines in specific groups with ggplot2? - r

I'm trying to highlight (change the color) specific lines in a plot.
The input data looks like this:
dt <- data.frame(Marker = paste0('m', rep(seq(1,10), 10)),
Year = rep(1990:1999, each = 10),
Ahat = rnorm(100, 0.5, 0.1)) %>%
mutate(Group = if_else(Marker %in% c("m1", "m2", "m3"), "A",
if_else(Marker %in% c("m4", "m5", "m6"), "B",
if_else(Marker %in% c("m7", "m8"), "C", "D")) ) )
And the general plot can be created by:
ggplot(dt, aes(x = Year, y = Ahat, group = interaction(as.factor(Group), Marker), color = as.factor(Group) ) ) +
geom_line(alpha = 0.5, size = 0.5) +
theme_classic() +
scale_y_continuous(name = "Predicted Value", breaks = pretty_breaks()) +
scale_colour_manual(name = "Groups", values = c("black", "red", "blue", "orange")) +
facet_wrap(~Group)
What I'd like to do is to highlight (e.g. make some lines black) some specific lines in specific groups (e.g. "m1" and "m9").
I've tried using something like this gghighlight(Marker %in% c("m1", "m9")), but it doesn't work.
I'd like to have something like this (sorry for my poor drawing skills):
Any suggestion?
P.S: My real data has 50K markers.
Thank you!

One option would be to first group data in subgroups (nesting in the dataframe) and then build the plots...
library(tidyverse)
library(scales)
library(patchwork)
# 1. Create dataframe ----
dt <- data.frame(Marker = as.factor(paste0('m', rep(seq(1,10), 10))),
Year = rep(1990:1999, each = 10),
Ahat = rnorm(100, 0.5, 0.1)) %>%
mutate(Group = case_when(
Marker %in% c("m1", "m2", "m3") ~ "A",
Marker %in% c("m4", "m5", "m6") ~ "B",
Marker %in% c("m7", "m8") ~ "C",
TRUE ~ "D"))
# 2. Function to choose which Market of sub_df should be Highlight
getHighlightMarketBasedOntAhatValue <- function(sub_dt) {
sub_dt <- sub_dt %>%
group_by(Marker) %>%
mutate(mean_Ahat = mean(Ahat))
# using mean to choose Ahat is just a doomed example... also instead of a single value you could get an array of values.
# Here I am not using the index...1, 2... any more (as was in first solution), but the factor itself.
highlightMarket <- first(sub_dt$Marker[sub_dt$mean_Ahat == max(sub_dt$mean_Ahat)])
}
# 3. Function to build plot for sub_df
my_plot <- function(sub_dt, highlighted_one) {
custom_pallete = rep("grey", length(levels(sub_dt$Marker)))
names(custom_pallete) <- levels(sub_dt$Marker)
custom_pallete[highlighted_one] = "blue"
dt %>% ggplot(aes(x = Year,
y = Ahat,
color = as.factor(Marker))) +
geom_line(alpha = 0.5, size = 0.5) +
theme_classic() +
scale_y_continuous(name = "Predicted Value", breaks = pretty_breaks()) +
scale_colour_manual(name = "Marker", values = custom_pallete)
}
# 4. Main ----
# 4.1 Nesting ----
nested_dt <- dt %>%
group_by(Group) %>%
nest()
# 4.2 Choosing highlight Market for each subgroup ----
nested_dt <- nested_dt %>%
mutate(highlighted_one = getHighlightMarketBasedOntAhatValue(data[[1]]))
# 4.3 Build plots ----
nested_dt <- nested_dt %>%
mutate(plot = map2(.x = data,
.y = highlighted_one,
.f = ~ my_plot(.x, .y)))
# 4.4 Use patchwork ... ----
# to combine plots ... see patchwork help to find out how to
# manage titles, labels, etc.
nested_dt %>% pull(plot) %>% patchwork::wrap_plots()
```

One way could be to set color as Marker.
Then you can change the color of the Marker in this line
scale_colour_manual(name = "Groups", values = c("black", "red", "blue", "orange", "green", "black", "red", "blue", "orange", "green")) +
Change the colors as you like:
ggplot(dt, aes(x = Year, y = Ahat, group = interaction(as.factor(Group), Marker), color = Marker ) ) +
geom_line(alpha = 0.5, size = 0.5) +
theme_classic() +
scale_y_continuous(name = "Predicted Value", breaks = pretty_breaks()) +
scale_colour_manual(name = "Groups", values = c("black", "red", "blue", "orange", "green",
"black", "red", "blue", "orange", "green")) +
facet_wrap(~Group)

Related

ggplot line plot with one group`s lines on top

I am making a line plot of several groups and want to make a visualization where one of the groups lines are highlighted
ggplot(df) + geom_line(aes(x=timepoint ,y=var, group = participant_id, color=color)) +
scale_color_identity(labels = c(red = "g1",gray90 = "Other"),guide = "legend")
However, the group lines are partially obscured by the other groups lines
How can I make these lines always on top of other groups lines?
The simplest way to do this is to plot the gray and red groups on different layers.
First, let's try to replicate your problem with a dummy data set:
set.seed(1)
df <- data.frame(
participant_id = rep(1:50, each = 25),
timepoint = factor(rep(0:24, 50)),
var = c(replicate(50, runif(1, 50, 200) + runif(25, 0.3, 1.5) *
sin(0:24/(0.6*pi))^2/seq(0.002, 0.005, length = 25))),
color = rep(sample(c("red", "gray90"), 50, TRUE, prob = c(1, 9)), each = 100)
)
Now we apply your plotting code:
library(ggplot2)
ggplot(df) +
geom_line(aes(x=timepoint ,y=var, group = participant_id, color = color)) +
scale_color_identity(labels = c(red = "g1", gray90 = "Other"),
guide = "legend") +
theme_classic()
This looks broadly similar to your plot. If instead we plot in different layers, we get:
ggplot(df, aes(timepoint, var, group = participant_id)) +
geom_line(data = df[df$color == "gray90",], aes(color = "Other")) +
geom_line(data = df[df$color == "red",], aes(color = "gl")) +
scale_color_manual(values = c("red", "gray90")) +
theme_classic()
Created on 2022-06-20 by the reprex package (v2.0.1)
You can use factor releveling to bring the line (-s) of interest to front.
First, let's plot the data as is, with the red line partly hidden by others.
library(ggplot2)
library(dplyr)
set.seed(13)
df <-
data.frame(timepoint = rep(c(1:100), 20),
participant_id = paste0("p_", sort(rep(c(1:20), 100))),
var = abs(rnorm(2000, 200, 50) - 200),
color = c(rep("red", 100), rep("gray90", 1900)))
ggplot(df) +
geom_line(aes(x = timepoint ,
y = var,
group = participant_id, color = color)) +
scale_color_identity(labels = c(red = "g1", gray90 = "Other"),
guide = "legend")
Now let's bring p_1 to front by making it the last factor level.
df %>%
mutate(participant_id = factor(participant_id)) %>%
mutate(participant_id = relevel(participant_id, ref = "p_1")) %>%
mutate(participant_id = factor(participant_id, levels = rev(levels(participant_id)))) %>%
ggplot() +
geom_line(aes(x=timepoint,
y=var,
group = participant_id,
color = color)) +
scale_color_identity(labels = c(red = "g1", gray90 = "Other"),
guide = "legend")

How to migrate `base R` location plot to `ggplot2` and avoid `for` loop?

I'm interested in visualizing the location of certain words in sentences. Say I have 500 sentences between 3-5 words long and want to visualize the location of word A in each sentence:
Data:
set.seed(123)
w1 <- sample(LETTERS[1:3], 1000, replace = TRUE)
w2 <- sample(LETTERS[1:5], 1000, replace = TRUE)
w3 <- sample(LETTERS[1:6], 1000, replace = TRUE)
w4 <- sample(c(NA,LETTERS[1:7]), 1000, replace = TRUE)
w5 <- sample(c(NA,LETTERS[1:8]), 1000, replace = TRUE)
df <- data.frame(
position = rep(1:5, each = 1000), # position of word in sentence
word = c(w1, w2, w3, w4, w5) # the words in the sentences
)
I can produce the location plot in base R. But the code involves a very slow for loop and does not have the aesthetic qualities of ggplot2. So how can the same type of visualization be produced faster and in ggplot2?
This is the code that produces the location plot in base R:
# Plot dimensions:
x <- rep(1:5, 100)
y <- 1:500
# Plot parameters:
par(mar=c(2,1.5,1,1.5), par(xpd = T))
# Plot:
plot(y ~ x, type = "n", frame = F, axes = F, ylab="", xlab="",
main="Location of word 'A' in sentences", cex.main=0.9)
axis(1, at=seq(1:5), labels=c("w1", "w2", "w3", "w4", "w5"), cex.axis=0.9)
# Legend:
legend(2.25, 530, c("A", "other", "NA"), fill=c("blue", "orange", "black"),
horiz = T, cex = 0.7, bty = "n", border = "white")
# For loop to print 'A' as color in positions:
for(i in unique(df$position)){
text(i, 1:500, "__________", cex = 1,
col = ifelse(df[df$position==i,]$word=="A", "blue", "orange"))
}
For example using geom_segment, and then with a conditional aesthetic.
I am using ggh4x for the truncated axis.
library(tidyverse)
library(ggh4x)
df <-
df %>% group_by(position) %>%mutate(index = row_number())
ggplot(df, aes( color = word=="A")) +
geom_segment(aes(x = position-.4, xend = position+.4,
y = index, yend = index),
key_glyph= "rect") +
scale_color_manual(name = NULL,
values=c(`TRUE` = "blue", `FALSE` = "orange"),
labels = c(`TRUE` = "A", `FALSE` = "other"),
na.value="black")+
guides(x = "axis_truncated") +
scale_x_continuous(breaks = 1:5, labels = paste0("w", 1:5))+
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
plot.title = element_text(hjust = .5),
legend.position = "top") +
labs( y = NULL, x = NULL, title = "Location of A")
Here's an initial attempt. (I'm not quite clear, are you looking to show just the first 500 of the 1000 sentences?)
My approach here is to first summarize the data in terms of contiguous sections that are A / other / NA. This way, the plot area is filled exactly without needing to tweak line thickness, and it should plot more quickly by reducing the number of plotted elements.
library(dplyr)
df_plot <- df %>%
mutate(A_spots = case_when(word == "A" ~ "A",
word != "A" ~ "other",
TRUE ~ "NA")) %>%
group_by(position) %>%
mutate(col_chg = A_spots != lag(A_spots, default = ""),
group_num = cumsum(col_chg)) %>%
ungroup() %>%
count(position, group_num, A_spots)
library(ggplot2)
ggplot(df_plot, aes(position, n, fill = A_spots, group = group_num)) +
geom_col() +
scale_x_continuous(name = NULL, breaks = 1:5, #stolen from #tjebo's answer
labels = paste0("w", 1:5))+
scale_fill_manual(
values = c("A" = "blue","other" = "orange", "NA" = "black")) +
labs(title = "Location of word 'A' in sentences") +
theme_minimal()

ggplot add variable to legend without including in plot (when using alpha)

I want to add a variable to the legend without including it in the plot.
I think problem doesn't occur when I don't use alpha(see: How do I add a variable to the legend without including it in the graph?)
library(tidyverse)
name_color <- c('black', "blue", "orange", "pink")
names(name_color) <- letters[1:4]
tibble(name = rep(letters[1:4], each = 2),
respond = rep(c("yes", "no"), 4),
n = rep(50, 8),
me = "i") %>%
filter(name != "c") %>%
ggplot(aes(me, n, fill = name, alpha = respond)) +
facet_wrap(~name) +
geom_bar(stat = "identity") +
scale_fill_manual(values = name_color, drop = FALSE)
The issue has nothing to do with alpha. The problem is the class of your data. When you use tibble to create your data, the name column is of class character. You need a factor class to "remember" the unused levels:
name_color <- c('black', "blue", "orange", "pink")
names(name_color) <- letters[1:4]
d = tibble(name = rep(letters[1:4], each = 2),
respond = rep(c("yes", "no"), 4),
n = rep(50, 8),
me = "i") %>%
class(d$name)
# [1] "character"
d %>% mutate(name = factor(name)) %>%
filter(name != "c") %>%
ggplot(aes(me, n, fill = name, alpha = respond)) +
facet_wrap(~name) +
geom_bar(stat = "identity") +
scale_fill_manual(values = name_color, drop = FALSE)
In the original question you link, you had the factor conversion explicitly, which is why it worked.
... %>% mutate(
gear = factor(gear),
vs = factor(vs)
) %>% ...

Plotting two densities with vertical lines and correct legend

I want to draw two densities with two vertical lines for the averages.
The legend is once to denote the densities and once the vertical
lines.
I tried the code below. However, only one legend appears and the labeling is wrong.
Can anyone help me?
set.seed(1234)
data <- data.frame(value = rnorm(n = 10000, mean = 50, sd = 20),
type = sample(letters[1:2], size = 10000, replace = TRUE))
data$value[data$type == "b"] <- data$value[data$type == "b"] + 50
mean.a <- mean(data$value[data$type == "a"])
mean.b <- mean(data$value[data$type == "b"])
library(ggplot2)
gp <- ggplot(data = data, aes(x = value))
gp <- gp + geom_density(aes(fill = type), color = "black", alpha=0.3, lwd = 1.0, show.legend = TRUE)
gp <- gp + scale_fill_manual(breaks = 1:2, name = "Density", values = c("a" = "green", "b" = "blue"), labels = c("a" = "Density a", "b" = "Density b") )
gp <- gp + geom_vline(aes(color="mean.a", xintercept=mean.a), linetype="solid", size=1.0, show.legend = NA)
gp <- gp + geom_vline(aes(color="mean.b", xintercept=mean.b), linetype="dashed", size=1.0, show.legend = NA)
gp <- gp + scale_color_manual(name = "", values = c("mean.a" = "red", "mean.b" = "darkblue"), labels = c("mean.a" = "Mean.A", "mean.b" = "Mean.B"))
gp <- gp + theme(legend.position="top")
gp
Here are a couple ways to do it. I'm not sure, but I think some of the difficulty comes from having more than one geom_vline and trying to hard-code values in the aes. You're building three scales here: fill for the density curves, and color and linetype for the vertical lines. But you're aiming (correct me if I'm misreading) for two legends.
The easiest way to deal with getting the proper legends is to make a small data frame for the means, rather than individual values for each mean. You can do this easily with dplyr to calculate means for each type.
library(tidyverse)
set.seed(1234)
data <- data.frame(value = rnorm(n = 10000, mean = 50, sd = 20),
type = sample(letters[1:2], size = 10000, replace = TRUE))
data$value[data$type == "b"] <- data$value[data$type == "b"] + 50
means <- group_by(data, type) %>%
summarise(mean = mean(value))
means
#> # A tibble: 2 x 2
#> type mean
#> <fct> <dbl>
#> 1 a 50.3
#> 2 b 99.9
Then when you plot, you can make a single geom_vline call, assigning the means data frame and allowing the aesthetics you want—color and linetype—to be scaled based on this data. The trick then is reconciling the names and labels: if you don't set the same legend name and labels for both the color and linetype scales, you'll have two legends for the lines. Set them the same, and you get a single legend for the mean lines.
ggplot(data, aes(x = value)) +
geom_density(aes(fill = type), alpha = 0.3) +
geom_vline(aes(xintercept = mean, color = type, linetype = type), data = means) +
scale_color_manual(values = c("red", "darkblue"), labels = c("Mean.A", "Mean.B"), name = NULL) +
scale_linetype_discrete(labels = c("Mean.A", "Mean.B"), name = NULL) +
scale_fill_manual(values = c(a = "green", b = "blue"), name = "Density")
The second way is to just add a step to creating the means data frame where you label the types the way you want later, i.e. "Mean.A" instead of just "a". Then you don't need to adjust labels, and you can skip the linetype scale—unless you want to change linetypes manually—and then just remove the name for that legend for both color and linetype in your labs.
means2 <- group_by(data, type) %>%
summarise(mean = mean(value)) %>%
mutate(type = paste("Mean", str_to_upper(type), sep = "."))
means2
#> # A tibble: 2 x 2
#> type mean
#> <chr> <dbl>
#> 1 Mean.A 50.3
#> 2 Mean.B 99.9
ggplot(data, aes(x = value)) +
geom_density(aes(fill = type), alpha = 0.3) +
geom_vline(aes(xintercept = mean, color = type, linetype = type), data = means2) +
scale_color_manual(values = c(Mean.A = "red", Mean.B = "darkblue")) +
scale_fill_manual(values = c(a = "green", b = "blue"), name = "Density") +
labs(color = NULL, linetype = NULL)
Created on 2018-06-05 by the reprex package (v0.2.0).

Reordering legend items in ggplot from two different datasets and layers

I'm combining two layers in ggplot that were created from two different data sets and want to control the order in which the legend appears.
With example data and code:
base <-
data.frame(idea_num = c(1, 2),
value = c(-50, 90),
it_cost = c(30, 10))
group <-
data.frame(idea_num = c(1, 1, 2, 2),
group = c("a", "b", "a", "b"),
is_primary = c(TRUE, FALSE, FALSE, TRUE),
group_value = c(-40, -10, 20, 70))
base %>%
left_join(group) %>%
arrange(desc(value)) %>%
mutate(idea_num = idea_num %>% factor(levels = unique(idea_num)),
is_primary = is_primary %>% factor(levels = c("TRUE", "FALSE"))) %>%
ggplot(aes(x = idea_num, y = group_value, fill = is_primary)) +
geom_bar(stat = "identity") +
geom_bar(data = base %>%
arrange(desc(value)) %>%
mutate(idea_num = idea_num %>% factor(levels = unique(idea_num))),
aes(x = idea_num, y = it_cost, alpha = 0.1, fill = "it_cost"),
stat = "identity") +
scale_fill_manual(name = "Group", labels = c("TRUE" = "Primary", "FALSE" = "Secondary", "it_cost" = "IT Cost"),
values = c("TRUE" = "blue", "FALSE" = "red", "it_cost" = "black")) +
scale_alpha(guide = "none") +
theme(legend.position = "bottom")
I get a figure
but I'd like the legend to appear in the order of Primary, Secondary, IT Cost.
Were all of the numbers I'm trying to plot part of the same grand number, I could easily melt the dataframe and sum everything; however, the values from the group$group_value need to be displayed separate from base$it_cost.
If I plot only the values from teh first layer, i.e.,
base %>%
left_join(group) %>%
arrange(desc(value)) %>%
mutate(idea_num = idea_num %>% factor(levels = unique(idea_num)),
is_primary = is_primary %>% factor(levels = c("TRUE", "FALSE"))) %>%
ggplot(aes(x = idea_num, y = group_value, fill = is_primary)) +
geom_bar(stat = "identity") +
scale_fill_manual(name = "Group", labels = c("TRUE" = "Primary", "FALSE" = "Secondary"),
values = c("TRUE" = "blue", "FALSE" = "red")) +
theme(legend.position = "bottom")
I get a figure I expect
How can I add the second layer and adjust the ordering of the legend boxes? I do not believe that this question or this question are entirely relevant to mine as the former is dealing with levels of a factor and the latter deals with ordering of multiple legends.
Can I do what I'd like to do? Is there a better way of constructing this plot?
use scale_fill_manual(..., limit=, ...):
... +
scale_fill_manual(name = "Group",
labels = c("TRUE" = "Primary", "FALSE" = "Secondary", "it_cost" = "IT Cost"),
limits = c("TRUE", "FALSE", "it_cost"),
values = c("TRUE" = "blue", "FALSE" = "red", "it_cost" = "black")) +
...
This gives:
That said, I think you may want to consider a few different approaches:
A: why do you create your data in such a complex way, ending up multiple observations of IT Costs for the same idea number? I don't know your data, you may well have your reasons, but a simple dataset along the lines:
idea_num value type
1 1 -40 Primary
2 1 -10 Secondary
3 2 20 Secondary
4 2 70 Primary
5 1 -50 IT Cost
6 2 90 IT Cost
would simplify the things quite a bit.
B: Why do you want to stack/overplot these two separate barplots? I would do position="dodge" instead to have separate bars.
df2 <- base %>%
left_join(group) %>%
mutate(is_primary=paste0("pri_", is_primary+0)) %>%
spread(is_primary, group_value) %>%
gather(yvar, y, it_cost, pri_0, pri_1)
df2$yvar <- factor(df2$yvar, levels=c("pri_0", "pri_1", "it_cost"),
labels=c("Primary", "Secondary", "IT Cost"))
df2$idea_num <- factor(df2$idea_num, levels=c(2, 1))
ggplot(df2, aes(idea_num, y, fill=yvar)) +
geom_bar(stat="identity") +
scale_fill_manual("Group", values=c("blue", "red", "black")) +
scale_alpha(guide = "none") +
theme(legend.position = "bottom")

Resources