I am plotting max_temperature (mean_tmax) against rainfall (mean_rain) in a mirrored barplot: max temp displayed upwards, rain values downwards on the negative scale. These two are stored in the "name" variable.
To highlight the highest values out of the 32 years plotted, I created two vectors colVecTmax, colVecRain. They return a color vector of length 32 each, with the index of max values marked differently.
But when adding these two vectors to fill within geom_bar(), it turns out that ggplot stops counting the top after 16 bars, and moves down to the negative scale to continue. So it does not count by the name (mean_tmax, or mean_rain) variable.
This messes up the plot, and I am not sure how to get ggplot count through on the top bars for max_temperature first, coloring by colVecTmax, and then move down to do the same for rain on the negative scale with colVecRain.
Can anyone give a hint on how to solve this?
colVecTmax <- rep("orange",32)
colVecTmax[which.max(as.numeric(unlist(df.long[df.long$place=="sheffield" & df.long$name == "mean_tmax",4])))] <- "blue"
colVecRain <- rep("grey",32)
colVecRain[which.max(as.numeric(unlist(df.long[df.long$place=="sheffield" & df.long$name == "mean_rain",4])))] <- "blue"
ggplot(df.long[df.long$name %in% c('mean_rain', 'mean_tmax'), ] %>% filter(place== "sheffield")%>%
group_by(name) %>% mutate(value = case_when(
name == 'mean_rain' ~ value/10 * -1,
TRUE ~ value)) %>% mutate(place==str_to_sentence(placenames)) %>%
mutate(name = recode(name,'mean_rain' = "rainfall" , "mean_tmax" = "max temp"))
, aes(x = yyyy, y = value, fill=name))+
geom_bar(stat="identity", position="identity", fill=c(colVecTmax,colVecRain))+
labs(x="Year", y=expression("Rain in cm, temperature in ("*~degree *C*")"))+
geom_smooth(colour="black", lwd=0.5,se=F)+
scale_y_continuous(breaks = seq(-30, 30 , 5))+
scale_x_continuous(breaks = seq(1990, 2025, 5))+
guides(fill= guide_legend(title=NULL))+
scale_fill_discrete(labels=c("Max temperature", "Rainfall"))+
guides(fill=guide_legend(reverse=T), res=96)
Using ggplot2 there are much easier and less error prone ways to assign colors. Instead of creating color vectors which you pass to the color or fill argument you could simply map on aesthetics (which you basically already have done) and assign your desired colors using a manual scale, e.g. scale_fill_manual. The same approach works fine when you want to highlight some values. To this end you could create additional categories, e.g. in the code below I add "_max" to the name for the observations with the max temperature or rainfall and assign your desired "blue" color to these categories. As doing so will add additional categories I use the breaks argument of scale_fill_manual so that these max categories will not show up in the legend.
Using some fake random example data:
# Create example data
set.seed(123)
df.long <- data.frame(
name = rep(c("mean_rain", "mean_tmax"), each = 30),
place = "sheffield",
yyyy = rep(1991:2020, 2),
value = c(runif(30, 40, 100), runif(30, 12, 16))
)
library(ggplot2)
library(dplyr)
df_plot <- df.long %>%
filter(name %in% c("mean_rain", "mean_tmax")) |>
filter(place == "sheffield") %>%
mutate(value = case_when(
name == "mean_rain" ~ -value / 10,
TRUE ~ value
)) |>
# Maximum values
group_by(name) |>
mutate(name = ifelse(abs(value) >= max(abs(value)), paste(name, "max", sep = "_"), name))
ggplot(df_plot, aes(x = yyyy, y = value, fill = name)) +
geom_col(position = "identity") +
geom_smooth(colour = "black", lwd = 0.5, se = F) +
scale_y_continuous(breaks = seq(-30, 30, 5), labels = abs) +
scale_x_continuous(breaks = seq(1990, 2025, 5)) +
scale_fill_manual(
values = c(
mean_rain = "orange", mean_tmax = "grey",
mean_rain_max = "blue", mean_tmax_max = "blue"
),
labels = c(mean_tmax = "Max temperature", mean_rain = "Rainfall"),
breaks = c("mean_rain", "mean_tmax")
) +
labs(x = "Year", y = expression("Rain in cm, temperature in (" * ~ degree * C * ")"), fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))
Related
I am trying to create a plot with two y axis, each of different values.
The moment I add the data of the second axis, the first axis gets extra (unwanted) area.
Any idea how to fix this?
I created some simple example, just to show what I do wrong.
Thank you!
library(tidyverse)
starwars <- starwars[,c(2,3,7)] %>% drop_na() %>%
dplyr::filter(mass > 40 & mass < 200) %>%
mutate(some_y = height/max(height))
viz <- starwars %>% ggplot() +
geom_line(aes(x = birth_year, y = mass), colour = "blue") +
labs(ylim = c(70,150)) #until here, the plot is fine
viz +
geom_line(data = starwars, #starting from here, it jumps up
aes(x = birth_year, y = some_y*100), colour = "red") +
scale_y_continuous(sec.axis = sec_axis(~ ., name = "2nd y",
labels = seq(0, 100, length.out = 11), breaks = seq(0, 100, length.out = 11)))
first, it works fine: (viz)
Then, it jumps up: (viz + ...)
TL;DR: with plot labels using geom_label etc., is it possible to use different data for the calculation of positions of using position_stack or similar functions, than for the display of the label itself? Or, less generally, is it possible to subset the label data after positions have been calculated?
I have some time series data for many different subjects. Observations took place at multiple time points, which are the same for each subject. I would like to plot this data as a stacked area plot, where the height of a subject's curve at each time point corresponds to the observed value for that subject at that time point. Crucially, I also need to add labels to identify each subject.
However, the trivial solution of adding one label at each observation makes the plot unreadable, so I would like to limit the displayed labels to the "most important" subjects (the ones that have the highest peak), as well as only display a label at the respective peak. This subsetting of the labels themselves is not a problem either, but I cannot figure out how to then position the (subset of) labels correctly so they match with the stacked area chart.
Here is some example code, which should work out of the box with tidyverse installed, to illustrate my issue. First, we generate some data which has the same structure as mine:
library(tidyverse)
set.seed(0)
# Generate some data
num_subjects = 50
num_timepoints = 10
labels = paste(sample(words, num_subjects), sample(fruit, num_subjects), sep = "_")
col_names = c("name", paste0("timepoint_", c(1:num_timepoints)))
df = bind_rows(map(labels,
~c(., cumsum(rnorm(num_timepoints))) %>%
set_names(col_names))) %>%
pivot_longer(starts_with("timepoint_"), names_to = "timepoint", names_prefix = "timepoint_") %>%
mutate(across(all_of(c("timepoint", "value")), as.numeric)) %>%
mutate(value = if_else(value < 0, 0, value)) %>%
group_by(name) %>% mutate(peak = max(value)) %>% ungroup()
Now, we can trivially make a simple stacked area plot without labels:
# Plot (without labels)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
scale_fill_viridis_d()
Plot without labels (it appears that I currently cannot embed images, which is very unfortunate as they are extremely illustrative here...)
It is also not too hard to add non-specific labels to this data. They can easily be made to appear at the correct position — so the center of the label is at the middle of the area for each time point and subject — using position_stack:
# Plot (all labels, positions are correct but the plot is basically unreadable)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with a label at each observation
However, as noted before, the labels almost entirely obscure the plot itself. So my approach would be to only show labels at the peaks, and only for the 10 subjects with the highest peaks:
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>% slice_max(value, n = 1) %>% ungroup() %>%
slice_max(value, n = max_labels)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(data = df_labels, mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with only a subset of labels
This code also works fine, but it is apparent that the labels no longer show up at the correct positions, but are instead too low on the plot, especially for the subjects which would otherwise be higher up. (The only subject where the position is correct is work_eggplant.) This makes perfect sense, as the data used for calculation of position_stack are now only a subset of the original data, so the observations which would receive no labels are not considered when stacking. This can be illustrated by zeroing out all the observations which would not receive a label:
df_zeroed = anti_join(df %>% mutate(value = 0),
df_labels,
by = c("name", "timepoint")) %>% bind_rows(df_labels)
ggplot(df_zeroed,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(data = df_labels, mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with unlabeled observations zeroed out
So now my question is, how can this problem be solved? Is there a way to use the original data for the positioning, but the subset data for the actual display of the labels?
Maybe this is what you are looking for. To achieve the desired result you could
use the whole dataset for plotting the labels to get the right positions,
use an empty string "" for the non-desired labels ,
set the fill and color of non-desired labels to "transparent"
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>%
slice_max(value, n = 1) %>%
ungroup() %>%
slice_max(value, n = max_labels) %>%
mutate(label = name)
df1 <- df %>%
left_join(df_labels) %>%
replace_na(list(label = ""))
#> Joining, by = c("name", "timepoint", "value", "peak")
ggplot(df1,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = as.character(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(mapping = aes(
label = label,
fill = ifelse(label != "", as.character(peak), NA_character_),
color = ifelse(label != "", "black", NA_character_)),
show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d(na.value = "transparent") +
scale_color_manual(values = c("black" = "black"), na.value = "transparent")
EDIT If you want the fill colors to correspond to the value of peak then
a simple solution would be to map peak on fill instead of factor(peak) and make use of fill = ifelse(label != "", peak, NA_real_) in geom_label. However, in that case you have to switch to a continuous fill scale.
as I guess that you had a good reason to make use of discrete scale an other option would be to make peak an orderd factor. This approach however is not that simple. To make this work I first reorder factor(peak) according to peak, add an additional NA level and make us of an auxilliary variable peak1 to fill the labels. However, as we have two different variables to be mapped on fill I would suggest to make use of a second fill scale using ggnewscale::new_scale_fill to achieve the desired result:
library(tidyverse)
set.seed(0)
#cumsum(rnorm(num_timepoints)) * 3
# Generate some data
num_subjects = 50
num_timepoints = 10
labels = paste(sample(words, num_subjects), sample(fruit, num_subjects), sep = "_")
col_names = c("name", paste0("timepoint_", c(1:num_timepoints)))
df = bind_rows(map(labels,
~c(., cumsum(rnorm(num_timepoints)) * 3) %>%
set_names(col_names))) %>%
pivot_longer(starts_with("timepoint_"), names_to = "timepoint", names_prefix = "timepoint_") %>%
mutate(across(all_of(c("timepoint", "value")), as.numeric)) %>%
mutate(value = if_else(value < 0, 0, value)) %>%
group_by(name) %>% mutate(peak = max(value)) %>% ungroup()
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>%
slice_max(value, n = 1) %>%
ungroup() %>%
slice_max(value, n = max_labels) %>%
mutate(label = name)
df1 <- df %>%
left_join(df_labels) %>%
replace_na(list(label = ""))
#> Joining, by = c("name", "timepoint", "value", "peak")
df2 <- df1 %>%
mutate(
# Make ordered factor
peak = fct_reorder(factor(peak), peak),
# Add NA level to peak
peak = fct_expand(peak, NA),
# Auxilliary variable to set the fill to NA for non-desired labels
peak1 = if_else(label != "", peak, factor(NA)))
ggplot(df2, mapping = aes(x = factor(timepoint), y = value, group = name, fill = peak)) +
geom_area(show.legend = TRUE, position = "stack", colour = "gray25") +
scale_fill_viridis_d(na.value = "transparent") +
# Add a second fill scale
ggnewscale::new_scale_fill() +
geom_label(mapping = aes(
label = label,
fill = peak1,
color = ifelse(label != "", "black", NA_character_)),
show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d(na.value = "transparent") +
scale_color_manual(values = c("black" = "black"), na.value = "transparent")
So i have a dataframe with 2 columns : "ID" and "Score"
ID contain the name of a simulation and each simulation have 58 different scores that are listed in the column Score.
There is 10 simulations.
I am doing a geom_density plot :
my_dataframe %>%
ggplot(aes(x=`Score`), xlim = c(0, 1)) +
geom_density(aes(color = ID)) +
theme_bw() +
labs(title = "Scores")
https://imgur.com/a/9DUTmWw
How can i tell ggplot that i want the curves of Simulation1 and Simulation2 to not be like the others, i want them to be in red and with an higher width than all the other one.
Thank you for your help,
Best,
Maxime
Something like this?
my_dataframe %>% mutate(group = ifelse(ID %in% c(1,2), 'special', 'NonSpecial')) %>%
ggplot(aes(x=`Score`, lty = group), xlim = c(0, 1)) +
geom_density(aes(color = ID)) +
theme_bw() +
labs(title = "Scores")
I used this data:
my_dataframe <- data.frame(ID = factor(sample(1:4, 100, T)), Score = sin(1:100))
I need help with plotting > 741 lines in a ggplot.
The color of one specific line should not change, e.g. the color line should be assigned only by the final value of eci.
I would want to display the name (in the code example “unit”) of each line at the beginning and the end of each line
Of course over 700 lines are hard to distinguish with the bare eye but any suggestions how to make the lines more distinguishable?
df <- data.frame(unit=rep(1:741, 4),
year=rep(c(2012, 2013, 2014, 2015), each=741),
eci=round(runif(2964, 1, 741), digits = 0))
g = ggplot(data = df, aes(x=year, y=eci, group=unit)) +
geom_line(aes(colour=eci), size=0.01) +
scale_colour_gradientn(colours = terrain.colors(10)) +
geom_point(aes(colour=eci), size=0.04)
# The colour of the line should be determined by all eci for which year=2015
One way to achieve your desired result is creating new columns with extra information to use when plotting with ggplot2.
With dplyr, we group data by unit, and then arrange it, so we can create a column that stores the value of the last eci, and two columns with labels for the first and last year, so we can add them as text to the plot.
df_new <- df %>%
group_by(unit) %>%
arrange(unit, year, eci) %>%
mutate(last_eci = last(eci),
first_year = ifelse(year == 2012, unit, ""),
last_year = ifelse(year == 2015, unit, ""))
Then, we plot it.
ggplot(data = df_new,
aes(x = year, y = eci, group = unit, colour = last_eci)) +
geom_line(size = 0.01) +
geom_text(aes(label = first_year), nudge_x = -0.05, color = "black") +
geom_text(aes(label = last_year), nudge_x = 0.05, color = "black") +
scale_colour_gradientn(colours = terrain.colors(10)) +
geom_point(aes(colour = eci), size = 0.04)
Of course, looking at the resulting plot it's easy to see that trying to plot >700 lines of different colors and >1400 labels in a single plot is not very advisable.
I'd use relevant subsets of df, so we produce plots that helps us to better understand the data.
df_new %>%
filter(unit %in% c(1:10)) %>%
ggplot(data = .,
aes(x = year, y = eci, group = unit, colour = last_eci)) +
geom_line(size = 0.01) +
geom_text(aes(label = first_year), nudge_x = -0.05, color = "black") +
geom_text(aes(label = last_year), nudge_x = 0.05, color = "black") +
scale_colour_gradientn(colours = terrain.colors(10)) +
geom_point(aes(colour = eci), size = 0.04)
For better readability, I have opted for a 10-line example, using the directlabels-package.
library(ggplot2)
library(dplyr)
library(directlabels)
set.seed(95)
l <- 10
df1 <- data.frame(unit=rep(1:l, 4),
year=rep(c(2012, 2013, 2014, 2015), each=l),
eci=round(runif(4*l, 1, l), digits = 0))
df2 <- df1 %>% filter (year == 2015) %>% select(-year, end = eci)
df <- left_join(df1,df2, by = "unit")
g <-
ggplot(data = df, aes(x=year,
y=eci,
group=unit)) +
geom_line(aes(colour=end), size=0.01) +
scale_colour_gradientn(colours = terrain.colors(10)) +
geom_point(aes(colour=eci), size=0.04) +
geom_dl(aes(label = unit,color = end), method = list(dl.combine("first.points", "last.points"), cex = 0.8))
g
Half a year later, I think there is a much easier solution based on parcoord() applied to a wide df.
set.seed(95)
l <- 1000 # really 1000 observations per year this time
df1 <- data.frame(unit=rep(1:l, 4),
year=rep(c(2012, 2013, 2014, 2015), each=l),
eci=round(runif(4*l, 1, l), digits = 0))
df1 <- tidyr::spread(df1, year, eci) # change from long to wide
df1 <- df1 %>%
dplyr::arrange(desc(`2015`)) # Assign after which column (year) rows should be ordered
# create 10 different colrs which are repeated 100 times
my_colors=rep(terrain.colors(11)[-1], each=100)
parcoord(df1[, c(2:5)] , col= my_colors)
This is more efficient and easily scaleable.
I'm trying to create a series of bar charts (to be replicated for multiple sites) that highlight the difference between the main site and the satellite locations. I can come somewhat close using geom_point, but I'd like to have them represented as bar charts, where the bar starts at the lowest point, there are labels for the main site and satellite locations, as well as the difference between them. Here is some sample code and screenshots of what I have, and an idea of what I'd like it to look like.
library(ggplot2)
library(dplyr)
site <- c("Site A", "Main Site", "Site A", "Main Site", "Site A", "Main Site")
year <- c("2013", "2013", "2014", "2014","2015", "2015" )
value <- c(57, 74, 60, 50, 60, 68)
df <- data.frame (site, year, value)
df %>%
mutate (label = paste0(site, " (", value, ")")) %>%
ggplot (aes (x = year, y = value, group = site, colour = site)) +
geom_point (size = 0.5) +
scale_y_continuous(limits = c (0,100)) +
geom_text (aes(label = label))
Using the comment from #Gregor I managed to come up with something that will work. Probably isn't the most elegant solution but will work for now.
df %>%
spread(site, value) %>%
mutate (diff = SiteA - MainSite) %>%
mutate (AboveBelow = recode (diff," -100:-1 = 'Below';
0 = 'No Difference';
1:100 = 'Above'")) %>%
ggplot() +
scale_x_continuous(name = "Year", breaks = c (2013, 2014, 2015)) +
scale_y_continuous(name = "Percentage", limits = c(0,100)) +
geom_rect (aes (xmin = year - 0.33, xmax = year + 0.33, ymin = SiteA, ymax = MainSite, fill = AboveBelow)) +
geom_text (aes (x = year, y = ifelse (diff < 0, MainSite + 5, MainSite - 3), label = paste0("MainSite - ", MainSite))) +
geom_text (aes (x = year, y = ifelse (diff < 0, SiteA - 3, SiteA +5), label = paste0("SiteA - ", SiteA))) +
geom_text (aes (x = year, y = MainSite + (diff/2), label = diff)) +
scale_fill_manual(values = c("green", "red", "white" ))
Gives me this:
Following up on the comment from #gregor, you can try the below (note dcast is from reshape2 and the heavy use of dplyr
df %>%
dcast(year~site) %>%
mutate(midpt = (`Main Site` + `Site A`)/2
, dir = factor( (`Main Site` - `Site A`) > 0
, levels = c(FALSE,TRUE)
, labels = c("Negative", "Positive"))
, diff = abs(`Main Site` - `Site A`)) %>%
ggplot(aes(x = year
, y = midpt
, fill = dir
, height = diff)) +
geom_tile() +
scale_fill_manual(values = c("Positive" = "darkgreen"
, "Negative" = "red3"))
If you have more than 2 sites, you would likely want a more flexible solution, probably using dplyr directly.