I have the following dataset:
Data:
test <- data.frame(
cluster = c("1", "2", "3","1", "2", "3","1", "2", "3",),
variable = c("age", "age", "age", "speed", "speed", "speed", "price","price","price",),
value = c(0.33,0.12,0.98,0.77,0.7,0.6,0.11,0.04,0.15))
test$variable <- factor(test$variable, levels = c("age","speed","price"))
Code
test %>%
ggplot(aes(x = cluster, y = value ,fill = variable ,group = (cluster))) +
geom_col(position = "stack", color = "black", alpha = .75) +
coord_flip()
I try to order the bar chart by a value within variable, for exampel "age".This is my code i used to visualize the chart, and i already tried the order function, but that doesnt seems to be possible within the "fill" argument.
Think the problem is, that "age" itself is just a value of "variable".
It should be like following:
Is it at all possible to display something like this with ggplot or do i need another package?
You've adjusted the level order of variable, which will affect the order of the fill colors within each bar. To change the order of the axis where you mapped x = cluster, we need to adjust the order of the levels of cluster. As a one-off, you can do this manually. It's a little bit more work to do it responsively:
Manually:
test$cluster = factor(test$cluster, levels = c(2, 1, 3))
Calculating the right order:
library(dplyr)
level_order = test %>%
filter(variable == "age") %>%
group_by(cluster) %>%
summarize(val = sum(value), .groups = "drop") %>%
arrange(val) %>%
pull(cluster)
test = mutate(test, cluster = factor(cluster, levels = level_order))
Related
TL;DR: with plot labels using geom_label etc., is it possible to use different data for the calculation of positions of using position_stack or similar functions, than for the display of the label itself? Or, less generally, is it possible to subset the label data after positions have been calculated?
I have some time series data for many different subjects. Observations took place at multiple time points, which are the same for each subject. I would like to plot this data as a stacked area plot, where the height of a subject's curve at each time point corresponds to the observed value for that subject at that time point. Crucially, I also need to add labels to identify each subject.
However, the trivial solution of adding one label at each observation makes the plot unreadable, so I would like to limit the displayed labels to the "most important" subjects (the ones that have the highest peak), as well as only display a label at the respective peak. This subsetting of the labels themselves is not a problem either, but I cannot figure out how to then position the (subset of) labels correctly so they match with the stacked area chart.
Here is some example code, which should work out of the box with tidyverse installed, to illustrate my issue. First, we generate some data which has the same structure as mine:
library(tidyverse)
set.seed(0)
# Generate some data
num_subjects = 50
num_timepoints = 10
labels = paste(sample(words, num_subjects), sample(fruit, num_subjects), sep = "_")
col_names = c("name", paste0("timepoint_", c(1:num_timepoints)))
df = bind_rows(map(labels,
~c(., cumsum(rnorm(num_timepoints))) %>%
set_names(col_names))) %>%
pivot_longer(starts_with("timepoint_"), names_to = "timepoint", names_prefix = "timepoint_") %>%
mutate(across(all_of(c("timepoint", "value")), as.numeric)) %>%
mutate(value = if_else(value < 0, 0, value)) %>%
group_by(name) %>% mutate(peak = max(value)) %>% ungroup()
Now, we can trivially make a simple stacked area plot without labels:
# Plot (without labels)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
scale_fill_viridis_d()
Plot without labels (it appears that I currently cannot embed images, which is very unfortunate as they are extremely illustrative here...)
It is also not too hard to add non-specific labels to this data. They can easily be made to appear at the correct position — so the center of the label is at the middle of the area for each time point and subject — using position_stack:
# Plot (all labels, positions are correct but the plot is basically unreadable)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with a label at each observation
However, as noted before, the labels almost entirely obscure the plot itself. So my approach would be to only show labels at the peaks, and only for the 10 subjects with the highest peaks:
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>% slice_max(value, n = 1) %>% ungroup() %>%
slice_max(value, n = max_labels)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(data = df_labels, mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with only a subset of labels
This code also works fine, but it is apparent that the labels no longer show up at the correct positions, but are instead too low on the plot, especially for the subjects which would otherwise be higher up. (The only subject where the position is correct is work_eggplant.) This makes perfect sense, as the data used for calculation of position_stack are now only a subset of the original data, so the observations which would receive no labels are not considered when stacking. This can be illustrated by zeroing out all the observations which would not receive a label:
df_zeroed = anti_join(df %>% mutate(value = 0),
df_labels,
by = c("name", "timepoint")) %>% bind_rows(df_labels)
ggplot(df_zeroed,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(data = df_labels, mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with unlabeled observations zeroed out
So now my question is, how can this problem be solved? Is there a way to use the original data for the positioning, but the subset data for the actual display of the labels?
Maybe this is what you are looking for. To achieve the desired result you could
use the whole dataset for plotting the labels to get the right positions,
use an empty string "" for the non-desired labels ,
set the fill and color of non-desired labels to "transparent"
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>%
slice_max(value, n = 1) %>%
ungroup() %>%
slice_max(value, n = max_labels) %>%
mutate(label = name)
df1 <- df %>%
left_join(df_labels) %>%
replace_na(list(label = ""))
#> Joining, by = c("name", "timepoint", "value", "peak")
ggplot(df1,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = as.character(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(mapping = aes(
label = label,
fill = ifelse(label != "", as.character(peak), NA_character_),
color = ifelse(label != "", "black", NA_character_)),
show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d(na.value = "transparent") +
scale_color_manual(values = c("black" = "black"), na.value = "transparent")
EDIT If you want the fill colors to correspond to the value of peak then
a simple solution would be to map peak on fill instead of factor(peak) and make use of fill = ifelse(label != "", peak, NA_real_) in geom_label. However, in that case you have to switch to a continuous fill scale.
as I guess that you had a good reason to make use of discrete scale an other option would be to make peak an orderd factor. This approach however is not that simple. To make this work I first reorder factor(peak) according to peak, add an additional NA level and make us of an auxilliary variable peak1 to fill the labels. However, as we have two different variables to be mapped on fill I would suggest to make use of a second fill scale using ggnewscale::new_scale_fill to achieve the desired result:
library(tidyverse)
set.seed(0)
#cumsum(rnorm(num_timepoints)) * 3
# Generate some data
num_subjects = 50
num_timepoints = 10
labels = paste(sample(words, num_subjects), sample(fruit, num_subjects), sep = "_")
col_names = c("name", paste0("timepoint_", c(1:num_timepoints)))
df = bind_rows(map(labels,
~c(., cumsum(rnorm(num_timepoints)) * 3) %>%
set_names(col_names))) %>%
pivot_longer(starts_with("timepoint_"), names_to = "timepoint", names_prefix = "timepoint_") %>%
mutate(across(all_of(c("timepoint", "value")), as.numeric)) %>%
mutate(value = if_else(value < 0, 0, value)) %>%
group_by(name) %>% mutate(peak = max(value)) %>% ungroup()
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>%
slice_max(value, n = 1) %>%
ungroup() %>%
slice_max(value, n = max_labels) %>%
mutate(label = name)
df1 <- df %>%
left_join(df_labels) %>%
replace_na(list(label = ""))
#> Joining, by = c("name", "timepoint", "value", "peak")
df2 <- df1 %>%
mutate(
# Make ordered factor
peak = fct_reorder(factor(peak), peak),
# Add NA level to peak
peak = fct_expand(peak, NA),
# Auxilliary variable to set the fill to NA for non-desired labels
peak1 = if_else(label != "", peak, factor(NA)))
ggplot(df2, mapping = aes(x = factor(timepoint), y = value, group = name, fill = peak)) +
geom_area(show.legend = TRUE, position = "stack", colour = "gray25") +
scale_fill_viridis_d(na.value = "transparent") +
# Add a second fill scale
ggnewscale::new_scale_fill() +
geom_label(mapping = aes(
label = label,
fill = peak1,
color = ifelse(label != "", "black", NA_character_)),
show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d(na.value = "transparent") +
scale_color_manual(values = c("black" = "black"), na.value = "transparent")
I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)
The legend for my bar graph currently lists all the items in the graph in one long list. I would like to have the legend group itself by each column.
The number of columns is dynamic so the legend must be able to adjust accordingly.
library("phyloseq"); packageVersion("phyloseq")
library(ggplot2)
library(scales)
data("GlobalPatterns")
TopNOTUs <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:50])
gp.ch <- prune_species(TopNOTUs, GlobalPatterns)
gp.ch = subset_taxa(gp.ch, Genus != "NA")
mdf = psmelt(gp.ch)
# Create a ggplot similar to
library("ggplot2")
mdf$group <- paste0(mdf$Phylum, "-", mdf$Genus, sep = "")
colours <-ColourPalleteMulti(mdf, "Phylum", "Genus")
# Plot resultss
ggplot(mdf, aes(Phylum)) +
geom_bar(aes(fill = group), colour = "grey", position = "stack")
Right now the legend prints the items:
Actinobacteria-Bifidobacterium
Actinobacteria-Rothia
Bacteriodetes-Alistipes
Bacteriodetes-Bacteroides
...
I would like it to print:
Actinobacteria
-Bifidobacterium
-Rothia
Bacteriodetes
-Alistipes
-Bacteroides
...
This is hacky but might work for you. First, using mtcars dataset, I add dummy rows to the data representing the groupings, then assign a factor level to each of the groupings and component categories. Finally, I hack the alpha in the legend so that grouping headers have transparent colors and look hidden.
# Fake data sample
library(tidyverse)
cars_sample <- mtcars %>%
rownames_to_column(var = "name") %>%
mutate(make = word(name, end = 1),
model = word(name, start = 2, end = -1)) %>%
filter(make %in% c("Mazda", "Merc", "Hornet")) %>%
select(name, make, model, mpg, wt)
# Add rows for groups and make a factor for each group and each component
cars_sample_fct <- cars_sample %>%
bind_rows( cars_sample %>% count(make) %>% mutate(model = make, name = "")) %>%
arrange(make, name) %>%
mutate(name_fct = fct_inorder(if_else(name == "", make, paste0("- ", model))))
# Plot with transparent grouping legend labels
ggplot(cars_sample_fct, aes(wt, mpg, color = name_fct)) +
geom_point() +
scale_color_discrete(name = "Car") +
guides(color = guide_legend(
override.aes = list(size = 5,
alpha = cars_sample_fct$name != "")))
I wanted to do something like this
Add multiple comparisons using ggsignif or ggpubr for subgroups with no labels on x-axis
I got this far:
Packages and Example data
library(tidyverse)
library(ggpubr)
library(ggpol)
library(ggsignif)
example.df <- data.frame(species = sample(c("primate", "non-primate"), 50, replace = TRUE),
treated = sample(c("Yes", "No"), 50, replace = TRUE),
gender = sample(c("male", "female"), 50, replace = TRUE),
var1 = rnorm(50, 100, 5))
Levels
example.df$species <- factor(example.df$species,
levels = c("primate", "non-primate"), labels = c("p", "np"))
example.df$treated <- factor(example.df$treated,
levels = c("No", "Yes"), labels = c("N","Y"))
example.df$gender <- factor(example.df$gender,
levels = c("male", "female"), labels = c("M", "F"))
Since I have had no luck in getting either ggsignif or ggpubr to work with placing the significant groups correctly when the groups they need to refer to are not explicitly named in the x-axis (as they are subgroups of each variable in the x-axis and are indicated only in the fill legend and not the x-axis, I tried this instead.
example.df %>%
unite(groups, species, treated, remove = F, sep= "\n") %>%
{ggplot(., aes(groups, var1, fill= treated)) +
geom_boxjitter() +
facet_wrap(~ gender, scales = "free") +
ggsignif::geom_signif(comparisons = combn(sort(unique(.$groups)), 2, simplify = F),
step_increase = 0.1)}
I get this,
Faceted plot with significance values computed for every group
However, the order of the combined groups on the x -axis is not how I want it. I want to order it with p/N, np/N, p/Y, np/Y for each facet.
How do I do this? Any help is greatly appreciated.
Edit: Creating a new variable using mutate and making it an ordered factor with my preferred plotting order solves.
example.df %>%
unite(groups, species, treated, remove = F, sep= "\n") %>%
mutate(groups2 = factor(groups, levels = c("p\nN", "np\nN", "p\nY", "np\nY"),
ordered = TRUE)) %>%
{ggplot(., aes(groups2, var1, fill= treated)) +
geom_boxjitter() +
facet_wrap(~gender,scales = "free") +
ggsignif::geom_signif(comparisons = combn(sort(unique(.$groups2)), 2, simplify = F),
step_increase = 0.1)}
But I am still looking for solutions to not having to use unite at all and keeping the original factors and still get significance values to plot using ggsignif or ggpubr.
The default parameters for interaction (from the base package) appear to give the factor ordering you are looking for:
example.df %>%
mutate(groups = interaction(species, treated, sep = "\n")) %>%
{ggplot(., aes(groups, var1, fill= treated)) +
geom_boxjitter() +
facet_wrap(~ gender, scales = "free") +
geom_signif(comparisons = combn(sort(as.character(unique(.$groups))), 2, simplify = F),
step_increase = 0.1)}
I have example data below:
> eg_data <- data.frame(period = c("1&2", "1&2","1", "1", "2","2"), size = c("big", "small", "big", "small","big", "small"), trip=c(1000, 250, 600, 100, 400, 150))
I want to make a stacked bar chart, where I have both periods as the first bar, period one as second, and period two as third. This is specified in the data as they are entered, but when I run the ggplot bar command, R decides that period one is a better candidate for first position.
ggplot() +
geom_bar(data = eg_data, aes(y = trip, x = period, fill = size),
stat = "identity",position = 'stack')
First, why does R feel the need to display data in a manner other than how I fed it in, and second, how do I correct this IE specify which groupings I want and in what order.
All help is appreciated, thank you.
We can create the column as a factor with levels specified as the unique values of that column. With that, the values are not sorted and would be in the same order as in the order of sequence of occurrence of the first unique value of 'period'
library(tidyverse)
eg_data %>%
mutate(period = factor(period, levels = unique(period))) %>%
ggplot() +
geom_bar(aes(y = trip, x = period, fill = size),
stat="identity",position='stack')
EDIT - solution with baseR would be as follows -
eg_data$period <- factor(eg_data$period, levels = c("1 & 2", "1", "2"))