Related
I want to boxplot two groups (A and B) and display the mean value on each box plot.
I have 30 lines and 2 columns : each line contains the value of group A (col 1) and group B (col 2).
I did a boxplot with graphic boxplot
boxplot(Data_Q4$Group.A,Data_Q4$Group.B,names=c("group A","group B"))
but it seems like adding a mean point on the boxplot necessiting ggplot 2.
I tried many things but it already send me an error message
! Aesthetics must be either length 1 or the same as the data (30): x...
It seems my problem come from y axis. I need him to take the data from columns A and B but I don't know how to do this.
if my data was with value column and group columns (A or B for each line) it would work but I don't know how to rearrange it so that I get 2 columns (value and groups) and 60 lines with the values of the groups.
and then I do dataQ4 %>% ggplot(aes(x=group,y=value))+geom_boxplot+stat_summary(fun.y=mean)
I think it will be ok.
so my problem is to rearrange my data frame so that I can use ggplot and boxplot it
thanks for your help !
I share here my data :
dput(Data_Q4) structure(list(Group.A = c(1.25310535, 0.5546414, 0.301283, 1.29312466, 0.99455579, 0.5141743, 2.0078324, 0.42224244, 2.17877257, 3.21778902, 0.55782935, 0.59461765, 0.97739581, 0.20986658, 0.30944786, 1.10593627, 0.77418776, 0.08967408, 1.10817666, 0.24726425, 1.57198685, 4.83281274, 0.43113213, 2.73038931, 1.13683142, 0.81336825, 0.83700649, 1.7847654, 2.31247163, 2.90988727), Group.B = c(2.94928948, 0.70302878, 0.69016263, 1.25069011, 0.43649776, 0.22462232, 0.39231981, 1.5763435, 0.42792839, 0.19608026, 0.37724368, 0.07071508, 0.03962611, 0.38580831, 2.63928857, 0.78220807, 0.66454197, 0.9568569, 0.02484568, 0.21600677, 0.88031195, 0.13567357, 0.68181725, 0.20116062, 0.4834762, 0.50102846, 0.15668497, 0.71992076, 0.68549794, 0.86150777)), class = "data.frame", row.names = c(NA, -30L))
First I create some random data:
df <- data.frame(group = rep(c("A", "B"), 15),
value = runif(30, 0, 10))
You can use the following code:
library(tidyverse)
ggplot(data = df,
aes(x = group, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:
The red dots represent the mean.
Using your data:
You can use the following code:
library(tidyverse)
library(reshape)
dataQ4 %>%
melt() %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:
On the same ggplot figure, I am trying to have the points (from geom_point), the lines (from geom_line) and the errorbars (from geom_errorbar) on the same "plane" (i.e. not overlapping), this for each factor.
As you can see the "layering" of the errorbars is not following the "layering" of the lines (not mentionning the points).
Here is a reproducible example:
# reproducible example
# package
library(dplyr)
library(ggplot2)
# generate the data
set.seed(244)
d1 <- data.frame(time_serie = as.factor(rep(rep(1:3, each = 6), 3)),
treatment = as.factor(rep(c("HIGH", "MEDIUM", "LOW"), each = 18)),
value = runif(54, 1, 10))
# create the error intervals
d2 <- d1 %>%
dplyr::group_by(time_serie,treatment) %>%
dplyr::summarise(mean_value = mean(value),
SE_value = sd(value/sqrt(length(value)))) %>%
as.data.frame()
# plot
p1 <- ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2)
p1
p1a <- p1 + geom_errorbar(aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value), width = .2, position = position_dodge(0.3), size =1) +
geom_point(aes(), position = position_dodge(0.3), size = 3) +
geom_line(aes(color = treatment), position=position_dodge(0.3), size =1)
p1a
Any idea?
Any help would be greatly appreciated :)
Thanks a lot!
Valérian
Up front: this is a partial answer that has two notable issues still to fix (see the end). Edit: the two issues have been resolved, see the far bottom.
I'll change the "dodge" slightly to clarify the point, identify an area of concern, and demonstrate a suggested workaround.
# generate the data
set.seed(244)
d1 <- data.frame(time_serie = as.factor(rep(rep(1:3, each = 6), 3)),
treatment = as.factor(rep(c("HIGH", "MEDIUM", "LOW"), each = 18)),
value = runif(54, 1, 10))
# create the error intervals
d2 <- d1 %>%
dplyr::group_by(time_serie,treatment) %>%
dplyr::summarise(mean_value = mean(value),
SE_value = sd(value/sqrt(length(value)))) %>%
dplyr::arrange(desc(treatment)) %>%
as.data.frame()
# plot
ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2) +
geom_errorbar(aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, position = position_dodge(0.03), size = 2) +
geom_point(aes(), position = position_dodge(0.03), size = 3) +
geom_line(aes(color = treatment), position = position_dodge(0.03), size = 2)
Namely, I'll assume that we want HIGH (red) points/lines/error-bars as the top-most layer, masked by nothing. We can see a clear violation of this in the right-most bar: the red dot is over the green errorbar but under the green line.
Unless/until there is an aes(layer=..) aesthetic (there is not afaik), you need to add layers one treatment at a time. While one could hard-code this with nine geoms, you can automate this with lapply. Note that ggplot(.) + list(geom1,geom2,geom3) works just fine, even with nested lists.
I'll control the order of layers with rev(levels(d2$treatment)), assuming that you want LOW as the bottom-most layer (ergo added first). The order of geoms within the list is what defines their layers. Technically we still have a single treatment's errorbar, point, and line on different layers, but they are consecutive so appear to be the same.
ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2) +
lapply(rev(levels(d2$treatment)), function(trtmnt) {
list(
geom_errorbar(data = ~ subset(., treatment == trtmnt),
aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, position = position_dodge(0.03), size = 2),
geom_point(data = ~ subset(., treatment == trtmnt), aes(), position = position_dodge(0.03), size = 3),
geom_line(data = ~ subset(., treatment == trtmnt), position = position_dodge(0.03), size = 2)
)
})
(Side note: I use levels(d2$treatment) and data=~subset(., treatment==trtmnt) here, but that's just one way to do it. Another would be lapply(split(d2, d2$treatment), function(x) ...) and use data=x in all of the inner geoms. This latter method allows for multi-variable grouping, if desired. I see no immediate advantage to one over the other.)
The problems with this:
The order of the legend is not consistent with the order of levels of the factor, somehow that is lost. (To be clear, I don't demonstrate this very well here: I can move "medium" to the middle of the legend using levels<-, and it works with the non-lapply rendering code with incorrect layering, but it is again lost with the lapply-geoms.)
position_dodge no longer has awareness of the other treatments, so it does not dodge the other errorbars. The only way around this (not demonstrated here) would be to manually dodge before plotting, shown below.
1: Order of legend elements
This one was solved in lapply'd geoms lose factor-ordering, where we just need to add scale_color_discrete(drop=FALSE).
2: Dodging
The dodge issue can be fixed by using real numerics in the x aesthetic. This is kind of a hack, as it is no longer done by ggplot2 but controlled externally. It's also applying an offset and not dodging, per se. But it does get the desired results.
d2$time_serie2 <- as.integer(as.character(d2$time_serie)) + as.numeric(d2$treatment)/10
ggplot(aes(x = time_serie2, y = mean_value, color = treatment, group = treatment), data = d2) +
lapply(rev(levels(d2$treatment)), function(trtmnt) {
list(
geom_errorbar(data = ~ subset(., treatment == trtmnt),
aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, size = 2),
geom_point(data = ~ subset(., treatment == trtmnt), aes(), size = 3),
geom_line(data = ~ subset(., treatment == trtmnt), size = 2)
)
}) +
scale_color_discrete(drop = FALSE)
TL;DR: with plot labels using geom_label etc., is it possible to use different data for the calculation of positions of using position_stack or similar functions, than for the display of the label itself? Or, less generally, is it possible to subset the label data after positions have been calculated?
I have some time series data for many different subjects. Observations took place at multiple time points, which are the same for each subject. I would like to plot this data as a stacked area plot, where the height of a subject's curve at each time point corresponds to the observed value for that subject at that time point. Crucially, I also need to add labels to identify each subject.
However, the trivial solution of adding one label at each observation makes the plot unreadable, so I would like to limit the displayed labels to the "most important" subjects (the ones that have the highest peak), as well as only display a label at the respective peak. This subsetting of the labels themselves is not a problem either, but I cannot figure out how to then position the (subset of) labels correctly so they match with the stacked area chart.
Here is some example code, which should work out of the box with tidyverse installed, to illustrate my issue. First, we generate some data which has the same structure as mine:
library(tidyverse)
set.seed(0)
# Generate some data
num_subjects = 50
num_timepoints = 10
labels = paste(sample(words, num_subjects), sample(fruit, num_subjects), sep = "_")
col_names = c("name", paste0("timepoint_", c(1:num_timepoints)))
df = bind_rows(map(labels,
~c(., cumsum(rnorm(num_timepoints))) %>%
set_names(col_names))) %>%
pivot_longer(starts_with("timepoint_"), names_to = "timepoint", names_prefix = "timepoint_") %>%
mutate(across(all_of(c("timepoint", "value")), as.numeric)) %>%
mutate(value = if_else(value < 0, 0, value)) %>%
group_by(name) %>% mutate(peak = max(value)) %>% ungroup()
Now, we can trivially make a simple stacked area plot without labels:
# Plot (without labels)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
scale_fill_viridis_d()
Plot without labels (it appears that I currently cannot embed images, which is very unfortunate as they are extremely illustrative here...)
It is also not too hard to add non-specific labels to this data. They can easily be made to appear at the correct position — so the center of the label is at the middle of the area for each time point and subject — using position_stack:
# Plot (all labels, positions are correct but the plot is basically unreadable)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with a label at each observation
However, as noted before, the labels almost entirely obscure the plot itself. So my approach would be to only show labels at the peaks, and only for the 10 subjects with the highest peaks:
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>% slice_max(value, n = 1) %>% ungroup() %>%
slice_max(value, n = max_labels)
ggplot(df,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(data = df_labels, mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with only a subset of labels
This code also works fine, but it is apparent that the labels no longer show up at the correct positions, but are instead too low on the plot, especially for the subjects which would otherwise be higher up. (The only subject where the position is correct is work_eggplant.) This makes perfect sense, as the data used for calculation of position_stack are now only a subset of the original data, so the observations which would receive no labels are not considered when stacking. This can be illustrated by zeroing out all the observations which would not receive a label:
df_zeroed = anti_join(df %>% mutate(value = 0),
df_labels,
by = c("name", "timepoint")) %>% bind_rows(df_labels)
ggplot(df_zeroed,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = factor(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(data = df_labels, mapping = aes(label = name), show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d()
Plot with unlabeled observations zeroed out
So now my question is, how can this problem be solved? Is there a way to use the original data for the positioning, but the subset data for the actual display of the labels?
Maybe this is what you are looking for. To achieve the desired result you could
use the whole dataset for plotting the labels to get the right positions,
use an empty string "" for the non-desired labels ,
set the fill and color of non-desired labels to "transparent"
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>%
slice_max(value, n = 1) %>%
ungroup() %>%
slice_max(value, n = max_labels) %>%
mutate(label = name)
df1 <- df %>%
left_join(df_labels) %>%
replace_na(list(label = ""))
#> Joining, by = c("name", "timepoint", "value", "peak")
ggplot(df1,
mapping = aes(x = factor(timepoint), y = value, group = name, fill = as.character(peak))) +
geom_area(show.legend = FALSE, position = "stack", colour = "gray25") +
geom_label(mapping = aes(
label = label,
fill = ifelse(label != "", as.character(peak), NA_character_),
color = ifelse(label != "", "black", NA_character_)),
show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d(na.value = "transparent") +
scale_color_manual(values = c("black" = "black"), na.value = "transparent")
EDIT If you want the fill colors to correspond to the value of peak then
a simple solution would be to map peak on fill instead of factor(peak) and make use of fill = ifelse(label != "", peak, NA_real_) in geom_label. However, in that case you have to switch to a continuous fill scale.
as I guess that you had a good reason to make use of discrete scale an other option would be to make peak an orderd factor. This approach however is not that simple. To make this work I first reorder factor(peak) according to peak, add an additional NA level and make us of an auxilliary variable peak1 to fill the labels. However, as we have two different variables to be mapped on fill I would suggest to make use of a second fill scale using ggnewscale::new_scale_fill to achieve the desired result:
library(tidyverse)
set.seed(0)
#cumsum(rnorm(num_timepoints)) * 3
# Generate some data
num_subjects = 50
num_timepoints = 10
labels = paste(sample(words, num_subjects), sample(fruit, num_subjects), sep = "_")
col_names = c("name", paste0("timepoint_", c(1:num_timepoints)))
df = bind_rows(map(labels,
~c(., cumsum(rnorm(num_timepoints)) * 3) %>%
set_names(col_names))) %>%
pivot_longer(starts_with("timepoint_"), names_to = "timepoint", names_prefix = "timepoint_") %>%
mutate(across(all_of(c("timepoint", "value")), as.numeric)) %>%
mutate(value = if_else(value < 0, 0, value)) %>%
group_by(name) %>% mutate(peak = max(value)) %>% ungroup()
# Plot (only show labels at the peak for the 10 highest peaks, readable but positions are wrong)
max_labels = 10 # how many labels to show
df_labels = df %>%
group_by(name) %>%
slice_max(value, n = 1) %>%
ungroup() %>%
slice_max(value, n = max_labels) %>%
mutate(label = name)
df1 <- df %>%
left_join(df_labels) %>%
replace_na(list(label = ""))
#> Joining, by = c("name", "timepoint", "value", "peak")
df2 <- df1 %>%
mutate(
# Make ordered factor
peak = fct_reorder(factor(peak), peak),
# Add NA level to peak
peak = fct_expand(peak, NA),
# Auxilliary variable to set the fill to NA for non-desired labels
peak1 = if_else(label != "", peak, factor(NA)))
ggplot(df2, mapping = aes(x = factor(timepoint), y = value, group = name, fill = peak)) +
geom_area(show.legend = TRUE, position = "stack", colour = "gray25") +
scale_fill_viridis_d(na.value = "transparent") +
# Add a second fill scale
ggnewscale::new_scale_fill() +
geom_label(mapping = aes(
label = label,
fill = peak1,
color = ifelse(label != "", "black", NA_character_)),
show.legend = FALSE, position = position_stack(vjust = 0.5)) +
scale_fill_viridis_d(na.value = "transparent") +
scale_color_manual(values = c("black" = "black"), na.value = "transparent")
Utilizing the example package code in ggpubr, the ggdotchart function does not create separate segments as is shown in the example, instead there is only a single segment, though the dots seem to be placed in the correct orientation. Does anyone have any tips on what the problem may be? I've thought it may be due to factors, tibbles vs. df, but I haven't been able to determine the problem.
Code:
df <- diamonds %>%
filter(color %in% c("J", "D")) %>%
group_by(cut, color) %>%
summarise(counts = n())
ggdotchart(df, x = "cut", y ="counts",
color = "color", palette = "jco", size = 3,
add = "segment",
add.params = list(color = "lightgray", size = 1.5),
position = position_dodge(0.3),
ggtheme = theme_pubclean()
)
With the expected output of:
But instead I am getting:
Here is a way to get your desired plot without ggpubr::ggdotchart. The issue seems to be that geom_segment does not allow dodging, as discussed here: R - ggplot dodging geom_lines and here: how to jitter/dodge geom_segments so they remain parallel?.
# your data
df <- diamonds %>%
filter(color %in% c("J", "D")) %>%
group_by(cut, color) %>%
summarise(counts = n())
The first step is to expand your data. We will need this when we call geom_line which allows for dodging. I took this idea from #Stibu's answer. We create a copy of df and change the counts column to be 0 in df2. Finally we use bind_rows to create a single data frame from df and df2.
df2 <- df
df2$counts <- 0
df_out <- purrr::bind_rows(df, df2)
df_out
Then I use ggplot to create / replicate your desired output.
ggplot(df_out, aes(x = cut, y = counts)) +
geom_line(
aes(col = color), # needed for dodging, we'll later change colors to "lightgrey"
position = position_dodge(width = 0.3),
show.legend = FALSE,
size = 1.5
) +
geom_point(
aes(fill = color),
data = subset(df_out, counts > 0),
col = "transparent",
shape = 21,
size = 3,
position = position_dodge(width = 0.3)
) +
scale_color_manual(values = c("lightgray", "lightgray")) + #change line colors
ggpubr::fill_palette(palette = "jco") +
ggpubr::theme_pubclean()
There is an extra "group" argument you need!
df <- diamonds %>%
dplyr::filter(color %in% c("J", "D")) %>%
dplyr::group_by(cut, color) %>%
dplyr::summarise(counts = n())
ggdotchart(df, x = "cut", y ="counts",
color = "color", group="color", # here it is
palette = "jco", size = 3,
add = "segment",
add.params = list(color = "lightgray", size = 1.5),
position = position_dodge(0.3),
ggtheme = theme_pubclean()
)
I am new to R and looking to make a few adjustments to a side by side box plot I made in R. Below is some simplified code.
name <- c('a','a','a','a','a','a','a','a','a','a','b','b','b','b','c','c','c','c','c','c','c','c','c','c')
category <- c('y','y','y','y','y','x','x','x','x','x','x','y','x','y','x','x','x','x','x','y','y','y','y','y')
value <- c(10,20,30,40,50,60,70,80,90,100,40,50,60,70,10,20,30,40,50,60,70,80,90,100)
graphA <- data.frame(name, category, value)
ggplot(graphA, aes(x=name, y=value, fill = category))+
geom_boxplot(width = 0.5, position=position_dodge(0.75))+
scale_fill_grey(start = 0.8, end = 0.5)
Which looks great. But I want to reverse the order of the categories so that the 'y' category is plotted first. I tried running this line of code:
graphA$category <- factor(graphA$category, values = c('y','x'))
But I get an error that reads
"Error in factor(graphA$category, values = c("y", "x")) :
unused argument (values = c("y", "x"))"
I would also like to replace boxes for category b with two sets of colored dots because for that category I do not have enough points to call it a data distribution.
Any guidance you can provide is greatly appreciated!
First, you need the levels argument in factor
graphA$category <- factor(graphA$category, levels = c('y','x'))
And here is a way you could plot category "b" as points. fill will only have an effect for shapes 21 to 25 in geom_point.
ggplot(subset(graphA, name != "b"), aes(x = name, y = value, fill = category)) +
geom_boxplot(width = 0.5, position = position_dodge(0.75)) +
geom_point(
data = subset(graphA, name == "b"),
size = 4,
shape = 21,
position = position_dodge(width = .4)
) +
scale_fill_grey(start = 0.8, end = 0.5)