How to specify unique geom assignments to facets? - r

Below I have simulated a dataset where an assignment was given to 5 groups of individuals on 5 different days (a new group with 200 new individuals each day). TrialStartDate denotes the date on which the assignment was given to each individual (ID), and TrialEndDate denotes when each individual finished the assignment.
set.seed(123)
data <-
data.frame(
TrialStartDate = rep(c(sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by="day"), 5)), each = 200),
TrialFinishDate = sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by = "day"), 1000,replace = T),
ID = seq(1,1000, 1)
)
I am interested in comparing how long individuals took to complete the trial depending on when they started the trial (i.e., assuming TrialStartDate has an effect on the length of time it takes to complete the trial).
To visualize this, I want to make a barplot showing counts of IDs on each TrialFinishDate where bars are colored by TrialStartDate (since each TrialStartDate acts as a grouping variable). The best I have come up with so far is by faceting like this:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
facet_wrap(~TrialStartDate, ncol = 1)
However, I also want to add a vertical line to each facet showing when the TrialStartDate was for each group (preferably colored the same as the bars). When attempting to add vertical lines with geom_vline, it adds all the lines to each facet:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(xintercept = unique(data$TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
How can we make the vertical lines unique to the respective group in each facet?

You're specifying xintercept outside of aes, so the faceting is not respected.
This should do the trick:
data %>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(aes(xintercept = TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
Note geom_vline(aes(xintercept = TrialStartDate))

Related

How to draw a line plot when dates are in two different columns?

Lets say I have the following dataset:
dt <- data.frame(id= c(1),
parameter= c("a","b","c"),
start_day = c(1,8,4),
end_day = c(16,NA,30))
I want to create a type of line chart such that "parameter" column is my y-axis and both ("start_day" and "end_day") are on my x-axis.
Alo, if both "start_day" and "end_day" have values, then they be connected through a line. In case there is no "end_day" (like for parameter "b") then the "start_day" be connected to an arrow indicating there is no "end_day" for that parameter. (I know it sound confusing but I will make an example to clarify)
I know that for line chart I need to have all the dates in one column. But in my data frame I have two separate columns (start and end dates).
So I think line chart is not the proper tool for this case and instead I tried swimmer_points_from_lines and swimmer_arrows.
I added a new column named "cont" to be used in swimmer_arrows.
dt$cont <- with(dt, ifelse(end_day > 0 ,1, 0))
ggplot(data = dt)+
swimmer_points_from_lines(df= dt,
id="parameter",
start= "start_day",
end="end_day")+
swimmer_arrows(df_arrows = dt,
id="parameter",
arrow_start = "start_day",
cont = "cont")+
coord_flip()
The outcome is as follow:
What I am looking for at this point is to find a way to draw a line between "start_day" and "end_day" (given both end and start day exist). And if there is no "end_day" I want an arrow indicates that there is no end date (the exact opposite of what I am getting right now).
Any help is much appreciated.
To draw a line between points where the values are in separate columns you could use geom_segment. To add an arrow to obs with no end date one option would be to split your data into two parts, one with non-missing and one with missing end-dates, and use two geom_segments:
library(ggplot2)
ggplot(dt, aes(y = parameter)) +
geom_segment(data = ~subset(.x, !is.na(end_day)), aes(x = start_day, xend = end_day, yend = parameter)) +
geom_segment(data = ~subset(.x, is.na(end_day)), aes(x = start_day, xend = start_day + 1, yend = parameter),
arrow = arrow(type = "closed", length = unit(0.1, "inches"))) +
geom_point(aes(x = start_day, shape = "start_day"), size = 3, color = "red") +
geom_point(aes(x = end_day, shape = "end_day"), size = 3, color = "red") +
scale_shape_manual(values = c(16, 17))
#> Warning: Removed 1 rows containing missing values (geom_point).

display mean value (rearrange data frame?)

I want to boxplot two groups (A and B) and display the mean value on each box plot.
I have 30 lines and 2 columns : each line contains the value of group A (col 1) and group B (col 2).
I did a boxplot with graphic boxplot
boxplot(Data_Q4$Group.A,Data_Q4$Group.B,names=c("group A","group B"))
but it seems like adding a mean point on the boxplot necessiting ggplot 2.
I tried many things but it already send me an error message
! Aesthetics must be either length 1 or the same as the data (30): x...
It seems my problem come from y axis. I need him to take the data from columns A and B but I don't know how to do this.
if my data was with value column and group columns (A or B for each line) it would work but I don't know how to rearrange it so that I get 2 columns (value and groups) and 60 lines with the values of the groups.
and then I do dataQ4 %>% ggplot(aes(x=group,y=value))+geom_boxplot+stat_summary(fun.y=mean)
I think it will be ok.
so my problem is to rearrange my data frame so that I can use ggplot and boxplot it
thanks for your help !
I share here my data :
dput(Data_Q4) structure(list(Group.A = c(1.25310535, 0.5546414, 0.301283, 1.29312466, 0.99455579, 0.5141743, 2.0078324, 0.42224244, 2.17877257, 3.21778902, 0.55782935, 0.59461765, 0.97739581, 0.20986658, 0.30944786, 1.10593627, 0.77418776, 0.08967408, 1.10817666, 0.24726425, 1.57198685, 4.83281274, 0.43113213, 2.73038931, 1.13683142, 0.81336825, 0.83700649, 1.7847654, 2.31247163, 2.90988727), Group.B = c(2.94928948, 0.70302878, 0.69016263, 1.25069011, 0.43649776, 0.22462232, 0.39231981, 1.5763435, 0.42792839, 0.19608026, 0.37724368, 0.07071508, 0.03962611, 0.38580831, 2.63928857, 0.78220807, 0.66454197, 0.9568569, 0.02484568, 0.21600677, 0.88031195, 0.13567357, 0.68181725, 0.20116062, 0.4834762, 0.50102846, 0.15668497, 0.71992076, 0.68549794, 0.86150777)), class = "data.frame", row.names = c(NA, -30L))
First I create some random data:
df <- data.frame(group = rep(c("A", "B"), 15),
value = runif(30, 0, 10))
You can use the following code:
library(tidyverse)
ggplot(data = df,
aes(x = group, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:
The red dots represent the mean.
Using your data:
You can use the following code:
library(tidyverse)
library(reshape)
dataQ4 %>%
melt() %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE)
Output:

ggplot2 – issue with overlay of lines and errorbars

On the same ggplot figure, I am trying to have the points (from geom_point), the lines (from geom_line) and the errorbars (from geom_errorbar) on the same "plane" (i.e. not overlapping), this for each factor.
As you can see the "layering" of the errorbars is not following the "layering" of the lines (not mentionning the points).
Here is a reproducible example:
# reproducible example
# package
library(dplyr)
library(ggplot2)
# generate the data
set.seed(244)
d1 <- data.frame(time_serie = as.factor(rep(rep(1:3, each = 6), 3)),
treatment = as.factor(rep(c("HIGH", "MEDIUM", "LOW"), each = 18)),
value = runif(54, 1, 10))
# create the error intervals
d2 <- d1 %>%
dplyr::group_by(time_serie,treatment) %>%
dplyr::summarise(mean_value = mean(value),
SE_value = sd(value/sqrt(length(value)))) %>%
as.data.frame()
# plot
p1 <- ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2)
p1
p1a <- p1 + geom_errorbar(aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value), width = .2, position = position_dodge(0.3), size =1) +
geom_point(aes(), position = position_dodge(0.3), size = 3) +
geom_line(aes(color = treatment), position=position_dodge(0.3), size =1)
p1a
Any idea?
Any help would be greatly appreciated :)
Thanks a lot!
Valérian
Up front: this is a partial answer that has two notable issues still to fix (see the end). Edit: the two issues have been resolved, see the far bottom.
I'll change the "dodge" slightly to clarify the point, identify an area of concern, and demonstrate a suggested workaround.
# generate the data
set.seed(244)
d1 <- data.frame(time_serie = as.factor(rep(rep(1:3, each = 6), 3)),
treatment = as.factor(rep(c("HIGH", "MEDIUM", "LOW"), each = 18)),
value = runif(54, 1, 10))
# create the error intervals
d2 <- d1 %>%
dplyr::group_by(time_serie,treatment) %>%
dplyr::summarise(mean_value = mean(value),
SE_value = sd(value/sqrt(length(value)))) %>%
dplyr::arrange(desc(treatment)) %>%
as.data.frame()
# plot
ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2) +
geom_errorbar(aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, position = position_dodge(0.03), size = 2) +
geom_point(aes(), position = position_dodge(0.03), size = 3) +
geom_line(aes(color = treatment), position = position_dodge(0.03), size = 2)
Namely, I'll assume that we want HIGH (red) points/lines/error-bars as the top-most layer, masked by nothing. We can see a clear violation of this in the right-most bar: the red dot is over the green errorbar but under the green line.
Unless/until there is an aes(layer=..) aesthetic (there is not afaik), you need to add layers one treatment at a time. While one could hard-code this with nine geoms, you can automate this with lapply. Note that ggplot(.) + list(geom1,geom2,geom3) works just fine, even with nested lists.
I'll control the order of layers with rev(levels(d2$treatment)), assuming that you want LOW as the bottom-most layer (ergo added first). The order of geoms within the list is what defines their layers. Technically we still have a single treatment's errorbar, point, and line on different layers, but they are consecutive so appear to be the same.
ggplot(aes(x = time_serie, y = mean_value, color = treatment, group = treatment), data=d2) +
lapply(rev(levels(d2$treatment)), function(trtmnt) {
list(
geom_errorbar(data = ~ subset(., treatment == trtmnt),
aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, position = position_dodge(0.03), size = 2),
geom_point(data = ~ subset(., treatment == trtmnt), aes(), position = position_dodge(0.03), size = 3),
geom_line(data = ~ subset(., treatment == trtmnt), position = position_dodge(0.03), size = 2)
)
})
(Side note: I use levels(d2$treatment) and data=~subset(., treatment==trtmnt) here, but that's just one way to do it. Another would be lapply(split(d2, d2$treatment), function(x) ...) and use data=x in all of the inner geoms. This latter method allows for multi-variable grouping, if desired. I see no immediate advantage to one over the other.)
The problems with this:
The order of the legend is not consistent with the order of levels of the factor, somehow that is lost. (To be clear, I don't demonstrate this very well here: I can move "medium" to the middle of the legend using levels<-, and it works with the non-lapply rendering code with incorrect layering, but it is again lost with the lapply-geoms.)
position_dodge no longer has awareness of the other treatments, so it does not dodge the other errorbars. The only way around this (not demonstrated here) would be to manually dodge before plotting, shown below.
1: Order of legend elements
This one was solved in lapply'd geoms lose factor-ordering, where we just need to add scale_color_discrete(drop=FALSE).
2: Dodging
The dodge issue can be fixed by using real numerics in the x aesthetic. This is kind of a hack, as it is no longer done by ggplot2 but controlled externally. It's also applying an offset and not dodging, per se. But it does get the desired results.
d2$time_serie2 <- as.integer(as.character(d2$time_serie)) + as.numeric(d2$treatment)/10
ggplot(aes(x = time_serie2, y = mean_value, color = treatment, group = treatment), data = d2) +
lapply(rev(levels(d2$treatment)), function(trtmnt) {
list(
geom_errorbar(data = ~ subset(., treatment == trtmnt),
aes(ymin = mean_value - SE_value, ymax = mean_value + SE_value),
width = 0.2, size = 2),
geom_point(data = ~ subset(., treatment == trtmnt), aes(), size = 3),
geom_line(data = ~ subset(., treatment == trtmnt), size = 2)
)
}) +
scale_color_discrete(drop = FALSE)

Generating multiple lines for repeat observations in only some factor levels

I am generating density plots for observations. The observations belong to a species and some are also connected to an individual ID.
With the data below, I want to generate a line for each level of IndID for species One and Two, and only a single line for Species Three, which does not include IndID. There are related questions on SO, but not with reproducible data and looking for different results.
library(ggplot2)
set.seed(1)
dat <- data.frame(Species = c(rep(c("One", "Two"), each = 2, length = 30), rep("Three",50)),
IndID = c(rep(letters[1:5],each = 6),rep(NA,50) ),
Value = sample(1:20, replace = T))
Keeping the color ascetic on the Species level, I want to create multiple lines for Species One and Two (green and red) and a single blue line for species Three.
ggplot(dat, aes(Value)) + geom_density(aes(color = Species), size = 1.25) +
scale_colour_manual(values = c("darkgreen","blue", "red"))
If you want to be able to tell them apart, you can set the linetype to IndID. Note, however, that you will need to change the NA to some other value to (easily) get it to plot.
I also expanded your data a little bit to give enough values per individual to show meaningful lines. I also used geom_line(stat = "density") instead of geom_density() because it omits the line along the bottom and gives legends with lines instead of boxes.
set.seed(1)
dat <- data.frame(Species = c(rep(c("One", "Two"), each = 2, length = 60), rep("Three",50)),
IndID = c(rep(letters[1:5],each = 12),rep("NA",50) ),
Value = sample(1:20, 110, replace = T))
ggplot(dat
, aes(x = Value
, color = Species
, linetype = IndID)) +
geom_line(stat = "density"
, size = 1.25) +
scale_colour_manual(values = c("darkgreen","blue", "red"))
gives
If you want the lines to all be solid, you can run:
ggplot(dat
, aes(x = Value
, color = Species
, linetype = IndID)) +
geom_line(stat = "density"
, size = 1.25) +
scale_colour_manual(values = c("darkgreen","blue", "red")) +
scale_linetype_manual(values = rep("solid", 6)) +
guides(linetype = "none")
(or use group as #Henrik suggested in zir comment)

R: In ggplot, how to add multiple text labels on the y-axis for each of multiple dates on the x-axis

I am making a very wide chart that, when output as a PNG file, takes up several thousand pixels in the x-axis; there is about 20 years of daily data. (This may or may not be regarded as good practise, but it is for my own use, not for publication.) Because the chart is so wide, the y-axis disappears from view as you scroll through the chart. Accordingly I want to add labels to the plot at 2-yearly intervals to show the values on the y-axis. The resulting chart looks like the one below, except that in the interests of keeping it compact I have used only 30 days of fake data and put labels roughly every 10th day:
This works more or less as required, but I wonder if there is some better way of approaching it as in this chart (see code below) I have a column for each of the 3 y-axis values of 120, 140 and 160. The real data has many more levels, so I would end up with 15 calls to geom_text to put everything on the plot area.
Q. Is there a simpler way to splat all 20-odd dates, with 15 labels per date, on to the chart at once?
require(ggplot2)
set.seed(12345)
mydf <- data.frame(mydate = seq(as.Date('2012-01-01'), as.Date('2012-01-31'), by = 'day'),
price = runif(31, min = 100, max = 200))
mytext <- data.frame(mydate = as.Date(c('2012-01-10', '2012-01-20')),
col1 = c(120, 120), col2 = c(140,140), col3 = c(160,160))
p <- ggplot(data = mydf) +
geom_line(aes(x = mydf$mydate, y = mydf$price), colour = 'red', size = 0.8) +
geom_text(data = mytext, aes(x = mydate, y = col1, label = col1), size = 4) +
geom_text(data = mytext, aes(x = mydate, y = col2, label = col2), size = 4) +
geom_text(data = mytext, aes(x = mydate, y = col3, label = col3), size = 4)
print(p)
ggplot2 likes data to be in long format, so melt()ing your text into long format lets you make a single call to geom_text():
require(reshape2)
mytext.m <- melt(mytext, id.vars = "mydate")
Then your plotting command becomes:
ggplot(data = mydf) +
geom_line(aes(x = mydf$mydate, y = mydf$price), colour = 'red', size = 0.8) +
geom_text(data = mytext.m, aes(x = mydate, y = value, label = value), size = 4)

Resources