Create subplots with ggplot2 - r

So I need to work with the hflights package and make subplots of every single weekday to show the delay of the airplanes (cancelled flights excluded). The problem is I'm not able to reproduce both x- and y-axis (x: month & y: delay in in min). I tried to use facet_wrap and facet_grid, but I'm not familiar to those function, because of not using ggplot2 that often.

The plot will be clearer if you name the months and weekdays, arrange them in the correct order, and use a logarithmic scale on the y axis. You can use facet_grid to create subplots for each weekday.
library(hflights)
library(tidyverse)
weekday <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday')
hflights %>%
mutate(WeekDay = factor(weekday[DayOfWeek], weekday),
Month = factor(month.abb[Month], month.abb)) %>%
filter(Cancelled == 0) %>%
ggplot(aes(Month, DepDelay)) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(.~WeekDay) +
labs(x = 'Month', y = 'Departure delay (log scale)')
To get a single line going through each panel, you need to have an average for each unique combination of month and day. The simplest way to get this is via geom_smooth
hflights %>%
mutate(WeekDay = factor(weekday[DayOfWeek], weekday),
Month = factor(month.abb[Month], month.abb)) %>%
filter(Cancelled == 0) %>%
ggplot(aes(Month, DepDelay, group = WeekDay)) +
geom_smooth(se = FALSE, color = 'black') +
facet_wrap(.~WeekDay) +
labs(x = 'Month', y = 'Departure delay (log scale)')
Though you can also summarize the data yourself and use geom_line
hflights %>%
filter(Cancelled == 0) %>%
mutate(WeekDay = factor(weekday[DayOfWeek], weekday),
Month = factor(month.abb[Month], month.abb)) %>%
group_by(Month, WeekDay) %>%
summarize(Delay = mean(DepDelay)) %>%
ggplot(aes(Month, Delay, group = WeekDay)) +
geom_line(color = 'black') +
facet_wrap(.~WeekDay) +
labs(x = 'Month', y = 'Departure delay (log scale)')

now since you didn't post the code let's assume you have saved a plot that plots everything in one plot under a.
a + facet_grid(rows = vars(weekday))
("weekday" is the column name where the weekdays are in, replace it if they are named diffrently)
If this isn't what you were searching for, it would be great if you could post some code...

Suppose you want to show the ArrDelay,
hflights %>%
filter(Cancelled!=1) %>%
ggplot(aes(x=as.factor(Month), y=mean(ArrDelay,na.rm=T)))+
geom_col()+
labs(x='Month',y='Mean arrival Delay')+
facet_wrap(~DayOfWeek)

Related

Difference in n() count and geom_col graph likely resulted from group_by(), but why and how?

Sorry in advance for I'm an R newbie. So I was working on Divvy Bike Share data (details see here. Here is a subset of my df:
I wanted to visualize the total ridership count (how many times bikes are used) as compressed and shown in a week. I tried two blocks of codes, with the only difference being summarize() - the second one has "month" inside the function. I don't understand what resulted in this difference in y-axis values in the two graphs.
p1 <- df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p1
p2 <- df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n()) %>%
ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Total Rides by Days in a Week", subtitle = "Casual Customers vs. Members", y = "ride times count (in thousands)") +
theme(axis.title.x = element_blank()) +
scale_fill_discrete(name = "") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k"))
p2
I tested the tables generated before a plot is visualized, so I tried the following blocks:
df %>%
group_by(member_casual, day_of_week) %>%
summarize(total_rides = n())
df %>%
group_by(member_casual, day_of_week, month) %>%
summarize(total_rides = n())
I guess I understand by adding more elements in group_by, the resulting table will become more catagorized or "grouped". However, the total should always be the same, no? For example, if you add up all the casual & Sundays (as separated into 12 months) in tibble 2, you'll get exactly the number in tibble 1 - 392107, the same number as shown in p1, not p2. So this exacerbated my confusion.
So in a word, I have two questions:
Why the difference in p1 and p2? How could I have avoided such errors in the future?
Where does the numbers come in p2?
Any advice would be greatly appreciated. Thank you!
You’re assuming that the counts for each month will be stacked, so that together the column will show the total across all months. But in fact the counts are overplotted in front of one another, so only the highest month-count is visible. You can see this is the case if you add a border and make your columns transparent. Using mpg as an example, with cyl as the “extra” grouping variable:
library(dplyr)
library(ggplot2)
mpg %>%
count(drv, year, cyl) %>%
ggplot(aes(year, n, fill = drv)) +
geom_col(
position = "dodge",
color = "black",
alpha = .5
)
NB: count(x) is a shortcut for group_by(x) %>% summarize(n = n()).

ggplot2: Facetting by year and aligning x-axis dates by month

I am trying to plot daily data with days of the week (Monday:Sunday) on the y-axis, week of the year on the x-axis with monthly labels (January:December), and facet by year with each facet as its own row. I want the week of the year to align between the facets. I also want each tile to be square.
Here is a toy dataset to work with:
my_data <- tibble(Date = seq(
as.Date("1/11/2013", "%d/%m/%Y"),
as.Date("31/12/2014", "%d/%m/%Y"),
"days"),
Value = runif(length(VectorofDates)))
One solution I came up with is to use lubridate::week() to number the weeks and plot by week. This correctly aligns the x-axis between the facets. The problem is, I can't figure out how to label the x-axis with monthly labels.
my_data %>%
mutate(Week = week(Date)) %>%
mutate(Weekday = wday(Date, label = TRUE, week_start = 1)) %>%
mutate(Year = year(Date)) %>%
ggplot(aes(fill = Value, x = Week, y = Weekday)) +
geom_tile() +
theme_bw() +
facet_grid(Year ~ .) +
coord_fixed()
Alternatively, I tried plotting by the first day of the week using lubridate::floor_date and lubridate::round_date. In this solution, the x-axis is correctly labeled, but the weeks don't align between the two years. Also, the tiles aren't perfectly square, though I think this could be fixed by playing around with the coord_fixed ratio.
my_data %>%
mutate(Week = floor_date(Date, week_start = 1),
Week = round_date(Week, "week", week_start = 1)) %>%
mutate(Weekday = wday(Date, label = TRUE, week_start = 1)) %>%
mutate(Year = year(Date)) %>%
ggplot(aes(fill = Value, x = Week, y = Weekday)) +
geom_tile() +
theme_bw() +
facet_grid(Year ~ .) +
scale_x_datetime(name = NULL, labels = label_date("%b")) +
coord_fixed(7e5)
Any suggestions of how to get the columns to align correctly by week of the year while labeling the months correctly?
The concept is a little flawed, since the same week of the year is not guaranteed to fall in the same month. However, you can get a "close enough" solution by using the labels and breaks argument of scale_x_continuous. The idea here is to write a function which takes a number of weeks, adds 7 times this value as a number of days onto an arbitrary year's 1st January, then format it as month name only using strftime:
my_data %>%
mutate(Week = week(Date)) %>%
mutate(Weekday = wday(Date, label = TRUE, week_start = 1)) %>%
mutate(Year = year(Date)) %>%
ggplot(aes(fill = Value, x = Week, y = Weekday)) +
geom_tile() +
theme_bw() +
facet_grid(Year ~ .) +
coord_fixed() +
scale_x_continuous(labels = function(x) {
strftime(as.Date("2000-01-01") + 7 * x, "%B")
}, breaks = seq(1, 52, 4.2))
Another option if you're sick of reinventing the wheel is to use the calendarHeat function in the Github-only makeR package:
install_github("jbryer/makeR")
library(makeR)
calendarHeat(my_data$Date, my_data$Value)

How to order time in y axis

I have a data frame (a tibble) like this:
library(tidyverse)
library(lubridate)
x = tibble(date=c("2022-04-25 07:04:07", "2022-04-25 07:09:07", "2022-04-25 07:14:07", "2022-04-26 07:04:07"),
value=c("on", "off", "on", "off"))
x$day<- as.factor(day(x$date))
x$time <- paste0(str_pad(hour(x$date),2,pad="0"),":",str_pad(minute(x$date),2,pad="0"))
When I plot the data:
x %>% ggplot() + geom_col(aes(x=day,y=time, fill=value))
the times in the y axis do not follow the bars. Each time data is supposed to be side by side with each bar segment.
I tried using as.factor(time) but that didn't solve.
I also tried to add a numeric scale:
x = tibble(date=c("2022-04-25 07:04:07", "2022-04-25 07:09:07", "2022-04-25 07:14:07", "2022-04-26 07:04:07"),
fake_y=c(1,1,1,1)
value=c("on", "off", "on", "off"))
x %>% ggplot() + geom_col(aes(x=day,y=fake_y, fill=value))
but then the order of the on/off bars is lost.
How can I fix this?
Since you are looking for a time line, you would probably be best with geom_segment rather than geom_col. The reason is that since you might have multiple 'on' or 'off' values in a single day, it would be difficult to get these to stack correctly. You would also need to diff the on-off times to get them to stack. Furthermore, your labels would be wrong using columns if "off" represents the time of going from an on state to an off state.
When working with times in R, it is often best to keep them in time format for plotting. If you convert times to character strings before plotting, they will be interpreted as factor levels, and therefore will not be proportionately spaced correctly.
Since you want to have the day along one axis, you will need quite a bit of data manipulation to ensure that you record the state at the start of each day and the end of each day, but it can be achieved by doing:
p <- x %>%
mutate(date = as.POSIXct(date)) %>%
mutate(day = as.factor(day(date))) %>%
group_by(day) %>%
group_modify(~ add_row(.x,
date = floor_date(as.POSIXct(first(.x$date)), 'day'),
value = ifelse(first(.x$value) == 'on', 'off', 'on'),
.before = 1)) %>%
group_modify(~ add_row(.x,
date = ceiling_date(as.POSIXct(last(.x$date)), 'day') - 1,
value = last(.x$value))) %>%
mutate(ends = lead(date)) %>%
filter(!is.na(ends)) %>%
mutate(date = hms::as_hms(date), ends = hms::as_hms(ends)) %>%
ggplot(aes(x = day, y = date)) +
geom_segment(aes(xend = day, yend = ends, color = value),
size = 20) +
coord_cartesian(ylim = c(25120, 26500)) +
labs(y = 'time') +
guides(color = guide_legend(override.aes = list(size = 8)))
p
And of course, you can easily flip the co-ordinates if you wish, and apply theme elements to make the plot more appealing:
p + coord_flip(ylim = c(25120, 26500)) +
scale_color_manual(values = c('deepskyblue4', 'orange')) +
theme_light(base_size = 16)

Plotting a line graph by datetime with a histogram/bar graph by date

I'm relatively new to R and could really use some help with some pretty basic ggplot2 work.
I'm trying to visualize total number of submissions on a graph, showing the overall total in a line graph and the daily total in a histogram (or bar graph) on top of it. I'm not sure how to add breaks or bins to the histogram so that it takes the submission datetime column and makes each bar the daily total.
I tried adding a column that converts the datetime into just date and plots based on that, but I'd really like the line graph to include the time.
Here's what I have so far:
df <- df %>%
mutate(datetime = lubridate::mdy_hm(datetime))%>%
mutate(date = lubridate::as_date(datetime))
#sort by datetime
df <- df %>%
arrange(datetime)
#add total number of submissions
df <- df %>%
mutate(total = row_number())
#ggplot
line_plus_histo <- df%>%
ggplot() +
geom_histogram(data = df, aes(x=datetime)) +
geom_line(data = df, aes(x=datetime, y=total), col = "red") +
stat_bin(data = df, aes(x=date), geom = "bar") +
labs(
title="Submissions by Day",
x="Date",
y="Submissions",
legend=NULL)
line_plus_histo
As you can see, I'm also calculating the total number of submissions by sorting by time and then adding a column with the row number. So if you can help me use a better method I'd really appreciate it.
Please, find below the line plus histogram of time v. submissions:
Here's the pastebin link with my data
You can extend your data manipulation by:
df <- df |>
mutate(datetime = lubridate::mdy_hm(datetime)) |>
arrange(datetime) |>
mutate(midday = as_datetime(floor_date(as_date(datetime), unit = "day") + 0.5)) |>
mutate(totals = row_number()) |>
group_by(midday) |>
mutate(N = n())|>
ungroup()
then use midday for bars and datetime for line:
df%>%
ggplot() +
geom_bar(data = df, aes(x = midday)) +
geom_line(data = df, aes(x=datetime, y=totals), col = "red") +
labs(
title="Submissions by Day",
x="Date",
y="Submissions",
legend=NULL)
PS. Sorry for Polish locales on X axis.
PS2. With geom_bar it looks much better
Created on 2022-02-03 by the reprex package (v2.0.1)

Why can't I get the right horizontal axis labels on my ggplot2 chart?

I am trying to do a faceted plot of a grouped dataframe with ggplot2, using geom_line(). My dataframe has a Date column and I would like to have dates on the horizontal axis. If I just use Date in aes(x=Date, ...) I get nice labels on the horizontal axis. However, the line has an almost horizontal section where the date jumps from the end of one group to the beginning of the next group. This code and chart shows that:
dts <- seq.Date(as.Date("2020-01-01"), as.Date("2021-12-31"), by="day")
mos <- sapply(dts, month)
df <- data.frame(Date=dts, Month=mos)
nr <- nrow(df)
df$X <- rep(1, nr)
df %>%
group_by(Month) -> dfgrp
dfgrp %>%
group_by(Month) %>%
mutate(Time = Date[1:n()],
Z = cumsum(X)) %>%
ggplot(aes(x=Date, y=Z)) +
geom_line(color="darkgreen", size=0.5) +
facet_grid(. ~ Month, scale="free_x") +
theme(axis.text.x = element_text(angle=45, size=7))
I would not like my chart to have those almost-horizontal lines when the date changes by a large amount. I was able to generate a chart without those lines using integers on aes() as follows:
dfgrp %>%
mutate(Time = 1:n() %>% as.integer(),
Z = cumsum(X)) %>%
ggplot(aes(x=Time, y=Z)) +
geom_line(color="darkgreen", size=0.5) +
facet_grid(. ~ Month, scale="free_x") +
scale_x_continuous(breaks = seq(from=1, to=nr, by=10) %>% as.integer(),
labels = function(x) as.character(dfgrp$Date[x])) +
theme(axis.text.x = element_text(angle=45, size=7))
The line on the chart looks like I want it but the dates on the horizontal axis are not correct: they end in February 2020 in every facet while the dates in the dataframe end in December 2021 and the dates in the first chart begin and end on different months in different facets.
I tried many things but nothing worked. Any suggestions on how to have a chart with dates like in the first chart above and lines like in the second chart above?
Help will be much appreciated.
You may want to adjust the dates to be in the same year, but noting the original year as a variable:
library(lubridate)
dfgrp %>%
group_by(Month) %>%
mutate(year = year(Date),
adj_date = ymd(paste(2020, month(Date), day(Date)))) %>%
# 2020 was leap year so 2/29 won't be lost
mutate(Time = Date[1:n()],
Z = cumsum(X)) %>%
ggplot(aes(x=adj_date, y=Z, color = year, group = year)) +
geom_line(size=0.5) +
facet_grid(. ~ Month, scale="free_x") +
theme(axis.text.x = element_text(angle=45, size=7))

Resources