R: ggplot of average daily counts by month - r

I am trying to plot average daily trip counts by month. However, I am struggling in finding how I can only include the mean number of trips per day by month in the plot instead of the total monthly trips.
The days of the week and months have already been converted from numeric type to abbreviations and have also been ordered (type: ).
Here's what I've tried for the plot.
by_day <- df_temp %>%
group_by(Start.Day)
ggplot(by_day, aes(x=Start.Month,
fill=Start.Month)) +
geom_bar() +
scale_fill_brewer(palette = "Paired") +
labs(title="Number of Daily Trips by Month",
x=" ",
y="Number of Daily Trips")
Here's the plot I am trying to replicate:

You are almost there. Since you did not share a reproducible example, I simulate your data. You may need to adapt the variable naming and/or correct my assumptions.
{lubridate} is a powerful package for date-time crunching. It comes handy when working with dates and binning dates for summaries, etc.
# simulating your data
## a series of dates from June through October
days <- seq(from = lubridate::ymd("2020-06-01")
,to = lubridate::ymd("2020-10-30")
,by = "1 day")
## random trips on each day
set.seed(666)
trips <- sample(2000:5000, length(days), replace = TRUE)
# putting things together in a data frame
df_temp <- data.frame(date = days, counts = trips) %>%
# I assume the variable Start.Month is the monthly bin
# let's use lubridate to "bin" the month from the date
mutate(Start.Month = lubridate::floor_date(date, unit = "month"))
# aggregate trips for each month, calculate average daily trips
by_month <- df_temp %>%
group_by(Start.Month) %>% # group by the binning variable
summarise(Avg.Trips = mean(counts)) # calculate the mean for each group
ggplot( data = by_month
, aes(x = Start.Month, y = Avg.Trips
, fill=as.factor(Start.Month)) # to work with a discrete palette, factorise
) +
# ------------ bar layer -----------------------------------------
## instead of geom_bar(... stat = "identity"), you can use geom_col()
## and define the fill colour
geom_col() +
scale_fill_brewer(palette = "Paired") +
# ------------ if you like provide context with annotation -------
geom_text(aes(label = Avg.Trips %>% round(2)), vjust = 1) +
# ------------ finalise plot with labels, theme, etc.
labs(title="Number of Daily Trips by Month",
x=NULL, # setting an unused lab to NULL is better than printing empty " "!
y="Number of Daily Trips"
) +
theme_minimal() +
theme(legend.position = "none") # to suppress colour legend

Related

How to unclutter the x-axis in a plot

Using the R programming language, I create some time series data (daily measurements, over a period of 20 years). I aggregated this data at monthly time periods and then produced a graph:
library(ggplot2)
library(xts)
library(scales)
set.seed(123)
day = seq(as.Date("2000/1/1"), as.Date("2020/1/1"),by="day")
day <- format(as.Date(day), "%Y/%m/%d")
amount <- rnorm(7306 ,100,10)
data <- data.frame(day, amount)
y.mon<-aggregate(amount~format(as.Date(day),
format="%Y/%m"),data=data, FUN=sum)
y.mon$d = y.mon$`format(as.Date(day), format = "%Y/%m")`
ggplot(y.mon, aes(x = d, y=amount))+
geom_line(aes(group=1))
Right now, the x-axis is completely unreadable. Is there a way to "unclutter" the x-axis? Perhaps "slant" the dates or show the dates at intervals of 4 month periods? I can completely delete the x-axis but ideally I would like to keep it there for reference.
At the end of the graph, there is a huge downwards "spike". I think this is because the data is aggregated every month - and since the last day the data is available at is "Jan-01-2020", this causes the "downwards spike". Is it possible to "query" the "y.mon" object so that the graph is made only until the last "complete" time period? This "spike" is deceiving, someone might look at the graph and think a big anomaly happened in Jan-2020, but it's actually because there is only 1 measurement at this time.
Thanks
You can also try:
library(ggplot2)
library(xts)
library(scales)
set.seed(123)
#Data
day = seq(as.Date("2000/1/1"), as.Date("2020/1/1"),by="day")
amount <- rnorm(7306 ,100,10)
data <- data.frame(day, amount)
#Aggregate
y.mon<-aggregate(amount~format(as.Date(day),
format="%Y/%m"),data=data, FUN=sum)
#Count days
y.mon2<-aggregate(amount~format(as.Date(day),
format="%Y/%m"),data=data,
FUN=function(x) length(x))
names(y.mon2)[2]<-'N'
#Format and merge to add N
y.mon$d = y.mon$`format(as.Date(day), format = "%Y/%m")`
mmon <- merge(y.mon,y.mon2)
#Add a dummy date
mmon$d <- as.Date(paste0(mmon$d,'/01'),'%Y/%m/%d')
#Plot
ggplot(subset(mmon,N!=1), aes(x = d, y=amount))+
geom_line(aes(group=1))+
scale_x_date(date_breaks = '4 month',date_labels = '%Y-%m',
expand = c(0,0))+
theme(axis.text.x = element_text(angle = 90))
Output:
Update: Using previous code and only changing for labels:
#Plot Update
ggplot(subset(mmon,N!=1), aes(x = d, y=amount))+
geom_line(aes(group=1))+
scale_x_date(date_breaks = '12 month',date_labels = '%Y',
expand = c(0,0))+
theme(axis.text.x = element_text(angle = 90))
Output:

Plotting multiple geom_lines at the same time and shift them on the X axis in R

I have a dataset that has individual records with ZIP Codes and Installation Dates. So far, I was able to plot records in a ZIP Code by:
Subset a ZIP Code
Sort the records by Date
Create a new column and assign increasing values (by 1) for the next row.
Plot this last field by Date
The result looks like this:
Now, what I want to do is have multiple ZIP Code geom_lines in the same figure. Each ZIP Code area has a different first record date, and I would like all of them to start at the same point on the X-axis.
Here's a failed attempt. I want these lines to start at the same point on the X-axis:
I am looking for ideas on how to proceed.
Thanks!
Let's try to roughly emulate your data structure since your question does not include any data:
library(ggplot2)
set.seed(69)
df <- data.frame(
ZIP = c(rep("A", 1000), rep("B", 687)),
Count = c(cumsum(round(runif(1000, 0, 0:999))),
cumsum(round(runif(687, 0, 0:686) * 4))),
Date = c(seq(as.POSIXct("2007-09-01"), by = "1 week", length.out = 1000),
seq(as.POSIXct("2013-08-31"), by = "1 week", length.out = 687)))
ggplot(df, aes(Date, Count, colour = ZIP)) +
geom_line() +
scale_colour_manual(values = c("blue", "red"))
Now clearly, if we want these lines to start at the same position on the x axis, the x axis can no longer reflect the absolute date, but rather the time since the first record. So we need to calculate what this would be for each group. The dplyr package can help us here:
library(dplyr)
df %>%
group_by(ZIP) %>%
mutate(Day = as.numeric(difftime(Date, min(Date), units = "days"))) %>%
ggplot(aes(Day, Count, colour = ZIP)) +
geom_line() +
labs(x = "Day since first record") +
scale_colour_manual(values = c("blue", "red"))

R - Formatting data per month and facet wrapping per year

I am practicing with R and have hit a speedbump while trying to create a graph of airline passengers per month.
I want to show a separate monthly line graph for each year from 1949 to 1960 whereby data has been recorded. To do this I have used ggplot to create a line graph with the values per month. This works fine, however when I try to separate this by year using facet_wrap() and formatting the current month field: facet_wrap(format(air$month[seq(1, length(air$month), 12)], "%Y")); it returns this:
Graph returned
I have also tried to format the facet by inputting my own sequence for the years: rep(c(1949:1960), each = 12). This returns a different result which is better but still wrong:
Second graph
Here is my code:
air = data.frame(
month = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by="months"),
air = as.vector(AirPassengers)
)
ggplot(air, aes(x = month, y = air)) +
geom_point() +
labs(x = "Month", y = "Passengers (in thousands)", title = "Total passengers per month, 1949 - 1960") +
geom_smooth(method = lm, se = F) +
geom_line() +
scale_x_date(labels = date_format("%b"), breaks = "12 month") +
facet_wrap(format(air$month[seq(1, length(air$month), 12)], "%Y"))
#OR
facet_wrap(rep(c(1949:1960), each = 12))
So how do I make an individual graph per year?
Thanks!
In the second try you were really close. The main problem with the data is that you are trying to make a facetted plot with different x-axis values (dates including the year). An easy solution to fix that would be to transform the data to a "common" x axis scale and then do the facetted plot. Here is the code that should output the desired plot.
library(tidyverse)
library(lubridate)
air %>%
# Get the year value to use it for the facetted plot
mutate(year = year(month),
# Get the month-day dates and set all dates with a dummy year (2021 in this case)
# This will get all your dates in a common x axis scale
month_day = as_date(paste(2021,month(month),day(month), sep = "-"))) %>%
# Do the same plot, just change the x variable to month_day
ggplot(aes(x = month_day,
y = air)) +
geom_point() +
labs(x = "Month",
y = "Passengers (in thousands)",
title = "Total passengers per month, 1949 - 1960") +
geom_smooth(method = lm,
se = F) +
geom_line() +
# Set the breaks to 1 month
scale_x_date(labels = scales::date_format("%b"),
breaks = "1 month") +
# Use the year variable to do the facetted plot
facet_wrap(~year) +
# You could set the x axis in an 90° angle to get a cleaner plot
theme(axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1))

How to column bind the legend to a computed column in ggplot2?

The code plots data with computed weekly regression lines.
I would like to combine the legend with weekly doubling times, computed from the weekly slopes.
Nice to solve question: I could get the weekly regression lines with a geom_smooth.
However, I could not extract the slope coefficient (to compute the doubling time) from the geom_smooth. I therefore had to do equivalent regressions outside the ggplot portion.
Any suggestions to do this more elegantly?
Main question: How can I combine the legend with the column of computed doubling times?
With a lot of fiddling I can place the legend sort of next to these computed doubling times.
It does not look nice and when I include another datapoint I have to start fiddling all over again. Suggestions will be appreciated. Thank you.
library(ggplot2)
library(gridExtra)
# Input data: Daily number of cases starting at day0
cases <- c(1,1,2,3,7,10,13,16,24,38,51,62,85,116,150,202,240,274,402,554,709, 927)
day0 <- as.Date("2020-03-04")
# actual dates by counting from day0
dates <- day0 + 1:length(cases)
# week number as factor to obtain regression line for each week
week <- as.factor(1 + (1:length(cases) ) %/% 7)
# tibble with daily data, also with week number
datatib <- tibble( dates, cases, week)
# tibble with computed doubling time per week
resulttib <- tibble(Week=unique(week), Doubling_Time=NA)
# linear regression on log of dependent variable
for (wk in unique(week) ) {
resulttib[wk,'Doubling_Time'] <-
round( log(2) / lm(log(cases) ~ dates, data=datatib[week==wk,] )$coef['dates'], 2 )
}
# insert row at top for second line of column heading
resulttib <- add_row(resulttib, Week = '', Doubling_Time = '(days)', .before = 1)
doublingtime = tableGrob(resulttib[,'Doubling_Time'], rows=NULL)
gp <-
ggplot(datatib, aes(dates, cases, color = week ) ) +
geom_point() +
geom_smooth( method = "lm", se = FALSE) +
scale_x_date() +
scale_y_continuous(trans="log10") +
labs(x = "", y = "Number of Cases") +
ggtitle("Number of Cases with Weekly Doubling Times") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position=c(0.75,0),
legend.justification=c(1.2, -0.1), legend.text=element_text(size=14) ) +
annotation_custom( doublingtime,
xmin=dates[length(cases)]-2, xmax=dates[length(cases)], ymin=-2.65 )
As an answer to your main question ... try this. I simply joined the doubling time to your main df and created a new var combining no. of week and doubling time. Color is then mapped on this new var.
Concerning your second question: There are ways to compute the slope from the computed values of geom_smooth/stat_smooth. However, in my opinion your approach of computing the slopes is the easier way to the kind of problem your are trying to solve.
library(ggplot2)
library(dplyr)
library(gridExtra)
# Input data: Daily number of cases starting at day0
cases <- c(1,1,2,3,7,10,13,16,24,38,51,62,85,116,150,202,240,274,402,554,709, 927)
day0 <- as.Date("2020-03-04")
# actual dates by counting from day0
dates <- day0 + 1:length(cases)
# week number as factor to obtain regression line for each week
week <- as.factor(1 + (1:length(cases) ) %/% 7)
# tibble with daily data, also with week number
datatib <- tibble( dates, cases, week)
# tibble with computed doubling time per week
resulttib <- tibble(Week=unique(week), Doubling_Time=NA)
# linear regression on log of dependent variable
for (wk in unique(week) ) {
resulttib[wk,'Doubling_Time'] <-
round( log(2) / lm(log(cases) ~ dates, data=datatib[week==wk,] )$coef['dates'], 2 )
}
# insert row at top for second line of column heading
#resulttib <- add_row(resulttib, Week = '', Doubling_Time = '(days)', .before = 1)
#doublingtime = tableGrob(resulttib[,'Doubling_Time'], rows=NULL)
datatib1 <- datatib %>%
left_join(resulttib, by = c("week" = "Week")) %>%
mutate(week1 = paste0(week, " (", Doubling_Time, ")"))
gp <-
ggplot(datatib1, aes(dates, cases, color = week1 ) ) +
geom_point() +
geom_smooth( method = "lm", se = FALSE) +
scale_x_date() +
scale_y_continuous(trans="log10") +
labs(x = "", y = "Number of Cases") +
ggtitle("Number of Cases with Weekly Doubling Times") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(
legend.position = c(.95, .05),
legend.justification = c("right", "bottom"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6)
) +
labs(color = "Week (Doubling time in days)")
gp
Created on 2020-03-27 by the reprex package (v0.3.0)

ggplot2: plotting time series data by month & week

I'm trying to plot time series data by week and month; ideally, I think, I'd like to use boxplots to visualise daily data binned by week. While I can change the labels and gridlines on the x-axis using scale_x_date, that won't affect the points in the plot.
Here's a demonstration of the problem and my current (clumsy) solution.
library(zoo)
library(ggplot2)
d = as.Date(c(as.Date("2007-06-01"):as.Date("2008-05-31"))) # using zoo to reformat numeric
x = runif(366, min = 0, max = 100)
df = data.frame(d,x)
# PROBLEM #
p = ggplot(df, aes(d, x))
p + geom_point()
p + geom_boxplot() # more or less useless
# CURRENT FIX #
df$Year.Month <- format(df$d, "%Y-%m")
p = ggplot(df, aes(Year.Month, x))
p + geom_point(alpha = 0.75)
p + geom_boxplot() # where I'm trying to get to...
I feel certain that there's a more elegant way to do this from within ggplot. Am I right?
#shadow's answer below is much neater. But is there a way to do this using binning? Using stats in some form, perhaps?
You can treat Dates as dates in R, and use scale_x_date() in ggplot to get the x-labels you want.
Also, I find it easier to just create a new variable-factor called "Month" to group the boxplots by month. In this case I used lubridate to accomplish the task.
If you do not want to go through the trouble of creating a new variable "Month", your bloxplot will be plotted on the 15th of the month, making the viz reading a bit more difficult.
library(magrittr)
library(lubridate)
library(dplyr)
df %>%
mutate(Date2 = as.Date(paste0("2000-", month(d), "-", "01"))) %>%
mutate(Month = lubridate::month(d)) %>%
ggplot(aes(Date2, x, group=Month)) +
geom_boxplot() +
scale_x_date(date_breaks="1 month", date_labels = "%b")
If you do not create the variable "Month", boxplots won't align nicely with the x tick marks:

Resources