How to column bind the legend to a computed column in ggplot2? - r

The code plots data with computed weekly regression lines.
I would like to combine the legend with weekly doubling times, computed from the weekly slopes.
Nice to solve question: I could get the weekly regression lines with a geom_smooth.
However, I could not extract the slope coefficient (to compute the doubling time) from the geom_smooth. I therefore had to do equivalent regressions outside the ggplot portion.
Any suggestions to do this more elegantly?
Main question: How can I combine the legend with the column of computed doubling times?
With a lot of fiddling I can place the legend sort of next to these computed doubling times.
It does not look nice and when I include another datapoint I have to start fiddling all over again. Suggestions will be appreciated. Thank you.
library(ggplot2)
library(gridExtra)
# Input data: Daily number of cases starting at day0
cases <- c(1,1,2,3,7,10,13,16,24,38,51,62,85,116,150,202,240,274,402,554,709, 927)
day0 <- as.Date("2020-03-04")
# actual dates by counting from day0
dates <- day0 + 1:length(cases)
# week number as factor to obtain regression line for each week
week <- as.factor(1 + (1:length(cases) ) %/% 7)
# tibble with daily data, also with week number
datatib <- tibble( dates, cases, week)
# tibble with computed doubling time per week
resulttib <- tibble(Week=unique(week), Doubling_Time=NA)
# linear regression on log of dependent variable
for (wk in unique(week) ) {
resulttib[wk,'Doubling_Time'] <-
round( log(2) / lm(log(cases) ~ dates, data=datatib[week==wk,] )$coef['dates'], 2 )
}
# insert row at top for second line of column heading
resulttib <- add_row(resulttib, Week = '', Doubling_Time = '(days)', .before = 1)
doublingtime = tableGrob(resulttib[,'Doubling_Time'], rows=NULL)
gp <-
ggplot(datatib, aes(dates, cases, color = week ) ) +
geom_point() +
geom_smooth( method = "lm", se = FALSE) +
scale_x_date() +
scale_y_continuous(trans="log10") +
labs(x = "", y = "Number of Cases") +
ggtitle("Number of Cases with Weekly Doubling Times") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position=c(0.75,0),
legend.justification=c(1.2, -0.1), legend.text=element_text(size=14) ) +
annotation_custom( doublingtime,
xmin=dates[length(cases)]-2, xmax=dates[length(cases)], ymin=-2.65 )

As an answer to your main question ... try this. I simply joined the doubling time to your main df and created a new var combining no. of week and doubling time. Color is then mapped on this new var.
Concerning your second question: There are ways to compute the slope from the computed values of geom_smooth/stat_smooth. However, in my opinion your approach of computing the slopes is the easier way to the kind of problem your are trying to solve.
library(ggplot2)
library(dplyr)
library(gridExtra)
# Input data: Daily number of cases starting at day0
cases <- c(1,1,2,3,7,10,13,16,24,38,51,62,85,116,150,202,240,274,402,554,709, 927)
day0 <- as.Date("2020-03-04")
# actual dates by counting from day0
dates <- day0 + 1:length(cases)
# week number as factor to obtain regression line for each week
week <- as.factor(1 + (1:length(cases) ) %/% 7)
# tibble with daily data, also with week number
datatib <- tibble( dates, cases, week)
# tibble with computed doubling time per week
resulttib <- tibble(Week=unique(week), Doubling_Time=NA)
# linear regression on log of dependent variable
for (wk in unique(week) ) {
resulttib[wk,'Doubling_Time'] <-
round( log(2) / lm(log(cases) ~ dates, data=datatib[week==wk,] )$coef['dates'], 2 )
}
# insert row at top for second line of column heading
#resulttib <- add_row(resulttib, Week = '', Doubling_Time = '(days)', .before = 1)
#doublingtime = tableGrob(resulttib[,'Doubling_Time'], rows=NULL)
datatib1 <- datatib %>%
left_join(resulttib, by = c("week" = "Week")) %>%
mutate(week1 = paste0(week, " (", Doubling_Time, ")"))
gp <-
ggplot(datatib1, aes(dates, cases, color = week1 ) ) +
geom_point() +
geom_smooth( method = "lm", se = FALSE) +
scale_x_date() +
scale_y_continuous(trans="log10") +
labs(x = "", y = "Number of Cases") +
ggtitle("Number of Cases with Weekly Doubling Times") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(
legend.position = c(.95, .05),
legend.justification = c("right", "bottom"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6)
) +
labs(color = "Week (Doubling time in days)")
gp
Created on 2020-03-27 by the reprex package (v0.3.0)

Related

R: ggplot of average daily counts by month

I am trying to plot average daily trip counts by month. However, I am struggling in finding how I can only include the mean number of trips per day by month in the plot instead of the total monthly trips.
The days of the week and months have already been converted from numeric type to abbreviations and have also been ordered (type: ).
Here's what I've tried for the plot.
by_day <- df_temp %>%
group_by(Start.Day)
ggplot(by_day, aes(x=Start.Month,
fill=Start.Month)) +
geom_bar() +
scale_fill_brewer(palette = "Paired") +
labs(title="Number of Daily Trips by Month",
x=" ",
y="Number of Daily Trips")
Here's the plot I am trying to replicate:
You are almost there. Since you did not share a reproducible example, I simulate your data. You may need to adapt the variable naming and/or correct my assumptions.
{lubridate} is a powerful package for date-time crunching. It comes handy when working with dates and binning dates for summaries, etc.
# simulating your data
## a series of dates from June through October
days <- seq(from = lubridate::ymd("2020-06-01")
,to = lubridate::ymd("2020-10-30")
,by = "1 day")
## random trips on each day
set.seed(666)
trips <- sample(2000:5000, length(days), replace = TRUE)
# putting things together in a data frame
df_temp <- data.frame(date = days, counts = trips) %>%
# I assume the variable Start.Month is the monthly bin
# let's use lubridate to "bin" the month from the date
mutate(Start.Month = lubridate::floor_date(date, unit = "month"))
# aggregate trips for each month, calculate average daily trips
by_month <- df_temp %>%
group_by(Start.Month) %>% # group by the binning variable
summarise(Avg.Trips = mean(counts)) # calculate the mean for each group
ggplot( data = by_month
, aes(x = Start.Month, y = Avg.Trips
, fill=as.factor(Start.Month)) # to work with a discrete palette, factorise
) +
# ------------ bar layer -----------------------------------------
## instead of geom_bar(... stat = "identity"), you can use geom_col()
## and define the fill colour
geom_col() +
scale_fill_brewer(palette = "Paired") +
# ------------ if you like provide context with annotation -------
geom_text(aes(label = Avg.Trips %>% round(2)), vjust = 1) +
# ------------ finalise plot with labels, theme, etc.
labs(title="Number of Daily Trips by Month",
x=NULL, # setting an unused lab to NULL is better than printing empty " "!
y="Number of Daily Trips"
) +
theme_minimal() +
theme(legend.position = "none") # to suppress colour legend

ggplot2 comparation of time period

I need to visualize and compare the difference in two equally long sales periods. 2018/2019 and 2019/2020. Both periods begin at week 44 and end at week 36 of the following year. If I create a graph, both periods are continuous and line up. If I use only the week number, the values ​​are sorted as continuum and the graph does not make sense. Can you think of a solution?
Thank You
Data:
set.seed(1)
df1 <- data.frame(sells = runif(44),
week = c(44:52,1:35),
YW = yearweek(seq(as.Date("2018-11-01"), as.Date("2019-08-31"), by = "1 week")),
period = "18/19")
df2 <- data.frame(sells = runif(44),
week = c(44:52,1:35),
YW = yearweek(seq(as.Date("2019-11-01"), as.Date("2020-08-31"), by = "1 week")),
period = "19/20")
# Yearweek on x axis, when both period are separated
ggplot(df1, aes(YW, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
labs(color="Legend text")
# week on x axis when weeks are like continuum and not splited by year
ggplot(df1, aes(week, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
labs(color="Legend text")
Another alternative is to facet it. This'll require combining the two sets into one, preserving the data source. (This is commonly a better way of dealing with it in general, anyway.)
(I don't have tstibble, so my YW just has seq(...), no yearweek. It should translate.)
ggplot(dplyr::bind_rows(tibble::lst(df1, df2), .id = "id"), aes(YW, sells)) +
geom_line(aes(color = id)) +
facet_wrap(id ~ ., scales = "free_x", ncol = 1)
In place of dplyr::bind_rows, one might also use data.table::rbindlist(..., idcol="id"), or do.call(rbind, ...), though with the latter you will need to assign id externally.
One more note: the default formatting of the x-axis is obscuring the "year" of the data. If this is relevant/important (and not apparent elsewhere), then use ggplot2's normal mechanism for forcing labels, e.g.,
... +
scale_x_date(labels = function(z) format(z, "%Y-%m"))
While unlikely that you can do this without having tibble::lst available, you can replace that with list(df1=df1, df2=df2) or similar.
If you want to keep the x axis as a numeric scale, you can do:
ggplot(df1, aes((week + 9) %% 52, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
scale_x_continuous(breaks = 1:52,
labels = function(x) ifelse(x == 9, 52, (x - 9) %% 52),
name = "week") +
labs(color="Legend text")
Try this. You can format your week variable as a factor and keep the desired order. Here the code:
library(ggplot2)
library(tsibble)
#Data
df1$week <- factor(df1$week,levels = unique(df1$week),ordered = T)
df2$week <- factor(df2$week,levels = unique(df2$week),ordered = T)
#Plot
ggplot(df1, aes(week, sells)) +
geom_line(aes(color="Period 18/19",group=1)) +
geom_line(data=df2, aes(color="Period 19/20",group=1)) +
labs(color="Legend text")
Output:

R - Formatting data per month and facet wrapping per year

I am practicing with R and have hit a speedbump while trying to create a graph of airline passengers per month.
I want to show a separate monthly line graph for each year from 1949 to 1960 whereby data has been recorded. To do this I have used ggplot to create a line graph with the values per month. This works fine, however when I try to separate this by year using facet_wrap() and formatting the current month field: facet_wrap(format(air$month[seq(1, length(air$month), 12)], "%Y")); it returns this:
Graph returned
I have also tried to format the facet by inputting my own sequence for the years: rep(c(1949:1960), each = 12). This returns a different result which is better but still wrong:
Second graph
Here is my code:
air = data.frame(
month = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by="months"),
air = as.vector(AirPassengers)
)
ggplot(air, aes(x = month, y = air)) +
geom_point() +
labs(x = "Month", y = "Passengers (in thousands)", title = "Total passengers per month, 1949 - 1960") +
geom_smooth(method = lm, se = F) +
geom_line() +
scale_x_date(labels = date_format("%b"), breaks = "12 month") +
facet_wrap(format(air$month[seq(1, length(air$month), 12)], "%Y"))
#OR
facet_wrap(rep(c(1949:1960), each = 12))
So how do I make an individual graph per year?
Thanks!
In the second try you were really close. The main problem with the data is that you are trying to make a facetted plot with different x-axis values (dates including the year). An easy solution to fix that would be to transform the data to a "common" x axis scale and then do the facetted plot. Here is the code that should output the desired plot.
library(tidyverse)
library(lubridate)
air %>%
# Get the year value to use it for the facetted plot
mutate(year = year(month),
# Get the month-day dates and set all dates with a dummy year (2021 in this case)
# This will get all your dates in a common x axis scale
month_day = as_date(paste(2021,month(month),day(month), sep = "-"))) %>%
# Do the same plot, just change the x variable to month_day
ggplot(aes(x = month_day,
y = air)) +
geom_point() +
labs(x = "Month",
y = "Passengers (in thousands)",
title = "Total passengers per month, 1949 - 1960") +
geom_smooth(method = lm,
se = F) +
geom_line() +
# Set the breaks to 1 month
scale_x_date(labels = scales::date_format("%b"),
breaks = "1 month") +
# Use the year variable to do the facetted plot
facet_wrap(~year) +
# You could set the x axis in an 90° angle to get a cleaner plot
theme(axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1))

superpose densities, non exclusive subsets

I need to have several density functions onto a single plot. Each density corresponds to a subset of my overall dataset. The subsets are defined by the value taken by one of the variables in the dataset.
Concretely, I would like to draw a density function for 1, 3, and 10 years horizons. Of course, the 10 years horizons includes the shorter ones. Likewise, the 3 year horizon density should be constructed taking data from the last year.
The subsets need to correspond to data[period == 1,], data[period <= 3, ], data[period == 10,].
I have managed to do so by adding geom_densitys on top of each other, i.e., by redefining the data each time.
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green")
It works fine but I feel like this is not the right way to do it (and indeed, it makes e.g., the creation of a legend cumbersome).
On the other hand, doing like that :
ggplot(data, aes(x=BEST_CUR_EV_TO_EBITDA, color=period)) +
geom_density(alpha=.2, fill="blue")
won't do because then the periods are taken to be mutually exclusive.
Is there a way to specify aes(color) based on the value taken by period where subsets overlap?
Running code:
library(data.table)
library(lubridate)
library(ggplot2)
YEARS <- 10
today <- Sys.Date()
lastYr <- Sys.Date()-years(1)
last3Yr <- Sys.Date()-years(3)
start.date = Sys.Date()-years(YEARS)
date = seq(start.date, Sys.Date(), by=1)
BEST_CUR_EV_TO_EBITDA <- rnorm(length(date), 3,1)
data <- cbind.data.frame(date, BEST_CUR_EV_TO_EBITDA)
data <- cbind.data.frame(data, period = rep(10, nrow(data)))
subPeriods <- function(aDf, from, to, value){
aDf[aDf$date >= from & aDf$date <= to, "period"] = value
return(aDf)
}
data <- subPeriods(data, last3Yr, today, 3)
data <- subPeriods(data, lastYr, today, 1)
data <- data.table(data)
colScale <- scale_colour_manual(
name = "horizon"
, values = c("1 Y" = "grey", "3 Y" = "red", "10 Y" = "green"))
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green") +
colScale
One of the ways to deal with dependent grouping is to create an independent grouping based on the existing groups. The way I'd opted to do it below is by creating three new columns (period_one, period_three and period_ten) with mutate function, where
period_one= BEST_CUR_EV_TO_EBITDA values for period==1
period_three= BEST_CUR_EV_TO_EBITDA values for period<=1
period_ten= BEST_CUR_EV_TO_EBITDA values for all periods
These columns were then converted into the long-format using gather function, where the columns (period_one, period_three and period_ten) are stacked in "period" variable, and the corresponding values in the column "val".
df2 <- data %>%
mutate(period_one=ifelse(period==1, BEST_CUR_EV_TO_EBITDA, NA),
period_three=ifelse(period<=3, BEST_CUR_EV_TO_EBITDA, NA),
period_ten=BEST_CUR_EV_TO_EBITDA) %>%
select(date, starts_with("period_")) %>%
gather(period, val, period_one, period_three, period_ten)
The ggplot is straightforward with long format consisting of independent grouping:
ggplot(df2, aes(val, fill=period)) + geom_density(alpha=.2)

R Script to average value over every <x> days

I'm having an issue finding out how to calculate an average over "x" days. If I try to plot this csv file over 1 year, it's too much data to display correctly on a plot line (screenshot attached). I'm looking to average the data over every few days (maybe 2, a week, etc..) so the line graph is not so hard to read. Any advice on how I would solve this issue with R?
results.csv
POSTS,PROVIDER,TYPE,DATE
29337,FTP,BLOG,2010-01-01
26725,FTP,BLOG,2010-01-02
27480,FTP,BLOG,2010-01-03
31187,FTP,BLOG,2010-01-04
31488,FTP,BLOG,2010-01-05
32461,FTP,BLOG,2010-01-06
33675,FTP,BLOG,2010-01-07
38897,FTP,BLOG,2010-01-08
37122,FTP,BLOG,2010-01-09
41365,FTP,BLOG,2010-01-10
51760,FTP,BLOG,2010-01-11
50859,FTP,BLOG,2010-01-12
53765,FTP,BLOG,2010-01-13
56836,FTP,BLOG,2010-01-14
59698,FTP,BLOG,2010-01-15
52095,FTP,BLOG,2010-01-16
57154,FTP,BLOG,2010-01-17
80755,FTP,BLOG,2010-01-18
227464,FTP,BLOG,2010-01-19
394510,FTP,BLOG,2010-01-20
371303,FTP,BLOG,2010-01-21
370450,FTP,BLOG,2010-01-22
268703,FTP,BLOG,2010-01-23
267252,FTP,BLOG,2010-01-24
375712,FTP,BLOG,2010-01-25
381041,FTP,BLOG,2010-01-26
380948,FTP,BLOG,2010-01-27
373140,FTP,BLOG,2010-01-28
361874,FTP,BLOG,2010-01-29
265178,FTP,BLOG,2010-01-30
269929,FTP,BLOG,2010-01-31
R Script
library(ggplot2);
data <- read.csv("results.csv", header=T);
dts <- as.POSIXct(data$DATE, format="%Y-%m-%d");
attach(data);
a <- ggplot(dataframe, aes(dts,POSTS/1000, fill = TYPE)) + opts(title = "Report") + labs(x = NULL, y = "Posts (k)", fill = NULL);
b <- a + geom_bar(stat = "identity", position = "stack");
plot_theme <- theme_update(axis.text.x = theme_text(angle=90, hjust=1), panel.grid.major = theme_line(colour = "grey90"), panel.grid.minor = theme_blank(), panel.background = theme_blank(), axis.ticks = theme_blank(), legend.position = "none");
c <- b + facet_grid(TYPE ~ ., scale = "free_y");
d <- c + scale_x_datetime(major = "1 months", format = "%Y %b");
ggsave(filename="/root/results.png",height=14,width=14,dpi=600);
Graph Image
Try this :
Average <- function(Data,n){
# Make an index to be used for aggregating
ID <- as.numeric(as.factor(Data$DATE))-1
ID <- ID %/% n
# aggregate over ID and TYPE for all numeric data.
out <- aggregate(Data[sapply(Data,is.numeric)],
by=list(ID,Data$TYPE),
FUN=mean)
# format output
names(out)[1:2] <-c("dts","TYPE")
# add the correct dates as the beginning of every period
out$dts <- as.POSIXct(Data$DATE[(out$dts*n)+1])
out
}
dataframe <- Average(Data,3)
This works with the plot script you gave.
Some remarks :
never ever call some variable after a function name (data, c, ...)
avoid the use of attach(). If you do, add detach() afterwards, or you'll get into trouble at some point. Better is to use the functions with() and within()
The TTR package also has several moving average functions that will do this with a single statement:
library(TTR)
mavg.3day <- SMA(data$POSTS, n=3) # Simple moving average
Substitute a different value of 'n' for your desired moving average length.

Resources