How to aggregate data from years to decades and plot them? - r

This is the graph that I would like to reproduce:
but for that I have to change the years column because on the graph the x axis is in decades. By what means could I accomplish this ?
This is what I did to extract the data from the site (https://ourworldindata.org/famines) :
library(rvest)
library(dplyr)
library(tidyr)
library(ggplot2)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>%
html_table() %>% . [[1]]
data1 <- col_table %>%
select(Year, `Excess Mortality midpoint`)
Year `Excess Mortality midpoint`
<chr> <chr>
1 1846–52 1,000,000
2 1860-1 2,000,000
3 1863-67 30,000
4 1866-7 961,043
5 1868 100,000
6 1868-70 1,500,000
7 1870–1871 1,000,000
8 1876–79 750,000
9 1876–79 7,176,346
10 1877–79 11,000,000
# ... with 67 more rows

Firstly, to convert the periods to decades, you need to extract a year for each period, based on which the calculation will be made. From your comment above, it looks like you need to extract the end year for each period. Given the data, regular expressions are used below to do this (and packages dplyr and stringr).
col_table <- col_table %>%
mutate(Year = case_when(
grepl("^\\d{4}$",Year) ~ Year,
grepl("\\d{4}[–-]\\d{4}",Year) ~ str_sub(Year, start= -4),
grepl("\\d{4}[–-]\\d{2}$",Year) ~ paste0(str_sub(Year,1,2),str_sub(Year,-2)),
grepl("\\d{4}[–-]\\d{1}$",Year) ~ paste0(str_sub(Year,1,3),str_sub(Year,-1))))
What this part of code is doing, is to detect the different cases and extract the proper year. Below there are examples for all cases, that are present on the dataset and what this part of code will result to.
1868 -> 1868
1878-1880 -> 1880
1846–52 -> 1852
1860-1 -> 1861
Now we have the year, so the next step is to extract the decade. To do so, we need to make sure that Year column is numeric and apply the necessary calculation (check here for it: https://stackoverflow.com/a/48966643/8864619)
col_table <- col_table %>%
mutate(Decade = as.numeric(Year) - as.numeric(Year) %% 10)
To reproduce the plot we need to group by decade and make sure that the Excess Mortality midpoint column is numeric to be able to get the sum of victims per decade.
col_table <- col_table %>%
mutate(`Excess Mortality midpoint` = as.numeric(gsub(",", "", `Excess Mortality midpoint`))) %>%
group_by(Decade) %>%
summarize(val = sum(`Excess Mortality midpoint`)) %>%
ungroup()
For the plot itself, ggplot2 is used:
ylab <- c(5, 10, 15, 20, 25)
options(scipen=999)
p <- ggplot(data = col_table, aes(x=factor(Decade),y=val)) +
geom_bar(stat = "identity", fill = "navy") +
scale_x_discrete(labels = col_table %>% distinct(Decade) %>% mutate(Decade = paste0(Decade,"s")) %>% pull()) +
geom_text(aes(label=format(val,big.mark=",")), size=2,vjust=-0.3) +
scale_y_continuous(labels = paste(ylab, "millions"),breaks = 10^6 * ylab) +
ggtitle('Famine victims worldwide')+
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.05, linetype = 'solid',
colour = "black"),
axis.title.x = element_blank(),
axis.title.y = element_blank())
p
So, putting everything together, the following code should get you a column for the year and a column for the relevant decade, which should be then used to create the plot you want to:
library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>% html_table() %>% . [[1]]
col_table <- col_table %>%
mutate(Year = case_when(
grepl("^\\d{4}$",Year) ~Year,
grepl("\\d{4}[–-]\\d{4}",Year) ~ str_sub(Year, start= -4),
grepl("\\d{4}[–-]\\d{2}$",Year) ~ paste0(str_sub(Year,1,2),str_sub(Year,-2)),
grepl("\\d{4}[–-]\\d{1}$",Year) ~ paste0(str_sub(Year,1,3),str_sub(Year,-1)))) %>%
mutate(Decade = as.numeric(Year) - as.numeric(Year)%%10) %>%
mutate(`Excess Mortality midpoint` = as.numeric(gsub(",", "", `Excess Mortality midpoint`))) %>%
group_by(Decade) %>%
summarize(val = sum(`Excess Mortality midpoint`)) %>%
ungroup()
ylab <- c(5, 10, 15, 20, 25)
options(scipen=999)
p <- ggplot(data = col_table, aes(x=factor(Decade),y=val)) +
geom_bar(stat = "identity", fill = "navy") +
scale_x_discrete(labels = col_table %>% distinct(Decade) %>% mutate(Decade = paste0(Decade,"s")) %>% pull()) +
geom_text(aes(label=format(val,big.mark=",")), size=2,vjust=-0.3) +
scale_y_continuous(labels = paste(ylab, "millions"),breaks = 10^6 * ylab) +
ggtitle('Famine victims worldwide')+
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.05, linetype = 'solid',
colour = "black"),
axis.title.x = element_blank(),
axis.title.y = element_blank())
p
Here's the reproduced plot:

First, strsplit, make a proper year matrix, combine back with famines divided by number of years and reshape to long format (lines 1:6). Next, aggregate sums by decade and barplot it.
r <- strsplit(data1$Year, '-|–|, ') |>
rapply(\(y) unlist(lapply(y, \(x) f(max(as.numeric(y)), x))), how='r') |>
{\(.) t(sapply(., \(x) `length<-`(x, max(lengths(.)))))}() |>
{\(.) cbind(`colnames<-`(., paste0('year.', seq_len(dim(.)[2]))),
n=dim(.)[2] - rowSums(is.na(.)))}() |>
{\(.) data.frame(., f=as.numeric(gsub('\\D', '',
data1$`Excess Mortality midpoint`))/
.[, 'n'])}()|>
reshape(1:3, direction='long') |>
stats:::aggregate.formula(formula=f ~ as.integer(substr(year, 1, 3)),
FUN=sum) |>
t()
## plot
op <- par(mar=c(5, 5, 4, 2)+.1) ## set/store old pars
b <- barplot(r, axes=FALSE, ylim=c(0, max(r[2, ])*1.05),
main='Famine victims', )
abline(h=asq, col='lightgrey', lty=3)
barplot(r, names.arg=paste0(r[1, ], '0s'), col='#20254c',
cex.names=.8, axes=FALSE, add=TRUE)
asq <- seq(0, max(axTicks(2)), 2e6)
axis(2, asq, labels=FALSE)
mtext(paste(asq/1e6, 'Million'), 2, 1, at=asq, las=2)
text(b, r[2, ] + 5e5, labels=formatC(r[2, ], format='d', big.mark=','), cex=.7)
box()
par(op) ## restore old pars
In line 2, I used this helper function f() to fill up the pseudo-years:
f <- \(x1, x2, n1=nchar(x1)) {
u <- lapply(list(x1, x2), as.character)
s <- c(n1 - nchar(u[[2]]) + 1L, n1)
as.integer(`substr<-`(u[[1]], s[1], s[2], u[[2]]))
}
You can refine the aggregation method yourself to make the result exactly look like the original, but maybe this is better :)

Related

Plot missing legend when multiple plotting

I am trying to make multiple plots of my data whilst colour coding them based on a grouping. I want to include legends on each plot that go outside the scope of the box. This works however, my first plot has no legend but the rest of my plots do.
For example:
library(faraway)
library(tidyverse)
library(glue)
data(savings)
group_data <- mapply(function(x, y) {
savings %>% mutate(test = ifelse(.[, y] > x, "Group 1 (GT)", "Group 2 (LT)"))
}, val, names(val), SIMPLIFY = FALSE) %>%
mapply(function(a,z) {
a %>% `colnames<-`(c(names(.)[-length(.)], glue("{z}_group")))
}, ., names(.), SIMPLIFY = FALSE) %>%
Reduce(cbind, .) %>%
.[, !duplicated(names(.))]
nn <- length(val)
ng <- names(group_data)[(length(group_data)-nn+1):length(group_data)]
n2 <- n2mfrow(nn, 2)
par(mfrow=n2, xpd=TRUE)
mapply(function(q, w){
form <- reformulate(q, response='sr')
plot(form, data=group_data, col=c('red', 'blue')[as.factor(group_data[,w])], pch=c(19, 19))
legend( x=0, 26,
legend=c("Group 1 (GT)","Group 2 (LT)"),
col=c("red","blue"), lwd=1, lty=c(0,0),
pch=c(19,19), bty='n' )
},names(val),ng, SIMPLIFY=FALSE)
Plots the following:
The data val:
list(pop15 = 35, pop75 = 2.5, dpi = 2000, ddpi = 7)
With response to the comments by #Harre, the following manipulates x for the missing legend:
if(q == 'pop15'){
legend( x=21, 26,
legend=c("Group 1 (GT)","Group 2 (GT)"),
col=c("red","blue"), lwd=1, lty=c(0,0),
pch=c(19,19), bty='n' )} else{
legend( x=0, 26,
legend=c("Group 1 (GT)","Group 2 (LT)"),
col=c("red","blue"), lwd=1, lty=c(0,0),
pch=c(19,19), bty='n' )
}
And I see all 4 legends now.
Unfortunately, If I add more columns likeso:
savings$status <- savings$pop15+1
val <- c(val, status=list(37))
Then repeat the code I get the following:
With #Harre's answer I got led to the right solution:
group_data <- mapply(function(x, y) {
savings %>% mutate(group = ifelse(.[, y] > x, "Group 1 (GT)", "Group 2 (LT)"))
}, val, names(val), SIMPLIFY = FALSE) %>%
mapply(function(a,z) {
a %>% `colnames<-`(c(names(.)[-length(.)], glue("{z}_group")))
}, ., names(.), SIMPLIFY = FALSE) %>%
Reduce(cbind, .) %>%
.[, !duplicated(names(.))] %>% pivot_longer(-c(1:(length(.)-nn))) %>% dplyr::select(group=value) %>% cbind.data.frame(savings %>% pivot_longer(-c(1)), .)
val_hline <- val %>% unlist() %>% data.frame(hline=.) %>% rownames_to_column() %>% `colnames<-`(c('name', 'hline'))
kop <- inner_join(group_data, val_hline, by='name')
kop %>% ggplot(aes(x = value, y = sr, color = group)) +
geom_point() +
facet_wrap(name ~ ., scales = "free") + theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
strip.background = element_blank(),
panel.border = element_rect(colour = "black", fill = NA),
legend.position = "bottom") +
stat_smooth(method='lm') +
geom_vline(aes(xintercept=hline))
A suggested ggplot-solution, in the case you want to explore:
savings |>
pivot_longer(-sr) |>
# I have collected your val's here for illustration; feel free to use the lists
mutate(group = case_when(name == "pop15" & value > 35 ~ "Group 1 (GT)",
name == "pop75" & value > 2.5 ~ "Group 1 (GT)",
name == "dpi" & value > 2000 ~ "Group 1 (GT)",
name == "ddpi" & value > 7 ~ "Group 1 (GT)",
TRUE ~ "Group 2 (GT)")) |>
ggplot(aes(x = value, y = sr, color = group)) +
geom_point() +
facet_wrap(name ~ ., scales = "free") +
theme(legend.position = "bottom")

Write a function to plot original value, mom and yoy change for time series data in 3 subplots [duplicate]

Given two monthly time series data sample from this link.
I will need to create one plot containing 3 subplots: plot1 for the original values, plot2 for month over month changes, and plot3 for year over year changes.
I'm able to draw the plot with code below, but the code is too redundant. So my question is how could achieve that in a concise way? Thanks.
library(xlsx)
library(ggplot2)
library(reshape)
library(dplyr)
library(tidyverse)
library(lubridate)
library(cowplot)
library(patchwork)
df <- read.xlsx('./sample_data.xlsx', 'Sheet1')
colnames(df)
# df
cols <- c('food_index', 'energy_index')
df <- df %>% mutate(date=as.Date(date)) %>%
mutate(across(-contains('date'), as.numeric)) %>%
mutate(date= floor_date(date, 'month')) %>%
group_by(date) %>%
summarise_at(vars(cols), funs(mean(., na.rm=TRUE))) %>%
mutate(across(cols, list(yoy = ~(. - lag(., 12))/lag(., 12)))*100) %>%
mutate(across(cols, list(mom = ~(. - lag(., 1))/lag(., 1)))*100) %>%
filter(date >= '2018-01-01' & date <= '2021-12-31') %>%
as.data.frame()
df1 <- df %>%
select(!grep('mom|yoy', names(df)))
df1_long <- melt(df1, id.vars = 'date')
plot1 <- ggplot(df1_long[!is.na(df1_long$value), ],
aes(x = date,
y = value,
col = variable)) +
geom_line(size=0.6, alpha=0.5) +
geom_point(size=1, alpha=0.8) +
labs(
x='',
y='Unit: $'
)
# MoM changes
df2 <- df %>%
select(grep('date|mom', names(df)))
df2_long <- melt(df2, id.vars = 'date')
plot2 <- ggplot(df2_long[!is.na(df2_long$value), ],
aes(x = date,
y = value,
col = variable)) +
geom_line(size=0.6, alpha=0.5) +
geom_point(size=1, alpha=0.8) +
labs(
x='',
y='Unit: %'
)
# YoY changes
df3 <- df %>%
select(grep('date|yoy', names(df)))
df3_long <- melt(df3, id.vars = 'date')
plot3 <- ggplot(df3_long[!is.na(df3_long$value), ],
aes(x = date,
y = value,
col = variable)) +
geom_line(size=0.6, alpha=0.5) +
geom_point(size=1, alpha=0.8) +
labs(
x='',
y='Unit: %'
)
plot <- plot1 + plot2 + plot3 + plot_layout(ncol=1)
# plot <- plot_grid(plot1, plot2, plot3, labels = c('Value', 'MoM', 'YoY'), label_size = 12)
plot
Out:
The expected result will be similar to the plot below (the upper plot will display the original data, the middle plot will display the mom changes data, and the lower plot will display the yoy changes data):
References:
https://waterdata.usgs.gov/blog/beyond-basic-plotting/
http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/
Side-by-side plots with ggplot2
Maybe this is what you are looking for? By reshaping your data to the right shape, using a plot function and e.g. purrr::map2 you could achieve your desired result without duplicating your code like so.
Using some fake random example data to mimic your true data:
library(tidyr)
library(dplyr)
library(ggplot2)
df_long <- df |>
rename(food_index_raw = food_index, energy_index_raw = energy_index) |>
pivot_longer(-date, names_to = c("variable", ".value"), names_pattern = "^(.*?_index)_(.*)$")
plot_fun <- function(x, y, ylab) {
x <- x |>
select(date, variable, value = .data[[y]]) |>
filter(!is.na(value))
ggplot(
x,
aes(
x = date,
y = value,
col = variable
)
) +
geom_line(size = 0.6, alpha = 0.5) +
geom_point(size = 1, alpha = 0.8) +
labs(
x = "",
y = ylab
)
}
yvars <- c("raw", "mom", "yoy")
ylabs <- paste0("Unit: ", c("$", "%", "%"))
plots <- purrr::map2(yvars, ylabs, plot_fun, x = df_long)
library(patchwork)
wrap_plots(plots) + plot_layout(ncol = 1)
DATA
set.seed(123)
date <- seq.POSIXt(as.POSIXct("2017-01-31"), as.POSIXct("2022-12-31"), by = "month")
food_index <- runif(length(date))
energy_index <- runif(length(date))
df <- data.frame(date, food_index, energy_index)
EDIT Adding subtitles to each plot when using patchwork is (as of the moment) a bit tricky. What I would do in this case would be to use a faceting "hack". To this end I slightly adjusted the function to take a subtitle argument and switched to purrr::pmap:
library(tidyr)
library(dplyr)
library(ggplot2)
df_long <- df |>
rename(food_index_raw = food_index, energy_index_raw = energy_index) |>
pivot_longer(-date, names_to = c("variable", ".value"), names_pattern = "^(.*?_index)_(.*)$")
plot_fun <- function(x, y, ylab, subtitle) {
x <- x |>
select(date, variable, value = .data[[y]]) |>
filter(!is.na(value))
ggplot(
x,
aes(
x = date,
y = value,
col = variable
)
) +
geom_line(size = 0.6, alpha = 0.5) +
geom_point(size = 1, alpha = 0.8) +
facet_wrap(~.env$subtitle) +
labs(
x = "",
y = ylab
) +
theme(strip.background = element_blank(), strip.text.x = element_text(hjust = 0))
}
yvars <- c("raw", "mom", "yoy")
ylabs <- paste0("Unit: ", c("$", "%", "%"))
subtitle <- c("Original", "Month-to-Month", "Year-to-Year")
plots <- purrr::pmap(list(y = yvars, ylab = ylabs, subtitle = subtitle), plot_fun, x = df_long)
library(patchwork)
wrap_plots(plots) + plot_layout(ncol = 1)
The target output is done with facets rather than stitching plots together. You could do this too if you like, but it requires reshaping your data in a different way. Which approach you take is really a matter of taste.
library(ggplot2)
library(dplyr)
yoy <- function(x) 100 * (x - lag(x, 13)) / lag(x, 12)
mom <- function(x) 100 * (x - lag(x)) / lag(x)
df %>%
mutate(date = as.Date(date, origin = "1899-12-30"),
`Actual value (Dollars).Food Index` = food_index,
`Month-on-month change (%).Food Index` = mom(food_index),
`Year-on-year change (%).Food Index` = yoy(food_index),
`Actual value (Dollars).Energy Index` = energy_index,
`Month-on-month change (%).Energy Index` = mom(energy_index),
`Year-on-year change (%).Energy Index` = yoy(energy_index)) %>%
select(-food_index, -energy_index) %>%
tidyr::pivot_longer(-1) %>%
filter(date > as.Date("2018-01-01")) %>%
tidyr::separate(name, into = c("series", "index"), sep = "\\.") %>%
ggplot(aes(date, value, color = index)) +
geom_point(na.rm = TRUE) +
geom_line() +
facet_grid(series~., scales = "free_y") +
theme_bw(base_size = 16)
Reproducible data taken from link in question
df <- structure(list(date = c(42766, 42794, 42825, 42855, 42886, 42916,
42947, 42978, 43008, 43039, 43069, 43100, 43131, 43159, 43190,
43220, 43251, 43281, 43312, 43343, 43373, 43404, 43434, 43465,
43496, 43524, 43555, 43585, 43616, 43646, 43677, 43708, 43738,
43769, 43799, 43830, 43861, 43890, 43921, 43951, 43982, 44012,
44043, 44074, 44104, 44135, 44165, 44196, 44227, 44255, 44286,
44316, 44347, 44377, 44408, 44439, 44469, 44500, 44530, 44561
), food_index = c(58.53, 61.23, 55.32, 55.34, 61.73, 56.91, 54.27,
59.08, 60.11, 66.01, 60.11, 63.41, 69.8, 72.45, 81.11, 89.64,
88.64, 88.62, 98.27, 111.11, 129.39, 140.14, 143.44, 169.21,
177.39, 163.88, 135.07, 151.28, 172.81, 143.82, 162.13, 172.22,
176.67, 179.3, 157.27, 169.12, 192.51, 194.2, 179.4, 169.1, 193.17,
174.92, 181.92, 188.41, 192.14, 203.41, 194.19, 174.3, 174.86,
182.33, 182.82, 185.36, 192.41, 195.59, 202.6, 201.51, 225.01,
243.78, 270.67, 304.57), energy_index = c(127.36, 119.87, 120.96,
112.09, 112.19, 109.24, 109.56, 106.89, 109.35, 108.35, 112.39,
117.77, 119.52, 122.24, 120.91, 125.41, 129.72, 135.25, 139.33,
148.6, 169.62, 184.23, 204.38, 198.55, 189.29, 202.47, 220.23,
240.67, 263.12, 249.74, 240.84, 243.42, 261.2, 256.76, 258.69,
277.98, 289.63, 293.46, 310.81, 318.68, 310.04, 302.17, 298.62,
260.92, 269.29, 258.84, 241.68, 224.18, 216.36, 226.57, 235.98,
253.86, 267.37, 261.99, 273.37, 280.91, 291.84, 297.88, 292.78,
289.79)), row.names = c(NA, 60L), class = "data.frame")

R: How to make multiple plots in a graph using unique IDs

I have a dataframe df with 4 unique UID - 1001,1002,1003,1004.
I want to write a user-defined function in R that does the following:
Plots Turbidity against Time for each unique UID. Turbidity values are the ones in the Time_1, Time_2 and Time_3 columns. For example, UID = 1001 will have 4 plots in one graph
Add a legend to each graph such as M-L, F-L, M-R, and F-R (from columns Gen and Type)
Add a title to each graph. For example- UID:1001
Export the graphs as pdf or jpeg or tiff pdf files - 4 graphs per page
# dataset
Gen <- c('M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F')
Site <- rep('FRX',length(gen))
Type <- c('L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R')
UID <- c(1001,1002,1003,1004,1001,1002,1003,1004,1001,1002,1003,1004,1001,1002,1003,1004)
Time_1 <- c(100.78,112.34,108.52,139.19,149.02,177.77,79.18,89.10,106.78,102.34,128.52,119.19,129.02,147.77,169.18,170.11)
Time_2 <- c(150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,247.73,269.91,249.19)
Time_3 <- c(250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.19)
df <- data.frame(Gen,Site,Type,UID,Time_1,Time_2,Time_3)
df
My attempt
library(ggplot2)
library(tidyr)
# See below for my thoughts/attempt- I am open to other R libraries and approaches
graphplotter <-function(x){
# 1. Convert from wide to long
data_long <- gather(df, time, turbidity, Time_1:Time_3, factor_key=TRUE)
data_long
#2. plot for each unique UID- 1001 to 1004 and add legend
basic <- ggplot(datalong, aes(time, turbidity, shape=Tree)) + geom_point() + geom_line()
basic + theme(
legend.position = c(.95, .95),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6))
#3. add title
print(basic+ labs( title= "UID: 1001, Tubidity against time", y="turbidity", x = "Time in hours"))
#4. export as pdf
pdf("turbdity-time.pdf")
par(mfrow = c(2, 2)) ## set the layout to be 2 by 2
sapply(1:4, function(i) plot(basic[,i]))
dev.off()
}
I want all four graphs to look something like this (ignore the circumference and age, should be turbidity and time).
Thanks
I use facet_wrap
graphplotter <-function(x){
x %>%
gather(., time, turbidity, Time_1:Time_3, factor_key=TRUE) %>%
mutate(label = (paste0(Gen, "-", Type))) %>%
#group_by(UID) %>%
ggplot(aes(color = label)) + geom_point(aes(time, turbidity, shape = label, group = label)) +
geom_line(aes(time, turbidity, group = label)) + facet_wrap(~UID) + theme(
legend.position = c(1, 1),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(1, 1, 1, 1),
legend.text = element_text(size = 7))
}
graphplotter(df)

R - How to 'create' or plot missing data?

I have a datset, AIS_dat, which looks at the number of boats (BoatCount) present at three sites (Site) on different days of the week (Day), before and during a Covid lockdown.
rm(list = ls())
setwd('K:/SoundTrap/Boats/PSD Output/Duty cycle data/TOL analysis')
getwd()
AIS_dat<-read.csv("AllSitesConcat_dBcalcs_50-24000Hz_matchedCameraCounts.csv")
str(AIS_dat)
#set factors
AIS_dat$Lockdown <- as.factor(AIS_dat$Lockdown)
#change order of sites
AIS_dat$Site<-factor(AIS_dat$Site,
level=c('Kawau','Tiritiri','Noises'))
#change order of days
AIS_dat$Day<-factor(AIS_dat$Day,
level=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun'))
#hour of day as factor
AIS_dat$Hour <- as.factor(AIS_dat$Hour)
#Look at variation between sites
bp<-ggplot(AIS_dat, aes(x=Day,y=BoatCount,fill=factor(Site))) +
geom_boxplot()+
ylab(expression("Number of Boats"))+
xlab("Day of Week")+
scale_fill_manual(values = get_pal("Kereru"),
name="Site") +
theme_bw()
bp
bp<-bp+theme(axis.text.x = element_text(angle = 0,size=14),
axis.text.y = element_text(size=14),
axis.title.x = element_text(size=14),
axis.title.y =element_text(size=14),
#legend.title = element_text(size = 14),
#legend.text = element_text(size = 14)
) #rotate x-axis labels
bp<-bp+facet_grid(rows=vars(Lockdown)) #separate plot for each season
bp
The plot looks like this
...which is awesome. BUT, because there is no data for Tiritiri under 'During', we only see two sites, and it is a bit misleading as the order of the sites then changes. I'd like to have missing data in this panel to make it clear which site is which. How would I do this? Would I have to add missing data to my original datasheet somehow? I tried this but it didn't work, so if that is the best approach I am not sure which way to do it?
The dataset looks like this (but has 17143 rows):
You can combine site and day into a new column used for the x axis ticks. The function facet will arrange the x tick values by default in a fixed way:
library(tidyverse)
set.seed(1337)
before_data <-
tribble(
~step, ~Site,
"before", "K",
"before", "T",
"before", "N"
) %>%
mutate(
value = rnorm(10) %>% list()
) %>%
unnest(value)
during_data <-
tribble(
~step, ~Site,
"during", "K",
"during", "N"
) %>%
mutate(
value = rnorm(10) %>% list()
) %>%
unnest(value)
data <- bind_rows(before_data, during_data) %>% mutate(day = "Monday")
data %>%
mutate(x = paste0(day, Site)) %>%
ggplot(aes(x, value, color = Site)) +
geom_boxplot() +
facet_grid(~step)
Created on 2021-09-13 by the reprex package (v2.0.1)

Use scale_x_continuous with labeller function that also takes a data frame as an argument as well as default breaks

Here's a code block:
# scale the log of price per group (cut)
my_diamonds <- diamonds %>%
mutate(log_price = log(price)) %>%
group_by(cut) %>%
mutate(scaled_log_price = scale(log_price) %>% as.numeric) %>% # scale within each group as opposed to overall
nest() %>%
mutate(mean_log_price = map_dbl(data, ~ .x$log_price %>% mean)) %>%
mutate(sd_log_price = map_dbl(data, ~ .x$log_price %>% sd)) %>%
unnest %>%
select(cut, price, price_scaled:sd_log_price) %>%
ungroup
# for each cut, find the back transformed actual values (exp) of each unit of zscore between -3:3
for (i in -3:3) {
my_diamonds <- my_diamonds %>%
mutate(!! paste0('mean_', ifelse(i < 0 , 'less_', 'plus_'), abs(i), 'z') := map2(.x = mean_log_price, .y = sd_log_price, ~ (.x + (i * .y)) %>% exp) %>% unlist)
}
my_diamonds_split <- my_diamonds %>% group_split(cut)
split_names <- my_diamonds %>% mutate(cut = as.character(cut)) %>% group_keys(cut) %>% pull(cut)
names(my_diamonds_split) <- split_names
I now have a variable my_diamonds_split that is a list of data frames. I would like to loop over these data frames and each time create a new ggplot.
I can use a custom labeller function with a single df, but I don't know how to do this within a loop:
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(ex_df$price) * x + mean(ex_df$price)))
}
ex_df <- my_diamonds_split$Ideal
ex_df %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller, limits = c(-3, 3))
This creates a plot for the 'Ideal' cut of diamonds. I also get two data points on the x axis, the zscore values at -2, 0 and 2 as well as the raw dollar values of 3.8K, 3.9K and 11.8K.
When I define the labeller function, I must specify the df to scale with. Tried instead with placing the dot instead of my_df, hoping that on each iteration ggplot would get the value of the df on any iteration:
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(.$price) * x + mean(.$price)))
}
ex_df <- my_diamonds_split$Ideal
ex_df %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller, limits = c(-3, 3))
Returns:
Error in is.data.frame(x) : object '.' not found
I then tried writing the function to accept an argument for the df to scale with:
labeller <- function(x, df) {
paste0(x,"\n", scales::dollar(sd(df$price) * x + mean(df$price)))
}
ex_df <- my_diamonds_split$Ideal
ex_df %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller(df = ex_df), limits = c(-3, 3)) # because when it comes to running in real life, I will try something like labeller(df = my_diamonds_split[[i]])
Error in paste0(x, "\n", scales::dollar(sd(df$price) * x + mean(df$price))) :
argument "x" is missing, with no default
Bearing in mind that the scaling must be done per iteration, how could I loop over my_diamonds_split, and on each iteration generate a ggplot per above?
labeller <- function(x) {
# how can I make df variable
paste0(x,"\n", scales::dollar(sd(df$price) * x + mean(df$price)))
}
for (i in split_names) {
my_diamonds_split[[i]] %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller, # <--- here, labeller must be defined with df$price except that will difer on each iteration
limits = c(-3, 3))
}
There's a hacky way to get this result in facets. Basically, after converting to z scores, you add different amounts (say, multiples of 1000) to each group's z scores. Then you set all the breaks to this collection of points and label them with pre-calculated labels.
library(ggplot2)
library(dplyr)
f <- function(x) {
y <- diamonds$price[diamonds$cut == x]
paste(seq(-3, 3), scales::dollar(round(mean(y) + seq(-3, 3) * sd(y))), sep = "\n")
}
breaks <- as.vector(sapply(levels(diamonds$cut), f))
diamonds %>%
group_by(cut) %>%
mutate(z = scale(price) + 3 + 1000 * as.numeric(cut)) %>%
ggplot(aes(z)) +
geom_point(aes(x = z - 2, y = 1), alpha = 0) +
geom_density() +
scale_x_continuous(breaks = as.vector(sapply(1:5 * 1000, "+", 0:6)),
labels = breaks) +
facet_wrap(vars(cut), scales = "free_x") +
theme(text = element_text(size = 16),
axis.text.x = element_text(size = 6))
You would have to increase the plot size to make the dollar values more visible of course.
Created on 2020-08-04 by the reprex package (v0.3.0)

Resources