R: Calculate percentages within groups - r

I have a list of interviews conducted by two survey institutes A + B over a long period of time (several years) and a corresponding date variable:
date_of_interview institute
--------------------------
2021-04-01 A
2021-04-01 A
2021-04-02 A
2021-04-02 A
2021-04-02 A
2021-04-02 B
2021-04-02 B
2021-04-02 B
etc.
All interviews should be evenly distributed over the weekdays (monday to friday). In order to check this, I would like to create the following graphic with a time variable on the x-axis (calendar weeks from 1-52):
library(tidyverse)
df <- df %>% mutate(weekday = format(date_of_interview, "%u"),
week = format(date_of_interview, "%V"))
However, I am struggleing with calculating the percentages of the weekdays within the week-groups. All weekdays should be around 20% (mo-fr).
ggplot(aes(x = week, fill = weekday, group = weekday)) +
geom_bar(position = "stack") +
facet_wrap(institute ~.)

From what I understand you want each facet to be an institute, each group per facet to be a weekday, and the filling to be the weekdays themselves. You can shuffle them around to suit your requirement if I have misunderstood.
library(dplyr)
library(ggplot2)
df <- df %>%
mutate(
week = format(date_of_interview, "%V"),
weekday = format(date_of_interview, "%u"),
.keep='unused'
) %>%
group_by(institute, week, weekday) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n), .keep='unused') %>%
ungroup()
ggplot(df, aes(x=week, y=freq, fill=weekday)) +
geom_bar(stat='identity') +
facet_wrap(institute ~.)
I tested on this dataframe:
df <- data.frame(
date_of_interview = as.Date(c(
'2021-04-01', '2021-04-01', '2021-04-02', '2021-04-02',
'2021-04-02', '2021-04-02', '2021-04-02', '2021-04-02',
'2021-04-09', '2021-04-10', '2021-04-11')),
institute = c('A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A')
)

Related

Plot time series of different years together

I am trying to compare different years' variables but I am having trouble plotting them together.
The time series is a temperature series which can be found in https://github.com/gonzalodqa/timeseries as temp.csv
I would like to plot something like the image but I find it difficult to subset the months between the years and then combine the lines in the same plot under the same months
If someone can give some advice or point me in the right direction I would really appreciate it
You can try this way.
The first chart shows all the available temperatures, the second chart is aggregated by month.
In the first chart, we force the same year so that ggplot will plot them aligned, but we separate the lines by colour.
For the second one, we just use month as x variable and year as colour variable.
Note that:
with scale_x_datetime we can hide the year so that no one can see that we forced the year 2020 to every observation
with scale_x_continous we can show the name of the months instead of the numbers
[just try to run the charts with and without scale_x_... to understand what I'm talking about]
month.abb is a useful default variable for months names.
# read data
df <- readr::read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
# libraries
library(ggplot2)
library(dplyr)
# line chart by datetime
df %>%
# make datetime: force unique year
mutate(datetime = lubridate::make_datetime(2020, month, day, hour, minute, second)) %>%
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = factor(year))) +
scale_x_datetime(breaks = lubridate::make_datetime(2020,1:12), labels = month.abb) +
labs(title = "Temperature by Datetime", colour = "Year")
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
ggplot() +
geom_line(aes(x = month, y = T42, colour = factor(year))) +
scale_x_continuous(breaks = 1:12, labels = month.abb, minor_breaks = NULL) +
labs(title = "Average Temperature by Month", colour = "Year")
In case you want your chart to start from July, you can use this code instead:
months_order <- c(7:12,1:6)
# line chart by month
df %>%
# average by year-month
group_by(year, month) %>%
summarise(T42 = mean(T42, na.rm = TRUE), .groups = "drop") %>%
# create new groups starting from each July
group_by(neworder = cumsum(month == 7)) %>%
# keep only complete years
filter(n() == 12) %>%
# give new names to groups
mutate(years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# reorder months
mutate(month = factor(month, levels = months_order, labels = month.abb[months_order], ordered = TRUE)) %>%
# plot
ggplot() +
geom_line(aes(x = month, y = T42, colour = years, group = years)) +
labs(title = "Average Temperature by Month", colour = "Year")
EDIT
To have something similar to the first plot but starting from July, you could use the following code:
# libraries
library(ggplot2)
library(dplyr)
library(lubridate)
# custom months order
months_order <- c(7:12,1:6)
# fake dates for plot
# note: choose 4 to include 29 Feb which exist only in leap years
dates <- make_datetime(c(rep(3,6), rep(4,6)), months_order)
# line chart by datetime
df %>%
# create date time
mutate(datetime = make_datetime(year, month, day, hour, minute, second)) %>%
# filter years of interest
filter(datetime >= make_datetime(2018,7), datetime < make_datetime(2020,7)) %>%
# create increasing group after each july
group_by(year, month) %>%
mutate(dummy = month(datetime) == 7 & datetime == min(datetime)) %>%
ungroup() %>%
mutate(dummy = cumsum(dummy)) %>%
# force unique years and create custom name
group_by(dummy) %>%
mutate(datetime = datetime - years(year - 4) - years(month>=7),
years = paste(unique(year), collapse = " / ")) %>%
ungroup() %>%
# plot
ggplot() +
geom_line(aes(x = datetime, y = T42, colour = years)) +
scale_x_datetime(breaks = dates, labels = month.abb[months_order]) +
labs(title = "Temperature by Datetime", colour = "Year")
To order month differently and sum up the values in couples of years, you've to work a bit with your data before plotting them:
library(dplyr) # work data
library(ggplot2) # plots
library(lubridate) # date
library(readr) # fetch data
# your data
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df %>%
mutate(date = make_date(year, month,day)) %>%
# reorder month
group_by(month_2 = factor(as.character(month(date, label = T, locale = Sys.setlocale("LC_TIME", "English"))),
levels = c('Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun')),
# group years as you like
year_2 = ifelse( year(date) %in% (2018:2019), '2018/2019', '2020/2021')) %>%
# you can put whatever aggregation function you need
summarise(val = mean(T42, na.rm = T)) %>%
# plot it!
ggplot(aes(x = month_2, y = val, color = year_2, group = year_2)) +
geom_line() +
ylab('T42') +
xlab('month') +
theme_light()
A slightly different solution without the all dates to 2020 trick.
library(tidyverse)
library(lubridate)
df <- read_csv2("https://raw.githubusercontent.com/gonzalodqa/timeseries/main/temp.csv")
df <- df |>
filter(year %in% c(2018, 2019, 2020)) %>%
mutate(year = factor(year),
month = ifelse(month<10, paste0(0,month), month),
day = paste0(0, day),
month_day = paste0(month, "-", day))
df |> ggplot(aes(x=month_day, y=T42, group=year, col=year)) +
geom_line() +
scale_x_discrete(breaks = c("01-01", "02-01", "03-01", "04-01", "05-01", "06-01", "07-01", "08-01", "09-01", "10-01", "11-01", "12-01"))

Counting observations in each year from a date range in dplyr

Let's say I have a data.frame consisting of industry type and starting and ending dates (e.g. for an employee).
mydf <- data.frame(industry = c("Government", "Education", "Military", "Private Sector", "Government", "Private Sector"),
start_date = c("2014-01-01", "2016-02-01", "2012-11-01", "2013-03-01", "2012-12-01", "2011-12-01"),
end_date = c("2020-12-01", "2016-10-01", "2014-01-01", "2016-10-01", "2015-10-01", "2014-09-01"))
> mydf
industry start_date end_date
1 Government 2014-01-01 2020-12-01
2 Education 2016-02-01 2016-10-01
3 Military 2012-11-01 2014-01-01
4 Private Sector 2013-03-01 2016-10-01
5 Government 2012-12-01 2015-10-01
6 Private Sector 2011-12-01 2014-09-01
I'd like to create a stacked ggplot bar chart in which each unique year in the start_date column is on the X axis (e.g. 2011-2016) and the y axis represents the total number of observations (the row count) represented in a given industry for that year.
I'm not sure what the right way to manipulate the data.frame to allow for this. Presumably I'd need to manipulate the data to have columns for industry year and count. But I'm not sure how to produce the year columns from a date range. Any ideas?
Convert the date columns to Date, create the 'date' sequence from the 'start_date' to 'end_date' for each row with map2 (from purrr), unnest the list output, count the year and plot with geom_bar
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
mydf %>%
mutate(across(c(start_date, end_date), as.Date)) %>%
transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>%
unnest(c(date)) %>%
count(industry, year = factor(year(date))) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_col() +
theme_bw()
If the plot should be separate for each 'industry'
mydf %>%
mutate(across(c(start_date, end_date), as.Date)) %>%
transmute(industry, date = map2(start_date, end_date, seq, by = 'day')) %>%
unnest(c(date)) %>%
count(industry, year = factor(year(date))) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_col() +
facet_wrap(~ industry) +
theme_bw()
-output
As #IanCampbell suggested, the by for seq can be 'year'
mydf %>%
mutate(across(c(start_date, end_date), as.Date)) %>%
transmute(industry, date = map2(start_date, end_date, seq, by = 'year')) %>%
unnest(c(date)) %>%
count(industry, year = factor(year(date))) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_col() +
facet_wrap(~ industry) +
theme_bw()
Is this what you're looking for?
I would recommend using purrr::pmap to create a new data frame with one row for each year based on each row of the original data.
We can use the purrr::pmap_dfr to automatically return a single data frame bound by row.
We can use the ~with(list(...), ) trick to be able to reference columns by name.
Then we can use dplyr::count to count by combinations of columns. Then it's easy.
library(dplyr)
library(purrr)
library(lubridate)
library(ggplot)
mydf %>%
mutate(across(c(start_date, end_date), as.Date),
start_year = year(start_date),
end_year = year(end_date)) %>%
pmap_dfr(~with(list(...),data.frame(industry,
year = seq(start_year, end_year)))) %>%
count(year, industry) %>%
ggplot(aes(x = year, y = n, fill = industry)) +
geom_bar(stat="identity")

Using group_by, nesting(), complete() and compute time interval over a huge data set in R

I struggle on my data for a long time and I don't know how to solve my problem. I work on nutritional data, that can be faked by this data set:
library(tidyverse)
library(lubridate)
# Used for data generation
groupFunction <- function(cat){
case_when(
cat == "apple" ~ "food",
cat == "bread" ~ "food",
cat == "cheese" ~ "food",
cat == "chocolate" ~ "candy",
cat == "water" ~ "drink",
cat == "tea" ~ "drink"
)
}
# Generate the data
set.seed(0)
fakeData <- tibble(
id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
eaten_at = sample(seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/01/05'), by="15 min"), 40),
category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
group = groupFunction(category),
amount = sample(10:100, 40)
)
# For every id, for each day, every hour and each category: sum the eaten amount,
# and keep 0 eaten amount so it is encounted in the mean calculation in step 2!
# PROBLEM: we loose time intervals where a given id didn't eat anything, this will
# biais the mean calculation in step 2!
step1 <- fakeData %>%
mutate(eaten_at_hour = hour(eaten_at)) %>%
group_by(id, eaten_at, eaten_at_hour, category, group) %>%
summarise(eaten_amount = sum(amount)) %>%
ungroup() %>%
complete(nesting(id, eaten_at, eaten_at_hour),
nesting(category, group),
fill = list(eaten_amount = 0))
# For every id, mean over the days the eaten amount for every hour interval.
# As before, keep 0 counts so it's encounted in the mean calculation in step 3!
step2 <- step1 %>%
group_by(id, eaten_at_hour, category, group) %>%
summarise(mean_per_id = mean(eaten_amount)) %>%
ungroup() %>%
complete(nesting(id, eaten_at_hour),
nesting(category, group),
fill = list(mean_per_id = 0))
# Mean over all id
step3 <- step2 %>%
group_by(eaten_at_hour, category, group) %>%
summarise(mean_for_all = mean(mean_per_id)) %>%
ungroup()
# Plot the data
ggplot(step3, aes(x=eaten_at_hour, y=category, color = mean_for_all, shape = group)) +
geom_point( size = 3) +
scale_color_gradient(low="blue", high="red", "Mean eaten\namount [g]")
What I want to build is a plot with 1h time interval on the x axis and the different food categories on the y axis, with mean eaten amount for all ids for each X min period over 24h (i.e. time interval must be flexible). I would like a plot looking like this:
My thought was to compute:
for every id,
for every day this id has eaten something,
for every time interval of X hour (even if the id didn't eat anything),
and for every food category:
-> sum the eaten amount
Then:
for each id,
for each category,
and for each 1 hour interval over the days of participation:
-> average the eaten amount
Then:
-> average all id so that we get for each category and for each 1 hour interval of a 24h period, the mean eaten amount
For this I use the group_by(), nesting() and complete() functions. But I have 3 problems:
I want to be able to set the desired time interval, it can be 15 min but also 2 hours. I didn't found any solutions to this yet.
I need to have all time intervals for all id even if they didn't eat anything (so amout = 0) because when I mean for the days or among ids, the mean would be biaised if I don't include the zero counts.
My actual data set includes about 100k rows, so I think that my way of doing it would not be the most appropriate in term of efficiency. Furthermore, I want to design a shiny app for this data where a user could set the time interval manually for example, that means the plot must be computed again and again (lot of work for a computer when the code is not efficient...)
I'm aware that my question is totally oriented towards a specific problem but since I'm really blocked I would highly appreciate any help/inputs/ideas on one or both of my question. Thanks a lot!
I'm not sure I fully understood your problem, but here is a draft of an answer.
First, a tricky way to group on an interval is to floor the hour (using lubridate::hour) divided by the step, and then multiply the result by the step. Then, I grouped by the id, hour, and group to sum and then by only hour and group to compute the mean.
eaten_n_hours = 2
df = fakeData %>%
mutate(hour = floor(hour(eaten_at)/eaten_n_hours)*eaten_n_hours) %>%
group_by(id, hour, group) %>%
summarise(amount = sum(amount, na.rm=TRUE)) %>%
group_by(hour, group) %>%
summarise(amount_m = mean(amount, na.rm=T),
amount_sd = sd(amount, na.rm=T)) %>%
identity()
Then, you can plot the whole thing like this:
breaks_hour = seq(min(df$hour), max(df$hour)+1, eaten_n_hours)
ggplot(df, aes(x=hour, y=amount_m, group=group, color=group, fill=group))+
geom_col(position="dodge") +
# geom_errorbar(aes(ymin=amount_m-amount_sd, ymax=amount_m+amount_sd), position="dodge") +
scale_x_binned(breaks=breaks_hour)
This is not the prettiest plot ever, but I'm not sure whether it is due to my ununderstanding of the problem or to the example fakeData.
EDIT
I'm not familiar with tiles, but you can try using geom_tiles this way. Also, using scales::breaks_width allow having a flexible time interval.
ggplot(df, aes(x=hour, y=group, fill=amount_m))+
geom_tile()+
scale_x_binned(breaks=scales::breaks_width(3)) # try other values
So I figured out a way to do this (thanks #Dan Chaltiel), it's surely not perfect but I'll post it here so it can be useful to others/or for the discussion:
library(tidyverse)
library(lubridate)
# Used for data generation
groupFunction <- function(cat){
case_when(
cat == "apple" ~ "food",
cat == "bread" ~ "food",
cat == "cheese" ~ "food",
cat == "chocolate" ~ "candy",
cat == "water" ~ "drink",
cat == "tea" ~ "drink"
)
}
# Generate the data
set.seed(0)
fakeData <- tibble(
id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
eaten_at = sample(seq(as.POSIXct('2020/01/01 22:00:00'), as.POSIXct('2020/01/05'), by="17 min"), 40),
category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
group = groupFunction(category),
amount = sample(10:100, 40)
)
# Set time interval in minutes here (0-60 min only):
set_time <- 60
# Generate time sequence for one day (1440 seconds), with the desired interval input. Then set it as factor.
timeLevels <- seq(from = as.POSIXct("2020-1-1 0:00"), by = paste(set_time, "min", sep = " "), length.out = 1440/set_time)
timeLevels <- paste(hour(timeLevels), minute(timeLevels), sep = ":")
# Calculate the means, keeping zero counts
toPlot <- fakeData %>%
mutate(eaten_hour = floor_date(eaten_at, unit = paste(set_time, "min", sep = " ")),
eaten_hour = paste(hour(eaten_hour), minute(eaten_hour), sep = ":"),
eaten_hour = factor(eaten_hour, levels = timeLevels),
eaten_date = date(eaten_at)) %>%
group_by(eaten_date, eaten_hour, category, group) %>%
summarise(sum_amount = sum(amount)) %>%
ungroup() %>%
complete(eaten_date, eaten_hour, nesting(category, group), fill = list(sum_amount = 0)) %>%
group_by(eaten_hour, category, group) %>%
summarise(mean_amount = mean(sum_amount)) %>%
ungroup()
# Plot the data
gg <- ggplot(toPlot, aes(x=eaten_hour, y=category, fill=mean_amount))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + coord_equal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
gg <- gg + labs(x = "Time of the day", y = NULL, title = "Mean eaten quantity over one day", fill = "Mean amount [g]")
gg
Output looks like:
Still open on any inputs about how to improve my code!

How can I sum up values over time and deduct them as soon as they are not needed anymore?

Hi and apologies if this was asked already, I could not find anything since an hour so I'll ask:
I have data in this style (sorry for the horrible formatting, how can I make these prettier?):
person start_time end_time amount
A 2019-10-04 2020-04-21 10
A 2019-12-10 2020-01-09 20
B 2019-11-04 2020-08-21 30
B 2019-12-10 2020-01-20 15
C 2019-12-20 2020-03-19 5
So, I want to be able to plot the sum of the amount per person with ggplot2 over time until today (or sys_date).
This means, for person A, the plot should show 10 from 2019-10-04 until 2019-12-10, and afterwards it should jump to 30 (10+20). This is until 2020-01-09 (since this is in the past), where the amount should go back to 10.
Similarly, for person B the amount should be 30 between 2019-11-04 and 2019-12-10, afterwards it should be 45, and fall back to 30 on 2020-01-20.
I tried something along:
SumAmount <- data %>%
group_by(person,start_time,end_time) %>%
summarise(cumulatedAmount = sum(amount))
But this isn't what I need...
Thanks a lot and apologies again for the poor formatting.
Here is one idea. We can calculate the total amount of each date and then plot the total amount.
library(tidyverse)
library(lubridate)
dat2 <- dat %>%
# Convert to date class
mutate_at(vars(ends_with("time")), ymd) %>%
# Create a date sequence and expand it
mutate(Date = map2(start_time, end_time, seq.Date, by = 1)) %>%
unnest(cols = Date) %>%
# Calculate the total amount for each date
group_by(person, Date) %>%
summarize(amount = sum(amount))
ggplot(dat2, aes(x = Date, y = amount, color = person)) +
geom_point() +
geom_line()
Here is another option. This is the same way to expand the data frame based on date. After that, we can use stat_summary to plot the data.
library(tidyverse)
library(lubridate)
dat2 <- dat %>%
# Convert to date class
mutate_at(vars(ends_with("time")), ymd) %>%
# Create a date sequence and expand it
mutate(Date = map2(start_time, end_time, seq.Date, by = 1)) %>%
unnest(cols = Date)
ggplot(dat2, aes(x = Date, y = amount, color = person)) +
stat_summary(fun.y = sum, geom = "path")
Update
This solution add 0 to the next date of the last date for each person
library(tidyverse)
library(lubridate)
dat2 <- dat %>%
# Convert to date class
mutate_at(vars(ends_with("time")), ymd) %>%
# Create a date sequence and expand it
mutate(Date = map2(start_time, end_time, seq.Date, by = 1)) %>%
unnest(cols = Date) %>%
# Calculate the total amount for each date
group_by(person, Date) %>%
summarize(amount = sum(amount))
dat3 <- dat2 %>%
# Find the last date for each person
filter(Date == max(Date)) %>%
# Add one day to the last date for each person
# Set amount to be 0
mutate(Date = Date + 1, amount = 0)
# Combine data frames
dat4 <- bind_rows(dat2, dat3)
ggplot(dat4, aes(x = Date, y = amount, color = person)) +
geom_point() +
geom_line()
DATA
dat <- read.table(text = "person start_time end_time amount
A '2019-10-04' '2020-04-21' 10
A '2019-12-10' '2020-01-09' 20
B '2019-11-04' '2020-08-21' 30
B '2019-12-10' '2020-01-20' 15
C '2019-12-20' '2020-03-19' 5",
stringsAsFactors = FALSE, header = TRUE)

ggplot using grouped date variables (such as year_month)

I feel like this should be an easy task for ggplot, tidyverse, lubridate, but I cannot seem to find an elegant solution.
GOAL: Create a bar graph of my data aggregated/summarized/grouped_by year and month.
#Libraries
library(tidyverse)
library(lubridate)
# Data
date <- sample(seq(as_date('2013-06-01'), as_date('2014-5-31'), by="day"), 10000, replace = TRUE)
value <- rnorm(10000)
df <- tibble(date, value)
# Summarise
df2 <- df %>%
mutate(year = year(date), month = month(date)) %>%
unite(year_month,year,month) %>%
group_by(year_month) %>%
summarise(avg = mean(value),
cnt = n())
# Plot
ggplot(df2) +
geom_bar(aes(x=year_month, y = avg), stat = 'identity')
When I create the year_month variable, it naturally becomes a character variable instead of a date variable. I have also tried grouping by year(date), month(date) but then I can't figure out how to use two variables as the x-axis in ggplot. Perhaps this could be solved by flooring the dates to the first day of the month...?
You were really close. The missing pieces are floor_date() and scale_x_date():
library(tidyverse)
library(lubridate)
date <- sample(seq(as_date('2013-06-01'), as_date('2014-5-31'), by = "day"),
10000, replace = TRUE)
value <- rnorm(10000)
df <- tibble(date, value) %>%
group_by(month = floor_date(date, unit = "month")) %>%
summarize(avg = mean(value))
ggplot(df, aes(x = month, y = avg)) +
geom_bar(stat = "identity") +
scale_x_date(NULL, date_labels = "%b %y", breaks = "month")

Resources