Plotting single-variable duration time series data with ggplot2 - r

I have a list of companies with a start and end date of an event. I want to plot a figure that displays the date on the x-axis and the count of companies currently undergoing the event on the y-axis. The only way I can think of doing this at the moment is generating a column for every day and giving it a 1/0 for whether or not that day is between the start and end date for every company, and then reshaping it. Is there a more efficient way to produce this?
Here's some example data:
set.seed(123)
df <- data.frame(id = sample(100:500, 100, replace = F))
df$start <- sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100)
df$end <- df$start + sample(1:50, replace = T)

Here's another option, though I doubt that it's much more efficient than what you're already doing. It is also making all of the days and then identifying whether or not any particular observation is "active" that day.
outdf <- tibble(
date = seq(min(df$start), max(df$end), by=1),
num = rowSums(outer(date,
1:nrow(df),
function(x,y)x > df$start[y] & x < df$end[y]))
)
outdf
# # A tibble: 404 x 2
# date num
# <date> <dbl>
# 1 2020-01-05 0
# 2 2020-01-06 1
# 3 2020-01-07 1
# 4 2020-01-08 1
# 5 2020-01-09 1
# 6 2020-01-10 1
# 7 2020-01-11 2
# 8 2020-01-12 2
# 9 2020-01-13 2
# 10 2020-01-14 2
# # … with 394 more rows

Related

Create events in weekly timeseries

I want to create a table with two columns. The first one represents the working weeks, named time_axis in my exemple below.
The second column, is also a sequence of Dates which represents particular events in a year, called bank_holidays. Each of the date get a one value to signalise its presence.
What I need, is to create a table where the first columns time axis remains unchanged and the second column will be a vector of ones and zeros. Zeros anywhere outside the weeks which contain the events in bank_holiday and with ones for the weeks which includes those dates in bank_holiday occurs. Every week starts with the date in time_axis
library(xts)
time_axis <- seq(as.Date("2017-01-21"), length = 10, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18"))
bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
The desired outcome:
df <- data.frame ( time_axis = c("2017-01-20", "2017-01-27", "2017-02-03", "2017-02-10", "2017-02-17", "2017-02-24", "2017-03-03", "2017-03-10", "2017-03-17", "2017-03-24"), bank_holidays = c(0, 1, 0,1,1,0,0,0,1,0))
df
Any idee on how to make it?
Thank you.
Something which needs to bear in mind and is not obviously from the data: the weeks on time_axis start on Saturday. Therefore, 2017-01-21 is not the end of the 3rd week (as if it would be in case the week starts on Monday) but it is already the 4th week.
Using strftime, "%V" gives the ISO 8601 week numbers where you may match on.
res <- data.frame(time_axis,
bank_holidays =+(strftime(time_axis, "%V") %in%
strftime(index(bank_holidays), "%V")))
res
# time_axis bank_holidays
# 1 2017-01-20 0
# 2 2017-01-27 0
# 3 2017-02-03 1
# 4 2017-02-10 0
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 1
# 10 2017-03-24 0
Edit
To use the custom working weeks whose starts are defined in time_axis variable, the simplest thing would probably be to compare if bank_holidays are greater or equal than that. Then counting the TRUEs with colSums gives the index where to set to 1.
res <- data.frame(time_axis, bank_holidays=0) ## init. column with `0`
res$bank_holidays[colSums(sapply(index(bank_holidays), `>=`, time_axis))] <- 1 ## set matches to 1
res
# time_axis bank_holidays
# 1 2017-01-21 0
# 2 2017-01-28 1
# 3 2017-02-04 0
# 4 2017-02-11 1
# 5 2017-02-18 1
# 6 2017-02-25 0
# 7 2017-03-04 0
# 8 2017-03-11 0
# 9 2017-03-18 1
# 10 2017-03-25 0
This works for me, a little bit longer than the previous answer but you can see what's happening here
I need the start date and end date so I can fill the rest of the dates in between so I'm selecting 11 weeks instead of 10 in your example. And also I can match the vector for bank holidays instead of xts object
library(xts)
library(tidyverse)
time_axis <- seq(as.Date("2017-01-20"), length = 11, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18")) # I'll work with the vector
#bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
df <- tibble() # I'm creating an empty tibble
for(i in 1:(length(time_axis)-1)) { # running a for loop for 10 weeks
df <- seq(as.Date(time_axis[i]), as.Date(time_axis[i+1]), "days") %>% # filling the dates between the two dates
enframe(name = NULL) %>% #converting it into a data frame (tibble)
mutate(week = as.Date(time_axis[i])) %>% # creating a new column indicating which week the given date belong to
bind_rows(df) # binding the rows to previous dataframe
}
Now I'm taking the df and checking the given holidays matching with our generated dates or not. if present 1 or it will be 0.
Then I'm group_by based on week column which is our given weeks above and summarising to find the sum
df %>%
mutate(bank_holidays_presence = if_else(value %in% bank_holidays, 1, 0)) %>%
group_by(week) %>%
summarise(sum = bank_holidays_presence %>% sum())
# A tibble: 10 x 2
# week sum
# <date> <dbl>
# 1 2017-01-20 0
# 2 2017-01-27 1
# 3 2017-02-03 0
# 4 2017-02-10 1
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 0
#10 2017-03-24 0
The advantage of this method is that even if you have more than one holiday for a particular week it'll give the count rather than mere presence or absence

How to generate monthly time series data with NA values for running examples?

I apologize in advance for the simplicity of this question.
How can I generate a monthly time series data set using something like set seed? I have a question about results from two packages but need to create a sample data set to show as an example. My data set needs to have some NA values within in it.
Regards,
Simon
Here's a random list of 1000 dates +- 5 years from today with some missing data using the simstudy package (please provide sample data and expected output for a more specific answer):
library(simstudy)
library(dplyr)
library(lubridate)
set.seed(1724)
# define data
def <- defData(varname = "tmp", dist = "uniform", formula = "0;1") # sumstudy seems to crash when adding missing data with only 1 column
def <- defData(def, varname = "date", dist = "uniform", formula = "-5;5") # +- 5 years
df_full <- genData(1000, def)
##### missing data ----
defM <- defMiss(varname = "date", formula = 0.1, logit.link = F)
df_missing <- genMiss(df_full, defM, idvars = "id")
# Create data with missing values
df <- genObs(df_full, df_missing, idvars = "id")
df %>%
as_tibble() %>%
select(-tmp) %>%
mutate(date = ymd(floor_date(as.POSIXct(Sys.Date()) + date * 365 * 24 * 60 * 60, unit = "day")), # +- 5 years from today
month = format(date, "%Y-%m"))
# A tibble: 1,000 x 3
id date month
<int> <date> <chr>
1 1 NA NA
2 2 2021-09-12 2021-09
3 3 2023-11-08 2023-11
4 4 2015-03-02 2015-03
5 5 2021-08-12 2021-08
6 6 2021-10-20 2021-10
7 7 2017-05-17 2017-05
8 8 2019-04-12 2019-04
9 9 NA NA
10 10 NA NA
# ... with 990 more rows

R: cumulative total at a daily level

I have the following dataset:
I want to measure the cumulative total at a daily level. So the result look something like:
I can use dplyr's cumsum function but the count for "missing days" won't show up. As an example, the date 1/3/18 does not exist in the original dataframe. I want this missed date to be in the resultant dataframe and its cumulative sum should be the same as the last known date i.e. 1/2/18 with the sum being 5.
Any help is appreciated! I am new to the language.
I'll use this second data.frame to fill out the missing dates:
daterange <- data.frame(Date = seq(min(x$Date), max(x$Date), by = "1 day"))
Base R:
transform(merge(x, daterange, all = TRUE),
Count = cumsum(ifelse(is.na(Count), 0, Count)))
# Date Count
# 1 2018-01-01 2
# 2 2018-01-02 5
# 3 2018-01-03 5
# 4 2018-01-04 5
# 5 2018-01-05 10
# 6 2018-01-06 10
# 7 2018-01-07 10
# 8 2018-01-08 11
# ...
# 32 2018-02-01 17
dplyr
library(dplyr)
x %>%
right_join(daterange) %>%
mutate(Count = cumsum(if_else(is.na(Count), 0, Count)))
Data:
x <- data.frame(Date = as.Date(c("1/1/18", "1/2/18", "1/5/18", "1/8/18", "2/1/18"), format="%m/%d/%y"),
Count = c(2,3,5,1,6))

How to perform an R equivalent of Excel's COUNTIFS function across multiple variables in a data frame

I'm working on a project and trying to create a plot of the number of open cases we have on any given date. An example of the data table is as follows.
case_files <- tibble(case_id = 1:10,
date_opened = c("2017-1-1",
"2017-1-1",
"2017-3-4",
"2017-4-4",
"2017-5-5",
"2017-5-6",
"2017-6-7",
"2017-6-6",
"2017-7-8",
"2017-7-8"),
date_closed = c("2017-4-1",
"2017-4-1",
"2017-5-4",
"2017-7-4",
"2017-7-5",
"2017-7-6",
"2017-8-7",
"2017-8-6",
"2017-9-8",
"2017-10-8"))
case_files$date_opened <- as.Date(case_files$date_opened)
case_files$date_closed <- as.Date(case_files$date_closed)
What I'm trying to do is create another data frame with the dates from the past year and the number of cases that are considered "Open" during each date. I would then be able to plot from this data frame.
daily_open_cases <- tibble(n = 0:365,
date = today() - n,
qty_open = .....)
Cases are considered Open on dates on orafter the date_opened AND on or before the date_closed
I've considered doing conditional subsetting and then using nrow(), but can't seem to get it to work. There must be an easier way to do this. I can do this easily in Excel using the COUNTIFS function.
Thanks!
The Excel funtion basically does a sum of logical 1's and 0's. Easy to do in R with sum function. I'd build a structure that had all the dates and then march through those dates summing up the logical vectors using the two inequalities below across the all paired rows in the case_files structure. The &-function in R is vectorized:
daily_open_cases <- tibble(dt = as.Date("2017-01-01")+0:365,
qty_open = NA)
daily_open_cases$qty_open = sapply(daily_open_cases$dt,
function(d) sum(case_files$date_opened <= d & case_files$date_closed >=d) )
> head( daily_open_cases)
# A tibble: 6 x 2
dt qty_open
<date> <int>
1 2017-01-01 2
2 2017-01-02 2
3 2017-01-03 2
4 2017-01-04 2
5 2017-01-05 2
6 2017-01-06 2
>
Here's a 'tidyverse' solution, the approach is the same as the one of 42 I just used dplyrs group_by and mutate instead of base-r sapply.
library(tidyverse)
library(magrittr)
days_files <- tibble(
date = as.Date("2017-01-01")+0:365,
no_open = NA_integer_
)
days_files %<>%
group_by(date) %>%
mutate(
no_open = sum(case_files$date_opened <= date & case_files$date_closed >= date)
)
# A tibble: 366 x 2
# Groups: date [366]
date no_open
<date> <int>
1 2017-01-01 2
2 2017-01-02 2
3 2017-01-03 2
4 2017-01-04 2
5 2017-01-05 2
6 2017-01-06 2
7 2017-01-07 2
8 2017-01-08 2
9 2017-01-09 2
10 2017-01-10 2
# ... with 356 more rows

Apply function to get months from today's date

I'm trying to create a column in a dataset that tells me the (approximate) number of months a customer has been with the company.
This is my current attempt:
dat <- data.frame(ID = c(1:4), start.date = as.Date(c('2015-04-09', '2014-03- 24', '2016-07-01', '2011-02-02')))
dat$months.customer <- apply(dat[2], 1, function(x) (as.numeric(Sys.Date())- as.numeric(x))/30)
It's returning all NAs
You can use difftime:
dat$months.customer <-
as.numeric(floor(difftime(Sys.Date(),dat$start.date,units="days")/30))
# ID start.date months.customer
# 1 1 2015-04-09 16
# 2 2 2014-03-24 29
# 3 3 2016-07-01 1
# 4 4 2011-02-02 67

Resources