I have a dataframe containing dates when a given event occurred. Some events go on for several days, and I want to summarise each event based on its start date and its total length (in days).
I want to go from this:
Date
2020-01-01
2020-01-02
2020-01-03
2020-01-15
2020-01-20
2020-01-21
To this:
StartDate
EventLength
2020-01-01
3
2020-01-15
1
2020-01-20
2
I've tried various approaches with aggregate, ave, seq_along and lag, but I haven't managed to get a count of event length that resets when the dates aren't sequential.
Code for the example data frame in case it's helpful:
Date <- c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-15", "2020-01-20", "2020-01-21")
df <- data.frame(Date)
df$Date <- as.Date(df$Date, origin = "1970-01-01")
You can split by cumsum(c(0, diff(df$Date) != 1) and then take the first date and combine it with the length assuming the dates are sorted.
do.call(rbind, lapply(split(df$Date, cumsum(c(0, diff(df$Date) != 1))),
function(x) data.frame(StartDate=x[1], EventLength=length(x))))
# StartDate EventLength
#0 2020-01-01 3
#1 2020-01-15 1
#2 2020-01-20 2
or another option using rle:
i <- cumsum(c(0, diff(df$Date) != 1))
data.frame(StartDate = df$Date[c(1, diff(i)) == 1], EventLength=rle(i)$lengths)
# StartDate EventLength
#1 2020-01-01 3
#2 2020-01-15 1
#3 2020-01-20 2
I propose dplyr approach which is incidentally very similar to #Rui's approach
df %>% mutate(dummy = c(0, diff(Date))) %>%
group_by(grp = cumsum(dummy != 1)) %>%
summarise(Date = first(Date),
event_count = n(), .groups = 'drop')
# A tibble: 3 x 3
grp Date event_count
<int> <date> <int>
1 1 2020-01-01 3
2 2 2020-01-15 1
3 3 2020-01-20 2
Here is a base R solution with a cumsum trick followed by ave/table.
d <- c(0, diff(df$Date) != 1)
res <- ave(df$Date, cumsum(d), FUN = function(x) x[1])
res <- as.data.frame(table(a))
names(res) <- c("Date", "EventLength")
res
# Date EventLength
#1 2020-01-01 3
#2 2020-01-15 1
#3 2020-01-20 2
Related
Hi I have this dataset:
Data_List <- "I611|I613|I614|I639"
df <-
read.table(textConnection("ID Code Date1 Date2
A I611 01/01/2021 03/01/2021
A L111 04/01/2021 09/01/2021
B L111 01/01/2021 03/01/2021
B Z538 08/01/2021 11/01/2021
C I613 09/08/2021 09/09/2021
C I639 10/09/2021 18/09/2021
C I639 19/11/2021 22/11/2021
D L111 01/01/2021 02/01/2021
D I639 03/01/2021 04/01/2021
D B111 11/01/2021 14/01/2021"), header=TRUE)
What I am looking to do is filter rows where 'Code' value is within Data_List and the value in the row immediately below it for 'Date1' column is within 30 days of the value in Date2 column of original row (the row that matches its 'Code' value in Data_List). If these conditions are matched, both rows should be kept (and any subsequent rows where these conditions are matched for each grouped ID).
So from the above dataset we would end up with the below:
ID
Code
Date1
Date2
A
I611
01/01/2021
03/01/2021
A
L111
04/01//2021
09/01/2021
C
I613
09/08/2021
09/09/2021
C
I639
10/09/2021
18/09/2021
D
I639
03/01/2021
04/01/2021
D
B111
11/01/2021
14/01/2021
So for ID A we have 2 retained records as first record is a match to Data_List for 'Code' value and below column Date1 is within 30 days of the first record's Date2 so both are retained.
For ID B we have no retained records as there are only two records and the first record is not within Data_List for 'Code' value.
For ID C we have 2 retained records as first record is a match within Data_list for 'Code' and second is within 30 days for Date1 of the Date2 value in first record. The third ID = 'C' record is not retained as its Date1 value is not within 30 days of Date2 in record above it.
For ID D we again have 2 matches but this is for the 2nd and 3rd rows which apply, thus demonstrating that sometimes the 'loop' does not start on row 1.
If anyone could assist with this filtering that would be very much appreciated, thank you.
I propose this. It matches the requirements as I understand them, though it does not matches the expected output.
df <-
read.table(textConnection("ID Code Date1 Date2
A I611 01/01/2021 03/01/2021
A L111 04/01/2021 09/01/2021
B L111 01/01/2021 03/01/2021
B Z538 08/01/2021 11/01/2021
C I613 09/08/2021 09/09/2021
C I639 10/09/2021 18/09/2021
C I639 19/11/2021 22/11/2021
D L111 01/01/2021 02/01/2021
D I639 03/01/2021 04/01/2021
D B111 11/01/2021 14/01/2021"), header=TRUE)
Data_List <- "I611|I613|I614|I639"
library(data.table)
library(lubridate)
Data_List <- base::strsplit(Data_List, split = "|", fixed = TRUE)[[1]]
setDT(df)
df[, `:=`(Date1 = dmy(Date1), Date2 = dmy(Date2), keep = FALSE) ]
df[, next_date := shift(Date2, 1L, type="lead")]
df[, date_limit := (Date1 + duration(30, 'days')) ]
df[(next_date < date_limit) & (Code %in% Data_List) , keep := TRUE]
df[, previous_id := shift(ID, 1L, type="lag")]
df[, previous_keep := shift(keep, 1L, type="lag")]
df[keep | ((ID == previous_id) & previous_keep), .(ID, Code, Date1, Date2)]
Here's an approach using tidyverse and lubridate:
library(tidyverse)
library(lubridate)
We first transform Data_list to a character vector of codes and coerce df to a tibble with date columns in a suitable format.
Data_List <- unlist(str_split("I611|I613|I614|I639", "\\|"))
df <- df %>% as_tibble() %>% mutate(across(starts_with("Date"), dmy))
Next, we group df by ID and filter accordingly. cs serves as an identifier for rows to be removed since they don't have preceding rows with Code matching an entry in Data_List in their group (cs is 0 for these rows). The second condition is for the 30-days window.
df %>% group_by(ID) %>%
mutate(cs = cumsum(Code %in% Data_List)) %>%
filter(
cs != 0,
(lead(Date1) - Date2) <= 30 | (Date1 - lag(Date2)) <= 30
) %>%
select(-cs) %>% ungroup()
Result:
# A tibble: 6 × 4
ID Code Date1 Date2
<chr> <chr> <date> <date>
1 A I611 2021-01-01 2021-01-03
2 A L111 2021-01-04 2021-01-09
3 C I613 2021-08-09 2021-09-09
4 C I639 2021-09-10 2021-09-18
5 D I639 2021-01-03 2021-01-04
6 D B111 2021-01-11 2021-01-14
library(data.table)
setDT(df, key = "ID")
Data_List = unlist(strsplit(Data_List, "|", fixed = TRUE))
dvars = paste0('Date', 1:2)
df[, (dvars) := lapply(.SD, as.Date, "%d/%m/%Y"), .SDcols = dvars]
df[, keep := {
tmp = Code %chin% Data_List
replace(tmp, seq_along(Code) > which.max(tmp), NA)
},
by = ID]
df[, connected := (Date1 - shift(Date2, fill = NA)) <= 30, by = ID]
df[!sapply(!keep, isTRUE), .SD[any(keep)], by = ID
][, .SD[cummin((fcoalesce(keep, connected)))], by = ID
][, !c("keep", "connected")]
# Key: <ID>
# ID Code Date1 Date2
# <char> <char> <Date> <Date>
# 1: A I611 2021-01-01 2021-01-03
# 2: A I611 2021-01-01 2021-01-03
# 3: C I613 2021-08-09 2021-09-09
# 4: C I613 2021-08-09 2021-09-09
# 5: D I639 2021-01-03 2021-01-04
# 6: D I639 2021-01-03 2021-01-04
Data_List has been a regex syntax, so just pass it into grepl() to determine where Code are matched.
cumany() from dplyr identify those rows at or behind where the matches of Code take place.
library(dplyr)
df %>%
mutate(across(contains('Date'), as.Date, '%d/%m/%Y')) %>%
group_by(ID) %>%
filter(cumany(grepl(Data_List, Code)),
lead(Date1) - Date2 <= 30 | Date1 - lag(Date2) <= 30) %>%
ungroup()
# # A tibble: 6 × 4
# ID Code Date1 Date2
# <chr> <chr> <date> <date>
# 1 A I611 2021-01-01 2021-01-03
# 2 A L111 2021-01-04 2021-01-09
# 3 C I613 2021-08-09 2021-09-09
# 4 C I639 2021-09-10 2021-09-18
# 5 D I639 2021-01-03 2021-01-04
# 6 D B111 2021-01-11 2021-01-14
I have 2 tables
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
I would like to know the number of observation periods that occur on each day of df1, based on the start/stop dates in df2. E.g. on 1/1/2020, observations a and b were in progress, hence "2".
The expected output would be as follows:
I've tried using sums
df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)
But that only sums up the entire column values
I've then tried to create a custom function for this:
df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))
But it returns an NA value.
I then tried to do embed an "ifelse" within it, but get the same issue with NAs
apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))
Can anyone suggest what the issue is? Thanks!
edit: an interval join was suggested which is not what I'm trying to get - I think naming the observations with a numeric label was what caused confusion. I am trying to find out the TOTAL number of observations with periods that fall within the day, as compared to doing a 1:1 match.
Regards
Sing
Define the comparison in a function f and pass it through outer, rowSums is what you're looking for.
f <- \(x, y) df1[x, 1] >= df2[y, 2] & df1[x, 1] <= df2[y, 3]
cbind(df1, number=rowSums(outer(1:nrow(df1), 1:nrow(df2), f)))
# dates number
# 1 2020-01-01 2
# 2 2020-01-02 2
# 3 2020-01-03 1
# 4 2020-01-04 0
# 5 2020-01-05 1
# 6 2020-01-06 1
# 7 2020-01-07 1
# 8 2020-01-08 1
# 9 2020-01-09 1
# 10 2020-01-10 2
Here is a potential solution using dplyr/tidyverse functions and the %within% function from the lubridate package. This approach is similar to Left Join Subset of Column Based on Date Interval, however there are some important differences i.e. use summarise() instead of filter() to avoid 'losing' dates where "number" == 0, and join by 'character()' as there are no common columns between datasets:
library(dplyr)
library(lubridate)
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"),
as.Date("2020-1-10"),
by = "days")))
df2 = data.frame("observations" = c("1", "2", "3", "4"),
"start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")),
"end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
df1 %>%
full_join(df2, by = character()) %>%
mutate(number = dates %within% interval(start, end)) %>%
group_by(dates) %>%
summarise(number = sum(number))
#> # A tibble: 10 × 2
#> dates number
#> <date> <dbl>
#> 1 2020-01-01 2
#> 2 2020-01-02 2
#> 3 2020-01-03 1
#> 4 2020-01-04 0
#> 5 2020-01-05 1
#> 6 2020-01-06 1
#> 7 2020-01-07 1
#> 8 2020-01-08 1
#> 9 2020-01-09 1
#> 10 2020-01-10 2
Created on 2022-06-27 by the reprex package (v2.0.1)
Does this approach work with your actual data?
My one table has data with date mentioned in last two columns:
dat<- data.frame(a = c(rep("x",3)),
date1=c(seq(as.Date("2018-01-01"), as.Date("2018-01-3"), 1)),
date2=c(seq(as.Date("2018-01-08"), as.Date("2018-01-10"), 1)))
a date1 date2
1 x 2018-01-01 2018-01-08
2 x 2018-01-02 2018-01-09
3 x 2018-01-03 2018-01-10
My another table has what kind of day each day is
cal <- data.frame(dt = c(seq(as.Date("2018-01-01"), as.Date("2018-01-10"),1)),
day = c(rep("workday",5), rep("holiday",1), rep("weekend",4)))
How to get number of days in table 1(dat) as anew column such that the it counts only the workday that falls in the range mentioned in column 2 and column 3?
Example output with 4 columns. The last column is the number of workdays for the date range in previous two columns
a date1 date2 countdown
1 x 2018-01-01 2018-01-08 5
2 x 2018-01-02 2018-01-09 4
3 x 2018-01-03 2018-01-10 3
data.table solution
library( data.table )
#set data to data.table format
setDT(dat); setDT(cal)
setkey(dat, date1, date2 )
dat[dat,
N := { val = cal[ day == "workday" & dt >= i.date1 & dt <= i.date2 ]
list( nrow( val ) ) },
by = .EACHI ]
# a date1 date2 N
# 1: x 2018-01-01 2018-01-08 5
# 2: x 2018-01-02 2018-01-09 4
# 3: x 2018-01-03 2018-01-10 3
update
data.table::foverlaps() solution
library( data.table )
#set data to data.table format
setDT(dat); setDT(cal)
#create dummy date
cal[,dt2 := dt]
#set keys
setkey( dat, date1, date2 )
setkey( cal, dt, dt2 )
#overlap join
ans <- foverlaps( dat, cal )
#summarise
ans[, .( countdown = uniqueN( dt[day == "workday"] ) ), by = .(a, date1, date2)][]
# a date1 date2 countdown
# 1: x 2018-01-01 2018-01-08 5
# 2: x 2018-01-02 2018-01-09 4
# 3: x 2018-01-03 2018-01-10 3
A way using tidyverse functions :
Create a sequence of days between date1 and date2
Get the data in long format
Left join data the above data with cal dataframe
Calculate number of workdays for each row.
library(dplyr)
dat %>%
mutate(row = row_number(),
dt = purrr::map2(date1, date2, seq, by = '1 day')) %>%
tidyr::unnest(dt) %>%
left_join(cal, by = 'dt') %>%
group_by(row, a, date1, date2) %>%
summarise(countdown = sum(day == 'workday')) %>%
ungroup() %>%
select(-row)
# a date1 date2 countdown
# <chr> <date> <date> <int>
#1 x 2018-01-01 2018-01-08 5
#2 x 2018-01-02 2018-01-09 4
#3 x 2018-01-03 2018-01-10 3
A base R option
within(
dat,
countdown <- sapply(
1:nrow(dat),
function(k) sum(cal$day == "workday" & !is.na(cut(cal$dt, c(date1[k], date2[k]))))
)
)
giving
a date1 date2 countdown
1 x 2018-01-01 2018-01-08 5
2 x 2018-01-02 2018-01-09 4
3 x 2018-01-03 2018-01-10 3
additional solution
# v1
df %>%
rowwise() %>%
mutate(int_date = list(seq(date1, date2, "1 day"))) %>%
unnest(int_date) %>%
left_join(cal, by = c("int_date" = "dt")) %>%
filter(day == "workday") %>%
group_by(a, date1, date2) %>%
count
# v2
df %>%
rowwise() %>%
mutate(int_date = list(seq(date1, date2, "1 day")),
out = sum(unlist(int_date) %in% cal$dt[cal$day == "workday"])) %>%
select(-int_date)
# v3 (using #Ronak Shah hint with a `map` )
df %>%
mutate(int_date = map2(date1, date2, seq, "1 day"),
out = map_dbl(int_date, ~ sum(.x %in% cal$dt[cal$day == "workday"]))) %>%
select(-int_date)
# A tibble: 3 x 4
# Rowwise:
a date1 date2 out
<chr> <date> <date> <int>
1 x 2018-01-01 2018-01-08 5
2 x 2018-01-02 2018-01-09 4
3 x 2018-01-03 2018-01-10 3
I have a dataframe like this:
source_data <-
data.frame(
id = c(seq(1,3)),
start = c(as.Date("2020-04-04"), as.Date("2020-04-02"), as.Date("2020-04-03")),
end = c(as.Date("2020-04-08"), as.Date("2020-04-05"), as.Date("2020-04-05"))
)
I want to create a date sequence for each id = crate each day between start and end dates and put it to another dataframe. So the result should look like this:
result <-
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
date = c(
as.Date("2020-04-04"),
as.Date("2020-04-05"),
as.Date("2020-04-06"),
as.Date("2020-04-07"),
as.Date("2020-04-08"),
as.Date("2020-04-02"),
as.Date("2020-04-03"),
as.Date("2020-04-04"),
as.Date("2020-04-05"),
as.Date("2020-04-03"),
as.Date("2020-04-04"),
as.Date("2020-04-05")
)
)
I started with this date sequence, but how to join my source_data dataframe there?
solution <-
data.frame(
date = seq(min(source_data$start), max(source_data$end), by = 1)
)
We can use map2 to create the sequence between each corresponding 'start', 'end' dates and then unnest the list column
library(dplyr)
library(purrr)
library(tidyr)
source_data %>%
transmute(id, date = map2(start, end, seq, by = '1 day')) %>%
unnest(c(date))
# A tibble: 12 x 2
# id date
# <int> <date>
# 1 1 2020-04-04
# 2 1 2020-04-05
# 3 1 2020-04-06
# 4 1 2020-04-07
# 5 1 2020-04-08
# 6 2 2020-04-02
# 7 2 2020-04-03
# 8 2 2020-04-04
# 9 2 2020-04-05
#10 3 2020-04-03
#11 3 2020-04-04
#12 3 2020-04-05
Or using data.table
library(data.table)
setDT(source_data)[, .(date = seq(start, end, by = '1 day')), by = id]
Additional option with base R
lst1 <- Map(seq, source_data$start, source_data$end, MoreArgs = list(by = '1 day'))
data.frame(id = rep(source_data$id, lengths(lst1)), date = do.call(c, lst1))
Another base R solution
result <- do.call(rbind,
c(make.row.names = FALSE,
lapply(split(source_data,source_data$id),
function(v) with(v,data.frame(id = id, date = seq(start,end,by = 1))))))
which yields
> result
id date
1 1 2020-04-04
2 1 2020-04-05
3 1 2020-04-06
4 1 2020-04-07
5 1 2020-04-08
6 2 2020-04-02
7 2 2020-04-03
8 2 2020-04-04
9 2 2020-04-05
10 3 2020-04-03
11 3 2020-04-04
12 3 2020-04-05
additional option
library(dplyr)
source_data %>%
rowwise() %>%
mutate(out = list(seq.Date(start, end, "day"))) %>%
unnest(out) %>%
select(-c(start, end))
Given a table
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017
I'm trying to split by Calendar month as the following table
id start end
1 22/03/2016 31/03/2016
1 01/04/2016 30/04/2016
1 01/05/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 30/09/2017
3 01/10/2017 31/10/2017
3 01/11/2017 30/11/2017
3 01/12/2017 25/12/2017
I'm trying to modify a code extract from how to split rows of a dataframe in multiple rows based on start date and end date? , but I am not being able to modify correctly the code. The problem is generally in the months with 30 days, and maybe is easy but I am not still familiarized with regular expressions.
#sample data
df <- data.frame("starting_date" = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
"end_date" = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
col3=c('1','2', '3'))
df1 <- df[,1:2] %>%
rowwise() %>%
do(rbind(data.frame(matrix(as.character(c(
.$starting_date,
seq(.$starting_date, .$end_date, by=1)[grep("\\d{4}-\\d{2}-31|\\d{4}-\\d{2}-01", seq(.$starting_date, .$end_date, by=1))],
.$end_date)), ncol=2, byrow=T))
)
) %>%
data.frame() %>%
`colnames<-`(c("starting_date", "end_date")) %>%
mutate(starting_date= as.Date(starting_date, format= "%Y-%m-%d"),
end_date= as.Date(end_date, format= "%Y-%m-%d"))
#add temporary columns to the original and expanded date column dataframes
df$row_idx <- seq(1:nrow(df))
df$temp_col <- (year(df$end_date) - year(df$starting_date)) +1
df1 <- cbind(df1,row_idx = rep(df$row_idx,df$temp_col))
#join both dataframes to get the final result
final_df <- left_join(df1,df[,3:(ncol(df)-1)],by="row_idx") %>%
select(-row_idx)
final_df
If anyone knows how to modify the code or a better way to do it I will be very grateful.
We assume there is an error in the sample output in the question since the third row spans parts of two months and so should be split into two rows.
Define Seq which given one start and end Date variables produces a data.frame of start and end columns and then run it on each id using group_by:
library(dplyr)
library(zoo)
Seq <- function(start, end) {
ym <- seq(as.yearmon(start), as.yearmon(end), 1/12)
starts <- pmax(start, as.Date(ym, frac = 0))
ends <- pmin(end, as.Date(ym, frac = 1))
unique(data.frame(start = starts, end = ends))
}
fmt <- "%d/%m/%Y"
DF %>%
mutate(start = as.Date(start, fmt), end = as.Date(end, fmt)) %>%
group_by(id) %>%
do(Seq(.$start, .$end)) %>%
ungroup
giving:
# A tibble: 9 x 3
id start end
<int> <date> <date>
1 1 2016-03-22 2016-03-31
2 1 2016-04-01 2016-04-30
3 1 2016-05-01 2016-05-31
4 1 2016-06-01 2016-06-05
5 2 2016-08-17 2016-08-29
6 3 2017-09-22 2017-09-30
7 3 2017-10-01 2017-10-31
8 3 2017-11-01 2017-11-30
9 3 2017-12-01 2017-12-25
Note
The input DF in reproducible form:
Lines <- "
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017"
DF <- read.table(text = Lines, header = TRUE)
So there's a probably a more elegant way to accomplish this and I feel like I've seen similar questions, but could not find a duplicate quickly, so here goes...
SETUP
library(tidyverse)
library(lubridate)
df <- data.frame(
id = c('1', '2', '3'),
starting_date = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
end_date = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
stringsAsFactors = FALSE
)
df
#> id starting_date end_date
#> 1 1 2016-03-22 2016-06-05
#> 2 2 2016-08-17 2016-08-29
#> 3 3 2017-09-12 2017-12-25
SOLUTION
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest() %>%
mutate(row = row_number()) %>%
mutate(
new_end_date = if_else(row == max(row), end_date, date_seq),
new_start_date = if_else(row == min(row), starting_date, floor_date(new_end_date, "month"))
) %>%
select(
id, new_start_date, new_end_date
)
#> # A tibble: 8 x 3
#> # Groups: id [3]
#> id new_start_date new_end_date
#> <chr> <date> <date>
#> 1 1 2016-03-22 2016-03-31
#> 2 1 2016-04-01 2016-04-30
#> 3 1 2016-06-01 2016-06-05
#> 4 2 2016-08-17 2016-08-29
#> 5 3 2017-09-12 2017-09-30
#> 6 3 2017-10-01 2017-10-31
#> 7 3 2017-11-01 2017-11-30
#> 8 3 2017-12-01 2017-12-25
EXPLANATION
Much of what's going on here takes place in the first mutate call which creates date_seq. To understand it, consider the following:
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month")
# [1] "2016-03-22" "2016-04-22" "2016-05-22"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month")
# [1] "2016-04-01" "2016-05-01" "2016-06-01"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month") - 1
# [1] "2016-03-31" "2016-04-30" "2016-05-31"
So basically, create a sequence of "end-of-month" dates between the original start and end dates. Putting this in a list-column allows us to organize by the id so that we unnest appropriately. Checkout the output after the end of the unnest():
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest()
From there I hope things are relatively straightforward. The row_number probably could have been replaced with something fancier like a first/last, but I thought this might be easier to follow.