R - dataframe with sequence from another dataframe - r

I have a dataframe like this:
source_data <-
data.frame(
id = c(seq(1,3)),
start = c(as.Date("2020-04-04"), as.Date("2020-04-02"), as.Date("2020-04-03")),
end = c(as.Date("2020-04-08"), as.Date("2020-04-05"), as.Date("2020-04-05"))
)
I want to create a date sequence for each id = crate each day between start and end dates and put it to another dataframe. So the result should look like this:
result <-
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
date = c(
as.Date("2020-04-04"),
as.Date("2020-04-05"),
as.Date("2020-04-06"),
as.Date("2020-04-07"),
as.Date("2020-04-08"),
as.Date("2020-04-02"),
as.Date("2020-04-03"),
as.Date("2020-04-04"),
as.Date("2020-04-05"),
as.Date("2020-04-03"),
as.Date("2020-04-04"),
as.Date("2020-04-05")
)
)
I started with this date sequence, but how to join my source_data dataframe there?
solution <-
data.frame(
date = seq(min(source_data$start), max(source_data$end), by = 1)
)

We can use map2 to create the sequence between each corresponding 'start', 'end' dates and then unnest the list column
library(dplyr)
library(purrr)
library(tidyr)
source_data %>%
transmute(id, date = map2(start, end, seq, by = '1 day')) %>%
unnest(c(date))
# A tibble: 12 x 2
# id date
# <int> <date>
# 1 1 2020-04-04
# 2 1 2020-04-05
# 3 1 2020-04-06
# 4 1 2020-04-07
# 5 1 2020-04-08
# 6 2 2020-04-02
# 7 2 2020-04-03
# 8 2 2020-04-04
# 9 2 2020-04-05
#10 3 2020-04-03
#11 3 2020-04-04
#12 3 2020-04-05
Or using data.table
library(data.table)
setDT(source_data)[, .(date = seq(start, end, by = '1 day')), by = id]
Additional option with base R
lst1 <- Map(seq, source_data$start, source_data$end, MoreArgs = list(by = '1 day'))
data.frame(id = rep(source_data$id, lengths(lst1)), date = do.call(c, lst1))

Another base R solution
result <- do.call(rbind,
c(make.row.names = FALSE,
lapply(split(source_data,source_data$id),
function(v) with(v,data.frame(id = id, date = seq(start,end,by = 1))))))
which yields
> result
id date
1 1 2020-04-04
2 1 2020-04-05
3 1 2020-04-06
4 1 2020-04-07
5 1 2020-04-08
6 2 2020-04-02
7 2 2020-04-03
8 2 2020-04-04
9 2 2020-04-05
10 3 2020-04-03
11 3 2020-04-04
12 3 2020-04-05

additional option
library(dplyr)
source_data %>%
rowwise() %>%
mutate(out = list(seq.Date(start, end, "day"))) %>%
unnest(out) %>%
select(-c(start, end))

Related

remove rows with overlaped dates and keep longest time interval in R using dplyr or sqldf

I need to remove rows with overlapped dates and keep the x value which is maximum among the overlapped dates. Here is a data frame:
data.frame(time_left = c("2011-08-05",
"2011-07-25",
"2017-08-20",
"2017-08-20",
"2017-10-09",
"2019-06-01"),
time_right= c("2011-09-14",
"2011-09-01",
"2017-09-12",
"2017-09-26",
"2017-10-15",
"2019-11-05"),
x = c(114,20,10,1,5,100) ) -> df
so my input is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
3 2017-08-20 2017-09-12 10
4 2017-08-20 2017-09-26 1
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
and my desired output is:
time_left time_right x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01 20
4 2017-08-20 2017-09-26 10
5 2017-10-09 2017-10-15 5
6 2019-06-01 2019-11-05 100
I appreciate any help.
#Maël brought this issue to my attention over on the ivs issue page https://github.com/DavisVaughan/ivs/issues/20.
I think this can be very elegantly and efficiently solved with ivs, but it is a bit hard to come up with the solution, so I'll probably add a helper to do this more easily.
This solution works with "recursive" containers too, i.e. where range A contains range B, but then range C also contains range A, so you really only want to list range C. I've described this in more detail with examples here https://github.com/DavisVaughan/ivs/issues/20#issuecomment-1234479783.
library(ivs)
library(dplyr)
library(vctrs)
df <- tibble(
time_left = as.Date(c(
"2011-08-05", "2011-07-25", "2017-08-20",
"2017-08-20", "2017-10-09", "2019-06-01"
)),
time_right = as.Date(c(
"2011-09-14", "2011-09-01", "2017-09-12",
"2017-09-26", "2017-10-15", "2019-11-05"
)),
x = c(114, 20, 10, 1, 5, 100)
)
df <- df %>%
mutate(range = iv(time_left, time_right), .keep = "unused")
df
#> # A tibble: 6 × 2
#> x range
#> <dbl> <iv<date>>
#> 1 114 [2011-08-05, 2011-09-14)
#> 2 20 [2011-07-25, 2011-09-01)
#> 3 10 [2017-08-20, 2017-09-12)
#> 4 1 [2017-08-20, 2017-09-26)
#> 5 5 [2017-10-09, 2017-10-15)
#> 6 100 [2019-06-01, 2019-11-05)
iv_locate_max_containment <- function(x) {
# Find all locations where the range "contains" any other range
# (including itself)
locs <- iv_locate_overlaps(x, x, type = "contains")
# Find the "top" ranges, i.e. the containers that aren't contained
# by any other containers
top <- !vec_duplicate_detect(locs$haystack)
top <- vec_slice(locs$haystack, top)
top <- vec_in(locs$needles, top)
locs <- vec_slice(locs, top)
locs
}
# i.e. row 4 "contains" rows 3 and 4
locs <- iv_locate_max_containment(df$range)
locs
#> needles haystack
#> 1 1 1
#> 2 2 2
#> 3 4 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
iv_align(df$range, df$x, locations = locs) %>%
rename(range = needles) %>%
group_by(range) %>%
summarise(x = max(haystack))
#> # A tibble: 5 × 2
#> range x
#> <iv<date>> <dbl>
#> 1 [2011-07-25, 2011-09-01) 20
#> 2 [2011-08-05, 2011-09-14) 114
#> 3 [2017-08-20, 2017-09-26) 10
#> 4 [2017-10-09, 2017-10-15) 5
#> 5 [2019-06-01, 2019-11-05) 100
Created on 2022-09-01 with reprex v2.0.2
This may sound a little verbose, however, this could also be a solution:
First we identify those observations that are potentially overlapped.
Then we group the similar ones.
In each group we choose the minimum time_left and maximum time_right and x.
library(tidyverse)
df %>%
mutate(across(starts_with('time'), ymd),
intv = interval(time_left, time_right),
id = row_number()) %>%
mutate(id2 = map2(intv, id, ~ if (any(.x %within% intv[intv != .x])) {
id[which(.x %within% intv[intv != .x]) + 1]
} else {
.y
})) %>%
group_by(id2) %>%
summarise(time_left = min(time_left),
across(c(time_right, x), max)) %>%
select(!(id2))
# A tibble: 4 × 3
time_left time_right x
<date> <date> <dbl>
1 2011-08-05 2011-09-14 114
2 2017-08-20 2017-09-26 10
3 2017-10-09 2017-10-15 5
4 2019-06-01 2019-11-05 100
I combined Anoushiravan's solution with this
How do I determine in R if a date interval overlaps another date interval for the same individual in a data frame?
and I think it is working now.
df %>%
mutate(id = row_number(), days = as.numeric(as.Date(time_right) - as.Date(time_left)) ) %>%
mutate(Int = interval(time_left, time_right),
within = map(seq_along(Int), function(x){
y = setdiff(seq_along(Int), x)
if(any(id[which((Int[x] %within% Int[y]))+1])){
return(id[days == max(days[which((Int[x] %within% Int[y]))+1])])
}else{ return(0)}
})
) %>%
mutate(within = ifelse(within > 0 , within, id)) %>%
group_by(within) %>%
summarise(time_left = min(time_left), time_right = max(time_right), x = max(x)) %>%
select(!within)
But it still has some bugs. for the following df, this code will not work unless I change the order of the records.
df = data.frame(time_left = c("2014-01-01", "2014-01-01", "2014-12-01", "2014-12-26"),
time_right = c("2014-04-23", "2014-12-31", "2014-12-31", "2014-12-31"),
x = c(10,100,200,20))

Multiple string replacement, decimals to quarters

I want to replace .00 with -Q1, .25 with -Q2, .50 with -Q3, and .75 with -Q4 as given below. However, my code is not working as expected. Any hints?
library(tidyverse)
dt1 <-
tibble(Date = c(2015.00, 2015.25, 2015.50, 2015.75))
dt1
# A tibble: 4 x 1
Date
<dbl>
1 2015
2 2015.
3 2016.
4 2016.
dt1 %>%
pull(Date)
[1] 2015.00 2015.25 2015.50 2015.75
dt1 %>%
mutate(Date1 = str_replace_all(string = Date, pattern = c(".00" = "-Q1", ".25" = "-Q2", ".50" = "-Q3", ".75" = "-Q4")))
# A tidytable: 4 × 2
Date Date1
<dbl> <chr>
1 2015 2015
2 2015. 2015-Q2
3 2016. 2015.5
4 2016. 2015-Q4
There also is a zoo-function for that:
library(tidyverse)
library(zoo)
dt1 <-
tibble(Date = c(2015.00, 2015.25, 2015.50, 2015.75))
dt1 %>%
mutate(Date1 = format.yearqtr(Date, format = "%Y.Q%q") )
# Date Date1
# <dbl> <chr>
# 1 2015 2015.Q1
# 2 2015. 2015.Q2
# 3 2016. 2015.Q3
# 4 2016. 2015.Q4
You may also use integer division %/% and modulo division %% simultaneously
paste0(dt1$Date %/% 1, '-Q',(dt1$Date %% 1)*4 +1)
[1] "2015-Q1" "2015-Q2" "2015-Q3" "2015-Q4"
Thus, using it in piped syntax as
dt1 %>%
mutate(date1 = paste0(Date %/% 1, '-Q',(Date %% 1)*4 +1))
# A tibble: 4 x 2
Date date1
<dbl> <chr>
1 2015 2015-Q1
2 2015. 2015-Q2
3 2016. 2015-Q3
4 2016. 2015-Q4
here is a quick fix:
dt1 %>%
mutate(Date1 = str_replace_all(format(Date, nsmall = 2),
pattern = c(".00" = "-Q1", ".25" = "-Q2", ".50" = "-Q3", ".75" = "-Q4")))
The problem is that 2015.00 is first transformed to character at which point it becomes 2015. Therefore, the string replacement fails.
You can see this, by trying as.character(2015.00).
However, this can easily be fixed by using format to format the number first.
vec <- c("00" = "-Q1", "25" = "-Q2", "50" = "-Q3", "75" = "-Q4")
dt1 %>%
mutate(new = paste0(Date %/% 1, vec[sprintf("%02d", Date %% 1 * 100)]))
Date new
<dbl> <chr>
1 2015 2015-Q1
2 2015. 2015-Q2
3 2016. 2015-Q3
4 2016. 2015-Q4
library(tidyverse)
dt1 <-
as.character(c(2015.00, 2015.25, 2015.50, 2015.75))
dt1 <- if_else(str_detect(dt1, '\\.', negate = TRUE),
paste0(dt1, '.00'), #If condition TRUE
dt1) #if condition FALSE
value_before <- c("\\.00","\\.25","\\.5","\\.75" )
value_after <- c("-Q1", "-Q2","-Q3", "-Q4")
tibble(Date = str_replace(dt1, value_before, value_after))
#> # A tibble: 4 x 1
#> Date
#> <chr>
#> 1 2015-Q1
#> 2 2015-Q2
#> 3 2015-Q3
#> 4 2015-Q4
Created on 2021-06-01 by the reprex package (v2.0.0)
A solution with dyplr and tidyr:
Prepare decimals for further process with format
separate and mutate with -Q1-Q4
unite
library(tidyr)
library(dplyr)
dt1 %>%
mutate(Date = format(round(Date, digits=2), nsmall = 2)) %>%
separate(Date, into = c("Year", "Quarter"), remove=FALSE) %>%
mutate(Quarter = recode(Quarter, "00" = "-Q1", "25" = "-Q2", "50" = "-Q3", "75" = "-Q4")) %>%
unite("new", Year:Quarter, sep = "")
Output:
Date new
<chr> <chr>
1 2015.00 2015-Q1
2 2015.25 2015-Q2
3 2015.50 2015-Q3
4 2015.75 2015-Q4

Counting sequential dates in R to determine the length of an event

I have a dataframe containing dates when a given event occurred. Some events go on for several days, and I want to summarise each event based on its start date and its total length (in days).
I want to go from this:
Date
2020-01-01
2020-01-02
2020-01-03
2020-01-15
2020-01-20
2020-01-21
To this:
StartDate
EventLength
2020-01-01
3
2020-01-15
1
2020-01-20
2
I've tried various approaches with aggregate, ave, seq_along and lag, but I haven't managed to get a count of event length that resets when the dates aren't sequential.
Code for the example data frame in case it's helpful:
Date <- c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-15", "2020-01-20", "2020-01-21")
df <- data.frame(Date)
df$Date <- as.Date(df$Date, origin = "1970-01-01")
You can split by cumsum(c(0, diff(df$Date) != 1) and then take the first date and combine it with the length assuming the dates are sorted.
do.call(rbind, lapply(split(df$Date, cumsum(c(0, diff(df$Date) != 1))),
function(x) data.frame(StartDate=x[1], EventLength=length(x))))
# StartDate EventLength
#0 2020-01-01 3
#1 2020-01-15 1
#2 2020-01-20 2
or another option using rle:
i <- cumsum(c(0, diff(df$Date) != 1))
data.frame(StartDate = df$Date[c(1, diff(i)) == 1], EventLength=rle(i)$lengths)
# StartDate EventLength
#1 2020-01-01 3
#2 2020-01-15 1
#3 2020-01-20 2
I propose dplyr approach which is incidentally very similar to #Rui's approach
df %>% mutate(dummy = c(0, diff(Date))) %>%
group_by(grp = cumsum(dummy != 1)) %>%
summarise(Date = first(Date),
event_count = n(), .groups = 'drop')
# A tibble: 3 x 3
grp Date event_count
<int> <date> <int>
1 1 2020-01-01 3
2 2 2020-01-15 1
3 3 2020-01-20 2
Here is a base R solution with a cumsum trick followed by ave/table.
d <- c(0, diff(df$Date) != 1)
res <- ave(df$Date, cumsum(d), FUN = function(x) x[1])
res <- as.data.frame(table(a))
names(res) <- c("Date", "EventLength")
res
# Date EventLength
#1 2020-01-01 3
#2 2020-01-15 1
#3 2020-01-20 2

Grouping rows on multiple conditions

I have got a follow up question on my previous question about grouping rows on multiple conditions (Previous question).
I was wondering how I can group observations within 31 days of the first date. More importantly, after the 31 days are passed the next date within the same group will be the 'new' first date of that group. Furthermore, after each 'purchase' the grouping should also stop, and the next observation after the purchase will be the 'new' first day of that group.
Let me illustrate it with an example:
example <- structure(
list(
userID = c(1,1,1,1,1,1,2,2,2,2),
date = structure(
c(
18168, #2019-09-29
18189, #2019-10-20
18197, #2019-10-28
18205, #2019-11-05
18205, #2019-11-05
18217, #2019-11-17
18239, #2019-12-09
18270, #2020-01-09
18271, #2020-01-10
18275 #2020-01-14
),
class = "Date"
),
purchase = c(0,0, 0, 0, 0, 1, 0, 0, 1, 0)
),
row.names = c(NA, 10L),
class = "data.frame"
)
Desired outcome:
Outcome <- data.frame(
userID = c(1,1,2,2,2),
date.start = c("2019-09-29", "2019-11-05", "2019-12-09", "2020-01-10", "2020-01-14"),
date.end = c("2019-10-28", "2019-11-17", "2020-01-09", "2020-01-10", "2020-01-14"),
purchase = c(0, 1, 0, 1, 0)
)
Thanks in advance! :)
Like my answer on linked question, I again suggest accumulate strategy here
library(tidyverse)
example
#> userID date purchase
#> 1 1 2019-09-29 0
#> 2 1 2019-10-20 0
#> 3 1 2019-10-28 0
#> 4 1 2019-11-05 0
#> 5 1 2019-11-05 0
#> 6 1 2019-11-17 1
#> 7 2 2019-12-09 0
#> 8 2 2020-01-09 0
#> 9 2 2020-01-10 1
#> 10 2 2020-01-14 0
example %>% group_by(userID) %>%
group_by(grp = unlist(accumulate2(date, purchase[-n()], ~ if(as.numeric(..2 - ..1) < 31 & ..3 != 1) ..1 else ..2)),
grp = with(rle(grp), rep(seq_along(lengths), lengths)), .add = T) %>%
summarise(start.date = first(date),
last.date = last(date), .groups = 'drop')
#> # A tibble: 5 x 4
#> userID grp start.date last.date
#> <dbl> <int> <date> <date>
#> 1 1 1 2019-09-29 2019-10-28
#> 2 1 2 2019-11-05 2019-11-17
#> 3 2 3 2019-12-09 2019-12-09
#> 4 2 4 2020-01-09 2020-01-10
#> 5 2 5 2020-01-14 2020-01-14
Created on 2021-06-13 by the reprex package (v2.0.0)
We could also use the following solution:
library(dplyr)
library(data.table)
example %>%
group_by(grp = cumsum(ifelse(lag(purchase, default = 0) == 1, 1, 0))) %>%
mutate(grp2 = cumsum(as.numeric(date - lag(date, default = first(date)))) > 30) %>%
ungroup() %>%
mutate(grp2 = data.table::rleid(grp2)) %>%
group_by(userID, grp, grp2) %>%
summarise(first = first(date), last = last(date), .groups = "drop") %>%
select(-grp)
# A tibble: 5 x 4
userID grp2 first last
<dbl> <int> <date> <date>
1 1 1 2019-09-29 2019-10-28
2 1 2 2019-11-05 2019-11-17
3 2 3 2019-12-09 2019-12-09
4 2 4 2020-01-09 2020-01-10
5 2 5 2020-01-14 2020-01-14
Because there are dependencies between when one time period ends and the next one starts (given a date, you can only tell if it is the start, middle, or end of a period after investigating every prior record) I can not see any better way of doing this than using a for loop.
Something like the following:
# create output column
example = example %>% mutate(grouping = NA)
# setup tracking variables
current_date = as.Date('1900-01-01')
current_id = -1
prev_purchase = 0
current_group = 0
for(ii in 1:nrow(example)){
# reset on new identity OR on puchase OR on 31 days elapsed
if(example$userID[ii] != current_id # new identity
|| prev_purchase == 1 # just had a purchase
|| example$date[ii] - current_date > 31){ # more than 31 days elapsed
current_date = example$date[ii]
current_id = example$userID[ii]
prev_purchase = example$purchase[ii]
current_group = current_group + 1
example$grouping[ii] = current_group
}
# otherwise step forwards
else {
prev_purchase = example$purchase[ii]
example$grouping[ii] = current_group
}
}
One advantage of this approach, is you can pause after the for loop and check whether the groupings are as expected. The groups can then be collapsed to the requested output using:
output = example %>%
group_by(userID, grouping) %>%
summarise(date.start = min(date),
date.end = max(date),
purchase = max(purchase)) %>%
select(-grouping)

Expand start and end dates into a sequence of beginning and ending dates by calendar month

Given a table
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017
I'm trying to split by Calendar month as the following table
id start end
1 22/03/2016 31/03/2016
1 01/04/2016 30/04/2016
1 01/05/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 30/09/2017
3 01/10/2017 31/10/2017
3 01/11/2017 30/11/2017
3 01/12/2017 25/12/2017
I'm trying to modify a code extract from how to split rows of a dataframe in multiple rows based on start date and end date? , but I am not being able to modify correctly the code. The problem is generally in the months with 30 days, and maybe is easy but I am not still familiarized with regular expressions.
#sample data
df <- data.frame("starting_date" = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
"end_date" = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
col3=c('1','2', '3'))
df1 <- df[,1:2] %>%
rowwise() %>%
do(rbind(data.frame(matrix(as.character(c(
.$starting_date,
seq(.$starting_date, .$end_date, by=1)[grep("\\d{4}-\\d{2}-31|\\d{4}-\\d{2}-01", seq(.$starting_date, .$end_date, by=1))],
.$end_date)), ncol=2, byrow=T))
)
) %>%
data.frame() %>%
`colnames<-`(c("starting_date", "end_date")) %>%
mutate(starting_date= as.Date(starting_date, format= "%Y-%m-%d"),
end_date= as.Date(end_date, format= "%Y-%m-%d"))
#add temporary columns to the original and expanded date column dataframes
df$row_idx <- seq(1:nrow(df))
df$temp_col <- (year(df$end_date) - year(df$starting_date)) +1
df1 <- cbind(df1,row_idx = rep(df$row_idx,df$temp_col))
#join both dataframes to get the final result
final_df <- left_join(df1,df[,3:(ncol(df)-1)],by="row_idx") %>%
select(-row_idx)
final_df
If anyone knows how to modify the code or a better way to do it I will be very grateful.
We assume there is an error in the sample output in the question since the third row spans parts of two months and so should be split into two rows.
Define Seq which given one start and end Date variables produces a data.frame of start and end columns and then run it on each id using group_by:
library(dplyr)
library(zoo)
Seq <- function(start, end) {
ym <- seq(as.yearmon(start), as.yearmon(end), 1/12)
starts <- pmax(start, as.Date(ym, frac = 0))
ends <- pmin(end, as.Date(ym, frac = 1))
unique(data.frame(start = starts, end = ends))
}
fmt <- "%d/%m/%Y"
DF %>%
mutate(start = as.Date(start, fmt), end = as.Date(end, fmt)) %>%
group_by(id) %>%
do(Seq(.$start, .$end)) %>%
ungroup
giving:
# A tibble: 9 x 3
id start end
<int> <date> <date>
1 1 2016-03-22 2016-03-31
2 1 2016-04-01 2016-04-30
3 1 2016-05-01 2016-05-31
4 1 2016-06-01 2016-06-05
5 2 2016-08-17 2016-08-29
6 3 2017-09-22 2017-09-30
7 3 2017-10-01 2017-10-31
8 3 2017-11-01 2017-11-30
9 3 2017-12-01 2017-12-25
Note
The input DF in reproducible form:
Lines <- "
id start end
1 22/03/2016 05/06/2016
2 17/08/2016 29/08/2016
3 22/09/2017 25/12/2017"
DF <- read.table(text = Lines, header = TRUE)
So there's a probably a more elegant way to accomplish this and I feel like I've seen similar questions, but could not find a duplicate quickly, so here goes...
SETUP
library(tidyverse)
library(lubridate)
df <- data.frame(
id = c('1', '2', '3'),
starting_date = as.Date(c("2016-03-22", "2016-08-17", "2017-09-12")),
end_date = as.Date(c("2016-06-05", "2016-08-29", "2017-12-25")),
stringsAsFactors = FALSE
)
df
#> id starting_date end_date
#> 1 1 2016-03-22 2016-06-05
#> 2 2 2016-08-17 2016-08-29
#> 3 3 2017-09-12 2017-12-25
SOLUTION
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest() %>%
mutate(row = row_number()) %>%
mutate(
new_end_date = if_else(row == max(row), end_date, date_seq),
new_start_date = if_else(row == min(row), starting_date, floor_date(new_end_date, "month"))
) %>%
select(
id, new_start_date, new_end_date
)
#> # A tibble: 8 x 3
#> # Groups: id [3]
#> id new_start_date new_end_date
#> <chr> <date> <date>
#> 1 1 2016-03-22 2016-03-31
#> 2 1 2016-04-01 2016-04-30
#> 3 1 2016-06-01 2016-06-05
#> 4 2 2016-08-17 2016-08-29
#> 5 3 2017-09-12 2017-09-30
#> 6 3 2017-10-01 2017-10-31
#> 7 3 2017-11-01 2017-11-30
#> 8 3 2017-12-01 2017-12-25
EXPLANATION
Much of what's going on here takes place in the first mutate call which creates date_seq. To understand it, consider the following:
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month")
# [1] "2016-03-22" "2016-04-22" "2016-05-22"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month")
# [1] "2016-04-01" "2016-05-01" "2016-06-01"
seq.Date(ymd("2016-03-22"), ymd("2016-06-05"), by = "month") %>%
ceiling_date("month") - 1
# [1] "2016-03-31" "2016-04-30" "2016-05-31"
So basically, create a sequence of "end-of-month" dates between the original start and end dates. Putting this in a list-column allows us to organize by the id so that we unnest appropriately. Checkout the output after the end of the unnest():
df %>%
group_by(id) %>%
mutate(
date_seq = list(seq.Date(starting_date, end_date, by = "month") %>% ceiling_date("month") - 1)
) %>%
unnest()
From there I hope things are relatively straightforward. The row_number probably could have been replaced with something fancier like a first/last, but I thought this might be easier to follow.

Resources