Group consecutive working day dates (excluding weekends and holidays) - r

I have data on different persons (ID), the dates they have worked (Date), and how many hours they have worked each date (Hours).
Monday - Friday are considered working days. For each ID, I want to group consecutive working days. Weekends and holidays should be omitted when deciding if working days are consecutive.
Some examples:
If a person worked Monday, Tuesday and Wednesday, then skipped Thursday, and worked again on Friday, then Monday - Wednesday is considered one group and Friday another group.
If a person worked Thursday & Friday one week and Monday & Tuesday the next week, then these four days should be in the same group. Thus, the days in the weekend are omitted when checking if working days are consecutive.
If Monday - Friday of one week and Monday - Friday of the next week, then I'd count those whole two weeks as consecutive working days.
I would also like to take into account common US Holidays (e.g. New Year), such that 12/31/2020 to 1/4/2021 would still count as consecutive workdays.
Once the grouping variable is created, I want to sum the working hours in each group.
Example data:
df1 = structure(list(ID = c(1, 1, 1, 1, 2, 2, 3, 3,
3, 3), Date = structure(c(18781, 18782, 18785, 18750, 18687,
18688, 18626, 18627, 18631, 18634), class = "Date"), Hours = c(8,
8, 8, 16, 8, 8, 8, 8, 8, 8)), row.names = c(NA, -10L), class = "data.frame")
ID Date Hours
1 1 2021-06-03 8
2 1 2021-06-04 8
3 1 2021-06-07 8
4 1 2021-05-03 16
5 2 2021-03-01 8
6 2 2021-03-02 8
7 3 2020-12-30 8
8 3 2020-12-31 8
9 3 2021-01-04 8
10 3 2021-01-07 8
I imagine my output to look something like this:
ID Date1 Date2 Hours
1 1 2021-06-03 2021-06-07 24
# the weekend, June 5-6, is omitted
# when the group of consecutive working days is created
2 1 2021-05-03 2021-05-03 16
3 2 2021-03-01 2021-03-02 16
4 3 2020-12-30 2021-01-04 24
# the public holiday (Jan 1) and the weekend (Jan 2-3) are omitted
5 3 2021-01-07 2021-01-07 8
My top priority is to at least get the consecutive work week figured out, the holiday portion would be an added bonus.

You may use RQuantLib::businessDaysBetween. For each ID (by = ID), calculate the number of business days between each row, i.e. provide vectors of "lag" (head(Date, -1)) and "lead" (tail(Date, -1)) as from and to dates. Pick a relevant calendar (see Details in ?businessDaysBetween)
For each ID and run of consecutive business days (by = .(ID, g = cumsum(d != 1L))), select the first and last date (from = Date[1], to = Date[.N]) and sum the Hours (sum(Hours))
library(data.table)
library(RQuantLib)
setDT(df1)
df1[ , d := c(1, businessDaysBetween(calendar = "UnitedStates",
from = head(Date, -1), to = tail(Date, -1))),
by = ID]
df1[ , .(from = Date[1], to = Date[.N], Hours = sum(Hours)),
by = .(ID, g = cumsum(d != 1L))]
# ID g from to Hours
# 1: 1 0 2021-06-03 2021-06-07 24
# 2: 1 1 2021-05-03 2021-05-03 16
# 3: 2 1 2021-03-01 2021-03-02 16
# 4: 3 1 2020-12-30 2021-01-04 24
# 5: 3 2 2021-01-07 2021-01-07 8
More convoluted solution (pre-businessDaysBetween):
Create a full sequence of dates within each ID (df1[ , .(Date = seq(min(Date), max(Date), by = "1 day")), by = ID]). Join with original data on ID and Date (df1[..., on =.(ID, Date)). For dates not present in original data, i.e. gaps between (originally) consecutive days, Hours will be NA.
Within each ID (by = ID), create a run length index based on missing Hours (rleid(is.na(Hours))). For rows with missing Hours (d[is.na(Hours)), i.e. the gaps in the original time series, for each ID and run (by = .(ID, r)), check if all dates are either a weekend (wday(Date) %in% c(1, 7)) or (|) a public holiday* (Date %in% as.Date(holidayNYSE(unique(year(Date))))), and create an index variable, ix.
For original rows and weekend/holiday gaps (!is.na(Hours) | ix), create a grouping variable of consecutive dates (g = cumsum(c(TRUE, diff(Date) != 1L))). For each ID and run of consecutive dates (by = .(ID, g)), select the first and last date (from = Date[1], to = Date[.N]) and sum the Hours (sum(Hours, na.rm = TRUE))
library(data.table)
library(timeDate)
setDT(df1)
d = df1[df1[ , .(Date = seq(min(Date), max(Date), by = "1 day")), by = ID],
on = .(ID, Date)]
d[ , r := rleid(is.na(Hours)), by = ID]
d[is.na(Hours), ix := all(
wday(Date) %in% c(1, 7) |
Date %in% as.Date(holidayNYSE(unique(year(Date)))))
, by = .(ID, r)]
d[!is.na(Hours) | ix, .(Date, Hours, g = cumsum(c(TRUE, diff(Date) != 1L))),
by = ID][
, .(from = Date[1], to = Date[.N],
Hours = sum(Hours, na.rm = TRUE)),
by = .(ID, g)]
# ID g from to Hours
# 1: 1 1 2021-05-03 2021-05-03 16
# 2: 1 2 2021-06-03 2021-06-07 24
# 3: 2 1 2021-03-01 2021-03-02 16
# 4: 3 1 2020-12-30 2021-01-04 24
# 5: 3 2 2021-01-07 2021-01-07 8
*Please see the timeDate manual for other definitions of holiday.

Related

Creating start date based on end date of prior observation for multiple observations by group

I have a dataset of episodes by ID, where for each person I have a start date and a length of the episode (df.have).
I'd like to create start and end dates for each episode where the start date for one episode is one day after the end date for the previous episode (df.want)
I know I need to lag the prior start date but I don't know how to do that repeatedly (i.e., I can do it for the second episode, but not the third).
df.have <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
episode_num=c(1,2,3,1,2,3,1,2,3),
start_date=as.Date(c("1/1/2001", NA, NA, "5/1/2001", NA, NA, "10/1/1001", NA, NA), "%m/%d/%y"),
episode_length=c(10,4,5,20,3,2,1,9,8))
df.want <- df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
episode_num=c(1,2,3,1,2,3,1,2,3),
start_date=as.Date(c("1/1/01", "1/12/01","1/17/01","5/1/01","5/22/01","5/26/01","10/1/01","10/3/01","10/13/01"),"%m/%d/%y"),
end_date= as.Date(c("1/11/01","1/16/01","1/22/01","5/21/01","5/25/01","5/28/01","10/2/01","10/12/01","10/21/01"), "%m/%d/%y"),
episode_length=c(10,4,5,20,3,2,1,9,8))
Unless you want to split() your input data set by ID and loop over the resulting list, there will ultimately be some kind of grouping involved. An easy solution to solve this would be to switch to data.table and calculate the remaining start dates per group by adding the cumsum() of the episode lengths to the initial start date per ID. The end dates can then be calculated by adding the episode lengths to the resulting start dates.
Note that this approach assumes that episode numbers per ID are in order.
library(data.table)
dt.have = data.table(
df.have
)
## calculate remaining start dates per id
dt.have[
, start_date := min(
start_date
, na.rm = TRUE
) + c(
0
, cumsum(
episode_length[-.N] + 1 # exclude last episode_length per group from `cumsum()`
)
)
, by = id
]
## append end dates
dt.have[
, end_date := start_date + episode_length
]
If you don't feel comfortable working with data.table, you can simply convert the output object back to data.frame using
data.frame(
dt.have
)
# id episode_num start_date episode_length end_date
# 1 1 2020-01-01 10 2020-01-11
# 1 2 2020-01-12 4 2020-01-16
# 1 3 2020-01-17 5 2020-01-22
# 2 1 2020-05-01 20 2020-05-21
# 2 2 2020-05-22 3 2020-05-25
# 2 3 2020-05-26 2 2020-05-28
# 3 1 2010-10-01 1 2010-10-02
# 3 2 2010-10-03 9 2010-10-12
# 3 3 2010-10-13 8 2010-10-21
Since this was explicitly asked in the comments #TomHoel, here's the dplyr analog of the above code (just replace the native |> with %>% for R versions lower than 4.1.0):
library(dplyr)
df.have |>
# group by id
group_by(id) |>
mutate(
# calculate remaining start dates per id
start_date = min(
start_date
, na.rm = TRUE
) + c(
0
, cumsum(
episode_length[-n()] + 1 # exclude last episode_length per group from `cumsum()`
)
)
# append end date
, end_date = start_date + episode_length
)

Group consecutive dates [duplicate question, but can't make it work with my data]

I have a database with a 142 columns with one called "Date" (of class POSIXct) that I'd like to make a new column from that groups consecutive dates together. Dates with more than 2 days separating one another are categorized into separate groups.
I'd also like to name the level of the group with the name of month the consecutive dates start in (For example: Jan. 3rd, 2018 -> Jan. 12th 2018 = group level called "January sampling event"; Feb 27th, 2018 -> March 1st, 2018 = group level called "February sampling event"; etc...).
I've seen very similar questions like Group consecutive dates in R and R: group dates that are next to each other, but just can't get it to work for my data.
EDIT:
My data example (Last row shows dates separated by over a year are grouped together, for some reason)
> dput(df)
structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
My attempt:
df$group <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Remove time from date time
It's hard to tell exactly what the problem is without seeing your data (or similar example data), but my guess is that the date time format (the 00:00:00 part) is messing up as.Date
One solution would be to extract just the date part and then try again with just the date part:
# here are your date times
date_time <- "2018-01-03 00:00:00"
# this looks for 4 digits between 0 and 9, followed by a dash, followed by 2 digits between 0 and 9,followed by a dash, followed by 2 digits between 0 and 9
date_pattern <- " ?([0-9]{4}-[0-9]{2}-[0-9]{2}) ?"
#need this library
library(stringr)
library(magrittr) #for pipes
#this pulls out text matching the pattern we specified in date pattern
date_new <- str_extract(date_time, date_pattern) %>%
str_squish() # this removes white space
# this is the new date without the time
date_new
# then we convert to as date
date_new <- as.Date(date_new)
See if converting your date column to just dates and then rerunning your grouping works.
If you have dates in different formats and need to adapt the regular expression, here's something about regular expressions: https://stackoverflow.com/a/49286794/16502170
Group dates
Let's start with an example data frame that contains a date column
# here's a bunch of example dates:
library(lubridate)
dates2 <- seq.Date(as.Date("2018-03-01"),by="days",length.out = 60)
#here's the dataframe
exampl_df <- data.frame(animals = rep(c("cats","dogs","rabbits"),20), dates=dates2,
numbers= rep(1:3,20))
Here's what it looks like:
head(exampl_df)
animals dates numbers
1 cats 2018-03-01 1
2 dogs 2018-03-02 2
3 rabbits 2018-03-03 3
4 cats 2018-03-04 1
5 dogs 2018-03-05 2
6 rabbits 2018-03-06 3
Then let's make a sequence of every day between the minimum and maximum date in the sequence. This step is important because there may be missing dates in our data that we still want counting towards the separation between days.
# this is a day by day sequence from the earliest day in your data to the latest day
date_sequence <- seq.Date(from = min(dates2),max(dates2),by="day")
Then let's make a sequence of numbers each repeated seven times. If you wanted to group every three days, you could change each to 3. Then the length.out= length(date_sequence) tells R to make this vector have as many entries as the min to max date sequence has:
# and then if you want a new group every seven days you can make this number sequence
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
Then let's attach the groups to the date_sequence to make a grouping index
date_grouping_index <- data.frame(a=date_sequence,b=groups)
then you can do a join to attach the groups to the original dataframe
library(dplyr)
example_df 2 <- exampl_df %>%
inner_join(date_grouping_index, by=c("dates"="a"))
This is what we get:
head(example_df2,n=10)
animals dates numbers b
1 cats 2018-03-01 1 1
2 dogs 2018-03-02 2 1
3 rabbits 2018-03-03 3 1
4 cats 2018-03-04 1 1
5 dogs 2018-03-05 2 1
6 rabbits 2018-03-06 3 1
7 cats 2018-03-07 1 1
8 dogs 2018-03-08 2 2
9 rabbits 2018-03-09 3 2
10 cats 2018-03-10 1 2
Then you should be able to group_by() or aggregate() your data using column b
Using the data provided in the question
#original data
df <- structure(list(Date = structure(c(17534, 17535, 17536, 17537,
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1,
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
#plus extra step
df$group2 <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Method described above
date_sequence <- seq.Date(from = min(df$Date),max(df$Date),by="day")
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
date_grouping_index <- data.frame(a=date_sequence,groups=groups)
example_df2<- df %>%
inner_join(date_grouping_index, by=c("Date"="a"))
Looks like it worked?
example_df2
# A tibble: 9 x 4
Date group group2 groups
<date> <dbl> <dbl> <int>
1 2018-01-03 1 1 1
2 2018-01-04 1 1 1
3 2018-01-05 1 1 1
4 2018-01-06 1 1 1
5 2020-01-18 2 2 107
6 2020-01-19 2 2 107
7 2020-01-20 2 2 107
8 2020-01-21 2 2 107
9 2019-02-05 2 2 57
Here's something you could do to make group names with the date and year in them:
example_df2$group_name <- paste0("sampling number ",
example_df2$groups,
" (",
month.name[month(example_df2$Date)],
"-",
year(example_df2$Date),
")")

How to calculate sumifs across a date range referencing two tables of unequal length?

I have two data tables. The first, DT1, is grouped by ID. It has a Week column that contains an instance of every consecutive week in a large time frame for every ID. It also contains a Units column. It looks something like this simplified version:
ID Week Units
A 1/1/2019 5
A 1/8/2019 9
A 1/15/2019 0
A 1/22/2019 1
B 1/1/2019 1
B 1/8/2019 32
B 1/15/2019 2
B 1/22/2019 6
C 1/1/2019 0
C 1/8/2019 0
C 1/15/2019 8
C 1/22/2019 3
The second table, DF2, has a time range window for every ID. This is represented by a start date column and a stop date column. It looks something like this simplified version:
ID Start Date Stop Date
A 1/1/2019 1/8/2019
B 1/8/2019 1/22/2019
C 1/8/2019 1/15/2019
I would like to calculate the sum of the units column for every date range/buyer combination in DF2. My desired output would be:
ID Start_Date Stop_Date sumUnits
A 1/1/2019 1/8/2019 14
B 1/8/2019 1/22/2019 40
C 1/8/2019 1/15/2019 8
Is there a way to calculate this type of sum in R?
I have referenced the article, In R: how to sum a variable by group between two dates
, as well as attempted the interval function and a non-equi join.
DT1[DT2[DT1, sum(x), on = .(Units, Week>= Stop_Date, Week<= Stop_Date),
by = .EACHI], newvar := V1, on = .(Units, Start_Date
=Week)]
It returns the message, "Week" not in i. How do I solve the problem?
If you don't mind using dplyr instead of data.table, you can use the fuzzyjoin package:
library(dplyr)
library(fuzzyjoin)
DF1 = data.frame(
ID = c(rep("A", 4), rep("B", 4), rep("C", 4)),
Week = rep(as.Date(c("1/1/2019", "1/8/2019", "1/15/2019", "1/22/2019"), format = "%m/%d/%Y")),
Units = c(5, 9, 0, 1, 1, 32, 2, 6, 0, 0, 8, 3)
)
DF2 = data.frame(
ID = c("A", "B", "C"),
Start.Date = as.Date(c("1/1/2019", "1/8/2019", "1/8/2019"), format = "%m/%d/%Y"),
Stop.Date = as.Date(c("1/8/2019", "1/22/2019", "1/15/2019"), format = "%m/%d/%Y")
)
fuzzy_inner_join(
DF1, DF2,
by = c("ID", "Week" = "Start.Date", "Week" = "Stop.Date"),
match_fun = list(`==`, `>=`, `<=`)
) %>%
group_by(ID.x, Start.Date, Stop.Date) %>%
summarize(sumUnits = sum(Units))
I have referenced the article, In R: how to sum a variable by group between two dates, as well as attempted the interval function and a non-equi join.
Here's a variation on #akrun's answer that works:
library(data.table)
setDT(DF1)
setDT(DF2)
DF2[, v :=
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date), sum(x.Units), by=.EACHI]$V1
]
ID Start.Date Stop.Date v
1: A 2019-01-01 2019-01-08 14
2: B 2019-01-08 2019-01-22 40
3: C 2019-01-08 2019-01-15 8
I am using the input data as created in #A.S.K.'s answer.
To see how it works, try running simpler parts of it:
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date), sum(x.Units), by=.EACHI]$V1
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date), sum(x.Units), by=.EACHI]
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date)]

Filtering observations based on specific date condition using data.table

I have a set of observations, which are recorded every time a user has taken an action. I want to filter only those observations from a user which are six or more months apart.
So, if a user has taken this action on "2018-01-01", "2018-03-01" and "2018-07-01", I only want to keep only "2018-01-01" and "2018-07-01".
Similarly, if a user has taken an action on "2018-01-01", "2018-03-01", "2018-07-01" and "2019-03-01" I want to keep only "2018-01-01", "2018-07-01", "2019-03-01".
So far, I have produced long and unworkable code.
# What I want to achieve
library(data.table)
dataIhave <- data.table(id = c(1, 1, 1, 1, 2, 2, 3, 4),
dates = c("2018-01-01",
"2018-03-01",
"2018-07-01",
"2019-01-01",
"2018-01-03",
"2018-07-02",
"2018-02-01",
"2018-02-01"))
dataIwant <- data.table(id = c(1, 1, 1, 2, 3, 4),
dates = c("2018-01-01",
"2018-07-01",
"2019-01-01",
"2018-01-01",
"2018-02-01",
"2018-02-01"))
This is a rolling-join variant of #Uwe's answer:
library(lubridate)
dataIhave[, dates := as.IDate(dates)]
ids = unique(dataIhave$id)
dataIhave[, seq := NA_integer_]
s = 1L
w = dataIhave[.(ids), on=.(id), mult="first", which = TRUE]
dataIhave[w, seq := s]
while (TRUE){
w = dataIhave[
dataIhave[w, .(id, dates = dates %m+% months(6))],
on = .(id, dates), roll = -Inf, nomatch = 0, which = TRUE
]
if (!length(w)) break
s = s + 1L
dataIhave[w, seq := s]
}
dataIhave[!is.na(seq)]
id dates seq
1: 1 2018-01-01 1
2: 1 2018-07-01 2
3: 1 2019-01-01 3
4: 2 2018-01-03 1
5: 3 2018-02-01 1
6: 4 2018-02-01 1
The loop takes rows w defined per id, steps their dates forward six months, and takes the next row found, if any. The arguments to the join are:
The tables, with join syntax x[i, ...]
x = dataIhave
i = dataIhave[w, .(id, dates = dates %m+% months(6))]
on = .(id, date): columns to match by
roll = -Inf: find the next match on the last column in on=
nomatch = 0: if no match is found, skip
which = TRUE: return matched row number
Additionally, if there are duplicate dates (see the second example in #Uwe's post):
mult = "first": take only the first match for each row of i
In selecting the first row by id before the loop, I'm assuming the data is sorted by dates within id (so I'm not using order as #Uwe's answer does).
If I understand correctly, the OP wants to drop those dates which are less than six months apart from the beginning of a period and start a new period at the first date that is more than 6 months apart from the begin of the previous period (separately for each id).
I have no idea how this can be accomplished by a non-recursive rolling or non-equi join as there is no fixed grid of dates. So, I believe it requires a kind of recursive approach, somehow. Here is one possibility:
library(data.table)
library(lubridate)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := TRUE]
dataIhave[order(id, dates)
, keep := {
start <- dates[1L]
for (i in tail(seq_along(dates), -1L)) {
if (dates[i] < start %m+% months(6)) {
keep[i] <- FALSE
} else {
start <- dates[i]
}
}
keep
}, by = id][]
id dates keep
1: 1 2018-01-01 TRUE
2: 1 2018-03-01 FALSE
3: 1 2018-07-01 TRUE
4: 1 2019-01-01 TRUE
5: 2 2018-01-03 TRUE
6: 2 2018-07-02 FALSE
7: 3 2018-02-01 TRUE
8: 4 2018-02-01 TRUE
Finally,
dataIhave[(keep), -"keep"]
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
2nd test case
The crucial point here is to detect the beginning of a new period (within each id).
As an additional test case I have added two dates to id == 1,
2018-07-01 and 2018-07-02.
2018-07-01 is a duplicate. Both dates should be removed as they lie within the second 6-months period starting at 2018-07-01.
dataIhave <- fread("
id dates
1 2018-01-01
1 2018-03-01
1 2018-07-01
1 2018-07-01
1 2018-07-02
1 2019-01-01
2 2018-01-03
2 2018-07-02
3 2018-02-01
4 2018-02-01")
Indeed, the code above returns the same output as with OP's original test case.
Remove rows only within the first six months for each id
If the question is interpreted to only remove entries within the first 6-months period for each id and keep all the date after 6 months this can be achieved by
dataIhave[!dataIhave[, .I[dates < dates[1L] %m+% months(6L)][-1L], by = id]$V1]
which returns
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2018-07-01
4: 1 2018-07-02
5: 1 2019-01-01
6: 2 2018-01-03
7: 3 2018-02-01
8: 4 2018-02-01
for the second test case. (Note that this is a streamlined version of Jaap's answer.)
Another variant:
library(lubridate)
library(data.table)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := dates >= dates[1] %m+% months(6), by = id
][dataIhave[, .I[1], by = id][[2]], keep := TRUE
][!!keep, -"keep"]
which gives:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
Using non-equi join and igraph to avoid implicit loops and recursion:
#data prep
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
setorder(dataIhave[, rn:=rowid(id)], id, dates)
dataIhave[, end := as.IDate(sapply(dates,
function(d) seq(d, by="6 months", length.out=2L)[2L]))]
#non-equi self join to find first date that is after 6months
nonequi <- dataIhave[dataIhave, on=.(id, dates>=end), mult="first", by=.EACHI,
.(i.id, i.rn, x.rn, i.dates, x.dates)]
library(igraph)
nonequi[, {
#create graph from the previous join
g <- graph_from_data_frame(.SD[, .(i.rn, x.rn)])
#plot(g)
#find the leaf nodes
leaf <- sapply(V(g), function(x) length(neighbors(g,x))==0L)
#from the first date (i.e. node = V(g)["1"]), find the path starting from this date.
path <- get.all.shortest.paths(g, V(g)["1"], leaf)$res
#return all dates (i.e. nodes) in this path
.(dates=i.dates[i.rn %in% na.omit(V(g)[path[[1L]]]$name)])
},
by=.(id=i.id)]
output:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
Or a recursive approach similar to Uwe's solution:
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
unique(dataIhave[,
.(dates=as.IDate(Reduce(
function(x, y) if (y >= seq(x, by="6 months", length.out=2L)[2L]) y else x,
dates,
accumulate=TRUE))),
.(id)])
output:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
library(lubridate)
library(data.table)
dataiHave[, dates := ymd(dates)]
dataiHave[, difDates := as.numeric(difftime(dates, units = "weeks"))]
dataIHave[difDates >= 24, .(id, dates)]
Does this produce the result you want?
Months have irregular durations so you'll have to stick to a time unit of fixed duration.
You can also check ?lubridate::interval, lubridate::as. duration and this question: Time difference in years with lubridate?

Merging data.table rows based on dates

Problem:
I have records with a start and end date for an intervention and I want to merge the rows according to the following rule:
For each ID, any intervention that begins within one year of the last intervention ending, merge the rows so that the start_date is the earliest start date of the two rows, and the end_date is the latest end_date of the two rows.
I also want to keep track of intervention IDs if they are merged.
There can be five scenarios:
Two rows have the same start date, but different end dates.
Start date....End date
Start date.........End date
The period between row 2's start and end date lies within the period of row 1's start and end date.
Start date...................End date
.......Start date...End date
Row 2's intervention starts within Row 1's intervention period but ends later.
Start date.....End date
.....Start date.............End date
Row 2 starts within one year of the end of Row 1.
Start date....End date
......................|....<= 1 year....|Start date...End date
Row 2 starts over one year after the end of Row 1.
Start date...End date
.....................|........ > 1 year..........|Start date...End date
I want to merge rows in cases 1 to 4 but not 5.
Data:
library(data.table)
sample_data <- data.table(id = c(rep(11, 3), rep(21, 4)),
start_date = as.Date(c("2013-01-01", "2013-01-01", "2013-02-01", "2013-01-01", "2013-02-01", "2013-12-01", "2015-06-01")),
end_date = as.Date(c("2013-06-01", "2013-07-01", "2013-05-01", "2013-07-01", "2013-09-01", "2014-01-01", "2015-12-01")),
intervention_id = as.character(1:7),
all_ids = as.character(1:7))
> sample_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-06-01 1 1
2: 11 2013-01-01 2013-07-01 2 2
3: 11 2013-02-01 2013-05-01 3 3
4: 21 2013-01-01 2013-07-01 4 4
5: 21 2013-02-01 2013-09-01 5 5
6: 21 2013-12-01 2014-01-01 6 6
7: 21 2015-06-01 2015-12-01 7 7
The final result should look like:
> merged_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2015-06-01 2015-12-01 7 7
I'm not sure if the all_ids column is the best way to keep track of the intervention_id's so open to ideas for that. (The intervention_id's don't need to be in order in the all_ids column.)
It doesn't matter what the value of the intervention_id column is where rows have been merged.
What I tried:
I started off by writing a function to deal with only those cases where the start date is the same. It's a very non-R, non-data.table way of doing it and therefore very inefficient.
mergestart <- function(unmerged) {
n <- nrow(unmerged)
mini_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
merge_a <- function(unmerged, un_i, merged, m_i, no_of_records) {
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
while (un_i <= no_of_records) {
if(merged[m_i]$start_date == unmerged[un_i]$start_date) {
merged[m_i]$end_date <- max(merged[m_i]$end_date, unmerged[un_i]$end_date)
merged[m_i]$all_ids <- paste0(merged[m_i]$all_ids, ",", unmerged[un_i]$intervention_id)
un_i <- un_i + 1
} else {
m_i <- m_i + 1
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
merge_a(unmerged, un_i, merged, m_i, (no_of_records - un_i))
}
}
return(merged)
}
mini_merged <- merge_a(unmerged, 1, mini_merged, 1, n)
return(copy(mini_merged[id != 0]))
}
Using this function on just one id gives:
> mergestart(sample_data[id == 11])
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 11 2013-02-01 2013-05-01 3 3
To use the function on the whole dataset:
n <- nrow(sample_data)
all_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
start_i <- 1
for (i in unique(sample_data$id)) {
id_merged <- mergestart(sample_data[id == i])
end_i <- start_i + nrow(id_merged) - 1
all_merged[start_i:end_i] <- copy(id_merged)
start_i <- end_i
}
all_merged <- all_merged[id != 0]
> all_merged
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 21 2013-01-01 2013-07-01 4 4
3: 21 2013-02-01 2013-09-01 5 5
4: 21 2013-12-01 2014-01-01 6 6
5: 21 2015-06-01 2015-12-01 7 7
I also had a look at rolling joins but still can't get how to use it in this situation.
This answer https://stackoverflow.com/a/48747399/6170115 looked promising but I don't know how to integrate all the other conditions and track the intervention IDs with this method.
Can anyone point me in the right direction?
There are related questions How to flatten / merge overlapping time periods and Consolidate rows based on date ranges but none of them has the additional requirements posed by the OP.
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := {
tmp <- as.integer(start_date)
cumsum(tmp > shift(cummax(tmp + 365L), type = "lag", fill = 0L))
}, by = id][]
id start_date end_date intervention_id all_ids period
1: 11 2013-01-01 2013-06-01 1 1 1
2: 11 2013-01-01 2013-07-01 2 2 1
3: 11 2013-02-01 2013-05-01 3 3 1
4: 21 2013-01-01 2013-07-01 4 4 1
5: 21 2013-02-01 2013-09-01 5 5 1
6: 21 2013-12-01 2014-01-01 6 6 1
7: 21 2015-06-01 2015-12-01 7 7 2
For the sake of simplicity, it is assumed that one year has 365 days which ignores leap years with 366 days. If leap years are to be considered, a more sophisticated date arithmetic is required.
Unfortunately, cummax() has no method for arguments of class Date or IDate (data.table's integer version). Therefore, the coersion from Date to integer is required.
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
Edit: Correction
I just noted that I had misinterpreted OP's requirements. The OP has requested (emphasis mine):
For each ID, any intervention that begins within one year of the last
intervention ending, merge the rows so that the start_date is the
earliest start date of the two rows, and the end_date is the latest
end_date of the two rows.
The solution above looks for gaps of one year in the sequence of start_date but not in the sequence of start_date and the preceeding end_date as requested. The corrected version is:
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id][]
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
The result for the given sample dataset is identical for both versions which caused the error to slip through unrecognized.
Benchmark
The OP has mentioned in a comment that using lubridate's date arithmetic has dramatically enlarged run times.
According to my benchmark below, the penalty of using end_date %m+% years(1) is not that much. I have benchmarked three versions of the code:
v_1 is the corrected version from above.
v_2 pulls the type conversion and the data arithmetic out of the grouping part and creates two helper columns in advance.
v_3 is like v_2 but uses end_date %m+% years(1).
The benchmark is repeated for different problem sizes, i.e., total number of rows. Also, the number of different ids is varied as grouping may have an effect on performance. According to the OP, his full dataset of 500 k rows has 250 k unique ids which corresponds to an id_share of 0.5 (50%). In the benchmark id_shares of 0.5, 0.2, and 0.01 (50%, 20%, 1%) are simulated.
As sample_data is modified, each run starts with a fresh copy.
library(bench)
library(magrittr)
bm <- press(
id_share = c(0.5, 0.2, 0.01),
n_row = c(1000L, 10000L, 1e5L),
{
n_id <- max(1L, as.integer(n_row * id_share))
print(sprintf("Number of ids: %i", n_id))
set.seed(123L)
sample_data_0 <- lapply(seq(n_id), function(.id) data.table(
start_date = as.IDate("2000-01-01") + cumsum(sample(0:730, n_row / n_id, TRUE))
)) %>%
rbindlist(idcol = "id") %>%
.[, end_date := start_date + sample(30:360, n_row, TRUE)] %>%
.[, intervention_id := as.character(.I)]
mark(
v_1 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_2 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date) + 365)]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_3 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date %m+% years(1)))]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
check = FALSE,
min_iterations = 3
)
}
)
ggplot2::autoplot(bm)
The result shows that the number of groups, i.e., number of unique id, does have a stronger effect on the run time than the different code versions. In case of many groups, the creation of helper columns before grouping (v_2) gains performance.

Resources