Merging data.table rows based on dates - r

Problem:
I have records with a start and end date for an intervention and I want to merge the rows according to the following rule:
For each ID, any intervention that begins within one year of the last intervention ending, merge the rows so that the start_date is the earliest start date of the two rows, and the end_date is the latest end_date of the two rows.
I also want to keep track of intervention IDs if they are merged.
There can be five scenarios:
Two rows have the same start date, but different end dates.
Start date....End date
Start date.........End date
The period between row 2's start and end date lies within the period of row 1's start and end date.
Start date...................End date
.......Start date...End date
Row 2's intervention starts within Row 1's intervention period but ends later.
Start date.....End date
.....Start date.............End date
Row 2 starts within one year of the end of Row 1.
Start date....End date
......................|....<= 1 year....|Start date...End date
Row 2 starts over one year after the end of Row 1.
Start date...End date
.....................|........ > 1 year..........|Start date...End date
I want to merge rows in cases 1 to 4 but not 5.
Data:
library(data.table)
sample_data <- data.table(id = c(rep(11, 3), rep(21, 4)),
start_date = as.Date(c("2013-01-01", "2013-01-01", "2013-02-01", "2013-01-01", "2013-02-01", "2013-12-01", "2015-06-01")),
end_date = as.Date(c("2013-06-01", "2013-07-01", "2013-05-01", "2013-07-01", "2013-09-01", "2014-01-01", "2015-12-01")),
intervention_id = as.character(1:7),
all_ids = as.character(1:7))
> sample_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-06-01 1 1
2: 11 2013-01-01 2013-07-01 2 2
3: 11 2013-02-01 2013-05-01 3 3
4: 21 2013-01-01 2013-07-01 4 4
5: 21 2013-02-01 2013-09-01 5 5
6: 21 2013-12-01 2014-01-01 6 6
7: 21 2015-06-01 2015-12-01 7 7
The final result should look like:
> merged_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2015-06-01 2015-12-01 7 7
I'm not sure if the all_ids column is the best way to keep track of the intervention_id's so open to ideas for that. (The intervention_id's don't need to be in order in the all_ids column.)
It doesn't matter what the value of the intervention_id column is where rows have been merged.
What I tried:
I started off by writing a function to deal with only those cases where the start date is the same. It's a very non-R, non-data.table way of doing it and therefore very inefficient.
mergestart <- function(unmerged) {
n <- nrow(unmerged)
mini_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
merge_a <- function(unmerged, un_i, merged, m_i, no_of_records) {
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
while (un_i <= no_of_records) {
if(merged[m_i]$start_date == unmerged[un_i]$start_date) {
merged[m_i]$end_date <- max(merged[m_i]$end_date, unmerged[un_i]$end_date)
merged[m_i]$all_ids <- paste0(merged[m_i]$all_ids, ",", unmerged[un_i]$intervention_id)
un_i <- un_i + 1
} else {
m_i <- m_i + 1
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
merge_a(unmerged, un_i, merged, m_i, (no_of_records - un_i))
}
}
return(merged)
}
mini_merged <- merge_a(unmerged, 1, mini_merged, 1, n)
return(copy(mini_merged[id != 0]))
}
Using this function on just one id gives:
> mergestart(sample_data[id == 11])
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 11 2013-02-01 2013-05-01 3 3
To use the function on the whole dataset:
n <- nrow(sample_data)
all_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
start_i <- 1
for (i in unique(sample_data$id)) {
id_merged <- mergestart(sample_data[id == i])
end_i <- start_i + nrow(id_merged) - 1
all_merged[start_i:end_i] <- copy(id_merged)
start_i <- end_i
}
all_merged <- all_merged[id != 0]
> all_merged
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 21 2013-01-01 2013-07-01 4 4
3: 21 2013-02-01 2013-09-01 5 5
4: 21 2013-12-01 2014-01-01 6 6
5: 21 2015-06-01 2015-12-01 7 7
I also had a look at rolling joins but still can't get how to use it in this situation.
This answer https://stackoverflow.com/a/48747399/6170115 looked promising but I don't know how to integrate all the other conditions and track the intervention IDs with this method.
Can anyone point me in the right direction?

There are related questions How to flatten / merge overlapping time periods and Consolidate rows based on date ranges but none of them has the additional requirements posed by the OP.
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := {
tmp <- as.integer(start_date)
cumsum(tmp > shift(cummax(tmp + 365L), type = "lag", fill = 0L))
}, by = id][]
id start_date end_date intervention_id all_ids period
1: 11 2013-01-01 2013-06-01 1 1 1
2: 11 2013-01-01 2013-07-01 2 2 1
3: 11 2013-02-01 2013-05-01 3 3 1
4: 21 2013-01-01 2013-07-01 4 4 1
5: 21 2013-02-01 2013-09-01 5 5 1
6: 21 2013-12-01 2014-01-01 6 6 1
7: 21 2015-06-01 2015-12-01 7 7 2
For the sake of simplicity, it is assumed that one year has 365 days which ignores leap years with 366 days. If leap years are to be considered, a more sophisticated date arithmetic is required.
Unfortunately, cummax() has no method for arguments of class Date or IDate (data.table's integer version). Therefore, the coersion from Date to integer is required.
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
Edit: Correction
I just noted that I had misinterpreted OP's requirements. The OP has requested (emphasis mine):
For each ID, any intervention that begins within one year of the last
intervention ending, merge the rows so that the start_date is the
earliest start date of the two rows, and the end_date is the latest
end_date of the two rows.
The solution above looks for gaps of one year in the sequence of start_date but not in the sequence of start_date and the preceeding end_date as requested. The corrected version is:
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id][]
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
The result for the given sample dataset is identical for both versions which caused the error to slip through unrecognized.
Benchmark
The OP has mentioned in a comment that using lubridate's date arithmetic has dramatically enlarged run times.
According to my benchmark below, the penalty of using end_date %m+% years(1) is not that much. I have benchmarked three versions of the code:
v_1 is the corrected version from above.
v_2 pulls the type conversion and the data arithmetic out of the grouping part and creates two helper columns in advance.
v_3 is like v_2 but uses end_date %m+% years(1).
The benchmark is repeated for different problem sizes, i.e., total number of rows. Also, the number of different ids is varied as grouping may have an effect on performance. According to the OP, his full dataset of 500 k rows has 250 k unique ids which corresponds to an id_share of 0.5 (50%). In the benchmark id_shares of 0.5, 0.2, and 0.01 (50%, 20%, 1%) are simulated.
As sample_data is modified, each run starts with a fresh copy.
library(bench)
library(magrittr)
bm <- press(
id_share = c(0.5, 0.2, 0.01),
n_row = c(1000L, 10000L, 1e5L),
{
n_id <- max(1L, as.integer(n_row * id_share))
print(sprintf("Number of ids: %i", n_id))
set.seed(123L)
sample_data_0 <- lapply(seq(n_id), function(.id) data.table(
start_date = as.IDate("2000-01-01") + cumsum(sample(0:730, n_row / n_id, TRUE))
)) %>%
rbindlist(idcol = "id") %>%
.[, end_date := start_date + sample(30:360, n_row, TRUE)] %>%
.[, intervention_id := as.character(.I)]
mark(
v_1 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_2 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date) + 365)]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_3 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date %m+% years(1)))]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
check = FALSE,
min_iterations = 3
)
}
)
ggplot2::autoplot(bm)
The result shows that the number of groups, i.e., number of unique id, does have a stronger effect on the run time than the different code versions. In case of many groups, the creation of helper columns before grouping (v_2) gains performance.

Related

Using foverlaps to truncate episodes

I'm trying to wrap my head around how to use data.table::foverlaps() to generate new data tables. In one application, I would like to use foverlaps to identify gaps and then use this information to truncate my original data table.
Suppose that I have a dataset (df1) of 2 employees (id) at a company with date ranges (start_date and end_date) for the periods in which they work on different projects
(proj_id; either "A", "B" or "C").
library(data.table)
library(lubridate)
df1<-data.table(id = rep(1:2,each=3),
start_date = ymd(c("1998-04-03","1999-03-08","2000-08-13",
"2005-03-03","2007-10-12","2014-02-23")),
end_date = ymd(c("1999-03-07","2000-08-12","2021-04-23",
"2007-09-05","2014-02-22","2019-05-04")),
proj_id = c("A","B","A","B","C","A"))
> df1
id start_date end_date proj_id
1: 1 1998-04-03 1999-03-07 A
2: 1 1999-03-08 2000-08-12 B
3: 1 2000-08-13 2021-04-23 A
4: 2 2005-03-03 2007-09-05 B
5: 2 2007-10-12 2014-02-22 C
6: 2 2014-02-23 2019-05-04 A
Now I have another dataset (df2) that specifies the time that I want to truncate from df1.
df2 <- data.table(id = 1:2,
start_date = ymd("1998-07-20", "2006-06-12"),
end_date = ymd("1998-08-15", "2016-04-08"))
> df2
id start_date end_date
1: 1 1998-07-20 1998-08-15
2: 2 2006-06-12 2016-04-08
I can then use data.table::foverlaps() to identify the overlapping episodes:
> setkey(df1,id,start_date,end_date)
> foverlaps(df2, df1, type="any",
+ by.x=c("id","start_date","end_date"))
id start_date end_date proj_id i.start_date i.end_date
1: 1 1998-04-03 1999-03-07 A 1998-07-20 1998-08-15
2: 2 2005-03-03 2007-09-05 B 2006-06-12 2016-04-08
3: 2 2007-10-12 2014-02-22 C 2006-06-12 2016-04-08
4: 2 2014-02-23 2019-05-04 A 2006-06-12 2016-04-08
I would now like to use this data to generate a new version of df1, where I generate new episodes by truncating the gaps identified above. My desired DT is therefore:
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A
```
There may be alternatives that work better, but this could work based on your foverlaps result.
Assume you created another data.table called df3 with your foverlaps result:
df3 <- foverlaps(df2, df1, type = "any", by.x = c("id", "start_date", "end_date"))
Then you could iterate through each row, and add 0, 1, or 2 date ranges depending on overlap (truncate at end, or beginning, or entire range is blocked out).
dt <- data.table(start_date = Date(), end_date = Date(), id = numeric(), proj_id = numeric())
for (i in seq_len(nrow(df3))) {
if (df3$start_date[i] < df3$i.start_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$start_date[i], end_date = df3$i.start_date[i] - 1, id = df3$id[i], proj_id = df3$proj_id[i]))
}
if (df3$end_date[i] > df3$i.end_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$i.end_date[i] + 1, end_date = df3$end_date[i], id = df3$id[i], proj_id = df3$proj_id[i]))
}
}
Finally, you can remove the foverlaps results from your initial df1 since new ranges have been determine for those (using fsetdiff). Then, you can add the new ranges back.
rbind(fsetdiff(df1, df3[,1:4]), dt)[order(id, start_date)]
Output
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A

Group consecutive working day dates (excluding weekends and holidays)

I have data on different persons (ID), the dates they have worked (Date), and how many hours they have worked each date (Hours).
Monday - Friday are considered working days. For each ID, I want to group consecutive working days. Weekends and holidays should be omitted when deciding if working days are consecutive.
Some examples:
If a person worked Monday, Tuesday and Wednesday, then skipped Thursday, and worked again on Friday, then Monday - Wednesday is considered one group and Friday another group.
If a person worked Thursday & Friday one week and Monday & Tuesday the next week, then these four days should be in the same group. Thus, the days in the weekend are omitted when checking if working days are consecutive.
If Monday - Friday of one week and Monday - Friday of the next week, then I'd count those whole two weeks as consecutive working days.
I would also like to take into account common US Holidays (e.g. New Year), such that 12/31/2020 to 1/4/2021 would still count as consecutive workdays.
Once the grouping variable is created, I want to sum the working hours in each group.
Example data:
df1 = structure(list(ID = c(1, 1, 1, 1, 2, 2, 3, 3,
3, 3), Date = structure(c(18781, 18782, 18785, 18750, 18687,
18688, 18626, 18627, 18631, 18634), class = "Date"), Hours = c(8,
8, 8, 16, 8, 8, 8, 8, 8, 8)), row.names = c(NA, -10L), class = "data.frame")
ID Date Hours
1 1 2021-06-03 8
2 1 2021-06-04 8
3 1 2021-06-07 8
4 1 2021-05-03 16
5 2 2021-03-01 8
6 2 2021-03-02 8
7 3 2020-12-30 8
8 3 2020-12-31 8
9 3 2021-01-04 8
10 3 2021-01-07 8
I imagine my output to look something like this:
ID Date1 Date2 Hours
1 1 2021-06-03 2021-06-07 24
# the weekend, June 5-6, is omitted
# when the group of consecutive working days is created
2 1 2021-05-03 2021-05-03 16
3 2 2021-03-01 2021-03-02 16
4 3 2020-12-30 2021-01-04 24
# the public holiday (Jan 1) and the weekend (Jan 2-3) are omitted
5 3 2021-01-07 2021-01-07 8
My top priority is to at least get the consecutive work week figured out, the holiday portion would be an added bonus.
You may use RQuantLib::businessDaysBetween. For each ID (by = ID), calculate the number of business days between each row, i.e. provide vectors of "lag" (head(Date, -1)) and "lead" (tail(Date, -1)) as from and to dates. Pick a relevant calendar (see Details in ?businessDaysBetween)
For each ID and run of consecutive business days (by = .(ID, g = cumsum(d != 1L))), select the first and last date (from = Date[1], to = Date[.N]) and sum the Hours (sum(Hours))
library(data.table)
library(RQuantLib)
setDT(df1)
df1[ , d := c(1, businessDaysBetween(calendar = "UnitedStates",
from = head(Date, -1), to = tail(Date, -1))),
by = ID]
df1[ , .(from = Date[1], to = Date[.N], Hours = sum(Hours)),
by = .(ID, g = cumsum(d != 1L))]
# ID g from to Hours
# 1: 1 0 2021-06-03 2021-06-07 24
# 2: 1 1 2021-05-03 2021-05-03 16
# 3: 2 1 2021-03-01 2021-03-02 16
# 4: 3 1 2020-12-30 2021-01-04 24
# 5: 3 2 2021-01-07 2021-01-07 8
More convoluted solution (pre-businessDaysBetween):
Create a full sequence of dates within each ID (df1[ , .(Date = seq(min(Date), max(Date), by = "1 day")), by = ID]). Join with original data on ID and Date (df1[..., on =.(ID, Date)). For dates not present in original data, i.e. gaps between (originally) consecutive days, Hours will be NA.
Within each ID (by = ID), create a run length index based on missing Hours (rleid(is.na(Hours))). For rows with missing Hours (d[is.na(Hours)), i.e. the gaps in the original time series, for each ID and run (by = .(ID, r)), check if all dates are either a weekend (wday(Date) %in% c(1, 7)) or (|) a public holiday* (Date %in% as.Date(holidayNYSE(unique(year(Date))))), and create an index variable, ix.
For original rows and weekend/holiday gaps (!is.na(Hours) | ix), create a grouping variable of consecutive dates (g = cumsum(c(TRUE, diff(Date) != 1L))). For each ID and run of consecutive dates (by = .(ID, g)), select the first and last date (from = Date[1], to = Date[.N]) and sum the Hours (sum(Hours, na.rm = TRUE))
library(data.table)
library(timeDate)
setDT(df1)
d = df1[df1[ , .(Date = seq(min(Date), max(Date), by = "1 day")), by = ID],
on = .(ID, Date)]
d[ , r := rleid(is.na(Hours)), by = ID]
d[is.na(Hours), ix := all(
wday(Date) %in% c(1, 7) |
Date %in% as.Date(holidayNYSE(unique(year(Date)))))
, by = .(ID, r)]
d[!is.na(Hours) | ix, .(Date, Hours, g = cumsum(c(TRUE, diff(Date) != 1L))),
by = ID][
, .(from = Date[1], to = Date[.N],
Hours = sum(Hours, na.rm = TRUE)),
by = .(ID, g)]
# ID g from to Hours
# 1: 1 1 2021-05-03 2021-05-03 16
# 2: 1 2 2021-06-03 2021-06-07 24
# 3: 2 1 2021-03-01 2021-03-02 16
# 4: 3 1 2020-12-30 2021-01-04 24
# 5: 3 2 2021-01-07 2021-01-07 8
*Please see the timeDate manual for other definitions of holiday.

How to calculate sumifs across a date range referencing two tables of unequal length?

I have two data tables. The first, DT1, is grouped by ID. It has a Week column that contains an instance of every consecutive week in a large time frame for every ID. It also contains a Units column. It looks something like this simplified version:
ID Week Units
A 1/1/2019 5
A 1/8/2019 9
A 1/15/2019 0
A 1/22/2019 1
B 1/1/2019 1
B 1/8/2019 32
B 1/15/2019 2
B 1/22/2019 6
C 1/1/2019 0
C 1/8/2019 0
C 1/15/2019 8
C 1/22/2019 3
The second table, DF2, has a time range window for every ID. This is represented by a start date column and a stop date column. It looks something like this simplified version:
ID Start Date Stop Date
A 1/1/2019 1/8/2019
B 1/8/2019 1/22/2019
C 1/8/2019 1/15/2019
I would like to calculate the sum of the units column for every date range/buyer combination in DF2. My desired output would be:
ID Start_Date Stop_Date sumUnits
A 1/1/2019 1/8/2019 14
B 1/8/2019 1/22/2019 40
C 1/8/2019 1/15/2019 8
Is there a way to calculate this type of sum in R?
I have referenced the article, In R: how to sum a variable by group between two dates
, as well as attempted the interval function and a non-equi join.
DT1[DT2[DT1, sum(x), on = .(Units, Week>= Stop_Date, Week<= Stop_Date),
by = .EACHI], newvar := V1, on = .(Units, Start_Date
=Week)]
It returns the message, "Week" not in i. How do I solve the problem?
If you don't mind using dplyr instead of data.table, you can use the fuzzyjoin package:
library(dplyr)
library(fuzzyjoin)
DF1 = data.frame(
ID = c(rep("A", 4), rep("B", 4), rep("C", 4)),
Week = rep(as.Date(c("1/1/2019", "1/8/2019", "1/15/2019", "1/22/2019"), format = "%m/%d/%Y")),
Units = c(5, 9, 0, 1, 1, 32, 2, 6, 0, 0, 8, 3)
)
DF2 = data.frame(
ID = c("A", "B", "C"),
Start.Date = as.Date(c("1/1/2019", "1/8/2019", "1/8/2019"), format = "%m/%d/%Y"),
Stop.Date = as.Date(c("1/8/2019", "1/22/2019", "1/15/2019"), format = "%m/%d/%Y")
)
fuzzy_inner_join(
DF1, DF2,
by = c("ID", "Week" = "Start.Date", "Week" = "Stop.Date"),
match_fun = list(`==`, `>=`, `<=`)
) %>%
group_by(ID.x, Start.Date, Stop.Date) %>%
summarize(sumUnits = sum(Units))
I have referenced the article, In R: how to sum a variable by group between two dates, as well as attempted the interval function and a non-equi join.
Here's a variation on #akrun's answer that works:
library(data.table)
setDT(DF1)
setDT(DF2)
DF2[, v :=
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date), sum(x.Units), by=.EACHI]$V1
]
ID Start.Date Stop.Date v
1: A 2019-01-01 2019-01-08 14
2: B 2019-01-08 2019-01-22 40
3: C 2019-01-08 2019-01-15 8
I am using the input data as created in #A.S.K.'s answer.
To see how it works, try running simpler parts of it:
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date), sum(x.Units), by=.EACHI]$V1
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date), sum(x.Units), by=.EACHI]
DF1[DF2, on=.(ID, Week >= Start.Date, Week <= Stop.Date)]

Filtering observations based on specific date condition using data.table

I have a set of observations, which are recorded every time a user has taken an action. I want to filter only those observations from a user which are six or more months apart.
So, if a user has taken this action on "2018-01-01", "2018-03-01" and "2018-07-01", I only want to keep only "2018-01-01" and "2018-07-01".
Similarly, if a user has taken an action on "2018-01-01", "2018-03-01", "2018-07-01" and "2019-03-01" I want to keep only "2018-01-01", "2018-07-01", "2019-03-01".
So far, I have produced long and unworkable code.
# What I want to achieve
library(data.table)
dataIhave <- data.table(id = c(1, 1, 1, 1, 2, 2, 3, 4),
dates = c("2018-01-01",
"2018-03-01",
"2018-07-01",
"2019-01-01",
"2018-01-03",
"2018-07-02",
"2018-02-01",
"2018-02-01"))
dataIwant <- data.table(id = c(1, 1, 1, 2, 3, 4),
dates = c("2018-01-01",
"2018-07-01",
"2019-01-01",
"2018-01-01",
"2018-02-01",
"2018-02-01"))
This is a rolling-join variant of #Uwe's answer:
library(lubridate)
dataIhave[, dates := as.IDate(dates)]
ids = unique(dataIhave$id)
dataIhave[, seq := NA_integer_]
s = 1L
w = dataIhave[.(ids), on=.(id), mult="first", which = TRUE]
dataIhave[w, seq := s]
while (TRUE){
w = dataIhave[
dataIhave[w, .(id, dates = dates %m+% months(6))],
on = .(id, dates), roll = -Inf, nomatch = 0, which = TRUE
]
if (!length(w)) break
s = s + 1L
dataIhave[w, seq := s]
}
dataIhave[!is.na(seq)]
id dates seq
1: 1 2018-01-01 1
2: 1 2018-07-01 2
3: 1 2019-01-01 3
4: 2 2018-01-03 1
5: 3 2018-02-01 1
6: 4 2018-02-01 1
The loop takes rows w defined per id, steps their dates forward six months, and takes the next row found, if any. The arguments to the join are:
The tables, with join syntax x[i, ...]
x = dataIhave
i = dataIhave[w, .(id, dates = dates %m+% months(6))]
on = .(id, date): columns to match by
roll = -Inf: find the next match on the last column in on=
nomatch = 0: if no match is found, skip
which = TRUE: return matched row number
Additionally, if there are duplicate dates (see the second example in #Uwe's post):
mult = "first": take only the first match for each row of i
In selecting the first row by id before the loop, I'm assuming the data is sorted by dates within id (so I'm not using order as #Uwe's answer does).
If I understand correctly, the OP wants to drop those dates which are less than six months apart from the beginning of a period and start a new period at the first date that is more than 6 months apart from the begin of the previous period (separately for each id).
I have no idea how this can be accomplished by a non-recursive rolling or non-equi join as there is no fixed grid of dates. So, I believe it requires a kind of recursive approach, somehow. Here is one possibility:
library(data.table)
library(lubridate)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := TRUE]
dataIhave[order(id, dates)
, keep := {
start <- dates[1L]
for (i in tail(seq_along(dates), -1L)) {
if (dates[i] < start %m+% months(6)) {
keep[i] <- FALSE
} else {
start <- dates[i]
}
}
keep
}, by = id][]
id dates keep
1: 1 2018-01-01 TRUE
2: 1 2018-03-01 FALSE
3: 1 2018-07-01 TRUE
4: 1 2019-01-01 TRUE
5: 2 2018-01-03 TRUE
6: 2 2018-07-02 FALSE
7: 3 2018-02-01 TRUE
8: 4 2018-02-01 TRUE
Finally,
dataIhave[(keep), -"keep"]
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
2nd test case
The crucial point here is to detect the beginning of a new period (within each id).
As an additional test case I have added two dates to id == 1,
2018-07-01 and 2018-07-02.
2018-07-01 is a duplicate. Both dates should be removed as they lie within the second 6-months period starting at 2018-07-01.
dataIhave <- fread("
id dates
1 2018-01-01
1 2018-03-01
1 2018-07-01
1 2018-07-01
1 2018-07-02
1 2019-01-01
2 2018-01-03
2 2018-07-02
3 2018-02-01
4 2018-02-01")
Indeed, the code above returns the same output as with OP's original test case.
Remove rows only within the first six months for each id
If the question is interpreted to only remove entries within the first 6-months period for each id and keep all the date after 6 months this can be achieved by
dataIhave[!dataIhave[, .I[dates < dates[1L] %m+% months(6L)][-1L], by = id]$V1]
which returns
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2018-07-01
4: 1 2018-07-02
5: 1 2019-01-01
6: 2 2018-01-03
7: 3 2018-02-01
8: 4 2018-02-01
for the second test case. (Note that this is a streamlined version of Jaap's answer.)
Another variant:
library(lubridate)
library(data.table)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := dates >= dates[1] %m+% months(6), by = id
][dataIhave[, .I[1], by = id][[2]], keep := TRUE
][!!keep, -"keep"]
which gives:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
Using non-equi join and igraph to avoid implicit loops and recursion:
#data prep
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
setorder(dataIhave[, rn:=rowid(id)], id, dates)
dataIhave[, end := as.IDate(sapply(dates,
function(d) seq(d, by="6 months", length.out=2L)[2L]))]
#non-equi self join to find first date that is after 6months
nonequi <- dataIhave[dataIhave, on=.(id, dates>=end), mult="first", by=.EACHI,
.(i.id, i.rn, x.rn, i.dates, x.dates)]
library(igraph)
nonequi[, {
#create graph from the previous join
g <- graph_from_data_frame(.SD[, .(i.rn, x.rn)])
#plot(g)
#find the leaf nodes
leaf <- sapply(V(g), function(x) length(neighbors(g,x))==0L)
#from the first date (i.e. node = V(g)["1"]), find the path starting from this date.
path <- get.all.shortest.paths(g, V(g)["1"], leaf)$res
#return all dates (i.e. nodes) in this path
.(dates=i.dates[i.rn %in% na.omit(V(g)[path[[1L]]]$name)])
},
by=.(id=i.id)]
output:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
Or a recursive approach similar to Uwe's solution:
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
unique(dataIhave[,
.(dates=as.IDate(Reduce(
function(x, y) if (y >= seq(x, by="6 months", length.out=2L)[2L]) y else x,
dates,
accumulate=TRUE))),
.(id)])
output:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
library(lubridate)
library(data.table)
dataiHave[, dates := ymd(dates)]
dataiHave[, difDates := as.numeric(difftime(dates, units = "weeks"))]
dataIHave[difDates >= 24, .(id, dates)]
Does this produce the result you want?
Months have irregular durations so you'll have to stick to a time unit of fixed duration.
You can also check ?lubridate::interval, lubridate::as. duration and this question: Time difference in years with lubridate?

How to remove rows from a dataframe that have overlapping start and end dates in R

I would like to remove products from a dataframe that have overlapping start and end dates to avoid duplicates in a subsequent step.
Example data:
library(dplyr)
d <-
bind_rows(
data.frame(product = 1,
start_date = as.Date("2016-01-01"),
end_date = as.Date("2016-01-10"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-02"),
end_date = as.Date("2016-01-04"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-05"),
end_date = as.Date("2016-06-09"),
stringsAsFactors = FALSE),
data.frame(product = 2,
start_date = as.Date("2016-01-03"),
end_date = as.Date("2016-01-07"),
stringsAsFactors = FALSE)
)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-02 2016-01-04
3 1 2016-01-05 2016-06-09
4 2 2016-01-03 2016-01-07
From this example I would like to remove rows 2 and 3 because of the overlaps.
I've used the lag function to remove overlaps that are next to each other:
d_cleaned <-
d %>%
arrange(product, start_date, end_date) %>%
mutate(overlapping = product == lag(product) & start_date <= lag(end_date) & end_date >= lag(start_date)) %>% # define overlaps
mutate(overlapping = ifelse(is.na(overlapping), FALSE, overlapping)) %>% # dont delete the first row
filter(overlapping == FALSE) %>% # remove overlaps
select(-overlapping)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-05 2016-06-09
3 2 2016-01-03 2016-01-07
As can be seen above this step removes overlaps on consecutive rows but not all.
I can solve this with a loop, but I was hoping that someone might be able to suggest a non-looping solution as the dataframe is quite large and each step takes a while.
Using non-equi joins from the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
setDT(d) # convert 'd' to a data.table by reference
idx = d[d, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE]
d[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
# product start_date end_date
# 1: 1 2016-01-01 2016-01-10
# 2: 1 2016-06-10 2016-06-12
# 3: 2 2016-01-03 2016-01-07
For each row in d (the one inside the square bracket), we find any kind of overlap with d (on the outside), i.e., a self-join, based on the condition provided to the on argument, and we extract the index of the first overlap (because which=TRUE and mult="first").
If and only if the first overlap is with itself, we return them. We discard all other intervals.
To install devel version, see installation instructions here.
Here's a benchmark on slightly more rows (the data is not by any means large):
set.seed(1L)
require(data.table) # v1.9.7+
dates = as.Date(sample(16000:17000, 1e5, TRUE), origin="1970-01-01")
dt = data.table(product=sample(100, 1e5, TRUE),
start_date = sample(dates, 1e5, TRUE),
end_date = sample(dates, 1e5, TRUE))
dt[, `:=`(start_date = pmin(start_date, end_date),
end_date = pmax(start_date, end_date))]
system.time({
idx = dt[dt, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE, verbose=TRUE]
ans = dt[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
})
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0.041 secs
# Recomputing forder with non-equi ids ... done in 0.005 secs
# Found 178 non-equi group(s) ...
# Starting bmerge ...done in 2.359 secs
# user system elapsed
# 2.402 0.011 2.421
head(ans)
# product start_date end_date
# 1: 71 2015-12-04 2016-03-22
# 2: 71 2014-04-12 2015-05-01
# 3: 32 2013-11-23 2015-03-18
# 4: 56 2014-07-29 2015-12-26
# 5: 88 2015-03-08 2015-03-21
# 6: 69 2014-10-31 2015-07-05
nrow(ans)
# [1] 186
I believe the following will work
d <- cbind(ID=1:nrow(d),d)
d_cleaned <- d[rep(1:nrow(d), times=nrow(d)),] %>% ## 1
setNames(.,paste0(names(.),"_other")) %>% ## 2
bind_cols(d[rep(1:nrow(d), each=nrow(d)),], .) %>% ## 3
arrange(product,start_date,end_date) %>% ## 4
filter(product == product_other) %>% ## 5
mutate(overlapping = ID_other < ID &
start_date <= end_date_other &
end_date >= start_date_other) %>% ## 6
group_by(ID) %>%
filter(all(overlapping==FALSE)) %>% ## 7
ungroup() %>%
select(product,start_date,end_date) %>%
distinct())
print(d_cleaned)
### A tibble: 2 x 3
## product start_date end_date
## <dbl> <date> <date>
##1 1 2016-01-01 2016-01-10
##2 2 2016-01-03 2016-01-07
First, add a column of IDs that identifies the rows of the data frame to group_by later to determine if there is overlap with any other row. The key is to be able to consider all distinct pairs of rows with the same product in testing for overlap. The above code does this by expanding the data as in an outer-join. Specifically,
Replicate d nrow(d) times
Change the names of the columns by appending _other to them so that they can be referenced separately from the original column names in the overlap test
Replicate each row of d nrow(d) times and append the result from (2) as new columns
The result of (3) have rows that enumerates all pairs of rows from the original data frame. Then:
Sort them as you did.
Consider only pairs where the product matches. Do this first to minimize not needed comparisons later
Do the overlap test. Here comparison is only made with respect to the previous rows in the original data frame. This has the effect of considering all lags and preserving the row itself (i.e., all rows overlap with itself)
Group by the ID (each row in original data frame) and keep those for which all overlapping is FALSE
At this point, the result contains only the non-overlapping rows in the original data frame. However, there are all those extra columns, and there are duplicates where multiple rows overlap with a row. The rest of the code cleans that up.
I have tested it with the following data (augmented yours to add a few more test conditions, but far from exhaustive):
d <- structure(list(product = c(1, 1, 1, 1, 1, 2, 2), start_date = structure(c(16801,
16802, 16805, 16811, 16962, 16803, 16806), class = "Date"), end_date = structure(c(16810,
16804, 16961, 16961, 16964, 16807, 16810), class = "Date")), .Names = c("product",
"start_date", "end_date"), row.names = c(NA, -7L), class = "data.frame")
and got the following results:
# A tibble: 3 x 3
product start_date end_date
<dbl> <date> <date>
1 1 2016-01-01 2016-01-10
2 1 2016-06-10 2016-06-12
3 2 2016-01-03 2016-01-07
Hope this helps.

Resources