Increasing code execution time-efficiency using data.table and for-loop - r

Problem: How can I make the for-loop in below code run more time-efficiently? For this toy example it works in a reasonable amount of time. However, unique_ids will be a vector of approximately 8000 entries and the for-loop slows down heavily the computation. Any ideas? Many thanks!
Purpose:
Cluster retrospectively IIDs for each day into hop and top based on calculation logic in for-loop.
Initial Data:
IID ENTRY FINISH TARGET max_finish_target_date
1: 1 2020-02-11 2020-02-19 2020-02-15 2020-02-19
2: 2 2020-02-13 2020-02-17 2020-02-19 2020-02-19
Final (Target) Data:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 hop
3: 1 2020-02-12 hop
4: 1 2020-02-13 hop
5: 1 2020-02-14 hop
6: 1 2020-02-15 hop
7: 1 2020-02-16 top
8: 1 2020-02-17 top
9: 1 2020-02-18 top
10: 1 2020-02-19 top
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 hop
15: 2 2020-02-14 hop
16: 2 2020-02-15 hop
17: 2 2020-02-16 hop
18: 2 2020-02-17 hop
19: 2 2020-02-18
20: 2 2020-02-19
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 hop
27: 3 2020-02-16 hop
28: 3 2020-02-17 top
29: 3 2020-02-18 top
30: 3 2020-02-19 top
Code
rm(list = ls())
library(data.table)
# Some sample start data
initial_dt <- data.table(IID = c(1, 2, 3),
ENTRY = c("2020-02-11", "2020-02-13", "2020-02-15"),
FINISH = c("2020-02-19", "2020-02-17", ""),
TARGET = c("2020-02-15", "2020-02-19", "2020-02-16"))
initial_dt[, ":="(ENTRY = ymd(ENTRY),
FINISH = ymd(FINISH),
TARGET = ymd(TARGET))]
initial_dt[is.na(FINISH), FINISH := as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d")]
initial_dt[, max_finish_target_date := pmax(FINISH, TARGET)]
# Specify target data shape and output format
unique_ids <- c(1, 2, 3)
dts <- seq(as.Date("2020-02-10", format = "%Y-%m-%d"), as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d"), by = "days")
ids <- rep(unique_ids, each = length(dts))
len <- length(unique_ids)
final_dt <- data.table(IID = ids,
Dates = rep(dts, times = len))
# Calculation logic
# QUESTION: How can I make this part below run more efficiently and less time costly?
for (d_id in unique_ids){
final_dt[(IID == d_id) & (Dates %between% c(initial_dt[IID == d_id, ENTRY], initial_dt[IID == d_id, max_finish_target_date])),
ind_frist := ifelse((Dates > initial_dt[IID == d_id, TARGET]) & (Dates <= initial_dt[IID == d_id, max_finish_target_date]),
"hop",
"top")]
}

Your loop doesn't produce the output you show. The following non-equi joins produce that output but could easily be adjusted for other rules (e.g. those from your for loop):
final_dt <- CJ(IID = initial_dt[["IID"]], Dates = dts)
final_dt[initial_dt, ind_frist := "hop", on = .(IID, Dates >= ENTRY, Dates <= FINISH)]
final_dt[initial_dt, ind_frist := "top", on = .(IID, Dates > TARGET, Dates <= FINISH)]
These joins should be very fast.
Result:
# IID Dates ind_frist
# 1: 1 2020-02-10 <NA>
# 2: 1 2020-02-11 hop
# 3: 1 2020-02-12 hop
# 4: 1 2020-02-13 hop
# 5: 1 2020-02-14 hop
# 6: 1 2020-02-15 hop
# 7: 1 2020-02-16 top
# 8: 1 2020-02-17 top
# 9: 1 2020-02-18 top
#10: 1 2020-02-19 top
#11: 2 2020-02-10 <NA>
#12: 2 2020-02-11 <NA>
#13: 2 2020-02-12 <NA>
#14: 2 2020-02-13 hop
#15: 2 2020-02-14 hop
#16: 2 2020-02-15 hop
#17: 2 2020-02-16 hop
#18: 2 2020-02-17 hop
#19: 2 2020-02-18 <NA>
#20: 2 2020-02-19 <NA>
#21: 3 2020-02-10 <NA>
#22: 3 2020-02-11 <NA>
#23: 3 2020-02-12 <NA>
#24: 3 2020-02-13 <NA>
#25: 3 2020-02-14 <NA>
#26: 3 2020-02-15 hop
#27: 3 2020-02-16 hop
#28: 3 2020-02-17 top
#29: 3 2020-02-18 top
#30: 3 2020-02-19 top
# IID Dates ind_frist

A possibel alternative using a data.table-join:
final_dt[initial_dt
, on = .(IID)
, ind_frist := c("", "top","hop")[1L + (Dates > TARGET & Dates <= max_finish_target_date) +
Dates %between% .(ENTRY, max_finish_target_date)]][]
which gives:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 top
3: 1 2020-02-12 top
4: 1 2020-02-13 top
5: 1 2020-02-14 top
6: 1 2020-02-15 top
7: 1 2020-02-16 hop
8: 1 2020-02-17 hop
9: 1 2020-02-18 hop
10: 1 2020-02-19 hop
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 top
15: 2 2020-02-14 top
16: 2 2020-02-15 top
17: 2 2020-02-16 top
18: 2 2020-02-17 top
19: 2 2020-02-18 top
20: 2 2020-02-19 top
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 top
27: 3 2020-02-16 top
28: 3 2020-02-17 hop
29: 3 2020-02-18 hop
30: 3 2020-02-19 hop
This is the same as the output of the for-loop.
Some explanation: the part 1L + (Dates > TARGET & Dates <= max_finish_target_date) + Dates %between% .(ENTRY, max_finish_target_date) creates an index vector of one's, two's and three's of equal length as the number of rows of final_dt; if you put that between square brackets after c("", "top","hop"), for each one you will get an empty string, for each two you will get "top" and for each three you will get "hop".

Related

Extract overlapping and non-overlapping time periods using R (data.table)

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
library(ggplot2)
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
theme_bw()
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
........
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?
This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
library(data.table)
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !is.na(type), ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.
#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
Output:
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

Define periods/episodes of exposition with overlaping and concatenated intervals of time

I'm trying to identify periods/episodes of exposition to a drug with prescriptions. If those prescriptions are separated for 30 days it's considered a new period/episode of exposition. Prescriptions can overlap during certain time or be consecutive. If the sum of separated days of two consecutive prescripction is greater than 30 days it's not considered a new episode.
I have data like this:
id = c(rep(1,3), rep(2,6), rep(3,5))
start = as.Date(c("2017-05-10", "2017-07-28", "2017-11-23", "2017-01-27", "2017-10-02", "2018-05-14", "2018-05-25", "2018-11-26", "2018-12-28", "2016-01-01", "2016-03-02", "2016-03-20", "2016-04-25", "2016-06-29"))
end = as.Date(c("2017-07-27", "2018-01-28", "2018-03-03", "2017-04-27", "2018-05-13", "2018-11-14", "2018-11-25", "2018-12-27", "2019-06-28", "2016-02-15", "2016-03-05", "2016-03-24", "2016-04-29", "2016-11-01"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2017-05-10 2017-07-27
2: 1 2017-07-28 2018-01-28
3: 1 2017-11-23 2018-03-03
4: 2 2017-01-27 2017-04-27
5: 2 2017-10-02 2018-05-13
6: 2 2018-05-14 2018-11-14
7: 2 2018-05-25 2018-11-25
8: 2 2018-11-26 2018-12-27
9: 2 2018-12-28 2019-06-28
10: 3 2016-01-01 2016-02-15
11: 3 2016-03-02 2016-03-05
12: 3 2016-03-20 2016-03-24
13: 3 2016-04-25 2016-04-29
14: 3 2016-06-29 2016-11-01
I calculated the difference of start and last end observation (last_diffdays)
DT[, last_diffdays := start-shift(end, n=1L), by = .(id)][is.na(last_diffdays), last_diffdays := 0][]
id start end last_diffdays
1: 1 2017-05-10 2017-07-27 0 days
2: 1 2017-07-28 2018-01-28 1 days
3: 1 2017-11-23 2018-03-03 -66 days
4: 2 2017-01-27 2017-04-27 0 days
5: 2 2017-10-02 2018-05-13 158 days
6: 2 2018-05-14 2018-11-14 1 days
7: 2 2018-05-25 2018-11-25 -173 days
8: 2 2018-11-26 2018-12-27 1 days
9: 2 2018-12-28 2019-06-28 1 days
10: 3 2016-01-01 2016-02-15 0 days
11: 3 2016-03-02 2016-03-05 16 days
12: 3 2016-03-20 2016-03-24 15 days
13: 3 2016-04-25 2016-04-29 32 days
14: 3 2016-06-29 2016-11-01 61 days
This shows when an overlap happens (negative values) or not (positive values). I think an ifelse/fcase statement here would be a bad idea and I'm not comfortable doing it.
I think a good output for this job would be something like:
id start end last_diffdays noexp_days period
1: 1 2017-05-10 2017-07-27 0 days 0 1
2: 1 2017-07-28 2018-01-28 1 days 1 1
3: 1 2017-11-23 2018-03-03 -66 days 0 1
4: 2 2017-01-27 2017-04-27 0 days 0 1
5: 2 2017-10-02 2018-05-13 158 days 158 2
6: 2 2018-05-14 2018-11-14 1 days 1 2
7: 2 2018-05-25 2018-11-25 -173 days 0 2
8: 2 2018-11-26 2018-12-27 1 days 1 2
9: 2 2018-12-28 2019-06-28 1 days 1 2
10: 3 2016-01-01 2016-02-15 0 days 0 1
11: 3 2016-03-02 2016-03-05 16 days 16 1
12: 3 2016-03-20 2016-03-24 15 days 15 1
13: 3 2016-04-25 2016-04-29 32 days 32 2
14: 3 2016-06-29 2016-11-01 61 days 61 3
I manually calculated the days without exposition (noexp_days) of the before prescription.
I dunno If I'm the right path but I think I need to calculate noexp_days variable and then make a cumsum((noexp_days)>30)+1.
If there is a much better solution I don't see or any other possibility I haven't considered I will appreciate to read about them.
Thanks in advance for any help! :)
Try :
library(data.table)
DT[, noexp_days := pmax(as.integer(last_diffdays), 0)]
DT[, period := cumsum(noexp_days > 30) + 1, id]
DT
# id start end last_diffdays noexp_days period
# 1: 1 2017-05-10 2017-07-27 0 days 0 1
# 2: 1 2017-07-28 2018-01-28 1 days 1 1
# 3: 1 2017-11-23 2018-03-03 -66 days 0 1
# 4: 2 2017-01-27 2017-04-27 0 days 0 1
# 5: 2 2017-10-02 2018-05-13 158 days 158 2
# 6: 2 2018-05-14 2018-11-14 1 days 1 2
# 7: 2 2018-05-25 2018-11-25 -173 days 0 2
# 8: 2 2018-11-26 2018-12-27 1 days 1 2
# 9: 2 2018-12-28 2019-06-28 1 days 1 2
#10: 3 2016-01-01 2016-02-15 0 days 0 1
#11: 3 2016-03-02 2016-03-05 16 days 16 1
#12: 3 2016-03-20 2016-03-24 15 days 15 1
#13: 3 2016-04-25 2016-04-29 32 days 32 2
#14: 3 2016-06-29 2016-11-01 61 days 61 3

Generating test data in R

I am trying to generate this table as one of the inputs to a test.
id diff d
1: 1 2 2020-07-31
2: 1 1 2020-08-01
3: 1 1 2020-08-02
4: 1 1 2020-08-03
5: 1 1 2020-08-04
6: 2 2 2020-07-31
7: 2 1 2020-08-01
8: 2 1 2020-08-02
9: 2 1 2020-08-03
10: 2 1 2020-08-04
11: 3 2 2020-07-31
12: 3 1 2020-08-01
13: 3 1 2020-08-02
14: 3 1 2020-08-03
15: 3 1 2020-08-04
16: 4 2 2020-07-31
17: 4 1 2020-08-01
18: 4 1 2020-08-02
19: 4 1 2020-08-03
20: 4 1 2020-08-04
21: 5 2 2020-07-31
22: 5 1 2020-08-01
23: 5 1 2020-08-02
24: 5 1 2020-08-03
25: 5 1 2020-08-04
id diff d
I have done it like this -
input1 = data.table(id=as.character(1:5), diff=1)
input1 = input1[,.(d=seq(as.Date('2020-07-31'), by='days', length.out = 5)),.(id, diff)]
input1[d == '2020-07-31']$diff = 2
diff is basically the number of days to the next weekday. Eg. 31st Jul 2020 is Friday. Hence diff is 2 which is the diff to the next weekday, Monday. For the others it will be 1.
Is there a more R idiomatic way of doing this ?
I personally dont like that I had to generate the date sequence for each of the ids separately or the hardcoding of the diff that I have to do in the input for 31st July. Is there a more generic way of doing this without the hardcoding?
We can create all combination of dates and id using crossing and create diff column based on whether the weekday is "Friday".
library(dplyr)
tidyr::crossing(id = 1:5, d = seq(as.Date('2020-07-31'),
by='days', length.out = 5)) %>%
mutate(diff = as.integer(weekdays(d) == 'Friday') + 1)
Similar logic using base R expand.grid :
transform(expand.grid(id = 1:5,
d = seq(as.Date('2020-07-31'), by='days', length.out = 5)),
diff = as.integer(weekdays(d) == 'Friday') + 1)
and CJ in data.table :
library(data.table)
df <- CJ(id = 1:5, d = seq(as.Date('2020-07-31'), by='days', length.out = 5))
df[, diff := as.integer(weekdays(d) == 'Friday') + 1]

Separating non-overlapping intervals within groups and counting in R

Using R, I have inpatient data that I have grouped by DNA strain (of the pathogen), clinic of inpatient stay, and overlapping timeframe of the stay to determine if transmission is possible.
I need to sequentially number the overlapping groups. This would appear quite simple, but two issues:
Everything I have found on SO or elsewhere talks about numbering rows within groups. I need each row in a group the same number and the groups themselves to be counted.
Whatever approach would accomplish that initially seemed simple enough with a %>% group_by(strain, clinic) %>%, but this doesn't account for non-overlapping time intervals
I have tried several approaches and search before finally giving in and posting here (none of my attempts are worthy of event posting here to waste your time.) The below code is an example of the data I have (have) and data I want (want). Note for strain B, all patients are in Clinic_1 but there are two groups due to a separation in time intervals.
Any advice would be much appreciated.
have <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
strain=c(rep("A",4),rep("B",5)),
clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/21","2020/03/06","2020/03/18")),
dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/22","2020/03/21","2020/03/22"))
)
want <- data.frame(have,overlap_number=c(1,1,2,2,3,3,3,4,4))
#How the final data would look
> View(want)
id strain clinic datein dateout overlap_number
1 K01 A Clinic_1 2020-01-01 2020-01-05 1
2 K02 A Clinic_1 2020-01-03 2020-01-16 1
3 K03 A Clinic_2 2020-02-03 2020-02-09 2
4 K04 A Clinic_2 2020-02-09 2020-02-19 2
5 K05 B Clinic_1 2020-02-18 2020-02-27 3
6 K06 B Clinic_1 2020-02-20 2020-02-23 3
7 K07 B Clinic_1 2020-02-21 2020-02-22 3
8 K08 B Clinic_1 2020-03-06 2020-03-21 4
9 K09 B Clinic_1 2020-03-18 2020-03-22 4
An alternative dataset based on Akrun's comment, changing dates slightly for K07:
have2 <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
strain=c(rep("A",4),rep("B",5)),
clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/25","2020/03/06","2020/03/18")),
dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/29","2020/03/21","2020/03/22"))
)
#Output:
#> have2 %>%
#+ mutate(overlap_number = rleid(strain, clinic,
#+ cumsum(datein > lag(dateout, default = #first(dateout)))))
# id strain clinic datein dateout overlap_number
#1 K01 A Clinic_1 2020-01-01 2020-01-05 1
#2 K02 A Clinic_1 2020-01-03 2020-01-16 1
#3 K03 A Clinic_2 2020-02-03 2020-02-09 2
#4 K04 A Clinic_2 2020-02-09 2020-02-19 2
#5 K05 B Clinic_1 2020-02-18 2020-02-27 3
#6 K06 B Clinic_1 2020-02-20 2020-02-23 3
#7 K07 B Clinic_1 2020-02-25 2020-02-29 4 ## treats this as single, should be 3
#8 K08 B Clinic_1 2020-03-06 2020-03-21 5 ## should be 4
#9 K09 B Clinic_1 2020-03-18 2020-03-22 5 ## should be 4
An option using data.table:
setkey(setDT(have), clinic, strain, datein, dateout)
have[, g := cumsum(c(0L, (shift(datein, -1L) > cummax(as.integer(dateout)))[-.N])),
.(clinic, strain)][,
g := rleid(clinic, strain, g)]
Also:
have[, g02 := cumsum(datein > shift(cummax(as.integer(dateout)), fill=dateout[1L])),
.(clinic, strain)][,
g2 := rleid(clinic, strain, g02)]
output:
id strain clinic datein dateout g g2
1: K01 A Clinic_1 2020-01-01 2020-01-05 1 1
2: K02 A Clinic_1 2020-01-03 2020-01-16 1 1
3: K05 B Clinic_1 2020-02-18 2020-02-27 2 2
4: K06 B Clinic_1 2020-02-20 2020-02-23 2 2
5: K07 B Clinic_1 2020-02-21 2020-02-22 2 2
6: K08 B Clinic_1 2020-03-06 2020-03-21 3 3
7: K09 B Clinic_1 2020-03-18 2020-03-22 3 3
8: K03 A Clinic_2 2020-02-03 2020-02-09 4 4
9: K04 A Clinic_2 2020-02-09 2020-02-19 4 4
Idea on the cummax came from David Aurenburg post: How to flatten / merge overlapping time periods
We can use rleid
library(dplyr)
library(data.table)
have %>%
mutate(overlap_number = rleid(strain, clinic,
cumsum(datein > lag(dateout, default = first(dateout)))))
# id strain clinic datein dateout overlap_number
#1 K01 A Clinic_1 2020-01-01 2020-01-05 1
#2 K02 A Clinic_1 2020-01-03 2020-01-16 1
#3 K03 A Clinic_2 2020-02-03 2020-02-09 2
#4 K04 A Clinic_2 2020-02-09 2020-02-19 2
#5 K05 B Clinic_1 2020-02-18 2020-02-27 3
#6 K06 B Clinic_1 2020-02-20 2020-02-23 3
#7 K07 B Clinic_1 2020-02-21 2020-02-22 3
#8 K08 B Clinic_1 2020-03-06 2020-03-21 4
#9 K09 B Clinic_1 2020-03-18 2020-03-22 4

How to find max of a column within a particular daterange using data.table

I have a dataset with contractID, data and DaysPastDue information. How do I look forward say 12 months for each row and identify the Max DPD, corresponding to that contractID.
The data set looks like this
Contract_number Date DPD
1: a 2014-03-01 14
2: a 2014-03-01 5
3: a 2014-10-01 6
4: a 2014-10-01 16
5: a 2015-12-01 4
6: a 2015-12-01 17
7: a 2016-09-01 16
8: a 2016-09-01 15
9: a 2016-10-01 3
10: a 2016-10-01 8
11: b 2014-05-01 18
12: b 2014-05-01 9
13: b 2014-08-01 2
14: b 2014-08-01 14
Code for generating this dataset
library(data.table)
set.seed(123)
dummy_data= data.table(Contract_number = letters[1:4],
Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
DPD=sample.int(20:50, 40, replace = TRUE)
)
dummy_data[order(Contract_number,Date)]
I have a dplyr solution to this, wondering if there is a more concise datatable way to do this?
max_dpd_data<-dummy_data %>% left_join(dummy_data,dummy_data,by="Contract_number") %>%
filter(Date.y>Date.x & Date.y<=(Date.x + months(12))) %>%
group_by(Contract_number, Date.x) %>% summarise(Max_DPD_12_M = max(DPD.y), N_Mnths_Future=n()) %>%
rename(Date='Date.x')
dummy_data1<- left_join(dummy_data,max_dpd_data,by = c("Contract_number","Date"))
I also do not want to go the route of using expand.grid to fill in missing months, and then using Shift.
I figure you are looking for something like this
> library(data.table)
> set.seed(123)
> dummy_data= data.table(Contract_number = letters[1:4],
+ Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
+ DPD=sample.int(20:50, 40, replace = TRUE)
+ )
>
> # You can useset key to sort
> setkey(dummy_data, Contract_number, Date)
> dummy_data
Contract_number Date DPD
1: a 2014-05-01 16
2: a 2014-05-01 3
3: a 2014-11-01 18
4: a 2014-11-01 3
5: a 2015-05-01 14
6: a 2015-05-01 16
7: a 2016-07-01 14
8: a 2016-07-01 4
9: a 2016-09-01 6
10: a 2016-09-01 6
11: b 2014-01-01 5
12: b 2014-01-01 16
13: b 2014-02-01 15
14: b 2014-02-01 3
15: b 2015-01-01 3
16: b 2015-01-01 18
17: b 2016-04-01 14
18: b 2016-04-01 9
19: b 2016-10-01 16
20: b 2016-10-01 3
21: c 2014-03-01 1
22: c 2014-03-01 12
23: c 2014-06-01 7
24: c 2014-06-01 18
25: c 2015-02-01 13
26: c 2015-02-01 9
27: c 2015-04-01 11
28: c 2015-04-01 5
29: c 2016-01-01 20
30: c 2016-01-01 1
31: d 2014-12-01 19
32: d 2014-12-01 9
33: d 2015-07-01 10
34: d 2015-07-01 5
35: d 2015-12-01 5
36: d 2015-12-01 8
37: d 2016-02-01 12
38: d 2016-02-01 10
39: d 2016-06-01 20
40: d 2016-06-01 8
Contract_number Date DPD
>
> # Add yearmonth decimal column
> dummy_data[, ym := as.integer(format(Date, "%Y%m"))][
+ , ym := (ym %/% 100) + (ym %% 100) / 12][, ym_less_one := ym - 1][
+ , ym2 := ym]
>
> dummy_data <- dummy_data[
+ dummy_data, on = c("Contract_number", "ym>ym", "ym_less_one<=ym2"),
+ .(Date = first(i.Date), DPD = first(i.DPD), max_DPD = max(DPD)),
+ by =.EACHI][, c("ym", "ym_less_one") := NULL]
>
> print(dummy_data)
Contract_number Date DPD max_DPD
1: a 2014-05-01 16 18
2: a 2014-05-01 3 18
3: a 2014-11-01 18 16
4: a 2014-11-01 3 16
5: a 2015-05-01 14 NA
6: a 2015-05-01 16 NA
7: a 2016-07-01 14 6
8: a 2016-07-01 4 6
9: a 2016-09-01 6 NA
10: a 2016-09-01 6 NA
11: b 2014-01-01 5 18
12: b 2014-01-01 16 18
13: b 2014-02-01 15 18
14: b 2014-02-01 3 18
15: b 2015-01-01 3 NA
16: b 2015-01-01 18 NA
17: b 2016-04-01 14 16
18: b 2016-04-01 9 16
19: b 2016-10-01 16 NA
20: b 2016-10-01 3 NA
21: c 2014-03-01 1 18
22: c 2014-03-01 12 18
23: c 2014-06-01 7 13
24: c 2014-06-01 18 13
25: c 2015-02-01 13 20
26: c 2015-02-01 9 20
27: c 2015-04-01 11 20
28: c 2015-04-01 5 20
29: c 2016-01-01 20 NA
30: c 2016-01-01 1 NA
31: d 2014-12-01 19 10
32: d 2014-12-01 9 10
33: d 2015-07-01 10 20
34: d 2015-07-01 5 20
35: d 2015-12-01 5 20
36: d 2015-12-01 8 20
37: d 2016-02-01 12 20
38: d 2016-02-01 10 20
39: d 2016-06-01 20 NA
40: d 2016-06-01 8 NA
Contract_number Date DPD max_DPD
I am not sure whether or not you want the month of the observation within the 12 month period. Further, there might be some issues with the >= operations + floating point issues. A solution is maybe to subtract some_factor * .Machine$double.eps from the ym_less_one column.

Resources