Extract overlapping and non-overlapping time periods using R (data.table) - r

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
library(ggplot2)
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
theme_bw()
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
........
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?

This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
library(data.table)
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !is.na(type), ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.

#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
Output:
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

Related

Define periods/episodes of exposition with overlaping and concatenated intervals of time

I'm trying to identify periods/episodes of exposition to a drug with prescriptions. If those prescriptions are separated for 30 days it's considered a new period/episode of exposition. Prescriptions can overlap during certain time or be consecutive. If the sum of separated days of two consecutive prescripction is greater than 30 days it's not considered a new episode.
I have data like this:
id = c(rep(1,3), rep(2,6), rep(3,5))
start = as.Date(c("2017-05-10", "2017-07-28", "2017-11-23", "2017-01-27", "2017-10-02", "2018-05-14", "2018-05-25", "2018-11-26", "2018-12-28", "2016-01-01", "2016-03-02", "2016-03-20", "2016-04-25", "2016-06-29"))
end = as.Date(c("2017-07-27", "2018-01-28", "2018-03-03", "2017-04-27", "2018-05-13", "2018-11-14", "2018-11-25", "2018-12-27", "2019-06-28", "2016-02-15", "2016-03-05", "2016-03-24", "2016-04-29", "2016-11-01"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2017-05-10 2017-07-27
2: 1 2017-07-28 2018-01-28
3: 1 2017-11-23 2018-03-03
4: 2 2017-01-27 2017-04-27
5: 2 2017-10-02 2018-05-13
6: 2 2018-05-14 2018-11-14
7: 2 2018-05-25 2018-11-25
8: 2 2018-11-26 2018-12-27
9: 2 2018-12-28 2019-06-28
10: 3 2016-01-01 2016-02-15
11: 3 2016-03-02 2016-03-05
12: 3 2016-03-20 2016-03-24
13: 3 2016-04-25 2016-04-29
14: 3 2016-06-29 2016-11-01
I calculated the difference of start and last end observation (last_diffdays)
DT[, last_diffdays := start-shift(end, n=1L), by = .(id)][is.na(last_diffdays), last_diffdays := 0][]
id start end last_diffdays
1: 1 2017-05-10 2017-07-27 0 days
2: 1 2017-07-28 2018-01-28 1 days
3: 1 2017-11-23 2018-03-03 -66 days
4: 2 2017-01-27 2017-04-27 0 days
5: 2 2017-10-02 2018-05-13 158 days
6: 2 2018-05-14 2018-11-14 1 days
7: 2 2018-05-25 2018-11-25 -173 days
8: 2 2018-11-26 2018-12-27 1 days
9: 2 2018-12-28 2019-06-28 1 days
10: 3 2016-01-01 2016-02-15 0 days
11: 3 2016-03-02 2016-03-05 16 days
12: 3 2016-03-20 2016-03-24 15 days
13: 3 2016-04-25 2016-04-29 32 days
14: 3 2016-06-29 2016-11-01 61 days
This shows when an overlap happens (negative values) or not (positive values). I think an ifelse/fcase statement here would be a bad idea and I'm not comfortable doing it.
I think a good output for this job would be something like:
id start end last_diffdays noexp_days period
1: 1 2017-05-10 2017-07-27 0 days 0 1
2: 1 2017-07-28 2018-01-28 1 days 1 1
3: 1 2017-11-23 2018-03-03 -66 days 0 1
4: 2 2017-01-27 2017-04-27 0 days 0 1
5: 2 2017-10-02 2018-05-13 158 days 158 2
6: 2 2018-05-14 2018-11-14 1 days 1 2
7: 2 2018-05-25 2018-11-25 -173 days 0 2
8: 2 2018-11-26 2018-12-27 1 days 1 2
9: 2 2018-12-28 2019-06-28 1 days 1 2
10: 3 2016-01-01 2016-02-15 0 days 0 1
11: 3 2016-03-02 2016-03-05 16 days 16 1
12: 3 2016-03-20 2016-03-24 15 days 15 1
13: 3 2016-04-25 2016-04-29 32 days 32 2
14: 3 2016-06-29 2016-11-01 61 days 61 3
I manually calculated the days without exposition (noexp_days) of the before prescription.
I dunno If I'm the right path but I think I need to calculate noexp_days variable and then make a cumsum((noexp_days)>30)+1.
If there is a much better solution I don't see or any other possibility I haven't considered I will appreciate to read about them.
Thanks in advance for any help! :)
Try :
library(data.table)
DT[, noexp_days := pmax(as.integer(last_diffdays), 0)]
DT[, period := cumsum(noexp_days > 30) + 1, id]
DT
# id start end last_diffdays noexp_days period
# 1: 1 2017-05-10 2017-07-27 0 days 0 1
# 2: 1 2017-07-28 2018-01-28 1 days 1 1
# 3: 1 2017-11-23 2018-03-03 -66 days 0 1
# 4: 2 2017-01-27 2017-04-27 0 days 0 1
# 5: 2 2017-10-02 2018-05-13 158 days 158 2
# 6: 2 2018-05-14 2018-11-14 1 days 1 2
# 7: 2 2018-05-25 2018-11-25 -173 days 0 2
# 8: 2 2018-11-26 2018-12-27 1 days 1 2
# 9: 2 2018-12-28 2019-06-28 1 days 1 2
#10: 3 2016-01-01 2016-02-15 0 days 0 1
#11: 3 2016-03-02 2016-03-05 16 days 16 1
#12: 3 2016-03-20 2016-03-24 15 days 15 1
#13: 3 2016-04-25 2016-04-29 32 days 32 2
#14: 3 2016-06-29 2016-11-01 61 days 61 3

Increasing code execution time-efficiency using data.table and for-loop

Problem: How can I make the for-loop in below code run more time-efficiently? For this toy example it works in a reasonable amount of time. However, unique_ids will be a vector of approximately 8000 entries and the for-loop slows down heavily the computation. Any ideas? Many thanks!
Purpose:
Cluster retrospectively IIDs for each day into hop and top based on calculation logic in for-loop.
Initial Data:
IID ENTRY FINISH TARGET max_finish_target_date
1: 1 2020-02-11 2020-02-19 2020-02-15 2020-02-19
2: 2 2020-02-13 2020-02-17 2020-02-19 2020-02-19
Final (Target) Data:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 hop
3: 1 2020-02-12 hop
4: 1 2020-02-13 hop
5: 1 2020-02-14 hop
6: 1 2020-02-15 hop
7: 1 2020-02-16 top
8: 1 2020-02-17 top
9: 1 2020-02-18 top
10: 1 2020-02-19 top
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 hop
15: 2 2020-02-14 hop
16: 2 2020-02-15 hop
17: 2 2020-02-16 hop
18: 2 2020-02-17 hop
19: 2 2020-02-18
20: 2 2020-02-19
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 hop
27: 3 2020-02-16 hop
28: 3 2020-02-17 top
29: 3 2020-02-18 top
30: 3 2020-02-19 top
Code
rm(list = ls())
library(data.table)
# Some sample start data
initial_dt <- data.table(IID = c(1, 2, 3),
ENTRY = c("2020-02-11", "2020-02-13", "2020-02-15"),
FINISH = c("2020-02-19", "2020-02-17", ""),
TARGET = c("2020-02-15", "2020-02-19", "2020-02-16"))
initial_dt[, ":="(ENTRY = ymd(ENTRY),
FINISH = ymd(FINISH),
TARGET = ymd(TARGET))]
initial_dt[is.na(FINISH), FINISH := as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d")]
initial_dt[, max_finish_target_date := pmax(FINISH, TARGET)]
# Specify target data shape and output format
unique_ids <- c(1, 2, 3)
dts <- seq(as.Date("2020-02-10", format = "%Y-%m-%d"), as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d"), by = "days")
ids <- rep(unique_ids, each = length(dts))
len <- length(unique_ids)
final_dt <- data.table(IID = ids,
Dates = rep(dts, times = len))
# Calculation logic
# QUESTION: How can I make this part below run more efficiently and less time costly?
for (d_id in unique_ids){
final_dt[(IID == d_id) & (Dates %between% c(initial_dt[IID == d_id, ENTRY], initial_dt[IID == d_id, max_finish_target_date])),
ind_frist := ifelse((Dates > initial_dt[IID == d_id, TARGET]) & (Dates <= initial_dt[IID == d_id, max_finish_target_date]),
"hop",
"top")]
}
Your loop doesn't produce the output you show. The following non-equi joins produce that output but could easily be adjusted for other rules (e.g. those from your for loop):
final_dt <- CJ(IID = initial_dt[["IID"]], Dates = dts)
final_dt[initial_dt, ind_frist := "hop", on = .(IID, Dates >= ENTRY, Dates <= FINISH)]
final_dt[initial_dt, ind_frist := "top", on = .(IID, Dates > TARGET, Dates <= FINISH)]
These joins should be very fast.
Result:
# IID Dates ind_frist
# 1: 1 2020-02-10 <NA>
# 2: 1 2020-02-11 hop
# 3: 1 2020-02-12 hop
# 4: 1 2020-02-13 hop
# 5: 1 2020-02-14 hop
# 6: 1 2020-02-15 hop
# 7: 1 2020-02-16 top
# 8: 1 2020-02-17 top
# 9: 1 2020-02-18 top
#10: 1 2020-02-19 top
#11: 2 2020-02-10 <NA>
#12: 2 2020-02-11 <NA>
#13: 2 2020-02-12 <NA>
#14: 2 2020-02-13 hop
#15: 2 2020-02-14 hop
#16: 2 2020-02-15 hop
#17: 2 2020-02-16 hop
#18: 2 2020-02-17 hop
#19: 2 2020-02-18 <NA>
#20: 2 2020-02-19 <NA>
#21: 3 2020-02-10 <NA>
#22: 3 2020-02-11 <NA>
#23: 3 2020-02-12 <NA>
#24: 3 2020-02-13 <NA>
#25: 3 2020-02-14 <NA>
#26: 3 2020-02-15 hop
#27: 3 2020-02-16 hop
#28: 3 2020-02-17 top
#29: 3 2020-02-18 top
#30: 3 2020-02-19 top
# IID Dates ind_frist
A possibel alternative using a data.table-join:
final_dt[initial_dt
, on = .(IID)
, ind_frist := c("", "top","hop")[1L + (Dates > TARGET & Dates <= max_finish_target_date) +
Dates %between% .(ENTRY, max_finish_target_date)]][]
which gives:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 top
3: 1 2020-02-12 top
4: 1 2020-02-13 top
5: 1 2020-02-14 top
6: 1 2020-02-15 top
7: 1 2020-02-16 hop
8: 1 2020-02-17 hop
9: 1 2020-02-18 hop
10: 1 2020-02-19 hop
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 top
15: 2 2020-02-14 top
16: 2 2020-02-15 top
17: 2 2020-02-16 top
18: 2 2020-02-17 top
19: 2 2020-02-18 top
20: 2 2020-02-19 top
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 top
27: 3 2020-02-16 top
28: 3 2020-02-17 hop
29: 3 2020-02-18 hop
30: 3 2020-02-19 hop
This is the same as the output of the for-loop.
Some explanation: the part 1L + (Dates > TARGET & Dates <= max_finish_target_date) + Dates %between% .(ENTRY, max_finish_target_date) creates an index vector of one's, two's and three's of equal length as the number of rows of final_dt; if you put that between square brackets after c("", "top","hop"), for each one you will get an empty string, for each two you will get "top" and for each three you will get "hop".

Calculate maximum date interval - R

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
)
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
library(magrittr)
library(data.table)
setDT(data)
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
}
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days

Cumsum reset at certain values [duplicate]

This question already has answers here:
Cumsum with reset when 0 is encountered and by groups
(2 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 5 years ago.
I have the following dataframe
x y count
1 1 2018-02-24 4.031540
2 2 2018-02-25 5.244303
3 3 2018-02-26 5.441465
4 NA 2018-02-27 4.164104
5 5 2018-02-28 5.172919
6 6 2018-03-01 5.591410
7 7 2018-03-02 4.691716
8 8 2018-03-03 5.465360
9 9 2018-03-04 3.269378
10 NA 2018-03-05 5.300679
11 11 2018-03-06 5.489664
12 12 2018-03-07 4.423334
13 13 2018-03-08 3.808764
14 14 2018-03-09 6.450136
15 15 2018-03-10 5.541785
16 16 2018-03-11 4.762889
17 17 2018-03-12 5.511649
18 18 2018-03-13 6.795386
19 19 2018-03-14 6.615762
20 20 2018-03-15 4.749151
I want to take the cumsum of the count column, but I want the the cumsum to restart when the x value is NA. I've tried the following:
df$cum_sum <- ifelse(is.na(df$x) == FALSE, cumsum(df$count), 0)
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 0.000000
5 5 2018-02-28 5.172919 24.054331
6 6 2018-03-01 5.591410 29.645741
7 7 2018-03-02 4.691716 34.337458
8 8 2018-03-03 5.465360 39.802817
9 9 2018-03-04 3.269378 43.072195
10 NA 2018-03-05 5.300679 0.000000
11 11 2018-03-06 5.489664 53.862538
12 12 2018-03-07 4.423334 58.285871
13 13 2018-03-08 3.808764 62.094635
14 14 2018-03-09 6.450136 68.544771
15 15 2018-03-10 5.541785 74.086556
16 16 2018-03-11 4.762889 78.849445
17 17 2018-03-12 5.511649 84.361094
18 18 2018-03-13 6.795386 91.156480
19 19 2018-03-14 6.615762 97.772242
20 20 2018-03-15 4.749151 102.521394
The result is the cum_sum column is 0 at the NA values, but the cumsum doesn't reset. How can I fix this?
A possible solution:
dat$cum_sum <- ave(dat$count, cumsum(is.na(dat$x)), FUN = cumsum)
which gives:
> dat
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 4.164104
5 5 2018-02-28 5.172919 9.337023
6 6 2018-03-01 5.591410 14.928433
7 7 2018-03-02 4.691716 19.620149
8 8 2018-03-03 5.465360 25.085509
9 9 2018-03-04 3.269378 28.354887
10 NA 2018-03-05 5.300679 5.300679
11 11 2018-03-06 5.489664 10.790343
12 12 2018-03-07 4.423334 15.213677
13 13 2018-03-08 3.808764 19.022441
14 14 2018-03-09 6.450136 25.472577
15 15 2018-03-10 5.541785 31.014362
16 16 2018-03-11 4.762889 35.777251
17 17 2018-03-12 5.511649 41.288900
18 18 2018-03-13 6.795386 48.084286
19 19 2018-03-14 6.615762 54.700048
20 20 2018-03-15 4.749151 59.449199
Or with dplyr:
library(dplyr)
dat %>%
group_by(grp = cumsum(is.na(x))) %>%
mutate(cum_sum = cumsum(count)) %>%
ungroup() %>%
select(-grp)
I have the data.table version
plouf <- setDT(df)
plouf[,group := cumsum(is.na(x))]
plouf[!is.na(x),cum_sum := cumsum(count),by = group]
x y count group cum_sum
1: 1 2018-02-24 4.031540 0 4.031540
2: 2 2018-02-25 5.244303 0 9.275843
3: 3 2018-02-26 5.441465 0 14.717308
4: NA 2018-02-27 4.164104 1 NA
5: 5 2018-02-28 5.172919 1 5.172919
6: 6 2018-03-01 5.591410 1 10.764329
7: 7 2018-03-02 4.691716 1 15.456045
8: 8 2018-03-03 5.465360 1 20.921405
9: 9 2018-03-04 3.269378 1 24.190783
10: NA 2018-03-05 5.300679 2 NA
11: 11 2018-03-06 5.489664 2 5.489664
12: 12 2018-03-07 4.423334 2 9.912998
13: 13 2018-03-08 3.808764 2 13.721762
14: 14 2018-03-09 6.450136 2 20.171898
15: 15 2018-03-10 5.541785 2 25.713683
16: 16 2018-03-11 4.762889 2 30.476572
17: 17 2018-03-12 5.511649 2 35.988221
18: 18 2018-03-13 6.795386 2 42.783607
19: 19 2018-03-14 6.615762 2 49.399369
20: 20 2018-03-15 4.749151 2 54.148520

How to find max of a column within a particular daterange using data.table

I have a dataset with contractID, data and DaysPastDue information. How do I look forward say 12 months for each row and identify the Max DPD, corresponding to that contractID.
The data set looks like this
Contract_number Date DPD
1: a 2014-03-01 14
2: a 2014-03-01 5
3: a 2014-10-01 6
4: a 2014-10-01 16
5: a 2015-12-01 4
6: a 2015-12-01 17
7: a 2016-09-01 16
8: a 2016-09-01 15
9: a 2016-10-01 3
10: a 2016-10-01 8
11: b 2014-05-01 18
12: b 2014-05-01 9
13: b 2014-08-01 2
14: b 2014-08-01 14
Code for generating this dataset
library(data.table)
set.seed(123)
dummy_data= data.table(Contract_number = letters[1:4],
Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
DPD=sample.int(20:50, 40, replace = TRUE)
)
dummy_data[order(Contract_number,Date)]
I have a dplyr solution to this, wondering if there is a more concise datatable way to do this?
max_dpd_data<-dummy_data %>% left_join(dummy_data,dummy_data,by="Contract_number") %>%
filter(Date.y>Date.x & Date.y<=(Date.x + months(12))) %>%
group_by(Contract_number, Date.x) %>% summarise(Max_DPD_12_M = max(DPD.y), N_Mnths_Future=n()) %>%
rename(Date='Date.x')
dummy_data1<- left_join(dummy_data,max_dpd_data,by = c("Contract_number","Date"))
I also do not want to go the route of using expand.grid to fill in missing months, and then using Shift.
I figure you are looking for something like this
> library(data.table)
> set.seed(123)
> dummy_data= data.table(Contract_number = letters[1:4],
+ Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
+ DPD=sample.int(20:50, 40, replace = TRUE)
+ )
>
> # You can useset key to sort
> setkey(dummy_data, Contract_number, Date)
> dummy_data
Contract_number Date DPD
1: a 2014-05-01 16
2: a 2014-05-01 3
3: a 2014-11-01 18
4: a 2014-11-01 3
5: a 2015-05-01 14
6: a 2015-05-01 16
7: a 2016-07-01 14
8: a 2016-07-01 4
9: a 2016-09-01 6
10: a 2016-09-01 6
11: b 2014-01-01 5
12: b 2014-01-01 16
13: b 2014-02-01 15
14: b 2014-02-01 3
15: b 2015-01-01 3
16: b 2015-01-01 18
17: b 2016-04-01 14
18: b 2016-04-01 9
19: b 2016-10-01 16
20: b 2016-10-01 3
21: c 2014-03-01 1
22: c 2014-03-01 12
23: c 2014-06-01 7
24: c 2014-06-01 18
25: c 2015-02-01 13
26: c 2015-02-01 9
27: c 2015-04-01 11
28: c 2015-04-01 5
29: c 2016-01-01 20
30: c 2016-01-01 1
31: d 2014-12-01 19
32: d 2014-12-01 9
33: d 2015-07-01 10
34: d 2015-07-01 5
35: d 2015-12-01 5
36: d 2015-12-01 8
37: d 2016-02-01 12
38: d 2016-02-01 10
39: d 2016-06-01 20
40: d 2016-06-01 8
Contract_number Date DPD
>
> # Add yearmonth decimal column
> dummy_data[, ym := as.integer(format(Date, "%Y%m"))][
+ , ym := (ym %/% 100) + (ym %% 100) / 12][, ym_less_one := ym - 1][
+ , ym2 := ym]
>
> dummy_data <- dummy_data[
+ dummy_data, on = c("Contract_number", "ym>ym", "ym_less_one<=ym2"),
+ .(Date = first(i.Date), DPD = first(i.DPD), max_DPD = max(DPD)),
+ by =.EACHI][, c("ym", "ym_less_one") := NULL]
>
> print(dummy_data)
Contract_number Date DPD max_DPD
1: a 2014-05-01 16 18
2: a 2014-05-01 3 18
3: a 2014-11-01 18 16
4: a 2014-11-01 3 16
5: a 2015-05-01 14 NA
6: a 2015-05-01 16 NA
7: a 2016-07-01 14 6
8: a 2016-07-01 4 6
9: a 2016-09-01 6 NA
10: a 2016-09-01 6 NA
11: b 2014-01-01 5 18
12: b 2014-01-01 16 18
13: b 2014-02-01 15 18
14: b 2014-02-01 3 18
15: b 2015-01-01 3 NA
16: b 2015-01-01 18 NA
17: b 2016-04-01 14 16
18: b 2016-04-01 9 16
19: b 2016-10-01 16 NA
20: b 2016-10-01 3 NA
21: c 2014-03-01 1 18
22: c 2014-03-01 12 18
23: c 2014-06-01 7 13
24: c 2014-06-01 18 13
25: c 2015-02-01 13 20
26: c 2015-02-01 9 20
27: c 2015-04-01 11 20
28: c 2015-04-01 5 20
29: c 2016-01-01 20 NA
30: c 2016-01-01 1 NA
31: d 2014-12-01 19 10
32: d 2014-12-01 9 10
33: d 2015-07-01 10 20
34: d 2015-07-01 5 20
35: d 2015-12-01 5 20
36: d 2015-12-01 8 20
37: d 2016-02-01 12 20
38: d 2016-02-01 10 20
39: d 2016-06-01 20 NA
40: d 2016-06-01 8 NA
Contract_number Date DPD max_DPD
I am not sure whether or not you want the month of the observation within the 12 month period. Further, there might be some issues with the >= operations + floating point issues. A solution is maybe to subtract some_factor * .Machine$double.eps from the ym_less_one column.

Resources