Separating non-overlapping intervals within groups and counting in R - r

Using R, I have inpatient data that I have grouped by DNA strain (of the pathogen), clinic of inpatient stay, and overlapping timeframe of the stay to determine if transmission is possible.
I need to sequentially number the overlapping groups. This would appear quite simple, but two issues:
Everything I have found on SO or elsewhere talks about numbering rows within groups. I need each row in a group the same number and the groups themselves to be counted.
Whatever approach would accomplish that initially seemed simple enough with a %>% group_by(strain, clinic) %>%, but this doesn't account for non-overlapping time intervals
I have tried several approaches and search before finally giving in and posting here (none of my attempts are worthy of event posting here to waste your time.) The below code is an example of the data I have (have) and data I want (want). Note for strain B, all patients are in Clinic_1 but there are two groups due to a separation in time intervals.
Any advice would be much appreciated.
have <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
strain=c(rep("A",4),rep("B",5)),
clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/21","2020/03/06","2020/03/18")),
dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/22","2020/03/21","2020/03/22"))
)
want <- data.frame(have,overlap_number=c(1,1,2,2,3,3,3,4,4))
#How the final data would look
> View(want)
id strain clinic datein dateout overlap_number
1 K01 A Clinic_1 2020-01-01 2020-01-05 1
2 K02 A Clinic_1 2020-01-03 2020-01-16 1
3 K03 A Clinic_2 2020-02-03 2020-02-09 2
4 K04 A Clinic_2 2020-02-09 2020-02-19 2
5 K05 B Clinic_1 2020-02-18 2020-02-27 3
6 K06 B Clinic_1 2020-02-20 2020-02-23 3
7 K07 B Clinic_1 2020-02-21 2020-02-22 3
8 K08 B Clinic_1 2020-03-06 2020-03-21 4
9 K09 B Clinic_1 2020-03-18 2020-03-22 4
An alternative dataset based on Akrun's comment, changing dates slightly for K07:
have2 <- data.frame(id=c("K01","K02","K03","K04","K05","K06","K07","K08","K09"),
strain=c(rep("A",4),rep("B",5)),
clinic=c(rep("Clinic_1",2),rep("Clinic_2",2),rep("Clinic_1",5)),
datein=as.Date(c("2020/01/01","2020/01/03","2020/02/03","2020/02/09","2020/02/18","2020/02/20","2020/02/25","2020/03/06","2020/03/18")),
dateout=as.Date(c("2020/01/05","2020/01/16","2020/02/09","2020/02/19","2020/02/27","2020/02/23","2020/02/29","2020/03/21","2020/03/22"))
)
#Output:
#> have2 %>%
#+ mutate(overlap_number = rleid(strain, clinic,
#+ cumsum(datein > lag(dateout, default = #first(dateout)))))
# id strain clinic datein dateout overlap_number
#1 K01 A Clinic_1 2020-01-01 2020-01-05 1
#2 K02 A Clinic_1 2020-01-03 2020-01-16 1
#3 K03 A Clinic_2 2020-02-03 2020-02-09 2
#4 K04 A Clinic_2 2020-02-09 2020-02-19 2
#5 K05 B Clinic_1 2020-02-18 2020-02-27 3
#6 K06 B Clinic_1 2020-02-20 2020-02-23 3
#7 K07 B Clinic_1 2020-02-25 2020-02-29 4 ## treats this as single, should be 3
#8 K08 B Clinic_1 2020-03-06 2020-03-21 5 ## should be 4
#9 K09 B Clinic_1 2020-03-18 2020-03-22 5 ## should be 4

An option using data.table:
setkey(setDT(have), clinic, strain, datein, dateout)
have[, g := cumsum(c(0L, (shift(datein, -1L) > cummax(as.integer(dateout)))[-.N])),
.(clinic, strain)][,
g := rleid(clinic, strain, g)]
Also:
have[, g02 := cumsum(datein > shift(cummax(as.integer(dateout)), fill=dateout[1L])),
.(clinic, strain)][,
g2 := rleid(clinic, strain, g02)]
output:
id strain clinic datein dateout g g2
1: K01 A Clinic_1 2020-01-01 2020-01-05 1 1
2: K02 A Clinic_1 2020-01-03 2020-01-16 1 1
3: K05 B Clinic_1 2020-02-18 2020-02-27 2 2
4: K06 B Clinic_1 2020-02-20 2020-02-23 2 2
5: K07 B Clinic_1 2020-02-21 2020-02-22 2 2
6: K08 B Clinic_1 2020-03-06 2020-03-21 3 3
7: K09 B Clinic_1 2020-03-18 2020-03-22 3 3
8: K03 A Clinic_2 2020-02-03 2020-02-09 4 4
9: K04 A Clinic_2 2020-02-09 2020-02-19 4 4
Idea on the cummax came from David Aurenburg post: How to flatten / merge overlapping time periods

We can use rleid
library(dplyr)
library(data.table)
have %>%
mutate(overlap_number = rleid(strain, clinic,
cumsum(datein > lag(dateout, default = first(dateout)))))
# id strain clinic datein dateout overlap_number
#1 K01 A Clinic_1 2020-01-01 2020-01-05 1
#2 K02 A Clinic_1 2020-01-03 2020-01-16 1
#3 K03 A Clinic_2 2020-02-03 2020-02-09 2
#4 K04 A Clinic_2 2020-02-09 2020-02-19 2
#5 K05 B Clinic_1 2020-02-18 2020-02-27 3
#6 K06 B Clinic_1 2020-02-20 2020-02-23 3
#7 K07 B Clinic_1 2020-02-21 2020-02-22 3
#8 K08 B Clinic_1 2020-03-06 2020-03-21 4
#9 K09 B Clinic_1 2020-03-18 2020-03-22 4

Related

Dates subtraction from different rows of data frame

I have big data frame (dim: 12867779x5) which looks like that:
id
group
date1
date 2
icf
id1
2
2020-03-17
2019-06-05
id1
3
2020-04-03
2019-05-09
id2
2
2020-04-10
2019-07-04
id2
3
2021-04-1
2020-06-01
id3
1
2020-04-13
2019-07-07
id3
2
2021-04-10
2020-06-01
id3
3
2020-04-10
2019-07-04
id3
4
2021-04-13
2020-06-01
Desired output:
id
group
date1
date 2
icf
id1
3
2020-04-03
2019-05-09
0
id2
2
2020-04-10
2019-07-04
52
id2
3
2021-04-01
2020-06-01
0
id3
1
2020-04-13
2019-07-07
49
id3
2
2021-04-10
2020-06-01
-646
id3
3
2020-04-10
2019-07-04
52
id3
4
2021-04-13
2020-06-01
0
To calculate icf I need to check if the id's from row i and i+1 are the same. If yes icf = date2(i+1) - date1(i).
I wrote this function to calculate icf, but it's too slow. I'm looking for ways to speed it up, I was thinking about using the apply function but I don't have idea how to re-write this icfCalculation fucntion.
icfCalculation <- function(dataFrame){
nr <- nrow(dataFrame) - 1
for (i in 1:nr) {
if(dataFrame[i, 1] == dataFrame[i+1, 1]){
dataFrame[i,5] = dataFrame[i+1, 4] - dataFrame[i, 3]
}
else{
dataFrame[i,5] = 0
}
}
return(dataFrame)
}
Thanks for putting expected output. This is not the same as what you have put - but it does give the same results as your function and should be significantly quicker to thanks to the data.table internal optimisations:
library(data.table)
# Read in data
dat <- read.table(text = "id group date1 date2
id1 2 2020-03-17 2019-06-05
id1 3 2020-04-03 2019-05-09
id2 2 2020-04-10 2019-07-04
id2 3 2021-04-1 2020-06-01
id3 1 2020-04-13 2019-07-07
id3 2 2021-04-10 2020-06-01
id3 3 2020-04-10 2019-07-04
id3 4 2021-04-13 2020-06-01",
h = T,
colClasses = c("character", "character", "Date", "Date")
)
# Make it a data.table
setDT(dat)
dat[, icf := fifelse(
id == shift(id, type = "lead"),
as.integer(
shift(date2, type = "lead") - date1
),
0)
]
dat
# id group date1 date2 icf
# 1: id1 2 2020-03-17 2019-06-05 -313
# 2: id1 3 2020-04-03 2019-05-09 0
# 3: id2 2 2020-04-10 2019-07-04 52
# 4: id2 3 2021-04-01 2020-06-01 0
# 5: id3 1 2020-04-13 2019-07-07 49
# 6: id3 2 2021-04-10 2020-06-01 -646
# 7: id3 3 2020-04-10 2019-07-04 52
# 8: id3 4 2021-04-13 2020-06-01 NA
If you want the last NA to be 0, just add dat$icf[nrow(dat)] <- 0.
library(dplyr)
library(tidyr)
df %>%
mutate(icf = replace_na(ifelse(id == lead(id), lead(date2) - date1, 0), 0))
Rather than use tidyr::replace_na you could also specify the default argument of lead.
Base R
A base R approach would be something like:
df$icf <- with(df, ifelse(id == c(id[2:nrow(df)], NA), c(date2[2:nrow(df)], NA) - date1, 0))
Output
id group date1 date2 icf
1 id1 2 2020-03-17 2019-06-05 -313
2 id1 3 2020-04-03 2019-05-09 0
3 id2 2 2020-04-10 2019-07-04 52
4 id2 3 2021-04-01 2020-06-01 0
5 id3 1 2020-04-13 2019-07-07 49
6 id3 2 2021-04-10 2020-06-01 -646
7 id3 3 2020-04-10 2019-07-04 52
8 id3 4 2021-04-13 2020-06-01 0

Create new column based on cummulative/rolling values in grouping column

Edit: Unfortunately, I simplified my needs and data too much. I will update the question below.
I have a df similar to the code below. I need to create a new column called first_funding_date that is equal to the value of fund.date, where sigma==0, until the next time sigma==0. In the example df below, first_fund_date should be a vector with the first observation equal to "2019/05/22", the following 3 observations equal to "2020/09/05", and the final 4 equal to "2019/11/30".
set.seed(111)
df <- data.frame(id = c(1,1,3,4,5,6,2,7),
fund.date = sample(seq(as.Date('2018/01/01'),
as.Date('2021/01/01'), by="day"), 8),
sigma = c(0,0,1,2,0,1,2,3))
%>% mutate(first_fund_date = ??? )
I also need to create a column called last_funding_date that is equal to fund.date, for the rolling max of sigma. The first 4 observations should be "2020/03/03" and the last 4 should be "2020/12/04".
library(dplyr)
df %>%
mutate(first_fund_date = fund.date[sigma==0],
last_funding_date = fund.date[sigma==max(sigma)])
id fund.date sigma first_fund_date last_funding_date
1 1 2019-05-22 1 2020-09-05 2018-03-10
2 2 2020-09-05 0 2020-09-05 2018-03-10
3 3 2018-06-24 1 2020-09-05 2018-03-10
4 4 2020-03-03 2 2020-09-05 2018-03-10
5 5 2019-11-30 3 2020-09-05 2018-03-10
6 6 2018-03-10 4 2020-09-05 2018-03-10
The key here is to to create index variables to group_by with cumsum(sigma==0) and cumsum(sigma < lag(sigma)).
library(dplyr)
df %>%
group_by(index = cumsum(sigma==0))%>%
mutate(first_fund.date = first(fund.date))%>%
group_by(index_2 = cumsum(sigma < lag(sigma, default = Inf)))%>%
mutate(last_fund.date = last(fund.date))%>%
ungroup()%>%
select(-contains('index'))
# A tibble: 8 × 5
id fund.date sigma first_fund.date last_fund.date
<dbl> <date> <dbl> <date> <date>
1 1 2019-05-22 0 2019-05-22 2020-03-03
2 1 2020-09-05 0 2020-09-05 2020-03-03
3 3 2018-06-24 1 2020-09-05 2020-03-03
4 4 2020-03-03 2 2020-09-05 2020-03-03
5 5 2019-11-30 0 2019-11-30 2020-12-04
6 6 2018-03-10 1 2019-11-30 2020-12-04
7 2 2018-11-01 2 2019-11-30 2020-12-04
8 7 2020-12-04 3 2019-11-30 2020-12-04

Extract overlapping and non-overlapping time periods using R (data.table)

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
library(ggplot2)
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
theme_bw()
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
........
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?
This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
library(data.table)
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !is.na(type), ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.
#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
Output:
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

R's padr package claiming the "datetime variable does not vary" when it does vary

library(tidyverse)
library(lubridate)
library(padr)
df
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
I have an R data frame named df shown above. I want to concentrate on row numbers 5 and 6. I can usually use the package padr to pad the months in between rows 5 and 6. The pad() function of the padr will basically add rows at intervals the user specifies, best shown as the added rows "X" below.
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> X 262-179 NA NA NA 2019-09-01
#> X 262-179 NA NA NA 2019-10-01
#> X 262-179 NA NA NA 2019-11-01
#> X 262-179 NA NA NA 2019-12-01
#> X 262-179 NA NA NA 2020-01-01
#> X 262-179 NA NA NA 2020-02-01
#> X 262-179 NA NA NA 2020-03-01
#> X 262-179 NA NA NA 2020-04-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
To get there I usually issue a command, such as is shown below, and it works fine in padr. But it doesn't work in my specific example, and instead yields the warning shown below.
df %>% pad(group = "Scar_Id", by = "YrMo", interval = "month")
#> # A tibble: 828 x 5
#> Scar_Id Code Type Value YrMo
#> <chr> <chr> <chr> <date> <date>
#> 1 0070-179 AA Start_Date 2020-04-22 2020-04-01
#> 2 0070-179 AA Closure_Date 2020-05-23 2020-05-01
#> 3 1139-179 AA Start_Date 2020-04-23 2020-04-01
#> 4 1139-179 AA Closure_Date 2020-05-23 2020-05-01
#> 5 262-179 AA Start_Date 2019-08-29 2019-08-01
#> 6 262-179 AA Closure_Date 2020-05-23 2020-05-01
#> 7 270-179 AA Start_Date 2019-08-29 2019-08-01
#> 8 270-179 AA Closure_Date 2020-05-23 2020-05-01
#> 9 476-179 BB Start_Date 2019-09-04 2019-09-01
#> 10 476-179 BB Closure_Date 2019-11-04 2019-11-01
#> # ... with 818 more rows
#> Warning message:
#> datetime variable does not vary for 537 of the groups, no padding applied on this / these group(s)
Why does it claim that "the datetime variable does not vary" for rows 5 and 6, when the datetime does indeed vary. The datetime for row 5 variable YrMo is "2019-08-01" and the datetime for row 6 variable YrMo is "2020-05-01". Let me state the obvious that "2019-08-01" varies from "2020-05-01".
Any ideas what went wrong? I tried to create a reproducible example and could not. The basic examples I created all work as expected (as I describe). Hopefully these clues can help somebody determine what is going on.

Increasing code execution time-efficiency using data.table and for-loop

Problem: How can I make the for-loop in below code run more time-efficiently? For this toy example it works in a reasonable amount of time. However, unique_ids will be a vector of approximately 8000 entries and the for-loop slows down heavily the computation. Any ideas? Many thanks!
Purpose:
Cluster retrospectively IIDs for each day into hop and top based on calculation logic in for-loop.
Initial Data:
IID ENTRY FINISH TARGET max_finish_target_date
1: 1 2020-02-11 2020-02-19 2020-02-15 2020-02-19
2: 2 2020-02-13 2020-02-17 2020-02-19 2020-02-19
Final (Target) Data:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 hop
3: 1 2020-02-12 hop
4: 1 2020-02-13 hop
5: 1 2020-02-14 hop
6: 1 2020-02-15 hop
7: 1 2020-02-16 top
8: 1 2020-02-17 top
9: 1 2020-02-18 top
10: 1 2020-02-19 top
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 hop
15: 2 2020-02-14 hop
16: 2 2020-02-15 hop
17: 2 2020-02-16 hop
18: 2 2020-02-17 hop
19: 2 2020-02-18
20: 2 2020-02-19
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 hop
27: 3 2020-02-16 hop
28: 3 2020-02-17 top
29: 3 2020-02-18 top
30: 3 2020-02-19 top
Code
rm(list = ls())
library(data.table)
# Some sample start data
initial_dt <- data.table(IID = c(1, 2, 3),
ENTRY = c("2020-02-11", "2020-02-13", "2020-02-15"),
FINISH = c("2020-02-19", "2020-02-17", ""),
TARGET = c("2020-02-15", "2020-02-19", "2020-02-16"))
initial_dt[, ":="(ENTRY = ymd(ENTRY),
FINISH = ymd(FINISH),
TARGET = ymd(TARGET))]
initial_dt[is.na(FINISH), FINISH := as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d")]
initial_dt[, max_finish_target_date := pmax(FINISH, TARGET)]
# Specify target data shape and output format
unique_ids <- c(1, 2, 3)
dts <- seq(as.Date("2020-02-10", format = "%Y-%m-%d"), as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d"), by = "days")
ids <- rep(unique_ids, each = length(dts))
len <- length(unique_ids)
final_dt <- data.table(IID = ids,
Dates = rep(dts, times = len))
# Calculation logic
# QUESTION: How can I make this part below run more efficiently and less time costly?
for (d_id in unique_ids){
final_dt[(IID == d_id) & (Dates %between% c(initial_dt[IID == d_id, ENTRY], initial_dt[IID == d_id, max_finish_target_date])),
ind_frist := ifelse((Dates > initial_dt[IID == d_id, TARGET]) & (Dates <= initial_dt[IID == d_id, max_finish_target_date]),
"hop",
"top")]
}
Your loop doesn't produce the output you show. The following non-equi joins produce that output but could easily be adjusted for other rules (e.g. those from your for loop):
final_dt <- CJ(IID = initial_dt[["IID"]], Dates = dts)
final_dt[initial_dt, ind_frist := "hop", on = .(IID, Dates >= ENTRY, Dates <= FINISH)]
final_dt[initial_dt, ind_frist := "top", on = .(IID, Dates > TARGET, Dates <= FINISH)]
These joins should be very fast.
Result:
# IID Dates ind_frist
# 1: 1 2020-02-10 <NA>
# 2: 1 2020-02-11 hop
# 3: 1 2020-02-12 hop
# 4: 1 2020-02-13 hop
# 5: 1 2020-02-14 hop
# 6: 1 2020-02-15 hop
# 7: 1 2020-02-16 top
# 8: 1 2020-02-17 top
# 9: 1 2020-02-18 top
#10: 1 2020-02-19 top
#11: 2 2020-02-10 <NA>
#12: 2 2020-02-11 <NA>
#13: 2 2020-02-12 <NA>
#14: 2 2020-02-13 hop
#15: 2 2020-02-14 hop
#16: 2 2020-02-15 hop
#17: 2 2020-02-16 hop
#18: 2 2020-02-17 hop
#19: 2 2020-02-18 <NA>
#20: 2 2020-02-19 <NA>
#21: 3 2020-02-10 <NA>
#22: 3 2020-02-11 <NA>
#23: 3 2020-02-12 <NA>
#24: 3 2020-02-13 <NA>
#25: 3 2020-02-14 <NA>
#26: 3 2020-02-15 hop
#27: 3 2020-02-16 hop
#28: 3 2020-02-17 top
#29: 3 2020-02-18 top
#30: 3 2020-02-19 top
# IID Dates ind_frist
A possibel alternative using a data.table-join:
final_dt[initial_dt
, on = .(IID)
, ind_frist := c("", "top","hop")[1L + (Dates > TARGET & Dates <= max_finish_target_date) +
Dates %between% .(ENTRY, max_finish_target_date)]][]
which gives:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 top
3: 1 2020-02-12 top
4: 1 2020-02-13 top
5: 1 2020-02-14 top
6: 1 2020-02-15 top
7: 1 2020-02-16 hop
8: 1 2020-02-17 hop
9: 1 2020-02-18 hop
10: 1 2020-02-19 hop
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 top
15: 2 2020-02-14 top
16: 2 2020-02-15 top
17: 2 2020-02-16 top
18: 2 2020-02-17 top
19: 2 2020-02-18 top
20: 2 2020-02-19 top
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 top
27: 3 2020-02-16 top
28: 3 2020-02-17 hop
29: 3 2020-02-18 hop
30: 3 2020-02-19 hop
This is the same as the output of the for-loop.
Some explanation: the part 1L + (Dates > TARGET & Dates <= max_finish_target_date) + Dates %between% .(ENTRY, max_finish_target_date) creates an index vector of one's, two's and three's of equal length as the number of rows of final_dt; if you put that between square brackets after c("", "top","hop"), for each one you will get an empty string, for each two you will get "top" and for each three you will get "hop".

Resources