R data.table cumulative sum over time intervals - r

I have a table with values that exist during specific time intervals. I want a field that will sum up over all values for a given ID that exist during the start time of that interval.
Here is a paired down example:
x = data.table(ID = c(rep(1, 5), rep(2, 5)),
DT_START = as.Date(c('2017-05-28', '2017-05-29', '2017-07-03', '2018-05-28', '2018-05-29',
'2019-07-03', '2019-10-08', '2020-05-28', '2020-05-29', '2020-07-03')),
DT_END = as.Date(c('2018-05-28', '2018-05-29', '2018-07-03', '2018-05-29', '2019-05-28',
'2019-10-08', '2020-07-03', '2021-05-28', '2021-05-29', '2020-10-03')),
VALUE = c(300, 400, 200, 100, 150, 250, 350, 50, 10, 45))
The table looks like this:
x
ID DT_START DT_END VALUE
1: 1 2017-05-28 2018-05-28 300
2: 1 2017-05-29 2018-05-29 400
3: 1 2017-07-03 2018-07-03 200
4: 1 2018-05-28 2018-05-29 100
5: 1 2018-05-29 2019-05-28 150
6: 2 2019-07-03 2019-10-08 250
7: 2 2019-10-08 2020-07-03 350
8: 2 2020-05-28 2021-05-28 50
9: 2 2020-05-29 2021-05-29 10
10: 2 2020-07-03 2020-10-03 45
In the first row, that's the first start date for that ID and there are no equal dates, so the cumulative value would be just 300. By the second row, we now add the 300+400 to get 700, because as of 5/29/2017, both the 400 and the 300 were active for ID 1. The full desired output vector is obtained using the following code:
x[, VALUE_CUM := sum(x$VALUE * ifelse(x$ID == ID & x$DT_START <= DT_START & x$DT_END > DT_START, 1, 0)), by = .(ID, DT_START)]
x
ID DT_START DT_END VALUE VALUE_CUM
1: 1 2017-05-28 2018-05-28 300 300
2: 1 2017-05-29 2018-05-29 400 700
3: 1 2017-07-03 2018-07-03 200 900
4: 1 2018-05-28 2018-05-29 100 700
5: 1 2018-05-29 2019-05-28 150 350
6: 2 2019-07-03 2019-10-08 250 250
7: 2 2019-10-08 2020-07-03 350 350
8: 2 2020-05-28 2021-05-28 50 400
9: 2 2020-05-29 2021-05-29 10 410
10: 2 2020-07-03 2020-10-03 45 105
This is great but takes way to long on my huge data table with millions of rows. Any ideas for how to do this more elegantly so it takes faster?
Thanks!

Here is a possible way to do it:
y <- x[x, .(
DT_END2 = i.DT_END,
VALUE = i.VALUE, VALUE_CUM = sum(x.VALUE)),
on = .(ID, DT_START <= DT_START, DT_END > DT_START), by = .EACHI]
# DT_END is overwritten by values of DT_START, so we use DT_END2 to backup the correct DT_END values.
y[, DT_END := DT_END2][, DT_END2 := NULL]
# ID DT_START DT_END VALUE VALUE_CUM
# 1: 1 2017-05-28 2018-05-28 300 300
# 2: 1 2017-05-29 2018-05-29 400 700
# 3: 1 2017-07-03 2018-07-03 200 900
# 4: 1 2018-05-28 2018-05-29 100 700
# 5: 1 2018-05-29 2019-05-28 150 350
# 6: 2 2019-07-03 2019-10-08 250 250
# 7: 2 2019-10-08 2020-07-03 350 350
# 8: 2 2020-05-28 2021-05-28 50 400
# 9: 2 2020-05-29 2021-05-29 10 410
# 10: 2 2020-07-03 2020-10-03 45 105

Related

Extract overlapping and non-overlapping time periods using R (data.table)

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
library(ggplot2)
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
theme_bw()
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
........
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?
This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
library(data.table)
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !is.na(type), ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.
#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
Output:
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

Generating test data in R

I am trying to generate this table as one of the inputs to a test.
id diff d
1: 1 2 2020-07-31
2: 1 1 2020-08-01
3: 1 1 2020-08-02
4: 1 1 2020-08-03
5: 1 1 2020-08-04
6: 2 2 2020-07-31
7: 2 1 2020-08-01
8: 2 1 2020-08-02
9: 2 1 2020-08-03
10: 2 1 2020-08-04
11: 3 2 2020-07-31
12: 3 1 2020-08-01
13: 3 1 2020-08-02
14: 3 1 2020-08-03
15: 3 1 2020-08-04
16: 4 2 2020-07-31
17: 4 1 2020-08-01
18: 4 1 2020-08-02
19: 4 1 2020-08-03
20: 4 1 2020-08-04
21: 5 2 2020-07-31
22: 5 1 2020-08-01
23: 5 1 2020-08-02
24: 5 1 2020-08-03
25: 5 1 2020-08-04
id diff d
I have done it like this -
input1 = data.table(id=as.character(1:5), diff=1)
input1 = input1[,.(d=seq(as.Date('2020-07-31'), by='days', length.out = 5)),.(id, diff)]
input1[d == '2020-07-31']$diff = 2
diff is basically the number of days to the next weekday. Eg. 31st Jul 2020 is Friday. Hence diff is 2 which is the diff to the next weekday, Monday. For the others it will be 1.
Is there a more R idiomatic way of doing this ?
I personally dont like that I had to generate the date sequence for each of the ids separately or the hardcoding of the diff that I have to do in the input for 31st July. Is there a more generic way of doing this without the hardcoding?
We can create all combination of dates and id using crossing and create diff column based on whether the weekday is "Friday".
library(dplyr)
tidyr::crossing(id = 1:5, d = seq(as.Date('2020-07-31'),
by='days', length.out = 5)) %>%
mutate(diff = as.integer(weekdays(d) == 'Friday') + 1)
Similar logic using base R expand.grid :
transform(expand.grid(id = 1:5,
d = seq(as.Date('2020-07-31'), by='days', length.out = 5)),
diff = as.integer(weekdays(d) == 'Friday') + 1)
and CJ in data.table :
library(data.table)
df <- CJ(id = 1:5, d = seq(as.Date('2020-07-31'), by='days', length.out = 5))
df[, diff := as.integer(weekdays(d) == 'Friday') + 1]

Match rows with the same or close start and end date in data.table r

Following data.table
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24"),
end_date=c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24"),
variable1=c("a","c","c","d","a",NA,"a","a","b"))
df
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-08-01 2019-12-01 c
3: 2 2019-07-12 2019-07-30 c
4: 2 2017-05-24 2017-11-24 d
5: 3 2016-05-08 2017-07-25 a
6: 3 2017-08-01 2018-08-01 <NA>
7: 4 2019-06-12 2019-12-12 a
8: 4 2017-02-24 2017-08-24 a
9: 4 2017-08-24 2018-08-24 b
Within the same ID, I want to compare the start_date and end_date. If the end_date of one row is within 30 days of the start_date of another row, I want to combine the rows. So that it looks like this:
id start_date end_date variable1
1: 1 2019-05-08 2019-09-08 a
2: 2 2019-07-12 2019-12-01 c
3: 2 2017-05-24 2017-11-24 d
4: 3 2016-05-08 2018-08-01 a
5: 4 2019-06-12 2019-12-12 a
6: 4 2017-02-24 2017-08-24 a
7: 4 2017-08-24 2018-08-24 b
If the other variables of the rows are the same, rows should be combined with the earliest start_date and latest end_date as id number 2. If the variable1 is NA it should be replaced with values from the matching row as id number 3. If the variable1 has different values, rows should remain separate as id number 4.
The data.table contains more variables and objects than displayed here. Preferable a function in data.table.
Not clear what happens if an id has 3 overlapping rows with variable1 = c('a', NA, 'b'), what should the variable1 be for the NA for this case? a or b?
If we just choose the first variable1 when there are multiple matches, here is an option to first fill the NA and then borrow the idea from David Aurenburg's solution here
setorder(df, id, start_date, end_date)
df[, end_d := end_date + 30L]
df[is.na(variable1), variable1 :=
df[!is.na(variable1)][.SD, on=.(id, start_date<=start_date, end_d>=start_date), mult="first", x.variable1]]
df[, g:= c(0L, cumsum(shift(start_date, -1L) > cummax(as.integer(end_d)))[-.N]), id][,
.(start_date=min(start_date), end_date=max(end_date)), .(id, variable1, g)]
output:
id variable1 g start_date end_date
1: 1 a 0 2019-05-08 2019-09-08
2: 2 d 0 2017-05-24 2017-11-24
3: 2 c 1 2019-07-12 2019-12-01
4: 3 a 0 2016-05-08 2018-08-01
5: 4 a 0 2017-02-24 2017-08-24
6: 4 b 0 2017-08-24 2018-08-24
7: 4 a 1 2019-06-12 2019-12-12
data:
library(data.table)
df <- data.table(id=c(1,2,2,2,3,3,4,4,4),
start_date=as.IDate(c("2019-05-08","2019-08-01","2019-07-12","2017-05-24","2016-05-08","2017-08-01","2019-06-12","2017-02-24","2017-08-24")),
end_date=as.IDate(c("2019-09-08","2019-12-01","2019-07-30","2017-11-24","2017-07-25","2018-08-01","2019-12-12","2017-08-24","2018-08-24")),
variable1=c("a","c","c","d","a",NA,"a","a","b"))

Fill incomplete time series in data.table

Question: In data.table is there any way to fill an incomplete date sequence with zeros? For instance, in the toy example some dates does not show up and I want to have a complete date sequence with y = 0 for these cases. Is there something like a forward fill?
Remark: Note I do not want to use merges where you create first the full date sequence and then merge it back to the initial data.table object (I think this is inefficient and rather slow).
library(data.table)
dt <- data.table(
x = c("2020-03-28", "2020-03-29", "2020-03-31", "2020-04-05"),
y = c(1, 5, 3, 70)
)
## Output:
x y
1: 2020-03-28 1
2: 2020-03-29 5
3: 2020-03-31 3
4: 2020-04-05 70
## Desired Output:
x y
1: 2020-03-28 1
2: 2020-03-29 5
3: 2020-03-30 0
4: 2020-03-31 3
5: 2020-04-01 0
6: 2020-04-02 0
7: 2020-04-03 0
8: 2020-04-04 0
9: 2020-04-05 70
How about this?
# convert to data.table's integer date type
dt[ , x := as.IDate(x)]
# find the range of dates
date_bounds = range(dt$x)
# construct a sequence of all dates
# NB: this will be integers as attributes are stripped
all_dates = date_bounds[1L]:date_bounds[2L]
# construct a table with the missing dates,
# with y filled to 0
missing = data.table(
# as.IDate uses the right origin for integer input
x = as.IDate(setdiff(all_dates, dt$x)),
y = 0
)
dt = rbind(dt, missing)
# x y
# <IDat> <num>
# 1: 2020-03-28 1
# 2: 2020-03-29 5
# 3: 2020-03-31 3
# 4: 2020-04-05 70
# 5: 2020-03-30 0
# 6: 2020-04-01 0
# 7: 2020-04-02 0
# 8: 2020-04-03 0
# 9: 2020-04-04 0
Afterwards you can setorder(dt, x) if you want the dates to be in order
Use CJ and tidyr::full_seq to create a join data table.
dt[, x := as.Date(x)] # convert x to the Date type
dt2 <- dt[CJ(x = tidyr::full_seq(x, 1)), on = .(x)] # create the full sequence
dt2[is.na(y), y := 0] # fill NAs with 0s
dt2
# x y
# 1: 2020-03-28 1
# 2: 2020-03-29 5
# 3: 2020-03-30 0
# 4: 2020-03-31 3
# 5: 2020-04-01 0
# 6: 2020-04-02 0
# 7: 2020-04-03 0
# 8: 2020-04-04 0
# 9: 2020-04-05 70
You could use complete from tidyr :
library(dplyr)
library(tidyr)
dt %>%
mutate(x = as.Date(x)) %>%
complete(x = seq(min(x), max(x), by = "day"), fill = list(y = 0))
# x y
# <date> <dbl>
#1 2020-03-28 1
#2 2020-03-29 5
#3 2020-03-30 0
#4 2020-03-31 3
#5 2020-04-01 0
#6 2020-04-02 0
#7 2020-04-03 0
#8 2020-04-04 0
#9 2020-04-05 70
You could also try this:
dt[, x := as.IDate(x)]
dt[.(seq(min(x), max(x), 1)), .(y = fifelse(is.na(y), 0, y)), .EACHI, on = "x"]
# x y
# 1: 2020-03-28 1
# 2: 2020-03-29 5
# 3: 2020-03-30 0
# 4: 2020-03-31 3
# 5: 2020-04-01 0
# 6: 2020-04-02 0
# 7: 2020-04-03 0
# 8: 2020-04-04 0
# 9: 2020-04-05 70

R Coding: Creating a complex conditional summary

I have a (cut-down) table containing the following pieces of information relating to a credit application process:
Date of Application
Email Address
The table can contain the same email address multiple times but with a different application date (it can be assumed that the same person has applied multiple times).
I would like to add a third column that tells me how many other applications have been seen with the same email address in the 90 days prior to application date.
How would I do this in R? Creating a summary by email address would be straightforward but adding the 90 day condition is for me the tricky part.
Coming from SAS I'd sort the table by email address and then use a lag function but any help with R would be massively helpful.
Thanks for reading.
A reproducible example would have been pretty helpful here but here's my best shot without it. What you're asking for could be done many ways. The easiest programming way is probably using a for loop over the rows of the data.
library(data.table)
library(lubridate)
set.seed(124)
emails <- 'None'
dates <- ymd('1900/01/01')
n_email = 500
for(i in seq_len(n_email)) {
n <- rpois(1, 3) + 1
d <- sample(seq(ymd('2018/01/01'), ymd('2019/09/01'), by = 'day'), n)
emails <- c(emails, rep(as.character(i), n))
dates <- c(dates, d)
}
dat <- data.table(emails, dates)
dat <- dat[order(emails, dates)]
dat[,counts := 0][]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 0
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 0
#> 1943: 99 2018-07-07 0
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 0
#> 1946: None 1900-01-01 0
for(i in 1:nrow(dat)) {
diffs = difftime(dat[i,dates], dat[emails == dat[i,emails],dates], units = 'days')
count = sum(diffs < 90 & diffs > 0)
dat[i, counts := count]
}
dat[]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 1
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 1
#> 1943: 99 2018-07-07 2
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 1
#> 1946: None 1900-01-01 0
dat[emails %in% dat[counts > 3,emails]][order(emails, dates)]
#> emails dates counts
#> 1: 396 2018-05-27 0
#> 2: 396 2018-07-10 1
#> 3: 396 2018-10-02 1
#> 4: 396 2019-02-13 0
#> 5: 396 2019-04-21 1
#> 6: 396 2019-04-22 2
#> 7: 396 2019-04-27 3
#> 8: 396 2019-05-02 4
#> 9: 396 2019-06-13 4
#> 10: 496 2018-03-06 0
#> 11: 496 2019-01-31 0
#> 12: 496 2019-04-08 1
#> 13: 496 2019-06-10 1
#> 14: 496 2019-06-24 2
#> 15: 496 2019-07-11 2
#> 16: 496 2019-07-23 3
#> 17: 496 2019-08-25 4
#> 18: 56 2018-11-16 0
#> 19: 56 2019-02-27 0
#> 20: 56 2019-04-09 1
#> 21: 56 2019-04-13 2
#> 22: 56 2019-04-25 3
#> 23: 56 2019-05-13 4
#> emails dates counts
Created on 2019-09-18 by the reprex package (v0.3.0)
However, here's a more concise, efficient way to do it as well making more use of data.table's capabilities. Note that this way doesn't require pre-sort
library(data.table)
library(lubridate)
set.seed(124)
emails <- 'None'
dates <- ymd('1900/01/01')
n_email = 500
for(i in seq_len(n_email)) {
n <- rpois(1, 3) + 1
d <- sample(seq(ymd('2018/01/01'), ymd('2019/09/01'), by = 'day'), n)
emails <- c(emails, rep(as.character(i), n))
dates <- c(dates, d)
}
dat <- data.table(emails, dates)
dat <- dat[sample(seq_len(nrow(dat)))]
dat
#> emails dates
#> 1: 70 2018-12-21
#> 2: 416 2018-10-02
#> 3: 289 2018-12-14
#> 4: 87 2018-03-02
#> 5: 441 2018-12-08
#> ---
#> 1942: 365 2018-01-25
#> 1943: 200 2019-02-02
#> 1944: 14 2019-03-20
#> 1945: 166 2018-06-20
#> 1946: 161 2018-02-07
dat[order(dates),
counts := sapply(1:.N, FUN = function(i) {
if(i == 1) return(0)
x = c(0, diff(dates))
days = 0
place = i
ret = 0
while(days < 90 & place > 1) {
if(x[place] + days < 90) ret = ret + 1
days = days + x[place]
place = place - 1
}
ret
}),
emails][order(emails, dates)]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 1
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 1
#> 1943: 99 2018-07-07 2
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 1
#> 1946: None 1900-01-01 0
dat[emails %in% dat[counts > 3,emails]][order(emails, dates)]
#> emails dates counts
#> 1: 396 2018-05-27 0
#> 2: 396 2018-07-10 1
#> 3: 396 2018-10-02 1
#> 4: 396 2019-02-13 0
#> 5: 396 2019-04-21 1
#> 6: 396 2019-04-22 2
#> 7: 396 2019-04-27 3
#> 8: 396 2019-05-02 4
#> 9: 396 2019-06-13 4
#> 10: 496 2018-03-06 0
#> 11: 496 2019-01-31 0
#> 12: 496 2019-04-08 1
#> 13: 496 2019-06-10 1
#> 14: 496 2019-06-24 2
#> 15: 496 2019-07-11 2
#> 16: 496 2019-07-23 3
#> 17: 496 2019-08-25 4
#> 18: 56 2018-11-16 0
#> 19: 56 2019-02-27 0
#> 20: 56 2019-04-09 1
#> 21: 56 2019-04-13 2
#> 22: 56 2019-04-25 3
#> 23: 56 2019-05-13 4
#> emails dates counts
Created on 2019-09-18 by the reprex package (v0.3.0)

Resources