R Coding: Creating a complex conditional summary - r

I have a (cut-down) table containing the following pieces of information relating to a credit application process:
Date of Application
Email Address
The table can contain the same email address multiple times but with a different application date (it can be assumed that the same person has applied multiple times).
I would like to add a third column that tells me how many other applications have been seen with the same email address in the 90 days prior to application date.
How would I do this in R? Creating a summary by email address would be straightforward but adding the 90 day condition is for me the tricky part.
Coming from SAS I'd sort the table by email address and then use a lag function but any help with R would be massively helpful.
Thanks for reading.

A reproducible example would have been pretty helpful here but here's my best shot without it. What you're asking for could be done many ways. The easiest programming way is probably using a for loop over the rows of the data.
library(data.table)
library(lubridate)
set.seed(124)
emails <- 'None'
dates <- ymd('1900/01/01')
n_email = 500
for(i in seq_len(n_email)) {
n <- rpois(1, 3) + 1
d <- sample(seq(ymd('2018/01/01'), ymd('2019/09/01'), by = 'day'), n)
emails <- c(emails, rep(as.character(i), n))
dates <- c(dates, d)
}
dat <- data.table(emails, dates)
dat <- dat[order(emails, dates)]
dat[,counts := 0][]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 0
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 0
#> 1943: 99 2018-07-07 0
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 0
#> 1946: None 1900-01-01 0
for(i in 1:nrow(dat)) {
diffs = difftime(dat[i,dates], dat[emails == dat[i,emails],dates], units = 'days')
count = sum(diffs < 90 & diffs > 0)
dat[i, counts := count]
}
dat[]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 1
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 1
#> 1943: 99 2018-07-07 2
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 1
#> 1946: None 1900-01-01 0
dat[emails %in% dat[counts > 3,emails]][order(emails, dates)]
#> emails dates counts
#> 1: 396 2018-05-27 0
#> 2: 396 2018-07-10 1
#> 3: 396 2018-10-02 1
#> 4: 396 2019-02-13 0
#> 5: 396 2019-04-21 1
#> 6: 396 2019-04-22 2
#> 7: 396 2019-04-27 3
#> 8: 396 2019-05-02 4
#> 9: 396 2019-06-13 4
#> 10: 496 2018-03-06 0
#> 11: 496 2019-01-31 0
#> 12: 496 2019-04-08 1
#> 13: 496 2019-06-10 1
#> 14: 496 2019-06-24 2
#> 15: 496 2019-07-11 2
#> 16: 496 2019-07-23 3
#> 17: 496 2019-08-25 4
#> 18: 56 2018-11-16 0
#> 19: 56 2019-02-27 0
#> 20: 56 2019-04-09 1
#> 21: 56 2019-04-13 2
#> 22: 56 2019-04-25 3
#> 23: 56 2019-05-13 4
#> emails dates counts
Created on 2019-09-18 by the reprex package (v0.3.0)
However, here's a more concise, efficient way to do it as well making more use of data.table's capabilities. Note that this way doesn't require pre-sort
library(data.table)
library(lubridate)
set.seed(124)
emails <- 'None'
dates <- ymd('1900/01/01')
n_email = 500
for(i in seq_len(n_email)) {
n <- rpois(1, 3) + 1
d <- sample(seq(ymd('2018/01/01'), ymd('2019/09/01'), by = 'day'), n)
emails <- c(emails, rep(as.character(i), n))
dates <- c(dates, d)
}
dat <- data.table(emails, dates)
dat <- dat[sample(seq_len(nrow(dat)))]
dat
#> emails dates
#> 1: 70 2018-12-21
#> 2: 416 2018-10-02
#> 3: 289 2018-12-14
#> 4: 87 2018-03-02
#> 5: 441 2018-12-08
#> ---
#> 1942: 365 2018-01-25
#> 1943: 200 2019-02-02
#> 1944: 14 2019-03-20
#> 1945: 166 2018-06-20
#> 1946: 161 2018-02-07
dat[order(dates),
counts := sapply(1:.N, FUN = function(i) {
if(i == 1) return(0)
x = c(0, diff(dates))
days = 0
place = i
ret = 0
while(days < 90 & place > 1) {
if(x[place] + days < 90) ret = ret + 1
days = days + x[place]
place = place - 1
}
ret
}),
emails][order(emails, dates)]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 1
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 1
#> 1943: 99 2018-07-07 2
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 1
#> 1946: None 1900-01-01 0
dat[emails %in% dat[counts > 3,emails]][order(emails, dates)]
#> emails dates counts
#> 1: 396 2018-05-27 0
#> 2: 396 2018-07-10 1
#> 3: 396 2018-10-02 1
#> 4: 396 2019-02-13 0
#> 5: 396 2019-04-21 1
#> 6: 396 2019-04-22 2
#> 7: 396 2019-04-27 3
#> 8: 396 2019-05-02 4
#> 9: 396 2019-06-13 4
#> 10: 496 2018-03-06 0
#> 11: 496 2019-01-31 0
#> 12: 496 2019-04-08 1
#> 13: 496 2019-06-10 1
#> 14: 496 2019-06-24 2
#> 15: 496 2019-07-11 2
#> 16: 496 2019-07-23 3
#> 17: 496 2019-08-25 4
#> 18: 56 2018-11-16 0
#> 19: 56 2019-02-27 0
#> 20: 56 2019-04-09 1
#> 21: 56 2019-04-13 2
#> 22: 56 2019-04-25 3
#> 23: 56 2019-05-13 4
#> emails dates counts
Created on 2019-09-18 by the reprex package (v0.3.0)

Related

Extract overlapping and non-overlapping time periods using R (data.table)

I have a dataset containing time periods during which an intervention is happening. We have two types of interventions. I have the start and end date of each intervention. I would now like to extract the time (in days) when there is no overlap between the two types and how much overlap there is.
Here's an example dataset:
data <- data.table( id = seq(1,21),
type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24",
"2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09",
"2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
"2020-05-28", "2020-08-25", "2020-12-14")),
end_dt = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09",
"2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15",
"2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29",
"2020-08-19", "2020-11-16", "2021-03-07")))
> data
id type start_dt end_dt
1: 1 1 2015-01-09 2017-07-24
2: 2 2 2015-04-14 2015-05-04
3: 3 2 2015-06-19 2015-08-27
4: 4 2 2015-10-30 2015-11-19
5: 5 2 2016-03-01 2016-03-21
6: 6 2 2016-05-24 2016-06-09
7: 7 2 2016-08-03 2017-07-18
8: 8 2 2017-08-18 2019-02-21
9: 9 1 2017-08-18 2018-01-23
10: 10 1 2018-02-01 2018-04-25
11: 11 1 2018-05-07 2018-07-29
12: 12 1 2018-08-09 2019-01-15
13: 13 1 2019-01-31 2019-04-24
14: 14 2 2019-03-22 2019-09-13
15: 15 1 2019-05-16 2019-10-13
16: 16 2 2019-11-04 2020-12-23
17: 17 1 2019-11-04 2020-01-26
18: 18 1 2020-02-06 2020-04-29
19: 19 1 2020-05-28 2020-08-19
20: 20 1 2020-08-25 2020-11-16
21: 21 1 2020-12-14 2021-03-07
Here's a plot of the data for a better view of what I want to know:
library(ggplot2)
ggplot(data = data,
aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +
geom_segment(size = 2) +
xlab("") +
ylab("") +
theme_bw()
I'll describe the first part of the example: we have an intervention of type 1 from 2015-01-09 until 2017-07-24. From 2015-04-14 however, also intervention type 2 is happening. This means that we only have "pure" type 1 from 2015-01-09 to 2015-04-13, which is 95 days.
Then we have an overlapping period from 2015-04-14 to 2015-05-04, which is 21 days. Then we again have a period with only type 1 from 2015-05-05 to 2015-06-18, which is 45 days. In total, we now have had (95 + 45 =) 140 days of "pure" type 1 and 21 days of overlap. Then we continue like this for the entire time period.
I would like to know the total time (in days) of "pure" type 1, "pure" type 2 and overlap.
Alternatively, if also possible, I would like to organise the data such, that I get all the seperate time periods extracted, meaning that the data would look something like this (type 3 = overlap):
> data_adjusted
id type start_dt end_dt
1: 1 1 2015-01-09 2015-04-14
2: 2 3 2015-04-15 2015-05-04
3: 3 1 2015-05-05 2015-06-18
4: 4 3 2015-06-19 2015-08-27
........
The time in days spent in each intervention type can then easily be calculated from data_adjuted.
I have similar answers using dplyr or just marking overlapping time periods, but I have not found an answer to my specific case.
Is there an efficient way to calculate this using data.table?
This method does a small explosion of looking at all dates in the range, so it may not scale very well if your data gets large.
library(data.table)
alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
.[, .N, by = .(start_dt, type) ] %>%
.[ !is.na(type), ] %>%
dcast(start_dt ~ type, value.var = "N") %>%
.[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
.[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
# r type start_dt end_dt
# <int> <char> <Date> <Date>
# 1: 1 1 2015-01-09 2015-04-13
# 2: 2 3 2015-04-14 2015-05-04
# 3: 3 1 2015-05-05 2015-06-18
# 4: 4 3 2015-06-19 2015-08-27
# 5: 5 1 2015-08-28 2015-10-29
# 6: 6 3 2015-10-30 2015-11-19
# 7: 7 1 2015-11-20 2016-02-29
# 8: 8 3 2016-03-01 2016-03-21
# 9: 9 1 2016-03-22 2016-05-23
# 10: 10 3 2016-05-24 2016-06-09
# 11: 11 1 2016-06-10 2016-08-02
# 12: 12 3 2016-08-03 2017-07-18
# 13: 13 1 2017-07-19 2017-07-24
# 14: 14 3 2017-08-18 2018-01-23
# 15: 15 2 2018-01-24 2018-01-31
# 16: 16 3 2018-02-01 2018-04-25
# 17: 17 2 2018-04-26 2018-05-06
# 18: 18 3 2018-05-07 2018-07-29
# 19: 19 2 2018-07-30 2018-08-08
# 20: 20 3 2018-08-09 2019-01-15
# 21: 21 2 2019-01-16 2019-01-30
# 22: 22 3 2019-01-31 2019-02-21
# 23: 23 1 2019-02-22 2019-03-21
# 24: 24 3 2019-03-22 2019-04-24
# 25: 25 2 2019-04-25 2019-05-15
# 26: 26 3 2019-05-16 2019-09-13
# 27: 27 1 2019-09-14 2019-10-13
# 28: 28 3 2019-11-04 2020-01-26
# 29: 29 2 2020-01-27 2020-02-05
# 30: 30 3 2020-02-06 2020-04-29
# 31: 31 2 2020-04-30 2020-05-27
# 32: 32 3 2020-05-28 2020-08-19
# 33: 33 2 2020-08-20 2020-08-24
# 34: 34 3 2020-08-25 2020-11-16
# 35: 35 2 2020-11-17 2020-12-13
# 36: 36 3 2020-12-14 2020-12-23
# 37: 37 1 2020-12-24 2021-03-07
# r type start_dt end_dt
It drops the id field, I don't know how to map it well back to your original data.
#r2evans solution is more complete, but if you want to explore the use offoverlaps you can start with something like this:
#split into two frames
data = split(data,by="type")
# key the second frame
setkey(data[[2]], start_dt, end_dt)
# create the rows that have overlaps
overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
# get the overlapping time periods
overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
Output:
nrow start_dt end_dt type
1: 1 2015-04-14 2015-05-04 3
2: 2 2015-06-19 2015-08-27 3
3: 3 2015-10-30 2015-11-19 3
4: 4 2016-03-01 2016-03-21 3
5: 5 2016-05-24 2016-06-09 3
6: 6 2016-08-03 2017-07-18 3
7: 7 2017-08-18 2018-01-23 3
8: 8 2018-02-01 2018-04-25 3
9: 9 2018-05-07 2018-07-29 3
10: 10 2018-08-09 2019-01-15 3
11: 11 2019-01-31 2019-02-21 3
12: 12 2019-03-22 2019-04-24 3
13: 13 2019-05-16 2019-09-13 3
14: 14 2019-11-04 2020-01-26 3
15: 15 2020-02-06 2020-04-29 3
16: 16 2020-05-28 2020-08-19 3
17: 17 2020-08-25 2020-11-16 3
18: 18 2020-12-14 2020-12-23 3
The sum of those overlap days is 1492.

Define periods/episodes of exposition with overlaping and concatenated intervals of time

I'm trying to identify periods/episodes of exposition to a drug with prescriptions. If those prescriptions are separated for 30 days it's considered a new period/episode of exposition. Prescriptions can overlap during certain time or be consecutive. If the sum of separated days of two consecutive prescripction is greater than 30 days it's not considered a new episode.
I have data like this:
id = c(rep(1,3), rep(2,6), rep(3,5))
start = as.Date(c("2017-05-10", "2017-07-28", "2017-11-23", "2017-01-27", "2017-10-02", "2018-05-14", "2018-05-25", "2018-11-26", "2018-12-28", "2016-01-01", "2016-03-02", "2016-03-20", "2016-04-25", "2016-06-29"))
end = as.Date(c("2017-07-27", "2018-01-28", "2018-03-03", "2017-04-27", "2018-05-13", "2018-11-14", "2018-11-25", "2018-12-27", "2019-06-28", "2016-02-15", "2016-03-05", "2016-03-24", "2016-04-29", "2016-11-01"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2017-05-10 2017-07-27
2: 1 2017-07-28 2018-01-28
3: 1 2017-11-23 2018-03-03
4: 2 2017-01-27 2017-04-27
5: 2 2017-10-02 2018-05-13
6: 2 2018-05-14 2018-11-14
7: 2 2018-05-25 2018-11-25
8: 2 2018-11-26 2018-12-27
9: 2 2018-12-28 2019-06-28
10: 3 2016-01-01 2016-02-15
11: 3 2016-03-02 2016-03-05
12: 3 2016-03-20 2016-03-24
13: 3 2016-04-25 2016-04-29
14: 3 2016-06-29 2016-11-01
I calculated the difference of start and last end observation (last_diffdays)
DT[, last_diffdays := start-shift(end, n=1L), by = .(id)][is.na(last_diffdays), last_diffdays := 0][]
id start end last_diffdays
1: 1 2017-05-10 2017-07-27 0 days
2: 1 2017-07-28 2018-01-28 1 days
3: 1 2017-11-23 2018-03-03 -66 days
4: 2 2017-01-27 2017-04-27 0 days
5: 2 2017-10-02 2018-05-13 158 days
6: 2 2018-05-14 2018-11-14 1 days
7: 2 2018-05-25 2018-11-25 -173 days
8: 2 2018-11-26 2018-12-27 1 days
9: 2 2018-12-28 2019-06-28 1 days
10: 3 2016-01-01 2016-02-15 0 days
11: 3 2016-03-02 2016-03-05 16 days
12: 3 2016-03-20 2016-03-24 15 days
13: 3 2016-04-25 2016-04-29 32 days
14: 3 2016-06-29 2016-11-01 61 days
This shows when an overlap happens (negative values) or not (positive values). I think an ifelse/fcase statement here would be a bad idea and I'm not comfortable doing it.
I think a good output for this job would be something like:
id start end last_diffdays noexp_days period
1: 1 2017-05-10 2017-07-27 0 days 0 1
2: 1 2017-07-28 2018-01-28 1 days 1 1
3: 1 2017-11-23 2018-03-03 -66 days 0 1
4: 2 2017-01-27 2017-04-27 0 days 0 1
5: 2 2017-10-02 2018-05-13 158 days 158 2
6: 2 2018-05-14 2018-11-14 1 days 1 2
7: 2 2018-05-25 2018-11-25 -173 days 0 2
8: 2 2018-11-26 2018-12-27 1 days 1 2
9: 2 2018-12-28 2019-06-28 1 days 1 2
10: 3 2016-01-01 2016-02-15 0 days 0 1
11: 3 2016-03-02 2016-03-05 16 days 16 1
12: 3 2016-03-20 2016-03-24 15 days 15 1
13: 3 2016-04-25 2016-04-29 32 days 32 2
14: 3 2016-06-29 2016-11-01 61 days 61 3
I manually calculated the days without exposition (noexp_days) of the before prescription.
I dunno If I'm the right path but I think I need to calculate noexp_days variable and then make a cumsum((noexp_days)>30)+1.
If there is a much better solution I don't see or any other possibility I haven't considered I will appreciate to read about them.
Thanks in advance for any help! :)
Try :
library(data.table)
DT[, noexp_days := pmax(as.integer(last_diffdays), 0)]
DT[, period := cumsum(noexp_days > 30) + 1, id]
DT
# id start end last_diffdays noexp_days period
# 1: 1 2017-05-10 2017-07-27 0 days 0 1
# 2: 1 2017-07-28 2018-01-28 1 days 1 1
# 3: 1 2017-11-23 2018-03-03 -66 days 0 1
# 4: 2 2017-01-27 2017-04-27 0 days 0 1
# 5: 2 2017-10-02 2018-05-13 158 days 158 2
# 6: 2 2018-05-14 2018-11-14 1 days 1 2
# 7: 2 2018-05-25 2018-11-25 -173 days 0 2
# 8: 2 2018-11-26 2018-12-27 1 days 1 2
# 9: 2 2018-12-28 2019-06-28 1 days 1 2
#10: 3 2016-01-01 2016-02-15 0 days 0 1
#11: 3 2016-03-02 2016-03-05 16 days 16 1
#12: 3 2016-03-20 2016-03-24 15 days 15 1
#13: 3 2016-04-25 2016-04-29 32 days 32 2
#14: 3 2016-06-29 2016-11-01 61 days 61 3

R data.table cumulative sum over time intervals

I have a table with values that exist during specific time intervals. I want a field that will sum up over all values for a given ID that exist during the start time of that interval.
Here is a paired down example:
x = data.table(ID = c(rep(1, 5), rep(2, 5)),
DT_START = as.Date(c('2017-05-28', '2017-05-29', '2017-07-03', '2018-05-28', '2018-05-29',
'2019-07-03', '2019-10-08', '2020-05-28', '2020-05-29', '2020-07-03')),
DT_END = as.Date(c('2018-05-28', '2018-05-29', '2018-07-03', '2018-05-29', '2019-05-28',
'2019-10-08', '2020-07-03', '2021-05-28', '2021-05-29', '2020-10-03')),
VALUE = c(300, 400, 200, 100, 150, 250, 350, 50, 10, 45))
The table looks like this:
x
ID DT_START DT_END VALUE
1: 1 2017-05-28 2018-05-28 300
2: 1 2017-05-29 2018-05-29 400
3: 1 2017-07-03 2018-07-03 200
4: 1 2018-05-28 2018-05-29 100
5: 1 2018-05-29 2019-05-28 150
6: 2 2019-07-03 2019-10-08 250
7: 2 2019-10-08 2020-07-03 350
8: 2 2020-05-28 2021-05-28 50
9: 2 2020-05-29 2021-05-29 10
10: 2 2020-07-03 2020-10-03 45
In the first row, that's the first start date for that ID and there are no equal dates, so the cumulative value would be just 300. By the second row, we now add the 300+400 to get 700, because as of 5/29/2017, both the 400 and the 300 were active for ID 1. The full desired output vector is obtained using the following code:
x[, VALUE_CUM := sum(x$VALUE * ifelse(x$ID == ID & x$DT_START <= DT_START & x$DT_END > DT_START, 1, 0)), by = .(ID, DT_START)]
x
ID DT_START DT_END VALUE VALUE_CUM
1: 1 2017-05-28 2018-05-28 300 300
2: 1 2017-05-29 2018-05-29 400 700
3: 1 2017-07-03 2018-07-03 200 900
4: 1 2018-05-28 2018-05-29 100 700
5: 1 2018-05-29 2019-05-28 150 350
6: 2 2019-07-03 2019-10-08 250 250
7: 2 2019-10-08 2020-07-03 350 350
8: 2 2020-05-28 2021-05-28 50 400
9: 2 2020-05-29 2021-05-29 10 410
10: 2 2020-07-03 2020-10-03 45 105
This is great but takes way to long on my huge data table with millions of rows. Any ideas for how to do this more elegantly so it takes faster?
Thanks!
Here is a possible way to do it:
y <- x[x, .(
DT_END2 = i.DT_END,
VALUE = i.VALUE, VALUE_CUM = sum(x.VALUE)),
on = .(ID, DT_START <= DT_START, DT_END > DT_START), by = .EACHI]
# DT_END is overwritten by values of DT_START, so we use DT_END2 to backup the correct DT_END values.
y[, DT_END := DT_END2][, DT_END2 := NULL]
# ID DT_START DT_END VALUE VALUE_CUM
# 1: 1 2017-05-28 2018-05-28 300 300
# 2: 1 2017-05-29 2018-05-29 400 700
# 3: 1 2017-07-03 2018-07-03 200 900
# 4: 1 2018-05-28 2018-05-29 100 700
# 5: 1 2018-05-29 2019-05-28 150 350
# 6: 2 2019-07-03 2019-10-08 250 250
# 7: 2 2019-10-08 2020-07-03 350 350
# 8: 2 2020-05-28 2021-05-28 50 400
# 9: 2 2020-05-29 2021-05-29 10 410
# 10: 2 2020-07-03 2020-10-03 45 105

What is the best way to use dplyr/dtplyr to make complex updates to a data.table

We have written a package to analyse a large number of events in relation to time windows.
To do the analysis we need to establish a number of attributes of the windows and cross-references
between them.
This has been done using data.table in its native syntax. Examples of some of the steps is included in the reprex below.
We are now looking to re-frame this package using dplyr/dtplyr for readability and sharing with other
parties.
While I can write the 'queries' in dplyr syntax, I am not seeing a tidyverse way to apply updates to the underlying tables - adding columns, updating rows etc. without repeatedly creating and replacing copies.
When the data is large, the 'update in place' features of data.table are very desirable. Is there a way to take advantage of this in the dplyr syntax? (I have hit barriers with immutable = FALSE and attempts to use rows_update())
library(data.table)
set.seed <- 123
#Create a table of events with timestamp and an event type (501 events randomly generated over the previous 30 days)
DT1 <- data.table(timeStamp = as.POSIXct('2021-03-25') - as.integer(runif(501)*60*1440*30),
eventType=c('A', 'B', 'C'))
setkey(DT1, timeStamp)
print(DT1)
#> timeStamp eventType
#> 1: 2021-02-23 00:42:37 A
#> 2: 2021-02-23 04:21:43 A
#> 3: 2021-02-23 05:23:51 C
#> 4: 2021-02-23 06:45:36 C
#> 5: 2021-02-23 08:34:32 B
#> ---
#> 497: 2021-03-24 11:32:09 A
#> 498: 2021-03-24 13:49:53 B
#> 499: 2021-03-24 14:26:55 C
#> 500: 2021-03-24 18:11:33 C
#> 501: 2021-03-24 20:13:51 A
#Create a table of time windows. One for each date represented with an early and late time for each
#Assign this a class (in this example the value of the most common eventType)
DT2 <- DT1[,keyby=.(date=lubridate::date(timeStamp)),
.(earlyTime = min(timeStamp - 1),
lateTime = max(timeStamp + 1),
as = sum(eventType == 'A'),
bs = sum(eventType == 'B'),
cs = sum(eventType == 'C'))][
,.(date,
earlyTime,
lateTime,
class=ifelse(as >= bs & as >= cs, 'A', ifelse(bs >= cs, 'B', 'C')))]
print(head(DT2))
#> date earlyTime lateTime class
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A
#Give each row in DT2 a row number (so that we can readily cross-reference between rows)
DT2[order(lateTime), rn := .I]
#For each row, get the row number of the previous instance of this class
DT2[order(class, rn), prevOfClass := shift(rn, 1), by=.(class)]
print(head(DT2))
#> date earlyTime lateTime class rn prevOfClass
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B 1 NA
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B 2 1
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A 3 NA
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B 4 2
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A 5 3
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A 6 5
#For each row that is not a 'C' find the previous and next instances of a C type row
#Note that when we assigned rn we ensured that the rows were in ascending time order
#so rn can be used as a proxy for sorting by time
DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))]
#> rn rn nextC prevC
#> 1: 1 1 8 NA
#> 2: 2 2 8 NA
#> 3: 3 3 8 NA
#> 4: 4 4 8 NA
#> 5: 5 5 8 NA
#> 6: 6 6 8 NA
#> 7: 7 7 8 NA
#> 8: 9 9 13 8
#> 9: 10 10 13 8
#> 10: 11 11 13 8
#> 11: 12 12 13 8
#> 12: 14 14 16 13
#> 13: 15 15 16 13
#> 14: 17 17 26 16
#> 15: 18 18 26 16
#> 16: 19 19 26 16
#> 17: 20 20 26 16
#> 18: 21 21 26 16
#> 19: 22 22 26 16
#> 20: 23 23 26 16
#> 21: 24 24 26 16
#> 22: 25 25 26 16
#> 23: 28 28 30 27
#> 24: 29 29 30 27
#> rn rn nextC prevC
#But I want to add this information as additional columns to the base table
DT2[DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))],
on = .(rn),
':='(nextC=i.nextC, prevC = i.prevC)
]
print(DT2[,.(rn, date, class, prevOfClass, nextC, prevC)])
#> rn date class prevOfClass nextC prevC
#> 1: 1 2021-02-23 B NA 8 NA
#> 2: 2 2021-02-24 B 1 8 NA
#> 3: 3 2021-02-25 A NA 8 NA
#> 4: 4 2021-02-26 B 2 8 NA
#> 5: 5 2021-02-27 A 3 8 NA
#> 6: 6 2021-02-28 A 5 8 NA
#> 7: 7 2021-03-01 A 6 8 NA
#> 8: 8 2021-03-02 C NA NA NA
#> 9: 9 2021-03-03 A 7 13 8
#> 10: 10 2021-03-04 A 9 13 8
#> 11: 11 2021-03-05 B 4 13 8
#> 12: 12 2021-03-06 A 10 13 8
#> 13: 13 2021-03-07 C 8 NA NA
#> 14: 14 2021-03-08 A 12 16 13
#> 15: 15 2021-03-09 B 11 16 13
#> 16: 16 2021-03-10 C 13 NA NA
#> 17: 17 2021-03-11 A 14 26 16
#> 18: 18 2021-03-12 B 15 26 16
#> 19: 19 2021-03-13 A 17 26 16
#> 20: 20 2021-03-14 B 18 26 16
#> 21: 21 2021-03-15 A 19 26 16
#> 22: 22 2021-03-16 A 21 26 16
#> 23: 23 2021-03-17 A 22 26 16
#> 24: 24 2021-03-18 A 23 26 16
#> 25: 25 2021-03-19 B 20 26 16
#> 26: 26 2021-03-20 C 16 NA NA
#> 27: 27 2021-03-21 C 26 NA NA
#> 28: 28 2021-03-22 B 25 30 27
#> 29: 29 2021-03-23 A 24 30 27
#> 30: 30 2021-03-24 C 27 NA NA
#> rn date class prevOfClass nextC prevC
#What would be the best approach to this using dplyr / dtplyr syntax?
#In practice there are many hundreds of thousands of rows in the tables
#and...
#There are many more update and enrichments that need to be applied
#some of which add new columns, others will update just a few rows
#in a column
#So 'mutate in place/by reference' is highly desirable
Created on 2021-03-25 by the reprex package (v1.0.0)

R vlookup combinded with if...and

as a beginner to R, I'm facing troubles with a complex issue, for my side.
I want to add a new column with a "1" when the data$Date is between/exactly the lookup$Begin and lookup$End. Identification_no is the key for both data sets.
If the data$date is not bewteen lookup$Begin and lookup$End then there should a "0" in the new data column.
Both data frames have different length of observations.
Here's my basic data frame:
> data
# A tibble: 6 x 2
Date Identification_no
* <date> <dbl>
1 2018-08-25 13
2 2018-02-03 54
3 2018-09-01 31
4 2018-11-10 54
5 2018-08-04 60
6 2018-07-07 58
Here's my lookup data frame:
> lookup
# A tibble: 6 x 3
Begin End Identification_no
* <date> <date> <dbl>
1 2017-01-26 2017-01-26 53
2 2017-01-26 2017-01-26 53
3 2017-01-26 2017-01-26 53
4 2017-01-26 2017-01-26 53
5 2017-01-26 2017-01-26 53
6 2017-01-26 2017-01-26 53
Thanks for your inputs in advance.
EDIT: new sample data
> data
# A tibble: 6 x 2
Date Identification_no
<date> <dbl>
1 2018-08-25 13
2 2018-02-03 54
3 2018-09-01 31
4 2018-11-10 54
5 2018-08-04 60
6 2018-07-07 58
> lookup
# A tibble: 6 x 3
Begin End Identification_no
<date> <date> <dbl>
1 2018-08-20 2018-08-27 13
2 2018-09-01 2018-09-08 53
3 2018-01-09 2018-01-23 20
4 2018-10-16 2018-10-30 4
5 2017-12-22 2017-12-29 54
6 2017-10-31 2017-11-07 66
Result through below described method:
> final
Begin End Identification_no match_col
1: 2018-08-25 2018-08-25 13 1
2: 2018-02-03 2018-02-03 54 0
3: 2018-09-01 2018-09-01 31 0
4: 2018-11-10 2018-11-10 54 0
5: 2018-08-04 2018-08-04 60 0
6: 2018-07-07 2018-07-07 58 0
Works perfectly fine - thanks for your solution.
Best regards,
Paul
Could do:
library(data.table)
setDT(data)[, Date := as.Date(Date)]
setDT(lookup)[, `:=` (Begin = as.Date(Begin), End = as.Date(End), match_col = 1)]
final <- unique(lookup, by = c("Begin", "End","Identification_no"))[
data, on = .(Begin <= Date, End >= Date, Identification_no)][
is.na(match_col), match_col := 0]
On your example dataset, this would give:
final
Begin End Identification_no match_col
1: 2018-08-25 2018-08-25 13 0
2: 2018-02-03 2018-02-03 54 0
3: 2018-09-01 2018-09-01 31 0
4: 2018-11-10 2018-11-10 54 0
5: 2018-08-04 2018-08-04 60 0
6: 2018-07-07 2018-07-07 58 0
.. but only because there's really no match.

Resources