Related
I have a set of observations, which are recorded every time a user has taken an action. I want to filter only those observations from a user which are six or more months apart.
So, if a user has taken this action on "2018-01-01", "2018-03-01" and "2018-07-01", I only want to keep only "2018-01-01" and "2018-07-01".
Similarly, if a user has taken an action on "2018-01-01", "2018-03-01", "2018-07-01" and "2019-03-01" I want to keep only "2018-01-01", "2018-07-01", "2019-03-01".
So far, I have produced long and unworkable code.
# What I want to achieve
library(data.table)
dataIhave <- data.table(id = c(1, 1, 1, 1, 2, 2, 3, 4),
dates = c("2018-01-01",
"2018-03-01",
"2018-07-01",
"2019-01-01",
"2018-01-03",
"2018-07-02",
"2018-02-01",
"2018-02-01"))
dataIwant <- data.table(id = c(1, 1, 1, 2, 3, 4),
dates = c("2018-01-01",
"2018-07-01",
"2019-01-01",
"2018-01-01",
"2018-02-01",
"2018-02-01"))
This is a rolling-join variant of #Uwe's answer:
library(lubridate)
dataIhave[, dates := as.IDate(dates)]
ids = unique(dataIhave$id)
dataIhave[, seq := NA_integer_]
s = 1L
w = dataIhave[.(ids), on=.(id), mult="first", which = TRUE]
dataIhave[w, seq := s]
while (TRUE){
w = dataIhave[
dataIhave[w, .(id, dates = dates %m+% months(6))],
on = .(id, dates), roll = -Inf, nomatch = 0, which = TRUE
]
if (!length(w)) break
s = s + 1L
dataIhave[w, seq := s]
}
dataIhave[!is.na(seq)]
id dates seq
1: 1 2018-01-01 1
2: 1 2018-07-01 2
3: 1 2019-01-01 3
4: 2 2018-01-03 1
5: 3 2018-02-01 1
6: 4 2018-02-01 1
The loop takes rows w defined per id, steps their dates forward six months, and takes the next row found, if any. The arguments to the join are:
The tables, with join syntax x[i, ...]
x = dataIhave
i = dataIhave[w, .(id, dates = dates %m+% months(6))]
on = .(id, date): columns to match by
roll = -Inf: find the next match on the last column in on=
nomatch = 0: if no match is found, skip
which = TRUE: return matched row number
Additionally, if there are duplicate dates (see the second example in #Uwe's post):
mult = "first": take only the first match for each row of i
In selecting the first row by id before the loop, I'm assuming the data is sorted by dates within id (so I'm not using order as #Uwe's answer does).
If I understand correctly, the OP wants to drop those dates which are less than six months apart from the beginning of a period and start a new period at the first date that is more than 6 months apart from the begin of the previous period (separately for each id).
I have no idea how this can be accomplished by a non-recursive rolling or non-equi join as there is no fixed grid of dates. So, I believe it requires a kind of recursive approach, somehow. Here is one possibility:
library(data.table)
library(lubridate)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := TRUE]
dataIhave[order(id, dates)
, keep := {
start <- dates[1L]
for (i in tail(seq_along(dates), -1L)) {
if (dates[i] < start %m+% months(6)) {
keep[i] <- FALSE
} else {
start <- dates[i]
}
}
keep
}, by = id][]
id dates keep
1: 1 2018-01-01 TRUE
2: 1 2018-03-01 FALSE
3: 1 2018-07-01 TRUE
4: 1 2019-01-01 TRUE
5: 2 2018-01-03 TRUE
6: 2 2018-07-02 FALSE
7: 3 2018-02-01 TRUE
8: 4 2018-02-01 TRUE
Finally,
dataIhave[(keep), -"keep"]
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
2nd test case
The crucial point here is to detect the beginning of a new period (within each id).
As an additional test case I have added two dates to id == 1,
2018-07-01 and 2018-07-02.
2018-07-01 is a duplicate. Both dates should be removed as they lie within the second 6-months period starting at 2018-07-01.
dataIhave <- fread("
id dates
1 2018-01-01
1 2018-03-01
1 2018-07-01
1 2018-07-01
1 2018-07-02
1 2019-01-01
2 2018-01-03
2 2018-07-02
3 2018-02-01
4 2018-02-01")
Indeed, the code above returns the same output as with OP's original test case.
Remove rows only within the first six months for each id
If the question is interpreted to only remove entries within the first 6-months period for each id and keep all the date after 6 months this can be achieved by
dataIhave[!dataIhave[, .I[dates < dates[1L] %m+% months(6L)][-1L], by = id]$V1]
which returns
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2018-07-01
4: 1 2018-07-02
5: 1 2019-01-01
6: 2 2018-01-03
7: 3 2018-02-01
8: 4 2018-02-01
for the second test case. (Note that this is a streamlined version of Jaap's answer.)
Another variant:
library(lubridate)
library(data.table)
dataIhave[, dates := as.Date(dates)]
dataIhave[, keep := dates >= dates[1] %m+% months(6), by = id
][dataIhave[, .I[1], by = id][[2]], keep := TRUE
][!!keep, -"keep"]
which gives:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
Using non-equi join and igraph to avoid implicit loops and recursion:
#data prep
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
setorder(dataIhave[, rn:=rowid(id)], id, dates)
dataIhave[, end := as.IDate(sapply(dates,
function(d) seq(d, by="6 months", length.out=2L)[2L]))]
#non-equi self join to find first date that is after 6months
nonequi <- dataIhave[dataIhave, on=.(id, dates>=end), mult="first", by=.EACHI,
.(i.id, i.rn, x.rn, i.dates, x.dates)]
library(igraph)
nonequi[, {
#create graph from the previous join
g <- graph_from_data_frame(.SD[, .(i.rn, x.rn)])
#plot(g)
#find the leaf nodes
leaf <- sapply(V(g), function(x) length(neighbors(g,x))==0L)
#from the first date (i.e. node = V(g)["1"]), find the path starting from this date.
path <- get.all.shortest.paths(g, V(g)["1"], leaf)$res
#return all dates (i.e. nodes) in this path
.(dates=i.dates[i.rn %in% na.omit(V(g)[path[[1L]]]$name)])
},
by=.(id=i.id)]
output:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
Or a recursive approach similar to Uwe's solution:
dataIhave[, dates := as.IDate(dates, format="%Y-%m-%d")]
unique(dataIhave[,
.(dates=as.IDate(Reduce(
function(x, y) if (y >= seq(x, by="6 months", length.out=2L)[2L]) y else x,
dates,
accumulate=TRUE))),
.(id)])
output:
id dates
1: 1 2018-01-01
2: 1 2018-07-01
3: 1 2019-01-01
4: 2 2018-01-03
5: 3 2018-02-01
6: 4 2018-02-01
library(lubridate)
library(data.table)
dataiHave[, dates := ymd(dates)]
dataiHave[, difDates := as.numeric(difftime(dates, units = "weeks"))]
dataIHave[difDates >= 24, .(id, dates)]
Does this produce the result you want?
Months have irregular durations so you'll have to stick to a time unit of fixed duration.
You can also check ?lubridate::interval, lubridate::as. duration and this question: Time difference in years with lubridate?
Problem:
I have records with a start and end date for an intervention and I want to merge the rows according to the following rule:
For each ID, any intervention that begins within one year of the last intervention ending, merge the rows so that the start_date is the earliest start date of the two rows, and the end_date is the latest end_date of the two rows.
I also want to keep track of intervention IDs if they are merged.
There can be five scenarios:
Two rows have the same start date, but different end dates.
Start date....End date
Start date.........End date
The period between row 2's start and end date lies within the period of row 1's start and end date.
Start date...................End date
.......Start date...End date
Row 2's intervention starts within Row 1's intervention period but ends later.
Start date.....End date
.....Start date.............End date
Row 2 starts within one year of the end of Row 1.
Start date....End date
......................|....<= 1 year....|Start date...End date
Row 2 starts over one year after the end of Row 1.
Start date...End date
.....................|........ > 1 year..........|Start date...End date
I want to merge rows in cases 1 to 4 but not 5.
Data:
library(data.table)
sample_data <- data.table(id = c(rep(11, 3), rep(21, 4)),
start_date = as.Date(c("2013-01-01", "2013-01-01", "2013-02-01", "2013-01-01", "2013-02-01", "2013-12-01", "2015-06-01")),
end_date = as.Date(c("2013-06-01", "2013-07-01", "2013-05-01", "2013-07-01", "2013-09-01", "2014-01-01", "2015-12-01")),
intervention_id = as.character(1:7),
all_ids = as.character(1:7))
> sample_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-06-01 1 1
2: 11 2013-01-01 2013-07-01 2 2
3: 11 2013-02-01 2013-05-01 3 3
4: 21 2013-01-01 2013-07-01 4 4
5: 21 2013-02-01 2013-09-01 5 5
6: 21 2013-12-01 2014-01-01 6 6
7: 21 2015-06-01 2015-12-01 7 7
The final result should look like:
> merged_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2015-06-01 2015-12-01 7 7
I'm not sure if the all_ids column is the best way to keep track of the intervention_id's so open to ideas for that. (The intervention_id's don't need to be in order in the all_ids column.)
It doesn't matter what the value of the intervention_id column is where rows have been merged.
What I tried:
I started off by writing a function to deal with only those cases where the start date is the same. It's a very non-R, non-data.table way of doing it and therefore very inefficient.
mergestart <- function(unmerged) {
n <- nrow(unmerged)
mini_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
merge_a <- function(unmerged, un_i, merged, m_i, no_of_records) {
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
while (un_i <= no_of_records) {
if(merged[m_i]$start_date == unmerged[un_i]$start_date) {
merged[m_i]$end_date <- max(merged[m_i]$end_date, unmerged[un_i]$end_date)
merged[m_i]$all_ids <- paste0(merged[m_i]$all_ids, ",", unmerged[un_i]$intervention_id)
un_i <- un_i + 1
} else {
m_i <- m_i + 1
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
merge_a(unmerged, un_i, merged, m_i, (no_of_records - un_i))
}
}
return(merged)
}
mini_merged <- merge_a(unmerged, 1, mini_merged, 1, n)
return(copy(mini_merged[id != 0]))
}
Using this function on just one id gives:
> mergestart(sample_data[id == 11])
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 11 2013-02-01 2013-05-01 3 3
To use the function on the whole dataset:
n <- nrow(sample_data)
all_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
start_i <- 1
for (i in unique(sample_data$id)) {
id_merged <- mergestart(sample_data[id == i])
end_i <- start_i + nrow(id_merged) - 1
all_merged[start_i:end_i] <- copy(id_merged)
start_i <- end_i
}
all_merged <- all_merged[id != 0]
> all_merged
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 21 2013-01-01 2013-07-01 4 4
3: 21 2013-02-01 2013-09-01 5 5
4: 21 2013-12-01 2014-01-01 6 6
5: 21 2015-06-01 2015-12-01 7 7
I also had a look at rolling joins but still can't get how to use it in this situation.
This answer https://stackoverflow.com/a/48747399/6170115 looked promising but I don't know how to integrate all the other conditions and track the intervention IDs with this method.
Can anyone point me in the right direction?
There are related questions How to flatten / merge overlapping time periods and Consolidate rows based on date ranges but none of them has the additional requirements posed by the OP.
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := {
tmp <- as.integer(start_date)
cumsum(tmp > shift(cummax(tmp + 365L), type = "lag", fill = 0L))
}, by = id][]
id start_date end_date intervention_id all_ids period
1: 11 2013-01-01 2013-06-01 1 1 1
2: 11 2013-01-01 2013-07-01 2 2 1
3: 11 2013-02-01 2013-05-01 3 3 1
4: 21 2013-01-01 2013-07-01 4 4 1
5: 21 2013-02-01 2013-09-01 5 5 1
6: 21 2013-12-01 2014-01-01 6 6 1
7: 21 2015-06-01 2015-12-01 7 7 2
For the sake of simplicity, it is assumed that one year has 365 days which ignores leap years with 366 days. If leap years are to be considered, a more sophisticated date arithmetic is required.
Unfortunately, cummax() has no method for arguments of class Date or IDate (data.table's integer version). Therefore, the coersion from Date to integer is required.
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
Edit: Correction
I just noted that I had misinterpreted OP's requirements. The OP has requested (emphasis mine):
For each ID, any intervention that begins within one year of the last
intervention ending, merge the rows so that the start_date is the
earliest start date of the two rows, and the end_date is the latest
end_date of the two rows.
The solution above looks for gaps of one year in the sequence of start_date but not in the sequence of start_date and the preceeding end_date as requested. The corrected version is:
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id][]
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
The result for the given sample dataset is identical for both versions which caused the error to slip through unrecognized.
Benchmark
The OP has mentioned in a comment that using lubridate's date arithmetic has dramatically enlarged run times.
According to my benchmark below, the penalty of using end_date %m+% years(1) is not that much. I have benchmarked three versions of the code:
v_1 is the corrected version from above.
v_2 pulls the type conversion and the data arithmetic out of the grouping part and creates two helper columns in advance.
v_3 is like v_2 but uses end_date %m+% years(1).
The benchmark is repeated for different problem sizes, i.e., total number of rows. Also, the number of different ids is varied as grouping may have an effect on performance. According to the OP, his full dataset of 500 k rows has 250 k unique ids which corresponds to an id_share of 0.5 (50%). In the benchmark id_shares of 0.5, 0.2, and 0.01 (50%, 20%, 1%) are simulated.
As sample_data is modified, each run starts with a fresh copy.
library(bench)
library(magrittr)
bm <- press(
id_share = c(0.5, 0.2, 0.01),
n_row = c(1000L, 10000L, 1e5L),
{
n_id <- max(1L, as.integer(n_row * id_share))
print(sprintf("Number of ids: %i", n_id))
set.seed(123L)
sample_data_0 <- lapply(seq(n_id), function(.id) data.table(
start_date = as.IDate("2000-01-01") + cumsum(sample(0:730, n_row / n_id, TRUE))
)) %>%
rbindlist(idcol = "id") %>%
.[, end_date := start_date + sample(30:360, n_row, TRUE)] %>%
.[, intervention_id := as.character(.I)]
mark(
v_1 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_2 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date) + 365)]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_3 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date %m+% years(1)))]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
check = FALSE,
min_iterations = 3
)
}
)
ggplot2::autoplot(bm)
The result shows that the number of groups, i.e., number of unique id, does have a stronger effect on the run time than the different code versions. In case of many groups, the creation of helper columns before grouping (v_2) gains performance.
Consider the following problem: I have two dataframes cases and events. For every case there can be several events (1:n). events$caseId (foreign key) refers to cases$id (primary key):
cases events
------ ------
id id
date caseId
var1 date
var2 var1
... var2
...
Now, I'd like to create a new column in cases giving the number of events for every case (i.e. where events$caseId equals cases$id)
# Sample data
cases<-data.frame(id=1:5, date=c("2017-01-02","2016-02-03","2015-02-12","2016-01-03","2016-08-09"), var1=sample(c("A", "B"), 5, replace=T))
events<-data.frame(id=1:10, date=c("2017-01-01","2016-12-12","2016-07-04","2017-04-03","2015-02-13","2015-01-01","2013-05-07","2015-12-25","2016-05-04","2016-10-11"), caseId=c(1,1,1,1,3,3,3,4,5,5))
# Calculate the number of events for every caseId
library(tidyverse)
events %>%
count(caseId) %>%
right_join(cases, by = c("caseId" = "id"))
The problem I have is that I want to count only those events that happened after the date specified in the case-dataframe (i.e. events$date > cases$date).
I would appreciate any help.
You could do a non-equi join with the data.table package (the date-columns need to be converted to date-format first, see below under Data-heading):
library(data.table)
setDT(cases)
setDT(events)
cases[events, on = .(id = caseId, date < date), events := .N, by = .EACHI][]
The result:
id date var1 events
1: 1 2017-01-02 B 1
2: 2 2016-02-03 A NA
3: 3 2015-02-12 A 1
4: 4 2016-01-03 B NA
5: 5 2016-08-09 B 1
A variant of this:
cases[, events := events[cases, on = .(caseId = id, date > date), .N, by = .EACHI]$N][]
The result of that:
id date var1 events
1: 1 2017-01-02 A 1
2: 2 2016-02-03 A 0
3: 3 2015-02-12 A 1
4: 4 2016-01-03 A 0
5: 5 2016-08-09 A 1
Data:
cases <- data.frame(id=1:5, date=c("2017-01-02","2016-02-03","2015-02-12","2016-01-03","2016-08-09"),
var1=sample(c("A", "B"), 5, replace=T))
events <- data.frame(id=1:10,
date=c("2017-01-01","2016-12-12","2016-07-04","2017-04-03","2015-02-13","2015-01-01","2013-05-07","2015-12-25","2016-05-04","2016-10-11"),
caseId=c(1,1,1,1,3,3,3,4,5,5))
cases$date <- as.Date(cases$date)
events$date <- as.Date(events$date)
I have a Data set consisting of dates when a person left the network. A person can leave a network multiple times as they may join the network again after leaving it. Following code replicates the scenario.
library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))
(ids repeat multiple times in this table as a person can leave a network multiple times given they joined it again)
> Leaving_Date
Id Date
1: 1 2017-01-01
2: 2 2017-02-03
3: 3 2017-01-01
4: 4 2017-03-10
5: 3 2017-02-09
6: 5 2017-02-05
I have another dataset giving the dates whenever a particular person was followed up which can be before or after they left the network. Following code replicates the scenario.
FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
Date =as.Date(c("2016-10-01","2017-02-04",
"2017-01-17","2017-02-23", "2017-03-03",
"2017-02-10","2017-02-11","2017-01-01",
"2017-01-15","2017-01-01")))
> FOLLOWUPs
Id Date
1: 1 2016-10-01
2: 2 2017-02-04
3: 3 2017-01-17
4: 2 2017-02-23
5: 2 2017-03-03
6: 3 2017-02-10
7: 3 2017-02-11
8: 4 2017-01-01
9: 1 2017-01-15
10: 5 2017-01-01
Now I want to lookup each case in Leaving_Date and find dates when they were followed up and create three columns(SevenDay, FourteenDay,ThirtyDay) indicating time period of followup(incase if there was any) in 0s and 1s. I am using following code :
SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+7)])== 0){
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
}
else{
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+14)])== 0){
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
}
else{
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+30)])== 0){
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
}
else{
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
}
}
Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)
Final Data
> Leaving_Date
Id Date SEVENDAY FOURTEENDAY THIRTYDAY
1: 1 2017-01-01 0 0 1
2: 2 2017-02-03 1 1 1
3: 3 2017-01-01 0 0 1
4: 4 2017-03-10 0 0 0
5: 3 2017-02-09 1 1 1
6: 5 2017-02-05 0 0 0
This code is very inefficient as I have to run it for 100k observations and it takes a lot of time. Is there any efficient way to do this.
Using a non-equi join:
setorder(FOLLOWUPs, Id, Date)
Leaving_Date[, n :=
FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date]
]
Id Date n
1: 1 2017-01-01 14 days
2: 2 2017-02-03 1 days
3: 3 2017-01-01 16 days
4: 4 2017-03-10 NA days
5: 3 2017-02-09 1 days
6: 5 2017-02-05 NA days
Switching from Date to IDate will probably make this about twice as fast. See ?IDate.
I think it's best to stop here, but n can be compared against 7, 14, 30 if necessary, like
Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]]
Id Date n bin
1: 1 2017-01-01 14 days 30
2: 2 2017-02-03 1 days 7
3: 3 2017-01-01 16 days 30
4: 4 2017-03-10 NA days NA
5: 3 2017-02-09 1 days 7
6: 5 2017-02-05 NA days NA
Side note: Please don't give tables names like this.
I think this does what you are looking for using dplyr.
It does an 'inner join' by Id - generating all combinations of dates in the two data frames for a given Id - then calculates the date differences, groups by Id, then checks whether there are values falling in the ranges for your three categories.
library(dplyr)
Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>%
mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>%
summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
THIRTYDAY=as.numeric(any(datediff %in% 0:29)))
We can do this as a query instead of a loop. First, I cleaned your data.tables a bit because I was getting confused by the variable names.
To make things easier for the comparison step, we first pre-compute the follow up date limit for the 7, 14 and 30 day thresholds.
library(dplyr)
dt_leaving_neat = Leaving_Date %>%
mutate(.id = 1:n()) %>%
mutate(limit_07 = Date + 7) %>%
mutate(limit_14 = Date + 14) %>%
mutate(limit_30 = Date + 30) %>%
rename(id = .id, id_person = Id, leaving_date = Date)
dt_follow_neat = FOLLOWUPs %>%
select(id_person = Id, followed_up_date = Date)
The actual operation is just a query. It's written out in dplyr for readability, but if speed is a main concern of yours, you could translate it to data.table. I'd recommend running each step in the pipeline to make sure you understand what's going on.
dt_followed_up = dt_leaving_neat %>%
tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
left_join(dt_follow_neat, by = "id_person") %>%
mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
select(id, id_person, leaving_date, follow_up, followed_up) %>%
filter(followed_up == TRUE) %>%
unique() %>%
tidyr::spread(follow_up, followed_up, fill = 0) %>%
select(id, id_person, leaving_date, limit_07, limit_14, limit_30)
The idea is to join the leaving dates to the follow up dates and check whether the follow up date is within the threshold (and also after the leaving date, as presumably you can't follow up before leaving).
Then some final cleaning to return your desired format. You can use select or rename to change the column names back too.
dt_result = dt_leaving_neat %>%
select(id, id_person, leaving_date) %>%
left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))
dt_result[is.na(dt_result)] = 0
Result
> dt_result
id id_person leaving_date limit_07 limit_14 limit_30
1 1 1 2017-01-01 0 0 1
2 2 2 2017-02-03 1 1 1
3 3 3 2017-01-01 0 0 1
4 4 4 2017-03-10 0 0 0
5 5 3 2017-02-09 1 1 1
6 6 5 2017-02-05 0 0 0
And following Andrew's answer, an equivalent 1 line data.table soln is
FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]
I would like to remove products from a dataframe that have overlapping start and end dates to avoid duplicates in a subsequent step.
Example data:
library(dplyr)
d <-
bind_rows(
data.frame(product = 1,
start_date = as.Date("2016-01-01"),
end_date = as.Date("2016-01-10"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-02"),
end_date = as.Date("2016-01-04"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-05"),
end_date = as.Date("2016-06-09"),
stringsAsFactors = FALSE),
data.frame(product = 2,
start_date = as.Date("2016-01-03"),
end_date = as.Date("2016-01-07"),
stringsAsFactors = FALSE)
)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-02 2016-01-04
3 1 2016-01-05 2016-06-09
4 2 2016-01-03 2016-01-07
From this example I would like to remove rows 2 and 3 because of the overlaps.
I've used the lag function to remove overlaps that are next to each other:
d_cleaned <-
d %>%
arrange(product, start_date, end_date) %>%
mutate(overlapping = product == lag(product) & start_date <= lag(end_date) & end_date >= lag(start_date)) %>% # define overlaps
mutate(overlapping = ifelse(is.na(overlapping), FALSE, overlapping)) %>% # dont delete the first row
filter(overlapping == FALSE) %>% # remove overlaps
select(-overlapping)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-05 2016-06-09
3 2 2016-01-03 2016-01-07
As can be seen above this step removes overlaps on consecutive rows but not all.
I can solve this with a loop, but I was hoping that someone might be able to suggest a non-looping solution as the dataframe is quite large and each step takes a while.
Using non-equi joins from the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
setDT(d) # convert 'd' to a data.table by reference
idx = d[d, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE]
d[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
# product start_date end_date
# 1: 1 2016-01-01 2016-01-10
# 2: 1 2016-06-10 2016-06-12
# 3: 2 2016-01-03 2016-01-07
For each row in d (the one inside the square bracket), we find any kind of overlap with d (on the outside), i.e., a self-join, based on the condition provided to the on argument, and we extract the index of the first overlap (because which=TRUE and mult="first").
If and only if the first overlap is with itself, we return them. We discard all other intervals.
To install devel version, see installation instructions here.
Here's a benchmark on slightly more rows (the data is not by any means large):
set.seed(1L)
require(data.table) # v1.9.7+
dates = as.Date(sample(16000:17000, 1e5, TRUE), origin="1970-01-01")
dt = data.table(product=sample(100, 1e5, TRUE),
start_date = sample(dates, 1e5, TRUE),
end_date = sample(dates, 1e5, TRUE))
dt[, `:=`(start_date = pmin(start_date, end_date),
end_date = pmax(start_date, end_date))]
system.time({
idx = dt[dt, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE, verbose=TRUE]
ans = dt[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
})
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0.041 secs
# Recomputing forder with non-equi ids ... done in 0.005 secs
# Found 178 non-equi group(s) ...
# Starting bmerge ...done in 2.359 secs
# user system elapsed
# 2.402 0.011 2.421
head(ans)
# product start_date end_date
# 1: 71 2015-12-04 2016-03-22
# 2: 71 2014-04-12 2015-05-01
# 3: 32 2013-11-23 2015-03-18
# 4: 56 2014-07-29 2015-12-26
# 5: 88 2015-03-08 2015-03-21
# 6: 69 2014-10-31 2015-07-05
nrow(ans)
# [1] 186
I believe the following will work
d <- cbind(ID=1:nrow(d),d)
d_cleaned <- d[rep(1:nrow(d), times=nrow(d)),] %>% ## 1
setNames(.,paste0(names(.),"_other")) %>% ## 2
bind_cols(d[rep(1:nrow(d), each=nrow(d)),], .) %>% ## 3
arrange(product,start_date,end_date) %>% ## 4
filter(product == product_other) %>% ## 5
mutate(overlapping = ID_other < ID &
start_date <= end_date_other &
end_date >= start_date_other) %>% ## 6
group_by(ID) %>%
filter(all(overlapping==FALSE)) %>% ## 7
ungroup() %>%
select(product,start_date,end_date) %>%
distinct())
print(d_cleaned)
### A tibble: 2 x 3
## product start_date end_date
## <dbl> <date> <date>
##1 1 2016-01-01 2016-01-10
##2 2 2016-01-03 2016-01-07
First, add a column of IDs that identifies the rows of the data frame to group_by later to determine if there is overlap with any other row. The key is to be able to consider all distinct pairs of rows with the same product in testing for overlap. The above code does this by expanding the data as in an outer-join. Specifically,
Replicate d nrow(d) times
Change the names of the columns by appending _other to them so that they can be referenced separately from the original column names in the overlap test
Replicate each row of d nrow(d) times and append the result from (2) as new columns
The result of (3) have rows that enumerates all pairs of rows from the original data frame. Then:
Sort them as you did.
Consider only pairs where the product matches. Do this first to minimize not needed comparisons later
Do the overlap test. Here comparison is only made with respect to the previous rows in the original data frame. This has the effect of considering all lags and preserving the row itself (i.e., all rows overlap with itself)
Group by the ID (each row in original data frame) and keep those for which all overlapping is FALSE
At this point, the result contains only the non-overlapping rows in the original data frame. However, there are all those extra columns, and there are duplicates where multiple rows overlap with a row. The rest of the code cleans that up.
I have tested it with the following data (augmented yours to add a few more test conditions, but far from exhaustive):
d <- structure(list(product = c(1, 1, 1, 1, 1, 2, 2), start_date = structure(c(16801,
16802, 16805, 16811, 16962, 16803, 16806), class = "Date"), end_date = structure(c(16810,
16804, 16961, 16961, 16964, 16807, 16810), class = "Date")), .Names = c("product",
"start_date", "end_date"), row.names = c(NA, -7L), class = "data.frame")
and got the following results:
# A tibble: 3 x 3
product start_date end_date
<dbl> <date> <date>
1 1 2016-01-01 2016-01-10
2 1 2016-06-10 2016-06-12
3 2 2016-01-03 2016-01-07
Hope this helps.