My question involves how to calculate the number of days since an event last that occurred in R.
Below is a minimal example of the data:
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","23/05/2001","26/08/2001"), "%d/%m/%Y"),
event=c(0,0,1,0,1,1,0))
date event
1 2000-07-06 0
2 2000-09-15 0
3 2000-10-15 1
4 2001-01-03 0
5 2001-03-17 1
6 2001-05-23 1
7 2001-08-26 0
A binary variable(event) has values 1 indicating that the event occurred and 0 otherwise. Repeated observations are done at different times(date)
The expected output is as follows with the days since last event(tae):
date event tae
1 2000-07-06 0 NA
2 2000-09-15 0 NA
3 2000-10-15 1 0
4 2001-01-03 0 80
5 2001-03-17 1 153
6 2001-05-23 1 67
7 2001-08-26 0 95
I have looked around for answers to similar problems but they don't address my specific problem. I have tried to implement ideas from
from a similar post (Calculate elapsed time since last event) and below is the closest I
got to the solution:
library(dplyr)
df %>%
mutate(tmp_a = c(0, diff(date)) * !event,
tae = cumsum(tmp_a))
Which yields the output shown below that is not quite the expected:
date event tmp_a tae
1 2000-07-06 0 0 0
2 2000-09-15 0 71 71
3 2000-10-15 1 0 71
4 2001-01-03 0 80 151
5 2001-03-17 1 0 151
6 2001-05-23 1 0 151
7 2001-08-26 0 95 246
Any assistance on how to fine tune this or a different approach would be greatly appreciated.
You could try something like this:
# make an index of the latest events
last_event_index <- cumsum(df$event) + 1
# shift it by one to the right
last_event_index <- c(1, last_event_index[1:length(last_event_index) - 1])
# get the dates of the events and index the vector with the last_event_index,
# added an NA as the first date because there was no event
last_event_date <- c(as.Date(NA), df[which(df$event==1), "date"])[last_event_index]
# substract the event's date with the date of the last event
df$tae <- df$date - last_event_date
df
# date event tae
#1 2000-07-06 0 NA days
#2 2000-09-15 0 NA days
#3 2000-10-15 1 NA days
#4 2001-01-03 0 80 days
#5 2001-03-17 1 153 days
#6 2001-05-23 1 67 days
#7 2001-08-26 0 95 days
It's painful and you lose performance but you can do it with a for loop :
datas <- read.table(text = "date event
2000-07-06 0
2000-09-15 0
2000-10-15 1
2001-01-03 0
2001-03-17 1
2001-05-23 1
2001-08-26 0", header = TRUE, stringsAsFactors = FALSE)
datas <- transform(datas, date = as.Date(date))
lastEvent <- NA
tae <- rep(NA, length(datas$event))
for (i in 2:length(datas$event)) {
if (datas$event[i-1] == 1) {
lastEvent <- datas$date[i-1]
}
tae[i] <- datas$date[i] - lastEvent
# To set the first occuring event as 0 and not NA
if (datas$event[i] == 1 && sum(datas$event[1:i-1] == 1) == 0) {
tae[i] <- 0
}
}
cbind(datas, tae)
date event tae
1 2000-07-06 0 NA
2 2000-09-15 0 NA
3 2000-10-15 1 0
4 2001-01-03 0 80
5 2001-03-17 1 153
6 2001-05-23 1 67
7 2001-08-26 0 95
Old question, but I was experimenting with rolling joins and found this interesting.
library(data.table)
setDT(df)
setkey(df, date)
# rolling self-join to attach last event time
df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
# find difference between record and previous event == 1 record
df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
# difftime for simple case between date and joint on previous event
df[event == 0, tae:= difftime(date, lastevent, unit = "days")]
> df
date lastevent event tae
1: 2000-07-06 <NA> 0 NA days
2: 2000-09-15 <NA> 0 NA days
3: 2000-10-15 2000-10-15 1 NA days
4: 2001-01-03 2000-10-15 0 80 days
5: 2001-03-17 2001-03-17 1 153 days
6: 2001-05-23 2001-05-23 1 67 days
7: 2001-08-26 2001-05-23 0 95 days
I'm way late to the party, but I used tidyr::fill to make this easier. You essentially convert your non-events to missing values, then use fill to fill the NAs in with the last event, then subtract the current date from the last event.
I've tested this with a integer date column, so it might need some tweaking for a Date-type date column (especially the use of NA_integer_. I'm not sure what the underlying type is for Date objects; I'm guessing NA_real_.)
df %>%
mutate(
event = as.logical(event),
last_event = if_else(event, true = date, false = NA_integer_)) %>%
fill(last_event) %>%
mutate(event_age = date - last_event)
I had a similar issue and was able to solve it combining some of the ideas above. The main difference I had with mine would be customers a - nth would have different events (for me it is purchases). I wanted to know the cumulative totals for all these purchases as well as the date of the last activity. The main way I solved this was to create an index-dataframe to join with the main data frame. Similar to the top rated question above. See repeatable code below.
library(tidyverse)
rm(list=ls())
#generate repeatable code sample dataframe
df <- as.data.frame(sample(rep(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 12), each = 4),36))
df$subtotal <- sample(1:100, 36)
df$cust <- sample(rep(c("a", "b", "c", "d", "e", "f"), each=12), 36)
colnames(df) <- c("dates", "subtotal", "cust")
#add a "key" based on date and event
df$datekey <- paste0(df$dates, df$cust)
#The following 2 lines are specific to my own analysis but added to show depth
df_total_visits <- df %>% select(dates, cust) %>% distinct() %>% group_by(cust) %>% tally(n= "total_visits") %>% mutate(variable = 1)
df_order_bydate <- df %>% select(dates, cust) %>% group_by(dates, cust) %>% tally(n= "day_orders")
df <- left_join(df, df_total_visits)
df <- left_join(df, df_order_bydate) %>% arrange(dates)
# Now we will add the index, the arrange from the previous line is super important if your data is not already ordered by date
cummulative_groupping <- df %>% select(datekey, cust, variable, subtotal) %>% group_by(datekey) %>% mutate(spending = sum(subtotal)) %>% distinct(datekey, .keep_all = T) %>% select(-subtotal)
cummulative_groupping <- cummulative_groupping %>% group_by(cust) %>% mutate(cumulative_visits = cumsum(variable),
cumulative_spend = cumsum(spending))
df <- left_join(df, cummulative_groupping) %>% select(-variable)
#using the cumulative visits as the index, if we add one to this number we can then join it again on our dataframe
last_date_index <- df %>% select(dates, cust, cumulative_visits)
last_date_index$cumulative_visits <- last_date_index$cumulative_visits + 1
colnames(last_date_index) <- c("last_visit_date", "cust", "cumulative_visits")
df <- left_join(df, last_date_index, by = c("cust", "cumulative_visits"))
#the difference between the date and last visit answers the original posters question. NAs will return as NA
df$toa <- df$dates - df$last_visit_date
This answer works in the cases where the same event occurs on the same day (either bad data hygiene OR if multiple vendors/cust go to that event). Thank you for viewing my answer. This is actually my first post on Stack.
Related
Background
I've got an R dataframe d:
d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
event = c(1,1,0,0,1,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
stringsAsFactors=FALSE)
As you can see, it's got 3 distinct people in the ID column, and they've either had or not had an event, along with a date their event status was recorded (event_date).
The Problem
I'd like to create a new variable / column, event_within_interval, which assigns 1 to all the cells of a given ID if that ID has 2 or more event=1 within 180 days of their first event=1.
Let me explain further: both ID=a and ID=c have 2 or more events each, but only ID=c has their second event within 180 days of their first (so here, the 4/7/2013 - 3/14/2013 = 24 days for ID=c).
The problem is that I'm not sure how to tell R this idea of "if the second happens within 180 days of the first event=1".
What I'd like
Here's what I'm looking for:
want <- data.frame(ID = c("a","a","b","b","c","c","c"),
event = c(1,1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
event_within_interval = c(0,0,0,0,1,1,1),
stringsAsFactors=FALSE)
What I've tried
I've only got the beginnings of an attempt thus far:
d <- d %>%
mutate(event_within_interval = ID %in% if_else(d$event == 1, 1, 0))
But this doesn't give me what I'd like, as you can tell if you run the code.
I've set the thing up as an if_else, but I'm not sure where to go from here.
UPDATE: I've edited both reproducible examples (what I've got and what I want) to emphasize the fact that the desired date interval needs to be between the first event and the second event, not the first event and the last event. (A couple of users submitted examples using last, which worked for the previous iteration of the reproducible example but wouldn't have worked on the real dataset.)
What about by packages lubridate and data.table?
library(data.table)
library(lubridate)
d <- data.frame(ID = c("a","a","b","b", "c","c"),
event = c(1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07")),
stringsAsFactors=FALSE)
d <- data.table(d)
d <- d[, event_within_interval := 0]
timeInterval <- interval(start = "2013-03-14", end = "2013-04-07")
d <- d[event == 1 & event_date %within% timeInterval, event_within_interval := 1]
d
# ID event event_date event_within_interval
# 1: a 1 2011-01-01 0
# 2: a 1 2012-08-21 0
# 3: b 0 2011-12-23 0
# 4: b 0 2011-12-31 0
# 5: c 1 2013-03-14 1
# 6: c 1 2013-04-07 1
This is good fun.
Scenario 1
My approach would be to
group events by ID
Apply first condition check on two the span of days between current date and initial date
check if the sum of events is bigger or equal two: sum(event) >= 2
only if the two conditions are met I would return one for the event
For readability, I've returned values of conditions in the data as test_* variables.
d %>%
group_by(ID) %>%
mutate(test_interval = event_date - min(event_date) < 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(test_interval & test_sum_events,
1, 0)) %>%
ungroup()
Scenario 2
In this scenario, the data is sorted by event_date within ID and the difference between the first event and second event has to be under 180 days. Rest of events is ignored.
d %>%
group_by(ID) %>%
arrange(event_date) %>%
mutate(
# Check the difference between first event: min(event_date) and
# second event: event_date[2]
test_interval_first_two = event_date[2] - min(event_date) <= 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(
test_interval_first_two & test_sum_events, 1, 0)
) %>%
ungroup()
You can first group_by the ID column, so that we can calculate days within the same ID. Then in the condition in the if_else statement, use condition with sum() > 1 AND day difference <= 180.
Here I assume there's only two "events" or rows per ID.
library(dplyr)
d %>%
group_by(ID) %>%
mutate(event_within_interval = if_else(sum(event) > 1 & last(event_date) - first(event_date) <= 180, 1L, 0L))
# A tibble: 6 x 4
# Groups: ID [3]
ID event event_date event_within_interval
<chr> <dbl> <date> <int>
1 a 1 2011-01-01 0
2 a 1 2012-08-21 0
3 b 0 2011-12-23 0
4 b 0 2011-12-31 0
5 c 1 2013-03-14 1
6 c 1 2013-04-07 1
Here is how we could do it. In this example with an additional column interval to see the interval and then use an ifelse statement.
library(dpylr)
d %>%
group_by(ID) %>%
mutate(interval = last(event_date)- first(event_date),
event_within_interval = ifelse(event == 1 &
interval < 180, 1, 0))
ID event event_date interval event_within_interval
<chr> <dbl> <date> <drtn> <dbl>
1 a 1 2011-01-01 598 days 0
2 a 1 2012-08-21 598 days 0
3 b 0 2011-12-23 8 days 0
4 b 0 2011-12-31 8 days 0
5 c 1 2013-03-14 24 days 1
6 c 1 2013-04-07 24 days 1
I want to create a table with two columns. The first one represents the working weeks, named time_axis in my exemple below.
The second column, is also a sequence of Dates which represents particular events in a year, called bank_holidays. Each of the date get a one value to signalise its presence.
What I need, is to create a table where the first columns time axis remains unchanged and the second column will be a vector of ones and zeros. Zeros anywhere outside the weeks which contain the events in bank_holiday and with ones for the weeks which includes those dates in bank_holiday occurs. Every week starts with the date in time_axis
library(xts)
time_axis <- seq(as.Date("2017-01-21"), length = 10, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18"))
bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
The desired outcome:
df <- data.frame ( time_axis = c("2017-01-20", "2017-01-27", "2017-02-03", "2017-02-10", "2017-02-17", "2017-02-24", "2017-03-03", "2017-03-10", "2017-03-17", "2017-03-24"), bank_holidays = c(0, 1, 0,1,1,0,0,0,1,0))
df
Any idee on how to make it?
Thank you.
Something which needs to bear in mind and is not obviously from the data: the weeks on time_axis start on Saturday. Therefore, 2017-01-21 is not the end of the 3rd week (as if it would be in case the week starts on Monday) but it is already the 4th week.
Using strftime, "%V" gives the ISO 8601 week numbers where you may match on.
res <- data.frame(time_axis,
bank_holidays =+(strftime(time_axis, "%V") %in%
strftime(index(bank_holidays), "%V")))
res
# time_axis bank_holidays
# 1 2017-01-20 0
# 2 2017-01-27 0
# 3 2017-02-03 1
# 4 2017-02-10 0
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 1
# 10 2017-03-24 0
Edit
To use the custom working weeks whose starts are defined in time_axis variable, the simplest thing would probably be to compare if bank_holidays are greater or equal than that. Then counting the TRUEs with colSums gives the index where to set to 1.
res <- data.frame(time_axis, bank_holidays=0) ## init. column with `0`
res$bank_holidays[colSums(sapply(index(bank_holidays), `>=`, time_axis))] <- 1 ## set matches to 1
res
# time_axis bank_holidays
# 1 2017-01-21 0
# 2 2017-01-28 1
# 3 2017-02-04 0
# 4 2017-02-11 1
# 5 2017-02-18 1
# 6 2017-02-25 0
# 7 2017-03-04 0
# 8 2017-03-11 0
# 9 2017-03-18 1
# 10 2017-03-25 0
This works for me, a little bit longer than the previous answer but you can see what's happening here
I need the start date and end date so I can fill the rest of the dates in between so I'm selecting 11 weeks instead of 10 in your example. And also I can match the vector for bank holidays instead of xts object
library(xts)
library(tidyverse)
time_axis <- seq(as.Date("2017-01-20"), length = 11, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18")) # I'll work with the vector
#bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
df <- tibble() # I'm creating an empty tibble
for(i in 1:(length(time_axis)-1)) { # running a for loop for 10 weeks
df <- seq(as.Date(time_axis[i]), as.Date(time_axis[i+1]), "days") %>% # filling the dates between the two dates
enframe(name = NULL) %>% #converting it into a data frame (tibble)
mutate(week = as.Date(time_axis[i])) %>% # creating a new column indicating which week the given date belong to
bind_rows(df) # binding the rows to previous dataframe
}
Now I'm taking the df and checking the given holidays matching with our generated dates or not. if present 1 or it will be 0.
Then I'm group_by based on week column which is our given weeks above and summarising to find the sum
df %>%
mutate(bank_holidays_presence = if_else(value %in% bank_holidays, 1, 0)) %>%
group_by(week) %>%
summarise(sum = bank_holidays_presence %>% sum())
# A tibble: 10 x 2
# week sum
# <date> <dbl>
# 1 2017-01-20 0
# 2 2017-01-27 1
# 3 2017-02-03 0
# 4 2017-02-10 1
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 0
#10 2017-03-24 0
The advantage of this method is that even if you have more than one holiday for a particular week it'll give the count rather than mere presence or absence
I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an onset variable, so for each patient it would code a 1 for the time which the patient first got the disease, but would then be a 0 for any time period before or a time period after (even if that patient still had the disease). For the example data it should now look like this.
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
outcome_onset <- c(0,1,0,0,0,1,1,0,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome,
outcome_onset=outcome_onset)
Data
Therefore I would like some code/ some help automating the creation of the outcome_onset variable.
Here is an option with cumsum to create a logical vector after grouping by the 'patientid'
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(cumsum(outcome) == 1))
Or use match and %in%
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(row_number() %in% match(1, outcome_onset)))
We can use which.max to get the index of 1st one in outcome variable and make that row as 1 and rest of them as 0.
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = as.integer(row_number() %in% which.max(outcome)),
outcome_onset = replace(outcome_onset, is.na(outcome), NA))
# patientid time outcome outcome_onset
# <dbl> <dbl> <dbl> <int>
#1 100 1 0 0
#2 100 2 1 1
#3 100 3 1 0
#4 101 1 0 0
#5 101 2 0 0
#6 101 3 1 1
#7 102 1 1 1
#8 102 2 1 0
#9 102 3 0 0
I have a data table that looks like
|userId|36|37|38|39|40|
|1|1|0|3|0|0|
|2|3|0|0|0|1|
Where each numbered column (36-40) represent week numbers. I want to calculate the number of weeks before the 1st occurrence of a non-zero value, and the last.
For instance, for userId 1 in my dataset, the first value appears at week 36, and the last one appears at week 38, so the value I want is 2. For userId 2 it's 40-36 which is 4.
I would like to store the data like:
|userId|lifespan|
|1|2|
|2|4|
I'm struggling to do this, can someone please help?
General method I would take is to melt it, convert the character column names to numeric, and take the delta by each userID. Here is an example using data.table.
library(data.table)
dt <- fread("userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1",
header = TRUE)
dt <- melt(dt, id.vars = "userId")
dt[, variable := as.numeric(as.character(variable))]
dt
# userId variable value
# 1: 1 36 1
# 2: 2 36 3
# 3: 1 37 0
# 4: 2 37 0
# 5: 1 38 3
# 6: 2 38 0
# 7: 1 39 0
# 8: 2 39 0
# 9: 1 40 0
# 10: 2 40 1
dt[!value == 0, .(lifespan = max(variable) - min(variable)), by = .(userId)]
# userId lifespan
# 1: 1 2
# 2: 2 4
Here's a dplyr method:
df %>%
gather(var, value, -userId) %>%
mutate(var = as.numeric(sub("X", "", var))) %>%
group_by(userId) %>%
slice(c(which.max(value!=0), max(which(value!=0)))) %>%
summarize(lifespan = var[2]-var[1])
Result:
# A tibble: 2 x 2
userId lifespan
<int> <dbl>
1 1 2
2 2 4
Data:
df = read.table(text = "userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1", header = TRUE, sep = "|")
I have a Data set consisting of dates when a person left the network. A person can leave a network multiple times as they may join the network again after leaving it. Following code replicates the scenario.
library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))
(ids repeat multiple times in this table as a person can leave a network multiple times given they joined it again)
> Leaving_Date
Id Date
1: 1 2017-01-01
2: 2 2017-02-03
3: 3 2017-01-01
4: 4 2017-03-10
5: 3 2017-02-09
6: 5 2017-02-05
I have another dataset giving the dates whenever a particular person was followed up which can be before or after they left the network. Following code replicates the scenario.
FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
Date =as.Date(c("2016-10-01","2017-02-04",
"2017-01-17","2017-02-23", "2017-03-03",
"2017-02-10","2017-02-11","2017-01-01",
"2017-01-15","2017-01-01")))
> FOLLOWUPs
Id Date
1: 1 2016-10-01
2: 2 2017-02-04
3: 3 2017-01-17
4: 2 2017-02-23
5: 2 2017-03-03
6: 3 2017-02-10
7: 3 2017-02-11
8: 4 2017-01-01
9: 1 2017-01-15
10: 5 2017-01-01
Now I want to lookup each case in Leaving_Date and find dates when they were followed up and create three columns(SevenDay, FourteenDay,ThirtyDay) indicating time period of followup(incase if there was any) in 0s and 1s. I am using following code :
SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+7)])== 0){
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
}
else{
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+14)])== 0){
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
}
else{
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+30)])== 0){
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
}
else{
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
}
}
Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)
Final Data
> Leaving_Date
Id Date SEVENDAY FOURTEENDAY THIRTYDAY
1: 1 2017-01-01 0 0 1
2: 2 2017-02-03 1 1 1
3: 3 2017-01-01 0 0 1
4: 4 2017-03-10 0 0 0
5: 3 2017-02-09 1 1 1
6: 5 2017-02-05 0 0 0
This code is very inefficient as I have to run it for 100k observations and it takes a lot of time. Is there any efficient way to do this.
Using a non-equi join:
setorder(FOLLOWUPs, Id, Date)
Leaving_Date[, n :=
FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date]
]
Id Date n
1: 1 2017-01-01 14 days
2: 2 2017-02-03 1 days
3: 3 2017-01-01 16 days
4: 4 2017-03-10 NA days
5: 3 2017-02-09 1 days
6: 5 2017-02-05 NA days
Switching from Date to IDate will probably make this about twice as fast. See ?IDate.
I think it's best to stop here, but n can be compared against 7, 14, 30 if necessary, like
Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]]
Id Date n bin
1: 1 2017-01-01 14 days 30
2: 2 2017-02-03 1 days 7
3: 3 2017-01-01 16 days 30
4: 4 2017-03-10 NA days NA
5: 3 2017-02-09 1 days 7
6: 5 2017-02-05 NA days NA
Side note: Please don't give tables names like this.
I think this does what you are looking for using dplyr.
It does an 'inner join' by Id - generating all combinations of dates in the two data frames for a given Id - then calculates the date differences, groups by Id, then checks whether there are values falling in the ranges for your three categories.
library(dplyr)
Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>%
mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>%
summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
THIRTYDAY=as.numeric(any(datediff %in% 0:29)))
We can do this as a query instead of a loop. First, I cleaned your data.tables a bit because I was getting confused by the variable names.
To make things easier for the comparison step, we first pre-compute the follow up date limit for the 7, 14 and 30 day thresholds.
library(dplyr)
dt_leaving_neat = Leaving_Date %>%
mutate(.id = 1:n()) %>%
mutate(limit_07 = Date + 7) %>%
mutate(limit_14 = Date + 14) %>%
mutate(limit_30 = Date + 30) %>%
rename(id = .id, id_person = Id, leaving_date = Date)
dt_follow_neat = FOLLOWUPs %>%
select(id_person = Id, followed_up_date = Date)
The actual operation is just a query. It's written out in dplyr for readability, but if speed is a main concern of yours, you could translate it to data.table. I'd recommend running each step in the pipeline to make sure you understand what's going on.
dt_followed_up = dt_leaving_neat %>%
tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
left_join(dt_follow_neat, by = "id_person") %>%
mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
select(id, id_person, leaving_date, follow_up, followed_up) %>%
filter(followed_up == TRUE) %>%
unique() %>%
tidyr::spread(follow_up, followed_up, fill = 0) %>%
select(id, id_person, leaving_date, limit_07, limit_14, limit_30)
The idea is to join the leaving dates to the follow up dates and check whether the follow up date is within the threshold (and also after the leaving date, as presumably you can't follow up before leaving).
Then some final cleaning to return your desired format. You can use select or rename to change the column names back too.
dt_result = dt_leaving_neat %>%
select(id, id_person, leaving_date) %>%
left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))
dt_result[is.na(dt_result)] = 0
Result
> dt_result
id id_person leaving_date limit_07 limit_14 limit_30
1 1 1 2017-01-01 0 0 1
2 2 2 2017-02-03 1 1 1
3 3 3 2017-01-01 0 0 1
4 4 4 2017-03-10 0 0 0
5 5 3 2017-02-09 1 1 1
6 6 5 2017-02-05 0 0 0
And following Andrew's answer, an equivalent 1 line data.table soln is
FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]