I want to create a table with two columns. The first one represents the working weeks, named time_axis in my exemple below.
The second column, is also a sequence of Dates which represents particular events in a year, called bank_holidays. Each of the date get a one value to signalise its presence.
What I need, is to create a table where the first columns time axis remains unchanged and the second column will be a vector of ones and zeros. Zeros anywhere outside the weeks which contain the events in bank_holiday and with ones for the weeks which includes those dates in bank_holiday occurs. Every week starts with the date in time_axis
library(xts)
time_axis <- seq(as.Date("2017-01-21"), length = 10, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18"))
bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
The desired outcome:
df <- data.frame ( time_axis = c("2017-01-20", "2017-01-27", "2017-02-03", "2017-02-10", "2017-02-17", "2017-02-24", "2017-03-03", "2017-03-10", "2017-03-17", "2017-03-24"), bank_holidays = c(0, 1, 0,1,1,0,0,0,1,0))
df
Any idee on how to make it?
Thank you.
Something which needs to bear in mind and is not obviously from the data: the weeks on time_axis start on Saturday. Therefore, 2017-01-21 is not the end of the 3rd week (as if it would be in case the week starts on Monday) but it is already the 4th week.
Using strftime, "%V" gives the ISO 8601 week numbers where you may match on.
res <- data.frame(time_axis,
bank_holidays =+(strftime(time_axis, "%V") %in%
strftime(index(bank_holidays), "%V")))
res
# time_axis bank_holidays
# 1 2017-01-20 0
# 2 2017-01-27 0
# 3 2017-02-03 1
# 4 2017-02-10 0
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 1
# 10 2017-03-24 0
Edit
To use the custom working weeks whose starts are defined in time_axis variable, the simplest thing would probably be to compare if bank_holidays are greater or equal than that. Then counting the TRUEs with colSums gives the index where to set to 1.
res <- data.frame(time_axis, bank_holidays=0) ## init. column with `0`
res$bank_holidays[colSums(sapply(index(bank_holidays), `>=`, time_axis))] <- 1 ## set matches to 1
res
# time_axis bank_holidays
# 1 2017-01-21 0
# 2 2017-01-28 1
# 3 2017-02-04 0
# 4 2017-02-11 1
# 5 2017-02-18 1
# 6 2017-02-25 0
# 7 2017-03-04 0
# 8 2017-03-11 0
# 9 2017-03-18 1
# 10 2017-03-25 0
This works for me, a little bit longer than the previous answer but you can see what's happening here
I need the start date and end date so I can fill the rest of the dates in between so I'm selecting 11 weeks instead of 10 in your example. And also I can match the vector for bank holidays instead of xts object
library(xts)
library(tidyverse)
time_axis <- seq(as.Date("2017-01-20"), length = 11, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18")) # I'll work with the vector
#bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
df <- tibble() # I'm creating an empty tibble
for(i in 1:(length(time_axis)-1)) { # running a for loop for 10 weeks
df <- seq(as.Date(time_axis[i]), as.Date(time_axis[i+1]), "days") %>% # filling the dates between the two dates
enframe(name = NULL) %>% #converting it into a data frame (tibble)
mutate(week = as.Date(time_axis[i])) %>% # creating a new column indicating which week the given date belong to
bind_rows(df) # binding the rows to previous dataframe
}
Now I'm taking the df and checking the given holidays matching with our generated dates or not. if present 1 or it will be 0.
Then I'm group_by based on week column which is our given weeks above and summarising to find the sum
df %>%
mutate(bank_holidays_presence = if_else(value %in% bank_holidays, 1, 0)) %>%
group_by(week) %>%
summarise(sum = bank_holidays_presence %>% sum())
# A tibble: 10 x 2
# week sum
# <date> <dbl>
# 1 2017-01-20 0
# 2 2017-01-27 1
# 3 2017-02-03 0
# 4 2017-02-10 1
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 0
#10 2017-03-24 0
The advantage of this method is that even if you have more than one holiday for a particular week it'll give the count rather than mere presence or absence
Related
I have a list of companies with a start and end date of an event. I want to plot a figure that displays the date on the x-axis and the count of companies currently undergoing the event on the y-axis. The only way I can think of doing this at the moment is generating a column for every day and giving it a 1/0 for whether or not that day is between the start and end date for every company, and then reshaping it. Is there a more efficient way to produce this?
Here's some example data:
set.seed(123)
df <- data.frame(id = sample(100:500, 100, replace = F))
df$start <- sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100)
df$end <- df$start + sample(1:50, replace = T)
Here's another option, though I doubt that it's much more efficient than what you're already doing. It is also making all of the days and then identifying whether or not any particular observation is "active" that day.
outdf <- tibble(
date = seq(min(df$start), max(df$end), by=1),
num = rowSums(outer(date,
1:nrow(df),
function(x,y)x > df$start[y] & x < df$end[y]))
)
outdf
# # A tibble: 404 x 2
# date num
# <date> <dbl>
# 1 2020-01-05 0
# 2 2020-01-06 1
# 3 2020-01-07 1
# 4 2020-01-08 1
# 5 2020-01-09 1
# 6 2020-01-10 1
# 7 2020-01-11 2
# 8 2020-01-12 2
# 9 2020-01-13 2
# 10 2020-01-14 2
# # … with 394 more rows
I have a data frame of observations with a start and end date for each observation indicating the period it was active.
The duration active varies by observation, and can spread across multiple weeks.
Some observations are still active and do not have an end date.
For a given date range, how can I count the number of observations that were active during a week within that date range, including those still active?
I have a crude method that works, but is pretty slow. It seems like there has to be a more efficient and simpler way to do this.
EDIT: My first approach was similar to Ronak's solution, which is definitely better than mine for smaller data sets, but my real data set has more observations and longer date ranges, so I run into memory constraints.
#I'm primarily using tidyverse/lubridate, but definitely open to other solutions.
library(tidyverse)
library(lubridate)
# sample data frame of observations with start and end dates:
df_obs <- tibble(
observation = c(1:10),
date_start = as_date(c("2020-03-17", "2020-01-20", "2020-02-06", "2020-01-04", "2020-01-06", "2020-01-24", "2020-01-09", "2020-02-11", "2020-03-13", "2020-02-07")),
date_end = as_date(c("2020-03-27", "2020-03-20", NA, "2020-03-04", "2020-01-16", "2020-02-24", NA, "2020-02-19", NA, "2020-02-27"))
)
# to account for observations that are still active, NAs are converted to today's date:
df_obs <- mutate(df_obs, date_end = if_else(is.na(date_end), Sys.Date(), date_end))
# create a data frame of weeks by start and end date to count the active observations in a given week
# for this example I'm just using date ranges from the sample data:
df_weeks <-
seq(min(df_obs$date_start), max(df_obs$date_start), by = 'day') %>%
enframe(NULL, 'week_start') %>%
mutate(week_start = as_date(cut(week_start, "week"))) %>%
mutate(week_end = week_start + 6) %>%
distinct()
# create a function that filters the observations data frame based on start and end dates:
check_active <- function(d, s, e){
d %>%
filter(date_start <= e) %>%
filter(date_end >= s) %>%
nrow()
}
# applying that function to each week in the date range data frame gives the expected results:
df_weeks %>%
rowwise() %>%
mutate(total_active = check_active(df_obs, week_start, week_end)) %>%
select(-week_end) %>%
ungroup()
# A tibble: 12 x 2
week_start total_active
<date> <int>
1 2019-12-30 1
2 2020-01-06 3
3 2020-01-13 3
4 2020-01-20 4
5 2020-01-27 4
6 2020-02-03 6
7 2020-02-10 7
8 2020-02-17 7
9 2020-02-24 6
10 2020-03-02 4
11 2020-03-09 4
12 2020-03-16 5
Here is one way :
library(tidyverse)
df_obs %>%
#Replace NA with today's date
#Create sequence between start and end date
mutate(date_end = replace(date_end, is.na(date_end), Sys.Date()),
date = map2(date_start, date_end, seq, "day")) %>%
#Get data in long format
unnest(date) %>%
#Unselect start an end date
select(-date_start, -date_end) %>%
#Cut data by week
mutate(date = cut(date, "week")) %>%
#Get unique rows for observation and date
distinct(observation, date) %>%
#Count number of observation in each week
count(date)
which returns :
# A tibble: 14 x 2
# value n
# <fct> <int>
# 1 2019-12-30 1
# 2 2020-01-06 3
# 3 2020-01-13 3
# 4 2020-01-20 4
# 5 2020-01-27 4
# 6 2020-02-03 6
# 7 2020-02-10 7
# 8 2020-02-17 7
# 9 2020-02-24 6
#10 2020-03-02 4
#11 2020-03-09 4
#12 2020-03-16 5
#13 2020-03-23 4
#14 2020-03-30 3
I am trying to create a flag for unique people (defined by id) that have a flight duration that is over 14 hours and they have another flight greater than or equal to 25 days after the 14-hour flight.
To tackle this, I decided to use an if-else statement where the max date grouped by id was subtracted by row date, but the flagging only seems to work for cases where the first flight is above 14 hours.
#Setup Data Frame
id <- c(1,1,2,2,3,3,4,4,4,4,5,5)
flght_dur <- c(27,13,13,17,19,12,7,9,27,14,13,45)
flght_dt <- as.Date(c("2016-03-29","2016-09-01","2015-07-23","2016-06-16","2015-11-12","2016-03-25","2015-12-23","2016-05-19","2016-08-18","2016-09-27","2016-08-18","2016-09-27"))
df <- data.frame(id, flght_dur, flght_dt)
df2 <- df %>% group_by(id) %>% mutate(flag = ifelse(flght_dur >= 14 && (max(as.Date(flght_dt)) - as.Date(flght_dt)) >= 25, 1,0))
df2
Any suggestions on next steps would be appreciated,
You are using the scalar and condition && with vectors, which will only look at the first element of the vector; To look at all possible conditions and return a scalar per group, you can use & on vectors and then use any to reduce the boolean result:
df2 <- df %>%
group_by(id) %>%
mutate(flag = +any(flght_dur >= 14 & max(as.Date(flght_dt)) - as.Date(flght_dt) >= 25))
# ^ used + here to convert boolean to 1 and 0 instead of if/else for short
df2
# A tibble: 12 x 4
# Groups: id [5]
# id flght_dur flght_dt flag
# <dbl> <dbl> <date> <int>
# 1 1. 27. 2016-03-29 1
# 2 1. 13. 2016-09-01 1
# 3 2. 13. 2015-07-23 0
# 4 2. 17. 2016-06-16 0
# 5 3. 19. 2015-11-12 1
# 6 3. 12. 2016-03-25 1
# 7 4. 7. 2015-12-23 1
# 8 4. 9. 2016-05-19 1
# 9 4. 27. 2016-08-18 1
#10 4. 14. 2016-09-27 1
#11 5. 13. 2016-08-18 0
#12 5. 45. 2016-09-27 0
Try using chaining with data.table as follows:
DF[, longHaul := ifelse(flght_dur > 14, TRUE, FALSE)][, maxFlight_DATE := max(flght_dt), by = "id"][longHaul == TRUE & (maxFlight_DATE - flght_dt > 25),]
This is after converting your data.frame to data.table with DF = data.table(df)
It gives me the following output, which appears to follow the logic you want.
id flght_dur flght_dt longHaul maxFlight_DATE
1: 1 27 2016-03-29 TRUE 2016-09-01
2: 3 19 2015-11-12 TRUE 2016-03-25
3: 4 27 2016-08-18 TRUE 2016-09-27
You can do this avoiding loops using rollapply as below.
df$sameid <- c(rollapply(df$id, width = 2, by = 1, FUN = function(x) x[1]==x[2] , align = "right"),NA)
df$nextdurcondition <- c(diff(df$flght_dt)>25 ,NA)
df$flag <- df$sameid &df$nextdurcondition
df
However, for these rolling functions, I personally always use loops
I have a Data set consisting of dates when a person left the network. A person can leave a network multiple times as they may join the network again after leaving it. Following code replicates the scenario.
library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))
(ids repeat multiple times in this table as a person can leave a network multiple times given they joined it again)
> Leaving_Date
Id Date
1: 1 2017-01-01
2: 2 2017-02-03
3: 3 2017-01-01
4: 4 2017-03-10
5: 3 2017-02-09
6: 5 2017-02-05
I have another dataset giving the dates whenever a particular person was followed up which can be before or after they left the network. Following code replicates the scenario.
FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
Date =as.Date(c("2016-10-01","2017-02-04",
"2017-01-17","2017-02-23", "2017-03-03",
"2017-02-10","2017-02-11","2017-01-01",
"2017-01-15","2017-01-01")))
> FOLLOWUPs
Id Date
1: 1 2016-10-01
2: 2 2017-02-04
3: 3 2017-01-17
4: 2 2017-02-23
5: 2 2017-03-03
6: 3 2017-02-10
7: 3 2017-02-11
8: 4 2017-01-01
9: 1 2017-01-15
10: 5 2017-01-01
Now I want to lookup each case in Leaving_Date and find dates when they were followed up and create three columns(SevenDay, FourteenDay,ThirtyDay) indicating time period of followup(incase if there was any) in 0s and 1s. I am using following code :
SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+7)])== 0){
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
}
else{
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+14)])== 0){
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
}
else{
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+30)])== 0){
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
}
else{
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
}
}
Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)
Final Data
> Leaving_Date
Id Date SEVENDAY FOURTEENDAY THIRTYDAY
1: 1 2017-01-01 0 0 1
2: 2 2017-02-03 1 1 1
3: 3 2017-01-01 0 0 1
4: 4 2017-03-10 0 0 0
5: 3 2017-02-09 1 1 1
6: 5 2017-02-05 0 0 0
This code is very inefficient as I have to run it for 100k observations and it takes a lot of time. Is there any efficient way to do this.
Using a non-equi join:
setorder(FOLLOWUPs, Id, Date)
Leaving_Date[, n :=
FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date]
]
Id Date n
1: 1 2017-01-01 14 days
2: 2 2017-02-03 1 days
3: 3 2017-01-01 16 days
4: 4 2017-03-10 NA days
5: 3 2017-02-09 1 days
6: 5 2017-02-05 NA days
Switching from Date to IDate will probably make this about twice as fast. See ?IDate.
I think it's best to stop here, but n can be compared against 7, 14, 30 if necessary, like
Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]]
Id Date n bin
1: 1 2017-01-01 14 days 30
2: 2 2017-02-03 1 days 7
3: 3 2017-01-01 16 days 30
4: 4 2017-03-10 NA days NA
5: 3 2017-02-09 1 days 7
6: 5 2017-02-05 NA days NA
Side note: Please don't give tables names like this.
I think this does what you are looking for using dplyr.
It does an 'inner join' by Id - generating all combinations of dates in the two data frames for a given Id - then calculates the date differences, groups by Id, then checks whether there are values falling in the ranges for your three categories.
library(dplyr)
Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>%
mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>%
summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
THIRTYDAY=as.numeric(any(datediff %in% 0:29)))
We can do this as a query instead of a loop. First, I cleaned your data.tables a bit because I was getting confused by the variable names.
To make things easier for the comparison step, we first pre-compute the follow up date limit for the 7, 14 and 30 day thresholds.
library(dplyr)
dt_leaving_neat = Leaving_Date %>%
mutate(.id = 1:n()) %>%
mutate(limit_07 = Date + 7) %>%
mutate(limit_14 = Date + 14) %>%
mutate(limit_30 = Date + 30) %>%
rename(id = .id, id_person = Id, leaving_date = Date)
dt_follow_neat = FOLLOWUPs %>%
select(id_person = Id, followed_up_date = Date)
The actual operation is just a query. It's written out in dplyr for readability, but if speed is a main concern of yours, you could translate it to data.table. I'd recommend running each step in the pipeline to make sure you understand what's going on.
dt_followed_up = dt_leaving_neat %>%
tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
left_join(dt_follow_neat, by = "id_person") %>%
mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
select(id, id_person, leaving_date, follow_up, followed_up) %>%
filter(followed_up == TRUE) %>%
unique() %>%
tidyr::spread(follow_up, followed_up, fill = 0) %>%
select(id, id_person, leaving_date, limit_07, limit_14, limit_30)
The idea is to join the leaving dates to the follow up dates and check whether the follow up date is within the threshold (and also after the leaving date, as presumably you can't follow up before leaving).
Then some final cleaning to return your desired format. You can use select or rename to change the column names back too.
dt_result = dt_leaving_neat %>%
select(id, id_person, leaving_date) %>%
left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))
dt_result[is.na(dt_result)] = 0
Result
> dt_result
id id_person leaving_date limit_07 limit_14 limit_30
1 1 1 2017-01-01 0 0 1
2 2 2 2017-02-03 1 1 1
3 3 3 2017-01-01 0 0 1
4 4 4 2017-03-10 0 0 0
5 5 3 2017-02-09 1 1 1
6 6 5 2017-02-05 0 0 0
And following Andrew's answer, an equivalent 1 line data.table soln is
FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]
My question involves how to calculate the number of days since an event last that occurred in R.
Below is a minimal example of the data:
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","23/05/2001","26/08/2001"), "%d/%m/%Y"),
event=c(0,0,1,0,1,1,0))
date event
1 2000-07-06 0
2 2000-09-15 0
3 2000-10-15 1
4 2001-01-03 0
5 2001-03-17 1
6 2001-05-23 1
7 2001-08-26 0
A binary variable(event) has values 1 indicating that the event occurred and 0 otherwise. Repeated observations are done at different times(date)
The expected output is as follows with the days since last event(tae):
date event tae
1 2000-07-06 0 NA
2 2000-09-15 0 NA
3 2000-10-15 1 0
4 2001-01-03 0 80
5 2001-03-17 1 153
6 2001-05-23 1 67
7 2001-08-26 0 95
I have looked around for answers to similar problems but they don't address my specific problem. I have tried to implement ideas from
from a similar post (Calculate elapsed time since last event) and below is the closest I
got to the solution:
library(dplyr)
df %>%
mutate(tmp_a = c(0, diff(date)) * !event,
tae = cumsum(tmp_a))
Which yields the output shown below that is not quite the expected:
date event tmp_a tae
1 2000-07-06 0 0 0
2 2000-09-15 0 71 71
3 2000-10-15 1 0 71
4 2001-01-03 0 80 151
5 2001-03-17 1 0 151
6 2001-05-23 1 0 151
7 2001-08-26 0 95 246
Any assistance on how to fine tune this or a different approach would be greatly appreciated.
You could try something like this:
# make an index of the latest events
last_event_index <- cumsum(df$event) + 1
# shift it by one to the right
last_event_index <- c(1, last_event_index[1:length(last_event_index) - 1])
# get the dates of the events and index the vector with the last_event_index,
# added an NA as the first date because there was no event
last_event_date <- c(as.Date(NA), df[which(df$event==1), "date"])[last_event_index]
# substract the event's date with the date of the last event
df$tae <- df$date - last_event_date
df
# date event tae
#1 2000-07-06 0 NA days
#2 2000-09-15 0 NA days
#3 2000-10-15 1 NA days
#4 2001-01-03 0 80 days
#5 2001-03-17 1 153 days
#6 2001-05-23 1 67 days
#7 2001-08-26 0 95 days
It's painful and you lose performance but you can do it with a for loop :
datas <- read.table(text = "date event
2000-07-06 0
2000-09-15 0
2000-10-15 1
2001-01-03 0
2001-03-17 1
2001-05-23 1
2001-08-26 0", header = TRUE, stringsAsFactors = FALSE)
datas <- transform(datas, date = as.Date(date))
lastEvent <- NA
tae <- rep(NA, length(datas$event))
for (i in 2:length(datas$event)) {
if (datas$event[i-1] == 1) {
lastEvent <- datas$date[i-1]
}
tae[i] <- datas$date[i] - lastEvent
# To set the first occuring event as 0 and not NA
if (datas$event[i] == 1 && sum(datas$event[1:i-1] == 1) == 0) {
tae[i] <- 0
}
}
cbind(datas, tae)
date event tae
1 2000-07-06 0 NA
2 2000-09-15 0 NA
3 2000-10-15 1 0
4 2001-01-03 0 80
5 2001-03-17 1 153
6 2001-05-23 1 67
7 2001-08-26 0 95
Old question, but I was experimenting with rolling joins and found this interesting.
library(data.table)
setDT(df)
setkey(df, date)
# rolling self-join to attach last event time
df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
# find difference between record and previous event == 1 record
df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
# difftime for simple case between date and joint on previous event
df[event == 0, tae:= difftime(date, lastevent, unit = "days")]
> df
date lastevent event tae
1: 2000-07-06 <NA> 0 NA days
2: 2000-09-15 <NA> 0 NA days
3: 2000-10-15 2000-10-15 1 NA days
4: 2001-01-03 2000-10-15 0 80 days
5: 2001-03-17 2001-03-17 1 153 days
6: 2001-05-23 2001-05-23 1 67 days
7: 2001-08-26 2001-05-23 0 95 days
I'm way late to the party, but I used tidyr::fill to make this easier. You essentially convert your non-events to missing values, then use fill to fill the NAs in with the last event, then subtract the current date from the last event.
I've tested this with a integer date column, so it might need some tweaking for a Date-type date column (especially the use of NA_integer_. I'm not sure what the underlying type is for Date objects; I'm guessing NA_real_.)
df %>%
mutate(
event = as.logical(event),
last_event = if_else(event, true = date, false = NA_integer_)) %>%
fill(last_event) %>%
mutate(event_age = date - last_event)
I had a similar issue and was able to solve it combining some of the ideas above. The main difference I had with mine would be customers a - nth would have different events (for me it is purchases). I wanted to know the cumulative totals for all these purchases as well as the date of the last activity. The main way I solved this was to create an index-dataframe to join with the main data frame. Similar to the top rated question above. See repeatable code below.
library(tidyverse)
rm(list=ls())
#generate repeatable code sample dataframe
df <- as.data.frame(sample(rep(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 12), each = 4),36))
df$subtotal <- sample(1:100, 36)
df$cust <- sample(rep(c("a", "b", "c", "d", "e", "f"), each=12), 36)
colnames(df) <- c("dates", "subtotal", "cust")
#add a "key" based on date and event
df$datekey <- paste0(df$dates, df$cust)
#The following 2 lines are specific to my own analysis but added to show depth
df_total_visits <- df %>% select(dates, cust) %>% distinct() %>% group_by(cust) %>% tally(n= "total_visits") %>% mutate(variable = 1)
df_order_bydate <- df %>% select(dates, cust) %>% group_by(dates, cust) %>% tally(n= "day_orders")
df <- left_join(df, df_total_visits)
df <- left_join(df, df_order_bydate) %>% arrange(dates)
# Now we will add the index, the arrange from the previous line is super important if your data is not already ordered by date
cummulative_groupping <- df %>% select(datekey, cust, variable, subtotal) %>% group_by(datekey) %>% mutate(spending = sum(subtotal)) %>% distinct(datekey, .keep_all = T) %>% select(-subtotal)
cummulative_groupping <- cummulative_groupping %>% group_by(cust) %>% mutate(cumulative_visits = cumsum(variable),
cumulative_spend = cumsum(spending))
df <- left_join(df, cummulative_groupping) %>% select(-variable)
#using the cumulative visits as the index, if we add one to this number we can then join it again on our dataframe
last_date_index <- df %>% select(dates, cust, cumulative_visits)
last_date_index$cumulative_visits <- last_date_index$cumulative_visits + 1
colnames(last_date_index) <- c("last_visit_date", "cust", "cumulative_visits")
df <- left_join(df, last_date_index, by = c("cust", "cumulative_visits"))
#the difference between the date and last visit answers the original posters question. NAs will return as NA
df$toa <- df$dates - df$last_visit_date
This answer works in the cases where the same event occurs on the same day (either bad data hygiene OR if multiple vendors/cust go to that event). Thank you for viewing my answer. This is actually my first post on Stack.