I have a Data set consisting of dates when a person left the network. A person can leave a network multiple times as they may join the network again after leaving it. Following code replicates the scenario.
library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))
(ids repeat multiple times in this table as a person can leave a network multiple times given they joined it again)
> Leaving_Date
Id Date
1: 1 2017-01-01
2: 2 2017-02-03
3: 3 2017-01-01
4: 4 2017-03-10
5: 3 2017-02-09
6: 5 2017-02-05
I have another dataset giving the dates whenever a particular person was followed up which can be before or after they left the network. Following code replicates the scenario.
FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
Date =as.Date(c("2016-10-01","2017-02-04",
"2017-01-17","2017-02-23", "2017-03-03",
"2017-02-10","2017-02-11","2017-01-01",
"2017-01-15","2017-01-01")))
> FOLLOWUPs
Id Date
1: 1 2016-10-01
2: 2 2017-02-04
3: 3 2017-01-17
4: 2 2017-02-23
5: 2 2017-03-03
6: 3 2017-02-10
7: 3 2017-02-11
8: 4 2017-01-01
9: 1 2017-01-15
10: 5 2017-01-01
Now I want to lookup each case in Leaving_Date and find dates when they were followed up and create three columns(SevenDay, FourteenDay,ThirtyDay) indicating time period of followup(incase if there was any) in 0s and 1s. I am using following code :
SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+7)])== 0){
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
}
else{
SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+14)])== 0){
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
}
else{
FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
}
if(nrow(sub_data[Date > Leaving_Date[i,Date] &
Date < (Leaving_Date[i,Date]+30)])== 0){
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
}
else{
THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
}
}
Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)
Final Data
> Leaving_Date
Id Date SEVENDAY FOURTEENDAY THIRTYDAY
1: 1 2017-01-01 0 0 1
2: 2 2017-02-03 1 1 1
3: 3 2017-01-01 0 0 1
4: 4 2017-03-10 0 0 0
5: 3 2017-02-09 1 1 1
6: 5 2017-02-05 0 0 0
This code is very inefficient as I have to run it for 100k observations and it takes a lot of time. Is there any efficient way to do this.
Using a non-equi join:
setorder(FOLLOWUPs, Id, Date)
Leaving_Date[, n :=
FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date]
]
Id Date n
1: 1 2017-01-01 14 days
2: 2 2017-02-03 1 days
3: 3 2017-01-01 16 days
4: 4 2017-03-10 NA days
5: 3 2017-02-09 1 days
6: 5 2017-02-05 NA days
Switching from Date to IDate will probably make this about twice as fast. See ?IDate.
I think it's best to stop here, but n can be compared against 7, 14, 30 if necessary, like
Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]]
Id Date n bin
1: 1 2017-01-01 14 days 30
2: 2 2017-02-03 1 days 7
3: 3 2017-01-01 16 days 30
4: 4 2017-03-10 NA days NA
5: 3 2017-02-09 1 days 7
6: 5 2017-02-05 NA days NA
Side note: Please don't give tables names like this.
I think this does what you are looking for using dplyr.
It does an 'inner join' by Id - generating all combinations of dates in the two data frames for a given Id - then calculates the date differences, groups by Id, then checks whether there are values falling in the ranges for your three categories.
library(dplyr)
Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>%
mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>%
summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
THIRTYDAY=as.numeric(any(datediff %in% 0:29)))
We can do this as a query instead of a loop. First, I cleaned your data.tables a bit because I was getting confused by the variable names.
To make things easier for the comparison step, we first pre-compute the follow up date limit for the 7, 14 and 30 day thresholds.
library(dplyr)
dt_leaving_neat = Leaving_Date %>%
mutate(.id = 1:n()) %>%
mutate(limit_07 = Date + 7) %>%
mutate(limit_14 = Date + 14) %>%
mutate(limit_30 = Date + 30) %>%
rename(id = .id, id_person = Id, leaving_date = Date)
dt_follow_neat = FOLLOWUPs %>%
select(id_person = Id, followed_up_date = Date)
The actual operation is just a query. It's written out in dplyr for readability, but if speed is a main concern of yours, you could translate it to data.table. I'd recommend running each step in the pipeline to make sure you understand what's going on.
dt_followed_up = dt_leaving_neat %>%
tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
left_join(dt_follow_neat, by = "id_person") %>%
mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
select(id, id_person, leaving_date, follow_up, followed_up) %>%
filter(followed_up == TRUE) %>%
unique() %>%
tidyr::spread(follow_up, followed_up, fill = 0) %>%
select(id, id_person, leaving_date, limit_07, limit_14, limit_30)
The idea is to join the leaving dates to the follow up dates and check whether the follow up date is within the threshold (and also after the leaving date, as presumably you can't follow up before leaving).
Then some final cleaning to return your desired format. You can use select or rename to change the column names back too.
dt_result = dt_leaving_neat %>%
select(id, id_person, leaving_date) %>%
left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))
dt_result[is.na(dt_result)] = 0
Result
> dt_result
id id_person leaving_date limit_07 limit_14 limit_30
1 1 1 2017-01-01 0 0 1
2 2 2 2017-02-03 1 1 1
3 3 3 2017-01-01 0 0 1
4 4 4 2017-03-10 0 0 0
5 5 3 2017-02-09 1 1 1
6 6 5 2017-02-05 0 0 0
And following Andrew's answer, an equivalent 1 line data.table soln is
FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]
Related
I want to create a table with two columns. The first one represents the working weeks, named time_axis in my exemple below.
The second column, is also a sequence of Dates which represents particular events in a year, called bank_holidays. Each of the date get a one value to signalise its presence.
What I need, is to create a table where the first columns time axis remains unchanged and the second column will be a vector of ones and zeros. Zeros anywhere outside the weeks which contain the events in bank_holiday and with ones for the weeks which includes those dates in bank_holiday occurs. Every week starts with the date in time_axis
library(xts)
time_axis <- seq(as.Date("2017-01-21"), length = 10, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18"))
bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
The desired outcome:
df <- data.frame ( time_axis = c("2017-01-20", "2017-01-27", "2017-02-03", "2017-02-10", "2017-02-17", "2017-02-24", "2017-03-03", "2017-03-10", "2017-03-17", "2017-03-24"), bank_holidays = c(0, 1, 0,1,1,0,0,0,1,0))
df
Any idee on how to make it?
Thank you.
Something which needs to bear in mind and is not obviously from the data: the weeks on time_axis start on Saturday. Therefore, 2017-01-21 is not the end of the 3rd week (as if it would be in case the week starts on Monday) but it is already the 4th week.
Using strftime, "%V" gives the ISO 8601 week numbers where you may match on.
res <- data.frame(time_axis,
bank_holidays =+(strftime(time_axis, "%V") %in%
strftime(index(bank_holidays), "%V")))
res
# time_axis bank_holidays
# 1 2017-01-20 0
# 2 2017-01-27 0
# 3 2017-02-03 1
# 4 2017-02-10 0
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 1
# 10 2017-03-24 0
Edit
To use the custom working weeks whose starts are defined in time_axis variable, the simplest thing would probably be to compare if bank_holidays are greater or equal than that. Then counting the TRUEs with colSums gives the index where to set to 1.
res <- data.frame(time_axis, bank_holidays=0) ## init. column with `0`
res$bank_holidays[colSums(sapply(index(bank_holidays), `>=`, time_axis))] <- 1 ## set matches to 1
res
# time_axis bank_holidays
# 1 2017-01-21 0
# 2 2017-01-28 1
# 3 2017-02-04 0
# 4 2017-02-11 1
# 5 2017-02-18 1
# 6 2017-02-25 0
# 7 2017-03-04 0
# 8 2017-03-11 0
# 9 2017-03-18 1
# 10 2017-03-25 0
This works for me, a little bit longer than the previous answer but you can see what's happening here
I need the start date and end date so I can fill the rest of the dates in between so I'm selecting 11 weeks instead of 10 in your example. And also I can match the vector for bank holidays instead of xts object
library(xts)
library(tidyverse)
time_axis <- seq(as.Date("2017-01-20"), length = 11, by = "weeks")
bank_holidays <- as.Date(c("2017-02-01", "2017-02-13", "2017-02-18", "2018-03-18")) # I'll work with the vector
#bank_holidays <- as.xts(rep(1,4), order.by = bank_holidays)
df <- tibble() # I'm creating an empty tibble
for(i in 1:(length(time_axis)-1)) { # running a for loop for 10 weeks
df <- seq(as.Date(time_axis[i]), as.Date(time_axis[i+1]), "days") %>% # filling the dates between the two dates
enframe(name = NULL) %>% #converting it into a data frame (tibble)
mutate(week = as.Date(time_axis[i])) %>% # creating a new column indicating which week the given date belong to
bind_rows(df) # binding the rows to previous dataframe
}
Now I'm taking the df and checking the given holidays matching with our generated dates or not. if present 1 or it will be 0.
Then I'm group_by based on week column which is our given weeks above and summarising to find the sum
df %>%
mutate(bank_holidays_presence = if_else(value %in% bank_holidays, 1, 0)) %>%
group_by(week) %>%
summarise(sum = bank_holidays_presence %>% sum())
# A tibble: 10 x 2
# week sum
# <date> <dbl>
# 1 2017-01-20 0
# 2 2017-01-27 1
# 3 2017-02-03 0
# 4 2017-02-10 1
# 5 2017-02-17 1
# 6 2017-02-24 0
# 7 2017-03-03 0
# 8 2017-03-10 0
# 9 2017-03-17 0
#10 2017-03-24 0
The advantage of this method is that even if you have more than one holiday for a particular week it'll give the count rather than mere presence or absence
I am trying to create a flag for unique people (defined by id) that have a flight duration that is over 14 hours and they have another flight greater than or equal to 25 days after the 14-hour flight.
To tackle this, I decided to use an if-else statement where the max date grouped by id was subtracted by row date, but the flagging only seems to work for cases where the first flight is above 14 hours.
#Setup Data Frame
id <- c(1,1,2,2,3,3,4,4,4,4,5,5)
flght_dur <- c(27,13,13,17,19,12,7,9,27,14,13,45)
flght_dt <- as.Date(c("2016-03-29","2016-09-01","2015-07-23","2016-06-16","2015-11-12","2016-03-25","2015-12-23","2016-05-19","2016-08-18","2016-09-27","2016-08-18","2016-09-27"))
df <- data.frame(id, flght_dur, flght_dt)
df2 <- df %>% group_by(id) %>% mutate(flag = ifelse(flght_dur >= 14 && (max(as.Date(flght_dt)) - as.Date(flght_dt)) >= 25, 1,0))
df2
Any suggestions on next steps would be appreciated,
You are using the scalar and condition && with vectors, which will only look at the first element of the vector; To look at all possible conditions and return a scalar per group, you can use & on vectors and then use any to reduce the boolean result:
df2 <- df %>%
group_by(id) %>%
mutate(flag = +any(flght_dur >= 14 & max(as.Date(flght_dt)) - as.Date(flght_dt) >= 25))
# ^ used + here to convert boolean to 1 and 0 instead of if/else for short
df2
# A tibble: 12 x 4
# Groups: id [5]
# id flght_dur flght_dt flag
# <dbl> <dbl> <date> <int>
# 1 1. 27. 2016-03-29 1
# 2 1. 13. 2016-09-01 1
# 3 2. 13. 2015-07-23 0
# 4 2. 17. 2016-06-16 0
# 5 3. 19. 2015-11-12 1
# 6 3. 12. 2016-03-25 1
# 7 4. 7. 2015-12-23 1
# 8 4. 9. 2016-05-19 1
# 9 4. 27. 2016-08-18 1
#10 4. 14. 2016-09-27 1
#11 5. 13. 2016-08-18 0
#12 5. 45. 2016-09-27 0
Try using chaining with data.table as follows:
DF[, longHaul := ifelse(flght_dur > 14, TRUE, FALSE)][, maxFlight_DATE := max(flght_dt), by = "id"][longHaul == TRUE & (maxFlight_DATE - flght_dt > 25),]
This is after converting your data.frame to data.table with DF = data.table(df)
It gives me the following output, which appears to follow the logic you want.
id flght_dur flght_dt longHaul maxFlight_DATE
1: 1 27 2016-03-29 TRUE 2016-09-01
2: 3 19 2015-11-12 TRUE 2016-03-25
3: 4 27 2016-08-18 TRUE 2016-09-27
You can do this avoiding loops using rollapply as below.
df$sameid <- c(rollapply(df$id, width = 2, by = 1, FUN = function(x) x[1]==x[2] , align = "right"),NA)
df$nextdurcondition <- c(diff(df$flght_dt)>25 ,NA)
df$flag <- df$sameid &df$nextdurcondition
df
However, for these rolling functions, I personally always use loops
I have a data table that looks like
|userId|36|37|38|39|40|
|1|1|0|3|0|0|
|2|3|0|0|0|1|
Where each numbered column (36-40) represent week numbers. I want to calculate the number of weeks before the 1st occurrence of a non-zero value, and the last.
For instance, for userId 1 in my dataset, the first value appears at week 36, and the last one appears at week 38, so the value I want is 2. For userId 2 it's 40-36 which is 4.
I would like to store the data like:
|userId|lifespan|
|1|2|
|2|4|
I'm struggling to do this, can someone please help?
General method I would take is to melt it, convert the character column names to numeric, and take the delta by each userID. Here is an example using data.table.
library(data.table)
dt <- fread("userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1",
header = TRUE)
dt <- melt(dt, id.vars = "userId")
dt[, variable := as.numeric(as.character(variable))]
dt
# userId variable value
# 1: 1 36 1
# 2: 2 36 3
# 3: 1 37 0
# 4: 2 37 0
# 5: 1 38 3
# 6: 2 38 0
# 7: 1 39 0
# 8: 2 39 0
# 9: 1 40 0
# 10: 2 40 1
dt[!value == 0, .(lifespan = max(variable) - min(variable)), by = .(userId)]
# userId lifespan
# 1: 1 2
# 2: 2 4
Here's a dplyr method:
df %>%
gather(var, value, -userId) %>%
mutate(var = as.numeric(sub("X", "", var))) %>%
group_by(userId) %>%
slice(c(which.max(value!=0), max(which(value!=0)))) %>%
summarize(lifespan = var[2]-var[1])
Result:
# A tibble: 2 x 2
userId lifespan
<int> <dbl>
1 1 2
2 2 4
Data:
df = read.table(text = "userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1", header = TRUE, sep = "|")
I have found similar problem like this, but not exactly the same. Here is my problem, I have the following data set:
> ds
id begin end
1 1 2017-01-15 2017-01-17
2 1 2017-01-01 2017-01-03
3 2 2017-02-01 2017-02-28
4 4 2017-04-11 2017-05-11
5 3 2017-02-05 2017-02-10
6 4 2017-03-10 2017-03-20
7 1 2017-01-30 2017-02-03
8 3 2017-02-28 2017-03-09
9 4 2017-02-26 2017-03-05
I want to create the following column: check that verifies the following condition for each rows with the same id value:
ds[i,]$begin - ds[i-1,]$end < 30 => 1 # for each row i
otherwise is 0. When it is the first element of the group, then there is no previous information, so in such case the value of the new column check will be always zero too.
The difference with other questions (solved via: ave, dplyr) I have seen, is that I need to make a calculation that involves subsequent row but also not the same column.
Here is the code:
Defining the data set
id <- c("1", "1", "2", "4", "3", "4", "1", "3")
begin <- c("20170115", "20170101", "20170201",
"20170411",
"20170205", "20170310",
"20170130", "20170228"
)
end <- c("20170117", "20170103", "20170228",
"20170511",
"20170210", "20170320",
"20170203", "20170309"
)
ds <- data.frame(id = id, begin = as.Date(begin, "%Y%m%d"), end = as.Date(end, "%Y%m%d"))
Sorting the information (we need it for the current solution using a for-loop)
idx = order(rank(ds$id), ds$begin, decreasing = FALSE)
ds <- ds[idx,]
Now using the for-loop for assigning the control variable: check:
ds$check <- numeric(nrow(ds))
ds$check <- NA_integer_
nrep <- -1
for (i in 1:nrow(ds)) {
rowi <- ds[i,]
if (nrep == -1) {# Setting the first element of ds
end.prev <- rowi$end
id.prev <- rowi$id
ds[i,]$check <- 0
nrep = 1
} else {
id.current <- rowi$id
if(id.prev == id.current) {
ds[i,]$check <- ifelse(rowi$begin - end.prev < 30, 1, 0)
} else {
ds[i,]$check <- 0
}
end.prev <- rowi$end
id.prev <- id.current
}
}
Finally the expected output:
> ds
id begin end check
2 1 2017-01-01 2017-01-03 0
1 1 2017-01-15 2017-01-17 1
7 1 2017-01-30 2017-02-03 1
3 2 2017-02-01 2017-02-28 0
5 3 2017-02-05 2017-02-10 0
8 3 2017-02-28 2017-03-09 1
6 4 2017-03-10 2017-03-20 0
4 4 2017-04-11 2017-05-11 1
>
Thanks for any hint.
With dplyr, using lag makes this easy:
ds %>%
group_by(id) %>%
arrange(id, begin) %>%
mutate(check = c(0, as.numeric(begin - lag(end) < 30)[-1]))
Gives:
Source: local data frame [8 x 4]
Groups: id [4]
id begin end check
<fctr> <date> <date> <dbl>
1 1 2017-01-01 2017-01-03 0
2 1 2017-01-15 2017-01-17 1
3 1 2017-01-30 2017-02-03 1
4 2 2017-02-01 2017-02-28 0
5 3 2017-02-05 2017-02-10 0
6 3 2017-02-28 2017-03-09 1
7 4 2017-03-10 2017-03-20 0
8 4 2017-04-11 2017-05-11 1
Explanation:
First we group_by, so calculations are done by id.
We arrange to make sure we start with the earliest date.
For all checks, the first value is 0,
and the other values are simply whether the current begin minus the previous end is smaller than 30. We use as.numeric to convert from logical to numeric.
Also see: vignette('window-functions').
Note: I think the first value per group can simply be left at NA, since it is indeed unknown or undefined, and this will simplify it to:
ds %>%
group_by(id) %>%
mutate(check = as.numeric(begin - lag(end, order_by = begin) < 30))
Use data.table:
setDT(ds)[,New:=as.numeric(begin-shift(end,fill=0)<30),id]
Use dplyr:
ds%>%group_by(id)%>%mutate(new=as.numeric(begin-lag(end,default=0)<30))%>%arrange(id)
My question involves how to calculate the number of days since an event last that occurred in R.
Below is a minimal example of the data:
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","23/05/2001","26/08/2001"), "%d/%m/%Y"),
event=c(0,0,1,0,1,1,0))
date event
1 2000-07-06 0
2 2000-09-15 0
3 2000-10-15 1
4 2001-01-03 0
5 2001-03-17 1
6 2001-05-23 1
7 2001-08-26 0
A binary variable(event) has values 1 indicating that the event occurred and 0 otherwise. Repeated observations are done at different times(date)
The expected output is as follows with the days since last event(tae):
date event tae
1 2000-07-06 0 NA
2 2000-09-15 0 NA
3 2000-10-15 1 0
4 2001-01-03 0 80
5 2001-03-17 1 153
6 2001-05-23 1 67
7 2001-08-26 0 95
I have looked around for answers to similar problems but they don't address my specific problem. I have tried to implement ideas from
from a similar post (Calculate elapsed time since last event) and below is the closest I
got to the solution:
library(dplyr)
df %>%
mutate(tmp_a = c(0, diff(date)) * !event,
tae = cumsum(tmp_a))
Which yields the output shown below that is not quite the expected:
date event tmp_a tae
1 2000-07-06 0 0 0
2 2000-09-15 0 71 71
3 2000-10-15 1 0 71
4 2001-01-03 0 80 151
5 2001-03-17 1 0 151
6 2001-05-23 1 0 151
7 2001-08-26 0 95 246
Any assistance on how to fine tune this or a different approach would be greatly appreciated.
You could try something like this:
# make an index of the latest events
last_event_index <- cumsum(df$event) + 1
# shift it by one to the right
last_event_index <- c(1, last_event_index[1:length(last_event_index) - 1])
# get the dates of the events and index the vector with the last_event_index,
# added an NA as the first date because there was no event
last_event_date <- c(as.Date(NA), df[which(df$event==1), "date"])[last_event_index]
# substract the event's date with the date of the last event
df$tae <- df$date - last_event_date
df
# date event tae
#1 2000-07-06 0 NA days
#2 2000-09-15 0 NA days
#3 2000-10-15 1 NA days
#4 2001-01-03 0 80 days
#5 2001-03-17 1 153 days
#6 2001-05-23 1 67 days
#7 2001-08-26 0 95 days
It's painful and you lose performance but you can do it with a for loop :
datas <- read.table(text = "date event
2000-07-06 0
2000-09-15 0
2000-10-15 1
2001-01-03 0
2001-03-17 1
2001-05-23 1
2001-08-26 0", header = TRUE, stringsAsFactors = FALSE)
datas <- transform(datas, date = as.Date(date))
lastEvent <- NA
tae <- rep(NA, length(datas$event))
for (i in 2:length(datas$event)) {
if (datas$event[i-1] == 1) {
lastEvent <- datas$date[i-1]
}
tae[i] <- datas$date[i] - lastEvent
# To set the first occuring event as 0 and not NA
if (datas$event[i] == 1 && sum(datas$event[1:i-1] == 1) == 0) {
tae[i] <- 0
}
}
cbind(datas, tae)
date event tae
1 2000-07-06 0 NA
2 2000-09-15 0 NA
3 2000-10-15 1 0
4 2001-01-03 0 80
5 2001-03-17 1 153
6 2001-05-23 1 67
7 2001-08-26 0 95
Old question, but I was experimenting with rolling joins and found this interesting.
library(data.table)
setDT(df)
setkey(df, date)
# rolling self-join to attach last event time
df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
# find difference between record and previous event == 1 record
df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
# difftime for simple case between date and joint on previous event
df[event == 0, tae:= difftime(date, lastevent, unit = "days")]
> df
date lastevent event tae
1: 2000-07-06 <NA> 0 NA days
2: 2000-09-15 <NA> 0 NA days
3: 2000-10-15 2000-10-15 1 NA days
4: 2001-01-03 2000-10-15 0 80 days
5: 2001-03-17 2001-03-17 1 153 days
6: 2001-05-23 2001-05-23 1 67 days
7: 2001-08-26 2001-05-23 0 95 days
I'm way late to the party, but I used tidyr::fill to make this easier. You essentially convert your non-events to missing values, then use fill to fill the NAs in with the last event, then subtract the current date from the last event.
I've tested this with a integer date column, so it might need some tweaking for a Date-type date column (especially the use of NA_integer_. I'm not sure what the underlying type is for Date objects; I'm guessing NA_real_.)
df %>%
mutate(
event = as.logical(event),
last_event = if_else(event, true = date, false = NA_integer_)) %>%
fill(last_event) %>%
mutate(event_age = date - last_event)
I had a similar issue and was able to solve it combining some of the ideas above. The main difference I had with mine would be customers a - nth would have different events (for me it is purchases). I wanted to know the cumulative totals for all these purchases as well as the date of the last activity. The main way I solved this was to create an index-dataframe to join with the main data frame. Similar to the top rated question above. See repeatable code below.
library(tidyverse)
rm(list=ls())
#generate repeatable code sample dataframe
df <- as.data.frame(sample(rep(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 12), each = 4),36))
df$subtotal <- sample(1:100, 36)
df$cust <- sample(rep(c("a", "b", "c", "d", "e", "f"), each=12), 36)
colnames(df) <- c("dates", "subtotal", "cust")
#add a "key" based on date and event
df$datekey <- paste0(df$dates, df$cust)
#The following 2 lines are specific to my own analysis but added to show depth
df_total_visits <- df %>% select(dates, cust) %>% distinct() %>% group_by(cust) %>% tally(n= "total_visits") %>% mutate(variable = 1)
df_order_bydate <- df %>% select(dates, cust) %>% group_by(dates, cust) %>% tally(n= "day_orders")
df <- left_join(df, df_total_visits)
df <- left_join(df, df_order_bydate) %>% arrange(dates)
# Now we will add the index, the arrange from the previous line is super important if your data is not already ordered by date
cummulative_groupping <- df %>% select(datekey, cust, variable, subtotal) %>% group_by(datekey) %>% mutate(spending = sum(subtotal)) %>% distinct(datekey, .keep_all = T) %>% select(-subtotal)
cummulative_groupping <- cummulative_groupping %>% group_by(cust) %>% mutate(cumulative_visits = cumsum(variable),
cumulative_spend = cumsum(spending))
df <- left_join(df, cummulative_groupping) %>% select(-variable)
#using the cumulative visits as the index, if we add one to this number we can then join it again on our dataframe
last_date_index <- df %>% select(dates, cust, cumulative_visits)
last_date_index$cumulative_visits <- last_date_index$cumulative_visits + 1
colnames(last_date_index) <- c("last_visit_date", "cust", "cumulative_visits")
df <- left_join(df, last_date_index, by = c("cust", "cumulative_visits"))
#the difference between the date and last visit answers the original posters question. NAs will return as NA
df$toa <- df$dates - df$last_visit_date
This answer works in the cases where the same event occurs on the same day (either bad data hygiene OR if multiple vendors/cust go to that event). Thank you for viewing my answer. This is actually my first post on Stack.