dplyr: grouping and summarizing/mutating data with rolling time windows - r

I have irregular timeseries data representing a certain type of transaction for users. Each line of data is timestamped and represents a transaction at that time. By the irregular nature of the data some users might have 100 rows in a day and other users might have 0 or 1 transaction in a day.
The data might look something like this:
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
id date n_widgets
1 1 2015-01-01 1
2 1 2015-01-01 2
3 1 2015-01-05 3
4 1 2015-01-25 4
5 1 2015-02-15 4
6 2 2015-05-05 5
7 2 2015-01-01 2
8 3 2015-08-01 4
9 4 2015-01-01 5
Often I'd like to know some rolling statistics about users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days etc.
Corresponding to the above example, the data should look like:
id date n_widgets n_trans_30 total_widgets_30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
If the time window is daily then the solution is simple: data %>% group_by(id, date) %>% summarize(...)
Similarly if the time window is monthly this is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)
However the challenge I'm having is how to setup a time window for an arbitrary period: 5-days, 10-days etc.
There's also the RcppRoll library but both RcppRoll and the rolling functions in zoo seem more setup for regular time series. As far as I can tell these window functions work based on the number of rows instead of a specified time period -- the key difference is that a certain time period might have a differing number of rows depending on date and user.
For example, it's possible for user 1, that the number of transactions in the 5 days previous of 2015-01-01 is equal to 100 transactions and for the same user the number of transactions in the 5 days previous of 2015-02-01 is equal to 5 transactions. Thus looking back a set number of rows will simply not work.
Additionally, there is another SO thread discussing rolling dates for irregular time series type data (Create new column based on condition that exists within a rolling date) however the accepted solution was using data.table and I'm specifically looking for a dplyr way of achieving this.
I suppose at the heart of this issue, this problem can be solved by answering this question: how can I group_by arbitrary time periods in dplyr. Alternatively, if there's a different dplyr way to achieve above without a complicated group_by, how can I do it?
EDIT: updated example to make nature of the rolling window more clear.

This can be done using SQL:
library(sqldf)
dd <- transform(data, date = as.Date(date))
sqldf("select a.*, count(*) n_trans30, sum(b.n_widgets) 'total_widgets30'
from dd a
left join dd b on b.date between a.date - 30 and a.date
and b.id = a.id
and b.rowid <= a.rowid
group by a.rowid")
giving:
id date n_widgets n_trans30 total_widgets30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 2 2015-05-05 5 1 5
6 2 2015-01-01 2 1 2
7 3 2015-08-01 4 1 4
8 4 2015-01-01 5 1 5

Another approach is to expand your dataset to contain all possible days (using tidyr::complete), then use a rolling function (RcppRoll::roll_sum)
The fact that you have multiple observations per day is probably creating an issue though...
library(tidyr)
library(RcppRoll)
df2 <- df %>%
mutate(date=as.Date(date))
## create full dataset with all possible dates (go even 30 days back for first observation)
df_full<- df2 %>%
mutate(date=as.Date(date)) %>%
complete(id,
date=seq(from=min(.$date)-30,to=max(.$date), by=1),
fill=list(n_widgets=0))
## now use rolling function, and keep only original rows (left join)
df_roll <- df_full %>%
group_by(id) %>%
mutate(n_trans_30=roll_sum(x=n_widgets!=0, n=30, fill=0, align="right"),
total_widgets_30=roll_sum(x=n_widgets, n=30, fill=0, align="right")) %>%
ungroup() %>%
right_join(df2, by = c("date", "id", "n_widgets"))
The result is the same as yours (by chance)
id date n_widgets n_trans_30 total_widgets_30
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
But as said, it will fail for some days as it count last 30 obs, not last 30 days. So you might want first to summarise the information by day, then apply this.

EDITED based on comment below.
You can try something like this for up to 5 days:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(Sys.Date(), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
In this case, there are no days within five of current. So, it won't produce any output.
To get last five days for each ID, you can do something like this:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(max(date), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
Resulting output will be:
Source: local data frame [4 x 2]
id n_total_widgets
(dbl) (dbl)
1 1 4
2 2 5
3 3 4
4 4 5

I found a way to do this while working on this question
df <- data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
count_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- length(df2$date)
return(out)
}
v_count_window <- Vectorize(count_window, vectorize.args = c("date2","id2"))
sum_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- sum(df2$n_widgets)
return(out)
}
v_sum_window <- Vectorize(sum_window, vectorize.args = c("date2","id2"))
res <- df %>% mutate(date = ymd(date)) %>%
mutate(min_date = date - 30,
n_trans = v_count_window(., date, 30, id),
total_widgets = v_sum_window(., date, 30, id)) %>%
select(id, date, n_widgets, n_trans, total_widgets)
res
id date n_widgets n_trans total_widgets
1 1 2015-01-01 1 2 3
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
This version is fairly case specific but you could probably make a version of the functions that is more general.

For simplicity reasons I recommend runner package which handles sliding window operations. In OP request window size k = 30 and windows depend on date idx = date. You can use runner function which applies any R function on given window, and sum_run
library(runner)
library(dplyr)
df %>%
group_by(id) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
n_trans30 = runner(n_widgets, k = 30, idx = date, function(x) length(x)),
n_widgets30 = sum_run(n_widgets, k = 30, idx = date),
)
# id date n_widgets n_trans30 n_widgets30
#<dbl> <date> <dbl> <dbl> <dbl>
# 1 2015-01-01 1 1 1
# 1 2015-01-01 2 2 3
# 1 2015-01-05 3 3 6
# 1 2015-01-25 4 4 10
# 1 2015-02-15 4 2 8
# 2 2015-01-01 2 1 2
# 2 2015-05-05 5 1 5
# 3 2015-08-01 4 1 4
# 4 2015-01-01 5 1 5
Important: idx = date should be in ascending order.
For more go to documentation and vignettes

Related

How can I count how many simultaneously-active sanctions have occurred over the past three years?

I have a system which records sanctions against clients' names.
There should only ever be one active sanction, yet there are some cases where there are multiple active sanctions.
I would like to know how I can count how many people had two or more simultaneously-active sanctions over the past three years (sample data ranges from 2019-2022, so this won't need to be filtered in the solution).
The way I would work this out is to detect those cases where start_date2 occurs before end_date1.
Sample data (note that the end_date values are random, so there may be several cases of them occurring before their respective start_date values, but bear in mind that this is just sample data, so take it with a pinch of salt):
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
The algorithm in my mind goes like this:
for each client
check if there was more than one active sanction at any one time
look for cases where start_date2/start_date3/start_dateY occurs before end_date1/end_date2/end_dateX
group by month-year (using month_year column)
The output I am looking for is a monthly breakdown, indicating how many simultaneous sanctions occurred per month. Something like this:
01-2020: 10
02-2020: 35
03-2020: 29
...
01-2022: 5
I believe that I have covered everything, but am happy to clarify anything where required/requested.
Updated, given clarifications in comment section
If we do this without regard to client, then we have something like this:
sanc %>% arrange(start_date) %>%
mutate(same_as_prev = start_date<lag(end_date) |row_number()==1 & end_date>lead(start_date)) %>%
group_by(start_month_year) %>%
summarize(simActive = sum(same_as_prev))
Output:
# A tibble: 37 x 2
start_month_year simActive
<chr> <int>
1 2019-01 29
2 2019-02 26
3 2019-03 30
4 2019-04 26
5 2019-05 25
6 2019-06 19
7 2019-07 19
8 2019-08 26
9 2019-09 21
10 2019-10 23
# ... with 27 more rows
It seems that in your sample data, all the clients have only one row, so I've adjusted it so that each of 200 clients has 5 rows. I then do something rather simple:
sanc %>% as_tibble() %>%
group_by(client, active = cumsum(start_date>lag(end_date) & row_number()>1)) %>%
filter(n()>1) %>%
ungroup() %>%
distinct(client, active) %>%
count(client, name="simActive")
This returns a list of clients, along with the number of times the client had simultaneous active sanctions.
Output:
# A tibble: 193 x 2
client simActive
<int> <int>
1 1 1
2 2 1
3 3 2
4 4 1
5 5 2
6 6 2
7 7 1
8 8 1
9 9 1
10 10 1
# ... with 183 more rows
So for client 1, there was one time when there was 2 or more active sanctions. The data for client one (see input below) looks like this, and this client had rows 3 and 4 active at the same time.
client start_date end_date start_month_year
1 1 2019-03-18 2019-09-25 2019-03
2 1 2020-10-19 2019-12-03 2020-10
3 1 2021-03-11 2019-11-26 2021-03
4 1 2020-07-06 2021-09-03 2020-07
5 1 2021-05-11 2019-09-06 2021-05
Input:
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
Here is another way to do it. It might not be very performant, but the approach should yield the correct results. See my inline comments for how it works. Further note, that I adjusted your sample data. You did just sample random start and end dates without making sure that start_date < end_date. I changed this so that each start_date is smaller than its end_date.
set.seed(147)
library(dplyr)
sanc <-
tibble(
client = sample(1:500, 1000, replace = TRUE),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-06-01"), by = "day"), 1000),
end_date = round(runif(1000, min = 1, max = 150), 0 ) + start_date
)
sanc %>%
# make each sanction an `lubridate::interval`
mutate(int = interval(start_date, end_date)) %>%
# group_by month and client
group_by(month = format(start_date, "%Y-%m"), client) %>%
# use `lubridate::int_overlaps` to compare all intervals
summarise(overlap = list(outer(int, int, int_overlaps))) %>%
# apply to each row ...
rowwise() %>%
# to get only the lower triangle of each matrix and sum it up
mutate(overlap = sum(overlap[lower.tri(overlap)])) %>%
# now group by month
group_by(month) %>%
# and how many individuals in each month have more than one active sanction
summarise(overlap = sum(overlap))
#> `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.
#> # A tibble: 42 x 2
#> month overlap
#> <chr> <int>
#> 1 2019-01 0
#> 2 2019-02 0
#> 3 2019-03 0
#> 4 2019-04 0
#> 5 2019-05 0
#> 6 2019-06 1
#> 7 2019-07 1
#> 8 2019-08 2
#> 9 2019-09 1
#> 10 2019-10 3
#> # ... with 32 more rows
Created on 2022-03-09 by the reprex package (v2.0.1)

count row if date falls within date range for all dates in series in R

I have a large data frame (~30,000 rows) where I have two date fields "start_date" and "end_date".
I want to summarise the data such that I have 1 column with all the dates and a second column with a count of all the rows in which that date is between the "start_date" and "end_date".
I can make this work using 2 for loops but it is very inefficient as it is going by one though one comparing about 180 dates to 30,000 rows of date ranges.
Below is an example. Say I have the following dataframe.
df <- tibble(
start_date = c(1,1,2,2,3,3,4,4,5,5),
end_date = c(2,3,4,5,6,7,8,9,10,11)
)
I want this to output a table/dataframe that looks like this
Date Count
1 2
2 4
3 5
4 6
5 7
6 6
7 5
8 4
9 3
10 2
11 1
Is there some TidyVerse functions or anything else that could do this transformation efficiently?
Here's a base R method:
date = seq(min(df$start_date), max(df$end_date))
count = sapply(date, \(x) sum(x >= df$start_date & x <= df$end_date))
data.frame(date, count)
# date count
# 1 1 2
# 2 2 4
# 3 3 5
# 4 4 6
# 5 5 7
# 6 6 6
# 7 7 5
# 8 8 4
# 9 9 3
# 10 10 2
# 11 11 1
Here is a data.table approach using foverlaps. First, create a sequence of desired dates from the minimum start_date to the maximum end_date. Then, create a simple data.table for each of these dates.
Use foverlaps to get the overlapping join between your starting data.frame and the new table. Finally, count up the number of rows after the join for each date.
library(data.table)
setDT(df)
dates <- seq(min(df$start_date), max(df$end_date), by = 1)
dt <- data.table(start_date = dates, end_date = dates, key = c("start_date", "end_date"))
foverlaps(df, dt, which = T)[, .N, by = yid]
Output
yid N
1: 1 2
2: 2 4
3: 3 5
4: 4 6
5: 5 7
6: 6 6
7: 7 5
8: 8 4
9: 9 3
10: 10 2
11: 11 1
In tidyverse you could adapt to the following:
library(tidyverse)
data.frame(date = seq(min(df$start_date), max(df$end_date), by = 1)) %>%
rowwise() %>%
mutate(count = sum(date >= df$start_date & date <= df$end_date))

How to add a column with most resent recurring observation within a group, but within a certain time period, in R

If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA

sum of positive events over a 12 month rolling window

I am trying to count the number of positive events over a 12 month rolling window.
I can create 365 rows of missing data per year and use zoo::rollapply to sum the number of events per 365 rows of data, but my data frame is really big and I want to do this on a bunch of variables, so this takes forever to run.
I can get the correct output with this:
data <- data.frame(id = c("a","a","a","a","a","b","b","b","b","b"),
date = c("20-01-2011","20-04-2011","20-10-2011","20-02-2012",
"20-05-2012","20-01-2013","20-04-2013","20-10-2013",
"20-02-2014","20-05-2014"),
event = c(0,1,1,1,0,1,0,0,1,1))
library(lubridate)
library(dplyr)
library(tidyr)
library(zoo)
data %>%
group_by(id) %>%
mutate(date = dmy(date),
cumsum = cumsum(event)) %>%
complete(date = full_seq(date, period = 1), fill = list(event = 0)) %>%
mutate(event12 = rollapplyr(event, width = 365, FUN = sum, partial = TRUE)) %>%
drop_na(cumsum)
Which is this:
id date event cumsum event12
<fct> <date> <dbl> <dbl> <dbl>
a 2011-01-20 0 0 0
a 2011-04-20 1 1 1
a 2011-10-20 1 2 2
a 2012-02-20 1 3 3
a 2012-05-20 0 3 2
b 2013-01-20 1 1 1
b 2013-04-20 0 1 1
b 2013-10-20 0 1 1
b 2014-02-20 1 2 1
b 2014-05-20 1 3 2
But want to see if there's a more efficient way, as in how would I make the width in rollyapply count up dates rather than count up rows.
This can be done without filling out the missing dates using a complex self join and a single sql statement after converting the dates to Date class:
library(sqldf)
data2 <- transform(data, date = as.Date(date, "%d-%m-%Y"))
sqldf("select a.*, sum(b.event) as event12
from data2 as a
left join data2 as b on a.id = b.id and b.date between a.date - 365 and a.date
group by a.rowid
order by a.rowid")
giving:
id date event event12
1 a 2011-01-20 0 0
2 a 2011-04-20 1 1
3 a 2011-10-20 1 2
4 a 2012-02-20 1 3
5 a 2012-05-20 0 2
6 b 2013-01-20 1 1
7 b 2013-04-20 0 1
8 b 2013-10-20 0 1
9 b 2014-02-20 1 1
10 b 2014-05-20 1 2

How to add a rank column to sorted dates in R?

I used this code in R:
df[with(df,order(ID,Date)),]
to order the date values for each distinct ID in my data frame. Now I want to add a rank column (1 to n) next to the ordered date values, where rank = 1 is the oldest date and rank = n is the most recent date for each distinct ID.
I have seen questions about adding a rank column but not when sorting with a date value. How do I add this rank column using my code above? Thanks!
Here's a dplyr approach:
library(dplyr)
# Fake data
set.seed(5)
dat = data.frame(date=sample(seq(as.Date("2015-01-01"), as.Date("2015-01-31"),
"1 day"), 12),
ID=rep(LETTERS[1:3], c(2,6,4)))
dat %>% group_by(ID) %>%
mutate(rank = rank(date)) %>%
arrange(date)
date ID rank
1 2015-01-07 A 1
2 2015-01-21 A 2
3 2015-01-03 B 1
4 2015-01-08 B 2
5 2015-01-14 B 3
6 2015-01-19 B 4
7 2015-01-20 B 5
8 2015-01-27 B 6
9 2015-01-06 C 1
10 2015-01-10 C 2
11 2015-01-22 C 3
12 2015-01-29 C 4

Resources