Cumulative sums in R with multiple conditions? - r

I am trying to figure out how to create a cumulative or rolling sum in R based on a few conditions.
The data set in question is a few million observations of library loans, and the question is to determine how many copies of a given book/title would be necessary to meet demand.
So for each Title.ID, begin with 1 copy for the first instance (ID.Index). Then for each instance after, determine whether another copy is needed based on whether the REQUEST.DATE is within 16 weeks (112 days) of the previous request.
# A tibble: 15 x 3
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index
<date> <int> <int>
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5
The tricky part is that determining whether a new copy is needed is based not only on the number of request (ID.Index) and the REQUEST.DATE of some previous loan, but also on the preceding accumulating sum.
For instance, for the third request for title 2 (Title.ID 2, ID.Index 3), there are now two copies, so to determine whether a new copy is needed, you have to see whether the REQUEST.DATE is within 112 days of the first (not second) request (ID.Index 1). By contrast, for the third request for title 6 (Title.ID 6, ID.Index 3), there is only one copy available (since request 2 was not within 112 days), so determining whether a new copy is needed is based on looking back to the REQUEST.DATE of ID.Index 2.
The desired output ("Copies") would take each new request (ID.Index), then look back to the relevant REQUEST.DATE based on the number of available copies, and doing that would mean looking at the accumulating sum for the preceding calculation. (Note: The max number of copies would be 10.)
I've provided the desired output for the sample below ("Copies").
# A tibble: 15 x 4
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index Copies
<date> <int> <int> <dbl>
1 2013-07-09 2 1 1
2 2013-08-07 2 2 2
3 2013-08-20 2 3 3
4 2013-09-08 2 4 4
5 2013-09-28 2 5 5
6 2013-12-27 2 6 5
7 2014-02-10 2 7 5
8 2014-03-12 2 8 5
9 2014-03-14 2 9 5
10 2014-08-27 2 10 5
11 2014-04-27 6 1 1
12 2014-08-01 6 2 2
13 2014-11-13 6 3 2
14 2015-02-14 6 4 2
15 2015-05-14 6 5 2
>
I recognize that the solution will be way beyond my abilities, so I will be extremely grateful for any solution or advice about how to solve this type of problem in the future.
Thanks a million!
*4/19 update: new examples where new copy may be added after delay, i.e., not in sequence. I've also added columns showing days since a given previous request, which helps checking whether a new copy should be added, based on how many copies there are.
Sample 2: new copy should be added with third request, since it has only been 96 days since last request (and there is only one copy)
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10680332 2013-10-17 6 1 NA days NA days NA days NA days NA days 1
2 PEN-10835735 2014-04-27 6 2 192 days NA days NA days NA days NA days 1
3 PEN-10873506 2014-08-01 6 3 96 days 288 days NA days NA days NA days 1
4 PEN-10951264 2014-11-13 6 4 104 days 200 days 392 days NA days NA days 1
5 PEN-11029526 2015-02-14 6 5 93 days 197 days 293 days 485 days NA days 1
6 PEN-11106581 2015-05-14 6 6 89 days 182 days 286 days 382 days 574 days 1
Sample 3: new copy should be added with last request, since there are two copies, and the oldest request is 45 days.
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10999392 2015-01-20 76 1 NA days NA days NA days NA days NA days 1
2 YAL-11004302 2015-01-22 76 2 2 days NA days NA days NA days NA days 2
3 COR-11108471 2015-05-18 76 3 116 days 118 days NA days NA days NA days 2
4 HVD-11136632 2015-07-27 76 4 70 days 186 days 188 days NA days NA days 2
5 MIT-11164843 2015-09-09 76 5 44 days 114 days 230 days 232 days NA days 2
6 HVD-11166239 2015-09-10 76 6 1 days 45 days 115 days 231 days 233 days 2

You can use runner package to apply any R function on cumulative window.
This time we execute function f using x = REQUEST.DATE. We just count number of observations which are within min(x) + 112.
library(dplyr)
library(runner)
data %>%
group_by(Title.ID) %>%
mutate(
Copies = runner(
x = REQUEST.DATE,
f = function(x) {
length(x[x <= (min(x + 112))])
}
)
)
# # A tibble: 15 x 4
# # Groups: Title.ID [2]
# REQUEST.DATE Title.ID ID.Index Copies
# <date> <int> <int> <int>
# 1 2013-07-09 2 1 1
# 2 2013-08-07 2 2 2
# 3 2013-08-20 2 3 3
# 4 2013-09-08 2 4 4
# 5 2013-09-28 2 5 5
# 6 2013-12-27 2 6 5
# 7 2014-02-10 2 7 5
# 8 2014-03-12 2 8 5
# 9 2014-03-14 2 9 5
# 10 2014-08-27 2 10 5
# 11 2014-04-27 6 1 1
# 12 2014-08-01 6 2 2
# 13 2014-11-13 6 3 2
# 14 2015-02-14 6 4 2
# 15 2015-05-14 6 5 2
data
data <- read.table(
text = " REQUEST.DATE Title.ID ID.Index
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5",
header = TRUE)
data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))

I was able to find a workable solution based on finding the max number of other requests within 112 days of a request (after creating return date), for each title.
data$RETURN.DATE <- as.Date(data$REQUEST.DATE + 112)
data <- data %>%
group_by(Title.ID) %>%
mutate(
Copies = sapply(REQUEST.DATE, function(x)
sum(as.Date(REQUEST.DATE) <= as.Date(x) &
as.Date(RETURN.DATE) >= as.Date(x)
))
)
Then I de-duplicated the list of titles, using the max number for each title, and added it back to the original data.
I still think there's a solution to the original problem, where I could go back and see at which point new copies needed to be added (for analysis based on when a title is published), but this works for now.

Related

adding rows by group to get same number of observations by group

I have what seems like a pretty simple question, but I haven't been able to successfully adapt solutions to similar ones to my situation, including this one: Add row for each group with missing value
I have some data that looks like this:
# A tibble: 265 x 4
anon_ID assistance_date Benefit_1 nth_assistance_interaction
<int> <chr> <chr> <dbl>
1 8 2020-04-10 Medical 5
2 8 2020-04-13 Medical 10
3 8 2020-04-15 Medical 15
4 8 2020-04-21 Medical 20
5 11 2020-06-17 Housing 5
6 11 2020-06-25 Financial 10
7 11 2021-01-27 Financial 15
8 26 2020-05-18 Legal 5
9 26 2021-06-01 Food 10
10 26 2021-08-02 Utilities 15
# ... with 255 more rows
I want to modify it so that each anon_ID has four observations, one for each unique value of nth_assistance_interaction. The values of assistance_date and Benefit_1 should be NA when real values for these variables don't exist.
e.g., for anon_ID = 11, these two variables would have NA values when nth_assistance_interaction = 20.
# A tibble: 265 x 4
anon_ID assistance_date Benefit_1 nth_assistance_interaction
<int> <chr> <chr> <dbl>
1 8 2020-04-10 Medical 5
2 8 2020-04-13 Medical 10
3 8 2020-04-15 Medical 15
4 8 2020-04-21 Medical 20
5 11 2020-06-17 Housing 5
6 11 2020-06-25 Financial 10
7 11 2021-01-27 Financial 15
8 11 NA NA 20
9 26 2020-05-18 Legal 5
10 26 2021-06-01 Food 10
11 26 2021-08-02 Utilities 15
# ... with 255 more rows
This is just one example of what I'm trying to accomplish. It could also be the case that anon_ID = 27 only has one observation for nth_assistance_interaction, and so I would need to add three rows for them.
How can I go about making this happen? Thanks in advance.
We may group by 'anon_ID' and use complete to expand the data
library(dplyr)
library(tidyr)
df1 %>
group_by(anon_ID) %>%
complete(nth_assistance_interaction = c(5, 10, 15, 20)) %>%
ungroup

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

Conditional aggregation by regarding missing data in R

I am trying to aggregate hourly data to daily data in R. The problem is missing values. I want to consider a threshold for the number of missing values before the aggregation. If the number of missing values is more than two in the given day, DO NOT compute the daily average and fill that day with NA.
My dummy data are daily data for the first day of 2005.
day hour amount amount2
1 2005-01-01 0 1 1
2 2005-01-01 1 2 NA
3 2005-01-01 2 4 4
4 2005-01-01 3 5 5
5 2005-01-01 4 11 NA
6 2005-01-01 5 4 NA
7 2005-01-01 6 NA NA
8 2005-01-01 7 2 2
9 2005-01-01 8 4 4
10 2005-01-01 9 2 2
11 2005-01-01 10 4 20
12 2005-01-01 11 12 12
13 2005-01-01 12 13 13
14 2005-01-01 13 7 7
15 2005-01-01 14 4 4
16 2005-01-01 15 12 12
17 2005-01-01 16 4 4
18 2005-01-01 17 12 12
19 2005-01-01 18 5 5
20 2005-01-01 19 11 11
21 2005-01-01 20 4 4
22 2005-01-01 21 12 12
23 2005-01-01 22 13 13
24 2005-01-01 23 7 7
What I already got
agg
day amount amount2
1 2005-01-01 6.9 7.7
what I want to have
agg
day amount amount2
1 2005-01-01 6.9 NA
Because the number of missing values in the column amount2 is more than two, I want its daily average filled by NA (not 7.7) while it is calculated for the column amount (6.9).
I have used the function "aggregate" from "stats" library.
library(stats)
amount=(c(1,2,4,5,11,4,NA,2,4,2,4,12,13,7,4,12,4,12,5,11,4,12,13,7))
amount2=(c(1,NA,4,5,NA,NA,NA,2,4,2,20,12,13,7,4,12,4,12,5,11,4,12,13,7))
day=c("2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01")
hour=seq(0,23)
date=data.frame(day,hour)
dummy=cbind(date,amount,amount2)
agg <- aggregate(cbind(amount,amount2) ~ day, dummy, mean)

r - dplyr: counting the frequency of unique values in one variable for each unique value of another variable in the same data frame

So here's a sample of some of the rows from my dataframe:
> data[1:25, c("TR_DATE", "TR_TYPE...")]
TR_DATE TR_TYPE...
1 2016-03-01 4
2 2016-03-01 4
3 2016-03-01 5
4 2016-03-01 4
5 2016-03-01 1
6 2016-03-01 7
7 2016-03-01 4
8 2016-03-01 4
9 2016-03-01 24
10 2016-03-01 23
11 2016-03-01 4
12 2016-03-02 4
13 2016-03-02 1
14 2016-03-02 1
15 2016-03-02 4
16 2016-03-02 4
17 2016-03-02 14
18 2016-03-02 4
19 2016-03-02 4
20 2016-03-03 4
21 2016-03-03 1
22 2016-03-03 4
23 2016-03-03 23
24 2016-03-03 1
25 2016-03-03 4
What I'd like to do exactly is rearrange in such a way that for every unique day, I get the number of unique transaction types and the frequency of each transaction type
Here's the code that I tried:
data %>%
group_by(TR_DATE) %>%
summarise(trancount = n(), trantype = n_distinct(TR_TYPE...))
which gave me part of the result that I wanted:
# A tibble: 68 x 3
TR_DATE trancount trantype
<date> <int> <int>
1 2016-03-01 5816 6
2 2016-03-02 5637 3
3 2016-03-03 4818 3
4 2016-03-04 5070 8
5 2016-03-05 4 2
6 2016-03-08 6707 5
7 2016-03-09 5228 5
8 2016-03-10 4722 6
9 2016-03-11 4469 8
10 2016-03-12 1 1
# ... with 58 more rows
so trantype tells me the number of unique transaction types that happened on a particular day, but I'd like to know the frequency of each of these unique transaction types. What would be the best way to go around doing this?
I tried looking around and found similar questions but was unable to modify the solutions to my requirement.
I'm fairly new to R and would really appreciate some help. Thanks.
You should group by both variables:
data %>%
group_by(TR_DATE, TR_TYPE...) %>%
summarise(trancount = n(), trantype = n_distinct(TR_TYPE...))

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Resources