Aggregate columns into new ones depending on row value in r [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I'm searching for a way to aggregate rows, based on date and ID, but arranging the values depending on one of te colums' values.
Basically I want to convert this:
DATE ID V1 V2 SHIFT
1 2020-05-21 A 21 34 morning
2 2020-05-21 A 6 12 afternoon
3 2020-05-20 A 19 15 morning
4 2020-05-20 A 7 16 afternoon
5 2020-05-20 B 17 20 morning
6 2020-05-20 B 5 11 afternoon
Into this:
DATE ID V1_mor V2_mor V1_aft V2_aft
1 2020-05-21 A 21 34 6 12
2 2020-05-20 A 19 15 7 16
3 2020-05-20 B 17 20 5 11
I've tried the route of making a loop and adding a bunch of "if" statements, but I don't want to get caught in bad practice.
The goal is so I can later add another column with a calculation of the style of (v1_mor * v2_aft) / v2_mor

Your goal can be achieved by using pivot_wider() from library tidyr:
library(tidyverse)
df %>%
pivot_wider(names_from = SHIFT, values_from = c(V1, V2))
This gives
# A tibble: 3 x 6
# Groups: DATE [2]
DATE ID V1_morning V1_afternoon V2_morning V2_afternoon
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2020-05-21 A 21 6 34 12
2 2020-05-20 A 19 7 15 16
3 2020-05-20 B 17 5 20 11
Data
df <- read_table2("DATE ID V1 V2 SHIFT
2020-05-21 A 21 34 morning
2020-05-21 A 6 12 afternoon
2020-05-20 A 19 15 morning
2020-05-20 A 7 16 afternoon
2020-05-20 B 17 20 morning
2020-05-20 B 5 11 afternoon")

Related

Given a series of dates and a birth day, is there a way to obtain the age at every date entry along with a final age using the lubridate package?

I have a database of information pertaining to individuals observed over time. I would like to find a way to obtain the age of these individuals whenever a record was taken. Assuming the BIRTH assigns a value of 0, I would like to obtain the age either in days or months for the visits after. It would also be helpful to obtain a final age (either day or month) for each individual (*not included in the code). For example, for ID (A), the final age would be 10 months. I would like to use the lubridate function as it's in-built date feature makes it easier to work with dates. Any help with this is much appreciated.
date<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2002-06-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
df1<-data.frame(date,ID,status)
print(df1)
date ID status
1 2000-01-01 A BIRTH
2 2000-01-14 A ETC
3 2000-01-25 A ETC
4 2000-02-12 A ETC
5 2000-02-27 A ETC
6 2000-06-05 A ETC
7 2000-10-30 A ETC
8 2001-02-04 B BIRTH
9 2001-06-15 B ETC
10 2001-12-26 B ETC
11 2002-05-22 B ETC
12 2002-06-04 B ETC
13 2000-01-08 C BIRTH
14 2000-07-11 C ETC
15 2000-08-18 C ETC
16 2000-11-27 C ETC
date.new<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2001-02-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID.new<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status.new<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
age<-c(0,1,1,2,2,6,10,
0,4,10,15,16,
0,6,7,10)
df2<-data.frame(date.new,ID.new,status.new,age)
print(df2)
date.new ID.new status.new age
1 2000-01-01 A BIRTH 0
2 2000-01-14 A ETC 1
3 2000-01-25 A ETC 1
4 2000-02-12 A ETC 2
5 2000-02-27 A ETC 2
6 2000-06-05 A ETC 6
7 2000-10-30 A ETC 10
8 2001-02-04 B BIRTH 0
9 2001-06-15 B ETC 4
10 2001-12-26 B ETC 10
11 2002-05-22 B ETC 15
12 2001-02-04 B ETC 16
13 2000-01-08 C BIRTH 0
14 2000-07-11 C ETC 6
15 2000-08-18 C ETC 7
16 2000-11-27 C ETC 10
For calculations related to age in years or months, I'd like to encourage you to try the clock package rather than lubridate. lubridate is a great package, but produces some unexpected results with these kinds of calculations if you aren't 100% sure of what you are doing. In clock, the function to do this is date_count_between(). Notice that one of the results is different between clock and lubridate here:
library(clock)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
date = c("2000-01-01","2000-01-14",
"2000-01-25","2000-02-12","2000-02-27","2000-06-05",
"2000-10-30","2001-02-04","2001-06-15","2001-12-26",
"2002-05-22","2002-06-04","2000-01-08","2000-07-11",
"2000-08-18","2000-11-27"),
ID = c("A","A","A","A","A","A",
"A","B","B","B","B","B","C","C","C","C"),
status = c("BIRTH","ETC","ETC","ETC",
"ETC","ETC","ETC","BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
)
df %>%
mutate(date = date_parse(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"]) %>%
ungroup() %>%
mutate(
age_clock = date_count_between(birth_date, date, "month"),
age_lubridate = as.period(date - birth_date) %/% months(1))
#> # A tibble: 16 × 6
#> date ID status birth_date age_clock age_lubridate
#> <date> <chr> <chr> <date> <int> <dbl>
#> 1 2000-01-01 A BIRTH 2000-01-01 0 0
#> 2 2000-01-14 A ETC 2000-01-01 0 0
#> 3 2000-01-25 A ETC 2000-01-01 0 0
#> 4 2000-02-12 A ETC 2000-01-01 1 1
#> 5 2000-02-27 A ETC 2000-01-01 1 1
#> 6 2000-06-05 A ETC 2000-01-01 5 5
#> 7 2000-10-30 A ETC 2000-01-01 9 9
#> 8 2001-02-04 B BIRTH 2001-02-04 0 0
#> 9 2001-06-15 B ETC 2001-02-04 4 4
#> 10 2001-12-26 B ETC 2001-02-04 10 10
#> 11 2002-05-22 B ETC 2001-02-04 15 15
#> 12 2002-06-04 B ETC 2001-02-04 16 15
#> 13 2000-01-08 C BIRTH 2000-01-08 0 0
#> 14 2000-07-11 C ETC 2000-01-08 6 6
#> 15 2000-08-18 C ETC 2000-01-08 7 7
#> 16 2000-11-27 C ETC 2000-01-08 10 10
clock says that 2001-02-04 to 2002-06-04 is 16 months, while the lubridate method here only says it is 15 months. This has to do with the fact that the lubridate calculation uses the length of an average month, which doesn't always accurately reflect how we think about months.
Consider this simple example, I think most people would agree that a child born on this date in February is considered "1 month and 1 day" old. But lubridate shows 0 months!
library(clock)
library(lubridate, warn.conflicts = FALSE)
# "1 month and 1 day apart"
feb <- as.Date("2020-02-28")
mar <- as.Date("2020-03-29")
# As expected when thinking about age in months
date_count_between(feb, mar, "month")
#> [1] 1
# Not expected
as.period(mar - feb) %/% months(1)
#> [1] 0
secs_in_day <- 86400
secs_in_month <- as.numeric(months(1))
secs_in_month / secs_in_day
#> [1] 30.4375
# Less than 30.4375 days, so not 1 month
mar - feb
#> Time difference of 30 days
The issue is that lubridate uses the length of an average month in the computation, which is 30.4375 days. But there are only 30 days between these two dates, so it isn't considered a full month.
clock, on the other hand, uses the day component of the starting date to determine if a "full month" has passed or not. In other words, because we have passed the 28th of March, clock decides that 1 month has passed, which is consistent with how we generally think about age.
Using dplyr and lubridate, we can do the following. We first turn the date column into a date. Then we group by ID, find the birth date and calculate the number of months since that date via some lubridate magic (see How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?).
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = as_date(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"],
age = as.period(date - birth_date) %/% months(1)) %>%
ungroup()
Which gives:
date ID status birth_date age
<date> <fct> <fct> <date> <dbl>
1 2000-01-01 A BIRTH 2000-01-01 0
2 2000-01-14 A ETC 2000-01-01 0
3 2000-01-25 A ETC 2000-01-01 0
4 2000-02-12 A ETC 2000-01-01 1
5 2000-02-27 A ETC 2000-01-01 1
6 2000-06-05 A ETC 2000-01-01 5
7 2000-10-30 A ETC 2000-01-01 9
8 2001-02-04 B BIRTH 2001-02-04 0
9 2001-06-15 B ETC 2001-02-04 4
10 2001-12-26 B ETC 2001-02-04 10
11 2002-05-22 B ETC 2001-02-04 15
12 2002-06-04 B ETC 2001-02-04 15
13 2000-01-08 C BIRTH 2000-01-08 0
14 2000-07-11 C ETC 2000-01-08 6
15 2000-08-18 C ETC 2000-01-08 7
16 2000-11-27 C ETC 2000-01-08 10
Which is your expected output except for some rounding differences. See my comment on your question.

adding rows by group to get same number of observations by group

I have what seems like a pretty simple question, but I haven't been able to successfully adapt solutions to similar ones to my situation, including this one: Add row for each group with missing value
I have some data that looks like this:
# A tibble: 265 x 4
anon_ID assistance_date Benefit_1 nth_assistance_interaction
<int> <chr> <chr> <dbl>
1 8 2020-04-10 Medical 5
2 8 2020-04-13 Medical 10
3 8 2020-04-15 Medical 15
4 8 2020-04-21 Medical 20
5 11 2020-06-17 Housing 5
6 11 2020-06-25 Financial 10
7 11 2021-01-27 Financial 15
8 26 2020-05-18 Legal 5
9 26 2021-06-01 Food 10
10 26 2021-08-02 Utilities 15
# ... with 255 more rows
I want to modify it so that each anon_ID has four observations, one for each unique value of nth_assistance_interaction. The values of assistance_date and Benefit_1 should be NA when real values for these variables don't exist.
e.g., for anon_ID = 11, these two variables would have NA values when nth_assistance_interaction = 20.
# A tibble: 265 x 4
anon_ID assistance_date Benefit_1 nth_assistance_interaction
<int> <chr> <chr> <dbl>
1 8 2020-04-10 Medical 5
2 8 2020-04-13 Medical 10
3 8 2020-04-15 Medical 15
4 8 2020-04-21 Medical 20
5 11 2020-06-17 Housing 5
6 11 2020-06-25 Financial 10
7 11 2021-01-27 Financial 15
8 11 NA NA 20
9 26 2020-05-18 Legal 5
10 26 2021-06-01 Food 10
11 26 2021-08-02 Utilities 15
# ... with 255 more rows
This is just one example of what I'm trying to accomplish. It could also be the case that anon_ID = 27 only has one observation for nth_assistance_interaction, and so I would need to add three rows for them.
How can I go about making this happen? Thanks in advance.
We may group by 'anon_ID' and use complete to expand the data
library(dplyr)
library(tidyr)
df1 %>
group_by(anon_ID) %>%
complete(nth_assistance_interaction = c(5, 10, 15, 20)) %>%
ungroup

Remove row on group depending on multiple criteria r

I have a dataset with some repeated values on Date variable, so I would like to filter this rows based on several conditions. As an example, the dataframe looks like:
df <- read.table(text =
"Date column_A column_B column_C Column_D
1 2020-01-01 10 15 15 20
2 2020-01-02 10 15 15 20
3 2020-01-03 10 13 15 20
4 2020-01-04 10 15 15 20
5 2020-01-05 NA 14 15 20
6 2020-01-05 7 NA NA 28
7 2020-01-06 10 15 15 20
8 2020-01-07 10 15 15 20
9 2020-01-07 10 NA NA 20
10 2020-01-08 10 15 15 20", header=TRUE)
df$Date <- as.Date(df$Date)
The different conditions to filter should be, ONLY on duplicated rows:
If "column A" is NA and the other numeric, select the numeric row
If both values are similar(both NA or both numeric), select row with less NAs.
My best approach, after several options is:
df$cnt_na <- apply(df[,2:5], 1, function(x) sum(is.na(x)))
df <- df %>% group_by(Date) %>% slice(which.min(all_of(cnt_na))) %>% select(-cnt_na)
Although in my case, it doesn't do the first condition. The main problem is that if I filter by !is.na(Date), I also remove other not duplicated rows.
Thanks in advance
I would sort your table based on your conditions and then pick the first row for every group:
library(dplyr)
df %>%
rowwise() %>%
mutate(cnt_na = sum(across(-Date, ~ sum(is.na(.))))) %>%
arrange(Date, is.na(column_A), cnt_na) %>%
group_by(Date) %>%
slice_head() %>%
ungroup()
which gives
# A tibble: 8 x 6
Date column_A column_B column_C Column_D cnt_na
<date> <int> <int> <int> <int> <int>
1 2020-01-01 10 15 15 20 0
2 2020-01-02 10 15 15 20 0
3 2020-01-03 10 13 15 20 0
4 2020-01-04 10 15 15 20 0
5 2020-01-05 7 NA NA 28 2
6 2020-01-06 10 15 15 20 0
7 2020-01-07 10 15 15 20 0
8 2020-01-08 10 15 15 20 0

A running sum for daily data that resets when month turns

I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33

Cumulative sums in R with multiple conditions?

I am trying to figure out how to create a cumulative or rolling sum in R based on a few conditions.
The data set in question is a few million observations of library loans, and the question is to determine how many copies of a given book/title would be necessary to meet demand.
So for each Title.ID, begin with 1 copy for the first instance (ID.Index). Then for each instance after, determine whether another copy is needed based on whether the REQUEST.DATE is within 16 weeks (112 days) of the previous request.
# A tibble: 15 x 3
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index
<date> <int> <int>
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5
The tricky part is that determining whether a new copy is needed is based not only on the number of request (ID.Index) and the REQUEST.DATE of some previous loan, but also on the preceding accumulating sum.
For instance, for the third request for title 2 (Title.ID 2, ID.Index 3), there are now two copies, so to determine whether a new copy is needed, you have to see whether the REQUEST.DATE is within 112 days of the first (not second) request (ID.Index 1). By contrast, for the third request for title 6 (Title.ID 6, ID.Index 3), there is only one copy available (since request 2 was not within 112 days), so determining whether a new copy is needed is based on looking back to the REQUEST.DATE of ID.Index 2.
The desired output ("Copies") would take each new request (ID.Index), then look back to the relevant REQUEST.DATE based on the number of available copies, and doing that would mean looking at the accumulating sum for the preceding calculation. (Note: The max number of copies would be 10.)
I've provided the desired output for the sample below ("Copies").
# A tibble: 15 x 4
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index Copies
<date> <int> <int> <dbl>
1 2013-07-09 2 1 1
2 2013-08-07 2 2 2
3 2013-08-20 2 3 3
4 2013-09-08 2 4 4
5 2013-09-28 2 5 5
6 2013-12-27 2 6 5
7 2014-02-10 2 7 5
8 2014-03-12 2 8 5
9 2014-03-14 2 9 5
10 2014-08-27 2 10 5
11 2014-04-27 6 1 1
12 2014-08-01 6 2 2
13 2014-11-13 6 3 2
14 2015-02-14 6 4 2
15 2015-05-14 6 5 2
>
I recognize that the solution will be way beyond my abilities, so I will be extremely grateful for any solution or advice about how to solve this type of problem in the future.
Thanks a million!
*4/19 update: new examples where new copy may be added after delay, i.e., not in sequence. I've also added columns showing days since a given previous request, which helps checking whether a new copy should be added, based on how many copies there are.
Sample 2: new copy should be added with third request, since it has only been 96 days since last request (and there is only one copy)
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10680332 2013-10-17 6 1 NA days NA days NA days NA days NA days 1
2 PEN-10835735 2014-04-27 6 2 192 days NA days NA days NA days NA days 1
3 PEN-10873506 2014-08-01 6 3 96 days 288 days NA days NA days NA days 1
4 PEN-10951264 2014-11-13 6 4 104 days 200 days 392 days NA days NA days 1
5 PEN-11029526 2015-02-14 6 5 93 days 197 days 293 days 485 days NA days 1
6 PEN-11106581 2015-05-14 6 6 89 days 182 days 286 days 382 days 574 days 1
Sample 3: new copy should be added with last request, since there are two copies, and the oldest request is 45 days.
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10999392 2015-01-20 76 1 NA days NA days NA days NA days NA days 1
2 YAL-11004302 2015-01-22 76 2 2 days NA days NA days NA days NA days 2
3 COR-11108471 2015-05-18 76 3 116 days 118 days NA days NA days NA days 2
4 HVD-11136632 2015-07-27 76 4 70 days 186 days 188 days NA days NA days 2
5 MIT-11164843 2015-09-09 76 5 44 days 114 days 230 days 232 days NA days 2
6 HVD-11166239 2015-09-10 76 6 1 days 45 days 115 days 231 days 233 days 2
You can use runner package to apply any R function on cumulative window.
This time we execute function f using x = REQUEST.DATE. We just count number of observations which are within min(x) + 112.
library(dplyr)
library(runner)
data %>%
group_by(Title.ID) %>%
mutate(
Copies = runner(
x = REQUEST.DATE,
f = function(x) {
length(x[x <= (min(x + 112))])
}
)
)
# # A tibble: 15 x 4
# # Groups: Title.ID [2]
# REQUEST.DATE Title.ID ID.Index Copies
# <date> <int> <int> <int>
# 1 2013-07-09 2 1 1
# 2 2013-08-07 2 2 2
# 3 2013-08-20 2 3 3
# 4 2013-09-08 2 4 4
# 5 2013-09-28 2 5 5
# 6 2013-12-27 2 6 5
# 7 2014-02-10 2 7 5
# 8 2014-03-12 2 8 5
# 9 2014-03-14 2 9 5
# 10 2014-08-27 2 10 5
# 11 2014-04-27 6 1 1
# 12 2014-08-01 6 2 2
# 13 2014-11-13 6 3 2
# 14 2015-02-14 6 4 2
# 15 2015-05-14 6 5 2
data
data <- read.table(
text = " REQUEST.DATE Title.ID ID.Index
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5",
header = TRUE)
data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))
I was able to find a workable solution based on finding the max number of other requests within 112 days of a request (after creating return date), for each title.
data$RETURN.DATE <- as.Date(data$REQUEST.DATE + 112)
data <- data %>%
group_by(Title.ID) %>%
mutate(
Copies = sapply(REQUEST.DATE, function(x)
sum(as.Date(REQUEST.DATE) <= as.Date(x) &
as.Date(RETURN.DATE) >= as.Date(x)
))
)
Then I de-duplicated the list of titles, using the max number for each title, and added it back to the original data.
I still think there's a solution to the original problem, where I could go back and see at which point new copies needed to be added (for analysis based on when a title is published), but this works for now.

Resources