Add column values ​considering Dates - r

I would like to create a new database from the df database I entered below. My idea is to create a base where only have one day per line. For example, instead of inserting 4 rows for 01/07/2021, it will only be 1, this way the values ​​of the columns of those days will be added.
df <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-04-03","2021-04-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-04-07","2021-04-09","2021-04-10","2021-04-10"),
Week= c("Thursday","Thursday","Thursday","Thursday","Friday","Friday","Friday","Friday",
"Friday","Friday","Saturday","Saturday","Saturday","Saturday","Saturday","Thursday",
"Thursday","Friday","Friday","Saturday","Saturday"),
DTPE = c("Ho","Ho","Ho","Ho","","","","","","","","","","","","","","","","Ho","Ho"),
D1 = c(8,1,9, 3,5,4,7,6,3,8,2,3,4,6,7,8,4,2,6,2,3), DR01 = c(4,1,4,3,3,4,3,6,3,7,2,3,4,6,7,8,4,2,6,7,3),
DR02 = c(8,1,4,3,3,4,1,6,3,7,2,3,4,6,7,8,4,2,6,2,3), DR03 = c(7,5,4,3,3,4,1,5,3,3,2,3,4,6,7,8,4,2,6,4,3),
DR04= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR05 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
> df
Id date1 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-07-01 Thursday Ho 8 4 8 7 4 9
2 1 2021-07-01 Thursday Ho 1 1 1 5 5 5
3 1 2021-07-01 Thursday Ho 9 4 4 4 6 4
4 1 2021-07-01 Thursday Ho 3 3 3 3 7 3
5 1 2021-04-02 Friday 5 3 3 3 3 3
6 1 2021-04-02 Friday 4 4 4 4 2 2
7 1 2021-04-02 Friday 7 3 1 1 7 1
8 1 2021-04-02 Friday 6 6 6 5 4 5
9 1 2021-04-02 Friday 3 3 3 3 2 3
10 1 2021-04-02 Friday 8 7 7 3 1 7
11 1 2021-04-03 Saturday 2 2 2 2 2 2
12 1 2021-04-03 Saturday 3 3 3 3 3 3
13 1 2021-04-03 Saturday 4 4 4 4 4 4
14 1 2021-04-03 Saturday 6 6 6 6 6 7
15 1 2021-04-03 Saturday 7 7 7 7 7 7
16 1 2021-04-08 Thursday 8 8 8 8 8 8
17 1 2021-04-08 Thursday 4 4 4 4 4 4
18 1 2021-04-07 Friday 2 2 2 2 2 2
19 1 2021-04-09 Friday 6 6 6 6 6 6
20 1 2021-04-10 Saturday Ho 2 7 2 4 4 4
21 1 2021-04-10 Saturday Ho 3 3 3 3 3 3

We may do a grouping by 'Id', along with 'date1' and 'Week', then summarise the numeric columns to get the sum in across
library(dplyr)
df %>% group_by(Id, date1, Week) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE), .groups = 'drop')

You can perform this using the following code:
library(dplyr)
df %>%
group_by(Id, date1, Week) %>%
select(D1:DR05) %>%
summarise_all(sum)
# A tibble: 7 × 9
# Groups: Id, date1 [7]
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-03 Saturday 22 22 22 22 22 23
3 1 2021-04-07 Friday 2 2 2 2 2 2
4 1 2021-04-08 Thursday 12 12 12 12 12 12
5 1 2021-04-09 Friday 6 6 6 6 6 6
6 1 2021-04-10 Saturday 5 10 5 7 7 7
7 1 2021-07-01 Thursday 21 12 16 19 22 21
You might want to also convert the date1 field to a DATE object, but can do that using the lubridate verbs for e.g. ymd() inside a mutate

Base R with aggregate:
aggregate(cbind(D1, DR01, DR02, DR03, DR04, DR05) ~ Id+date1+Week, df, sum)
Output:
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-07 Friday 2 2 2 2 2 2
3 1 2021-04-09 Friday 6 6 6 6 6 6
4 1 2021-04-03 Saturday 22 22 22 22 22 23
5 1 2021-04-10 Saturday 5 10 5 7 7 7
6 1 2021-04-08 Thursday 12 12 12 12 12 12
7 1 2021-07-01 Thursday 21 12 16 19 22 21

Related

lubridate: Finding weeks within months

I want to find weeks within months (separate numbering of weeks within months) using lubridate R package. My minimum working example is below:
library(tidyverse)
library(lubridate)
dt1 <-
tibble(
Date = seq(from = ymd("2021-01-01"), to = ymd("2021-12-31"), by = '1 day')
, Month = month(Date)
)
dt2 <-
dt1 %>%
group_by(Month) %>%
mutate(Week = week(Date))
dt2 %>%
print(n = 40)
# A tibble: 365 x 3
# Groups: Month [12]
Date Month Week
<date> <dbl> <dbl>
1 2021-01-01 1 1
2 2021-01-02 1 1
3 2021-01-03 1 1
4 2021-01-04 1 1
5 2021-01-05 1 1
6 2021-01-06 1 1
7 2021-01-07 1 1
8 2021-01-08 1 2
9 2021-01-09 1 2
10 2021-01-10 1 2
11 2021-01-11 1 2
12 2021-01-12 1 2
13 2021-01-13 1 2
14 2021-01-14 1 2
15 2021-01-15 1 3
16 2021-01-16 1 3
17 2021-01-17 1 3
18 2021-01-18 1 3
19 2021-01-19 1 3
20 2021-01-20 1 3
21 2021-01-21 1 3
22 2021-01-22 1 4
23 2021-01-23 1 4
24 2021-01-24 1 4
25 2021-01-25 1 4
26 2021-01-26 1 4
27 2021-01-27 1 4
28 2021-01-28 1 4
29 2021-01-29 1 5
30 2021-01-30 1 5
31 2021-01-31 1 5
32 2021-02-01 2 5
33 2021-02-02 2 5
34 2021-02-03 2 5
35 2021-02-04 2 5
36 2021-02-05 2 6
37 2021-02-06 2 6
38 2021-02-07 2 6
39 2021-02-08 2 6
40 2021-02-09 2 6
# ... with 325 more rows
Wondering what am I missing here. For row number 31 in output (31 2021-01-31 1 5), the value in Week column should be 1. Any lead to get the desired output.
It's not completely clear how you are defining a week. If Week 1 starts on the first day of a month, then you can do:
dt2 <- dt1 %>% mutate(Week = 1L + ((day(Date) - 1L) %/% 7L))
dt2 %>% slice(21:40) %>% print(n = 20L)
# A tibble: 20 × 3
Date Month Week
<date> <dbl> <int>
1 2021-01-21 1 3
2 2021-01-22 1 4
3 2021-01-23 1 4
4 2021-01-24 1 4
5 2021-01-25 1 4
6 2021-01-26 1 4
7 2021-01-27 1 4
8 2021-01-28 1 4
9 2021-01-29 1 5
10 2021-01-30 1 5
11 2021-01-31 1 5
12 2021-02-01 2 1
13 2021-02-02 2 1
14 2021-02-03 2 1
15 2021-02-04 2 1
16 2021-02-05 2 1
17 2021-02-06 2 1
18 2021-02-07 2 1
19 2021-02-08 2 2
20 2021-02-09 2 2
With base R, you could simply do:
Week <- 1L + ((as.POSIXlt(Date)$mday - 1L) %/% 7L)

How to adjust select of a function

Could you help me solve the problem below: as you can see in the second part of the code I exclude the DR that have all columns that are equal to 0. However, in the third part of the code, I need to select D1 until the last column DR, for the sum to be done. But it gives an error, could you help me solve the problem?
library(dplyr)
df1 <- structure(
list(date1 = c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
"2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28"),
date2 = c("2021-04-02","2021-04-02","2021-04-03","2021-04-08","2021-04-09","2021-04-10","2021-07-01","2021-07-02","2021-07-03","2021-07-03"),
Week= c("Friday","Friday","Saturday","Thursday","Friday","Saturday","Thursday","Friday","Saturday","Saturday"),
D1 = c(2,3,4,4,6,3,4,5,6,2), DR01 = c(4,1,4,3,3,4,3,6,3,2), DR02= c(4,2,6,7,3,2,7,4,4,3),DR03= c(9,5,4,3,3,2,1,5,4,3),
DR04 = c(5,4,3,3,3,6,2,1,9,2),DR05 = c(5,4,5,3,6,2,1,9,3,4),
DR06 = c(2,4,4,3,3,5,6,7,8,3),DR07 = c(2,5,4,4,9,4,7,8,3,3),
DR08 = c(0,0,0,0,1,2,0,0,0,0),DR09 = c(0,0,0,0,0,0,0,0,0,0),DR010 = c(0,0,0,0,0,0,0,0,0,0),DR011 = c(0,4,0,0,0,0,0,0,0,0), DR012 = c(0,0,0,0,0,0,0,0,0,0)),
class = "data.frame", row.names = c(NA, -10L))
df1<-df1 %>%
select(!where(~ is.numeric(.) && all(. == 0)))
df1<-df1 %>%
group_by(date1,date2, Week) %>%
select(D1:DR012) %>%
summarise_all(sum)
We can have the select before
library(dplyr)
df1 %>%
select(date1, date2, Week, matches("^D")) %>%
group_by(date1, date2, Week) %>%
summarise(across(everything(), sum), .groups = 'drop')
-output
# A tibble: 8 × 13
date1 date2 Week D1 DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR011
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-06-28 2021-04-02 Friday 5 5 6 14 9 9 6 7 0 4
2 2021-06-28 2021-04-03 Saturday 4 4 6 4 3 5 4 4 0 0
3 2021-06-28 2021-04-08 Thursday 4 3 7 3 3 3 3 4 0 0
4 2021-06-28 2021-04-09 Friday 6 3 3 3 3 6 3 9 1 0
5 2021-06-28 2021-04-10 Saturday 3 4 2 2 6 2 5 4 2 0
6 2021-06-28 2021-07-01 Thursday 4 3 7 1 2 1 6 7 0 0
7 2021-06-28 2021-07-02 Friday 5 6 4 5 1 9 7 8 0 0
8 2021-06-28 2021-07-03 Saturday 8 5 7 7 11 7 11 6 0 0
After we did the select, it is not clear why we have to select again. It is not really needed as summarise with across can be everything() other than the grouping columns
df1 %>%
select(!where(~ is.numeric(.) && all(. == 0))) %>%
group_by(across(date1:Week)) %>%
summarise(across(everything(), sum), .groups = 'drop')
# A tibble: 8 × 13
date1 date2 Week D1 DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR011
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-06-28 2021-04-02 Friday 5 5 6 14 9 9 6 7 0 4
2 2021-06-28 2021-04-03 Saturday 4 4 6 4 3 5 4 4 0 0
3 2021-06-28 2021-04-08 Thursday 4 3 7 3 3 3 3 4 0 0
4 2021-06-28 2021-04-09 Friday 6 3 3 3 3 6 3 9 1 0
5 2021-06-28 2021-04-10 Saturday 3 4 2 2 6 2 5 4 2 0
6 2021-06-28 2021-07-01 Thursday 4 3 7 1 2 1 6 7 0 0
7 2021-06-28 2021-07-02 Friday 5 6 4 5 1 9 7 8 0 0
8 2021-06-28 2021-07-03 Saturday 8 5 7 7 11 7 11 6 0 0
We could use summarise with across:
library(dplyr)
df1 %>%
select(!where(~ is.numeric(.) && all(. == 0))) %>%
group_by(date1,date2, Week) %>%
summarise(across(where(is.numeric), sum))
date1 date2 Week D1 DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR011
1 2021-06-28 2021-04-02 Friday 2 4 4 9 5 5 2 2 0 0
2 2021-06-28 2021-04-02 Friday 3 1 2 5 4 4 4 5 0 4
3 2021-06-28 2021-04-03 Saturday 4 4 6 4 3 5 4 4 0 0
4 2021-06-28 2021-04-08 Thursday 4 3 7 3 3 3 3 4 0 0
5 2021-06-28 2021-04-09 Friday 6 3 3 3 3 6 3 9 1 0
6 2021-06-28 2021-04-10 Saturday 3 4 2 2 6 2 5 4 2 0
7 2021-06-28 2021-07-01 Thursday 4 3 7 1 2 1 6 7 0 0
8 2021-06-28 2021-07-02 Friday 5 6 4 5 1 9 7 8 0 0
9 2021-06-28 2021-07-03 Saturday 6 3 4 4 9 3 8 3 0 0
10 2021-06-28 2021-07-03 Saturday 2 2 3 3 2 4 3 3 0 0
DR012 is filtered, so it does not exist anymore to select:
df1 %>%
select(!where(~ is.numeric(.) && all(. == 0))) %>%
names()
[1] "date1" "date2" "Week" "D1" "DR01" "DR02" "DR03" "DR04" "DR05"
[10] "DR06" "DR07" "DR08" "DR011"
Change your code to
df1 %>%
group_by(date1,date2, Week) %>%
select(D1:DR011) %>%
summarise_all(sum)
or
df1 %>%
group_by(date1,date2, Week) %>%
select(starts_with("D")) %>%
summarise_all(sum)

Delete dates in a database in R [duplicate]

This question already has answers here:
Subsetting a dataframe for a specified month and year
(3 answers)
subset function with "different than"?
(3 answers)
Closed 1 year ago.
How do I delete the April dates that are in the date2 column? Here is a small example, but I have a much larger database. So, would I be able to do this quickly?
Thanks!
data <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20","2021-06-20",
"2021-06-20","2021-06-20","2021-06-20","2021-06-20"),
date2 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-06-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-05-03","2021-06-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-06-09","2021-05-09","2021-08-10","2021-06-10"),
DR01= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR02 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
We could use month function in lubridate and then filter:
library(dplyr)
library(lubridate)
data %>%
filter(month(date2)!=4)
Id date1 date2 DR01 DR02
1 1 2021-06-20 2021-07-01 4 9
2 1 2021-06-20 2021-07-01 5 5
3 1 2021-06-20 2021-07-01 6 4
4 1 2021-06-20 2021-07-01 7 3
5 1 2021-06-20 2021-06-02 7 1
6 1 2021-06-20 2021-05-03 3 3
7 1 2021-06-20 2021-06-03 4 4
8 1 2021-06-20 2021-06-09 2 2
9 1 2021-06-20 2021-05-09 6 6
10 1 2021-06-20 2021-08-10 4 4
11 1 2021-06-20 2021-06-10 3 3
Extract the month part after converting to Date class and use !=
data2 <- subset(data, format(as.Date(date2), '%m') != '04')
-output
data2
Id date1 date2 DR01 DR02
1 1 2021-06-20 2021-07-01 4 9
2 1 2021-06-20 2021-07-01 5 5
3 1 2021-06-20 2021-07-01 6 4
4 1 2021-06-20 2021-07-01 7 3
7 1 2021-06-20 2021-06-02 7 1
12 1 2021-06-20 2021-05-03 3 3
13 1 2021-06-20 2021-06-03 4 4
18 1 2021-06-20 2021-06-09 2 2
19 1 2021-06-20 2021-05-09 6 6
20 1 2021-06-20 2021-08-10 4 4
21 1 2021-06-20 2021-06-10 3 3
Another option without using any dates:
data[!grepl("-04-", data$date2), ]
We interprete date2 as string and look for any cell without a "-04-". This returns
Id date1 date2 DR01 DR02
1 1 2021-06-20 2021-07-01 4 9
2 1 2021-06-20 2021-07-01 5 5
3 1 2021-06-20 2021-07-01 6 4
4 1 2021-06-20 2021-07-01 7 3
7 1 2021-06-20 2021-06-02 7 1
12 1 2021-06-20 2021-05-03 3 3
13 1 2021-06-20 2021-06-03 4 4
18 1 2021-06-20 2021-06-09 2 2
19 1 2021-06-20 2021-05-09 6 6
20 1 2021-06-20 2021-08-10 4 4
21 1 2021-06-20 2021-06-10 3 3

Calculate days from start in R

I'm looking for a way, to calculate the number of days a participant (id) spent in a study.
An exemplary data file looks like this:
data <- data.frame(date = as.Date(c("2020-11-29", "2020-11-30", "2020-12-02",
"2020-12-04", "2020-12-05", "2020-12-08",
"2020-11-22", "2020-11-21", "2020-11-24",
"2020-11-25", "2020-11-30", "2020-11-29",
"2021-01-29", "2021-01-20", "2021-01-30",
"2021-02-01", "2021-02-04", "2021-02-04")),
id = rep(1:3, each = 6))
data <- dplyr::arrange(data, id, date)
data
date id
1 2020-11-29 1
2 2020-11-30 1
3 2020-12-02 1
4 2020-12-04 1
5 2020-12-05 1
6 2020-12-08 1
7 2020-11-21 2
8 2020-11-22 2
9 2020-11-24 2
10 2020-11-25 2
11 2020-11-29 2
12 2020-11-30 2
13 2021-01-20 3
14 2021-01-29 3
15 2021-01-30 3
16 2021-02-01 3
17 2021-02-04 3
18 2021-02-04 3
What i'd like to have, is new column days_from_start that will take the 1st day for every id and set it to 0. Then it will compute number of days for every other row within each id. Something like this:
data$days_from_start <- c(0, 1, 3, 4, 5, 8,
0, 1, 3, 4, 8, 10,
0, 9, 10, 11, 14, 14)
data
date id days_from_start
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 4
5 2020-12-05 1 5
6 2020-12-08 1 8
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 10
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 11
17 2021-02-04 3 14
18 2021-02-04 3 14
Any ideas?
Thank you
Simply group the data, work out the earliest date for each id and then calculate differences.
data <- dplyr::arrange(data, id, date)
data %>%
group_by(id) %>%
mutate(
start_date=min(date),
days_from_start=as.numeric(date-start_date)
) %>%
ungroup() %>%
select(-start_date)
# A tibble: 18 x 3
date id days_from_start
<date> <int> <dbl>
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 5
5 2020-12-05 1 6
6 2020-12-08 1 9
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 9
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 12
17 2021-02-04 3 15
18 2021-02-04 3 15

Sum up ending with the current observation starting based on a criteria

I observe the number of purchases of (in the example below: 4) different customers on (five) different days. Now I want to create a new variable summing up the number of purchases of every single user during the last 20 purchases that have been made in total, across users.
Example data:
> da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
+ day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
+ n_purchase = c(5,2,8,0,3,2,0,3,4,0,2,4,5,1,0,2,3,5,0,3))
> da
customer_id day n_purchase
1 1 2016-04-11 5
2 1 2016-04-12 2
3 1 2016-04-13 8
4 1 2016-04-14 0
5 1 2016-04-15 3
6 2 2016-04-11 2
7 2 2016-04-12 0
8 2 2016-04-13 3
9 2 2016-04-14 4
10 2 2016-04-15 0
11 3 2016-04-11 2
12 3 2016-04-12 4
13 3 2016-04-13 5
14 3 2016-04-14 1
15 3 2016-04-15 0
16 4 2016-04-11 2
17 4 2016-04-12 3
18 4 2016-04-13 5
19 4 2016-04-14 0
20 4 2016-04-15 3
I need to know three things to construct my variable:
(1) What's the overall number of purchases on a day across users (day purchases)?
(2) What's the cumulative number of purchases across users starting from the first day (cumsum_day_purchases)?
(3) On which day did, originating from the current observation, the 20 immediately precending (across users) purchases start? This is where I have issues with coding such a variable.
> library(dplyr)
> da %>%
+ group_by(day) %>%
+ mutate(day_purchases = sum(n_purchase)) %>%
+ group_by(customer_id) %>%
+ mutate(cumsum_day_purchases = cumsum(day_purchases))
# A tibble: 20 x 5
# Groups: customer_id [4]
customer_id day n_purchase day_purchases cumsum_day_purchases
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11
2 1 2016-04-12 2 9 20
3 1 2016-04-13 8 21 41
4 1 2016-04-14 0 5 46
5 1 2016-04-15 3 6 52
6 2 2016-04-11 2 11 11
7 2 2016-04-12 0 9 20
8 2 2016-04-13 3 21 41
9 2 2016-04-14 4 5 46
10 2 2016-04-15 0 6 52
11 3 2016-04-11 2 11 11
12 3 2016-04-12 4 9 20
13 3 2016-04-13 5 21 41
14 3 2016-04-14 1 5 46
15 3 2016-04-15 0 6 52
16 4 2016-04-11 2 11 11
17 4 2016-04-12 3 9 20
18 4 2016-04-13 5 21 41
19 4 2016-04-14 0 5 46
20 4 2016-04-15 3 6 52
I will now in the following dataset compute the variable I wish to have by hand.
For all observations on day 2016-04-12 , I compute the cumulative sum
of purchases of a specific customer by adding the number of purchases
of the current day and the precending day, because in total all
customers together made 20 purchases on the current day and the
precending day.
For day 2016-04-13, I only use the number of purchases of a user on
this day, because there have been 21 (41-20) new purchases on the day itself
Resulting in the following output:
> da = da %>% ungroup() %>%
+ mutate(cumsum_last_20_purchases = c(5,5+2,8,0,0+3,2,2+0,3,4,4+0,2,2+4,5,1,1+0,2,2+3,5,0,0+3))
> da
# A tibble: 20 x 6
customer_id day n_purchase day_purchases cumsum_day_purchases cumsum_last_20_purchases
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11 5
2 1 2016-04-12 2 9 20 7
3 1 2016-04-13 8 21 41 8
4 1 2016-04-14 0 5 46 0
5 1 2016-04-15 3 6 52 3
6 2 2016-04-11 2 11 11 2
7 2 2016-04-12 0 9 20 2
8 2 2016-04-13 3 21 41 3
9 2 2016-04-14 4 5 46 4
10 2 2016-04-15 0 6 52 4
11 3 2016-04-11 2 11 11 2
12 3 2016-04-12 4 9 20 6
13 3 2016-04-13 5 21 41 5
14 3 2016-04-14 1 5 46 1
15 3 2016-04-15 0 6 52 1
16 4 2016-04-11 2 11 11 2
17 4 2016-04-12 3 9 20 5
18 4 2016-04-13 5 21 41 5
19 4 2016-04-14 0 5 46 0
20 4 2016-04-15 3 6 52 3
We can create a new grouping based on the last day the day_purchase columns is above 20, and then use cumsum on that:
library(dplyr)
da %>%
group_by(day) %>%
mutate(day_purchases = sum(n_purchase)) %>%
group_by(customer_id) %>%
mutate(above = with(rle(day_purchases >= 20), rep(1:length(lengths), lengths))) %>%
group_by(above, .add =TRUE) %>%
mutate(cumsum_last_20_purchases = cumsum(n_purchase))
#> # A tibble: 20 x 6
#> # Groups: customer_id, above [12]
#> customer_id day n_purchase day_purchases above cumsum_last_20_purchas…
#> <dbl> <fct> <dbl> <dbl> <int> <dbl>
#> 1 1 2016-04-11 5 11 1 5
#> 2 1 2016-04-12 2 9 1 7
#> 3 1 2016-04-13 8 21 2 8
#> 4 1 2016-04-14 0 5 3 0
#> 5 1 2016-04-15 3 6 3 3
#> 6 2 2016-04-11 2 11 1 2
#> 7 2 2016-04-12 0 9 1 2
#> 8 2 2016-04-13 3 21 2 3
#> 9 2 2016-04-14 4 5 3 4
#> 10 2 2016-04-15 0 6 3 4
#> 11 3 2016-04-11 2 11 1 2
#> 12 3 2016-04-12 4 9 1 6
#> 13 3 2016-04-13 5 21 2 5
#> 14 3 2016-04-14 1 5 3 1
#> 15 3 2016-04-15 0 6 3 1
#> 16 4 2016-04-11 2 11 1 2
#> 17 4 2016-04-12 3 9 1 5
#> 18 4 2016-04-13 5 21 2 5
#> 19 4 2016-04-14 0 5 3 0
#> 20 4 2016-04-15 3 6 3 3
Created on 2020-07-28 by the reprex package (v0.3.0)

Resources