Related
I want to find weeks within months (separate numbering of weeks within months) using lubridate R package. My minimum working example is below:
library(tidyverse)
library(lubridate)
dt1 <-
tibble(
Date = seq(from = ymd("2021-01-01"), to = ymd("2021-12-31"), by = '1 day')
, Month = month(Date)
)
dt2 <-
dt1 %>%
group_by(Month) %>%
mutate(Week = week(Date))
dt2 %>%
print(n = 40)
# A tibble: 365 x 3
# Groups: Month [12]
Date Month Week
<date> <dbl> <dbl>
1 2021-01-01 1 1
2 2021-01-02 1 1
3 2021-01-03 1 1
4 2021-01-04 1 1
5 2021-01-05 1 1
6 2021-01-06 1 1
7 2021-01-07 1 1
8 2021-01-08 1 2
9 2021-01-09 1 2
10 2021-01-10 1 2
11 2021-01-11 1 2
12 2021-01-12 1 2
13 2021-01-13 1 2
14 2021-01-14 1 2
15 2021-01-15 1 3
16 2021-01-16 1 3
17 2021-01-17 1 3
18 2021-01-18 1 3
19 2021-01-19 1 3
20 2021-01-20 1 3
21 2021-01-21 1 3
22 2021-01-22 1 4
23 2021-01-23 1 4
24 2021-01-24 1 4
25 2021-01-25 1 4
26 2021-01-26 1 4
27 2021-01-27 1 4
28 2021-01-28 1 4
29 2021-01-29 1 5
30 2021-01-30 1 5
31 2021-01-31 1 5
32 2021-02-01 2 5
33 2021-02-02 2 5
34 2021-02-03 2 5
35 2021-02-04 2 5
36 2021-02-05 2 6
37 2021-02-06 2 6
38 2021-02-07 2 6
39 2021-02-08 2 6
40 2021-02-09 2 6
# ... with 325 more rows
Wondering what am I missing here. For row number 31 in output (31 2021-01-31 1 5), the value in Week column should be 1. Any lead to get the desired output.
It's not completely clear how you are defining a week. If Week 1 starts on the first day of a month, then you can do:
dt2 <- dt1 %>% mutate(Week = 1L + ((day(Date) - 1L) %/% 7L))
dt2 %>% slice(21:40) %>% print(n = 20L)
# A tibble: 20 × 3
Date Month Week
<date> <dbl> <int>
1 2021-01-21 1 3
2 2021-01-22 1 4
3 2021-01-23 1 4
4 2021-01-24 1 4
5 2021-01-25 1 4
6 2021-01-26 1 4
7 2021-01-27 1 4
8 2021-01-28 1 4
9 2021-01-29 1 5
10 2021-01-30 1 5
11 2021-01-31 1 5
12 2021-02-01 2 1
13 2021-02-02 2 1
14 2021-02-03 2 1
15 2021-02-04 2 1
16 2021-02-05 2 1
17 2021-02-06 2 1
18 2021-02-07 2 1
19 2021-02-08 2 2
20 2021-02-09 2 2
With base R, you could simply do:
Week <- 1L + ((as.POSIXlt(Date)$mday - 1L) %/% 7L)
I would like to create a new database from the df database I entered below. My idea is to create a base where only have one day per line. For example, instead of inserting 4 rows for 01/07/2021, it will only be 1, this way the values of the columns of those days will be added.
df <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-04-03","2021-04-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-04-07","2021-04-09","2021-04-10","2021-04-10"),
Week= c("Thursday","Thursday","Thursday","Thursday","Friday","Friday","Friday","Friday",
"Friday","Friday","Saturday","Saturday","Saturday","Saturday","Saturday","Thursday",
"Thursday","Friday","Friday","Saturday","Saturday"),
DTPE = c("Ho","Ho","Ho","Ho","","","","","","","","","","","","","","","","Ho","Ho"),
D1 = c(8,1,9, 3,5,4,7,6,3,8,2,3,4,6,7,8,4,2,6,2,3), DR01 = c(4,1,4,3,3,4,3,6,3,7,2,3,4,6,7,8,4,2,6,7,3),
DR02 = c(8,1,4,3,3,4,1,6,3,7,2,3,4,6,7,8,4,2,6,2,3), DR03 = c(7,5,4,3,3,4,1,5,3,3,2,3,4,6,7,8,4,2,6,4,3),
DR04= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR05 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
> df
Id date1 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-07-01 Thursday Ho 8 4 8 7 4 9
2 1 2021-07-01 Thursday Ho 1 1 1 5 5 5
3 1 2021-07-01 Thursday Ho 9 4 4 4 6 4
4 1 2021-07-01 Thursday Ho 3 3 3 3 7 3
5 1 2021-04-02 Friday 5 3 3 3 3 3
6 1 2021-04-02 Friday 4 4 4 4 2 2
7 1 2021-04-02 Friday 7 3 1 1 7 1
8 1 2021-04-02 Friday 6 6 6 5 4 5
9 1 2021-04-02 Friday 3 3 3 3 2 3
10 1 2021-04-02 Friday 8 7 7 3 1 7
11 1 2021-04-03 Saturday 2 2 2 2 2 2
12 1 2021-04-03 Saturday 3 3 3 3 3 3
13 1 2021-04-03 Saturday 4 4 4 4 4 4
14 1 2021-04-03 Saturday 6 6 6 6 6 7
15 1 2021-04-03 Saturday 7 7 7 7 7 7
16 1 2021-04-08 Thursday 8 8 8 8 8 8
17 1 2021-04-08 Thursday 4 4 4 4 4 4
18 1 2021-04-07 Friday 2 2 2 2 2 2
19 1 2021-04-09 Friday 6 6 6 6 6 6
20 1 2021-04-10 Saturday Ho 2 7 2 4 4 4
21 1 2021-04-10 Saturday Ho 3 3 3 3 3 3
We may do a grouping by 'Id', along with 'date1' and 'Week', then summarise the numeric columns to get the sum in across
library(dplyr)
df %>% group_by(Id, date1, Week) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE), .groups = 'drop')
You can perform this using the following code:
library(dplyr)
df %>%
group_by(Id, date1, Week) %>%
select(D1:DR05) %>%
summarise_all(sum)
# A tibble: 7 × 9
# Groups: Id, date1 [7]
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-03 Saturday 22 22 22 22 22 23
3 1 2021-04-07 Friday 2 2 2 2 2 2
4 1 2021-04-08 Thursday 12 12 12 12 12 12
5 1 2021-04-09 Friday 6 6 6 6 6 6
6 1 2021-04-10 Saturday 5 10 5 7 7 7
7 1 2021-07-01 Thursday 21 12 16 19 22 21
You might want to also convert the date1 field to a DATE object, but can do that using the lubridate verbs for e.g. ymd() inside a mutate
Base R with aggregate:
aggregate(cbind(D1, DR01, DR02, DR03, DR04, DR05) ~ Id+date1+Week, df, sum)
Output:
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-07 Friday 2 2 2 2 2 2
3 1 2021-04-09 Friday 6 6 6 6 6 6
4 1 2021-04-03 Saturday 22 22 22 22 22 23
5 1 2021-04-10 Saturday 5 10 5 7 7 7
6 1 2021-04-08 Thursday 12 12 12 12 12 12
7 1 2021-07-01 Thursday 21 12 16 19 22 21
I have data that looks like:
intial<-
tibble(
start_date=
rep(seq.Date(as.Date("2021-06-01"),as.Date("2021-10-01"),by="months"),4),
end_date=
rep(seq.Date(as.Date("2021-07-01"),as.Date("2021-11-01"),by="months"),4),
id=
rep(c(rep(1,5),rep(2,5)),2),
group=
c(rep("a",10),rep("b",10)),
increase=
c(
c(4:8),
c(5:9),
c(6:10),
c(7:11)
),
decrease=
c(
c(1:5),
c(2:6),
c(3:7),
c(4:8)
),
start_count=
c(c(10,13,NA,NA,NA),c(15,18,NA,NA,NA),c(20,23,NA,NA,NA),c(25,28,NA,NA,NA)),
end_count=
c(c(13,16,NA,NA,NA),c(18,21,NA,NA,NA),c(23,26,NA,NA,NA),c(28,31,NA,NA,NA))
)
print(initial)
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 NA NA
4 2021-09-01 2021-10-01 1 a 7 4 NA NA
5 2021-10-01 2021-11-01 1 a 8 5 NA NA
6 2021-06-01 2021-07-01 2 a 5 2 15 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 NA NA
9 2021-09-01 2021-10-01 2 a 8 5 NA NA
10 2021-10-01 2021-11-01 2 a 9 6 NA NA
11 2021-06-01 2021-07-01 1 b 6 3 20 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 NA NA
14 2021-09-01 2021-10-01 1 b 9 6 NA NA
15 2021-10-01 2021-11-01 1 b 10 7 NA NA
16 2021-06-01 2021-07-01 2 b 7 4 25 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 NA NA
19 2021-09-01 2021-10-01 2 b 10 7 NA NA
20 2021-10-01 2021-11-01 2 b 11 8 NA NA
My goal is to populate each missing start_count and end_count by sequentially populating the missing start_count with the previous month's end_count, and calculate the subsequent end_count with the newly populated start_count by start_count + increase - decrease within each unique id group.
The final result should look like:
final<-
tibble(
start_date=
rep(seq.Date(as.Date("2021-06-01"),as.Date("2021-10-01"),by="months"),4),
end_date=
rep(seq.Date(as.Date("2021-07-01"),as.Date("2021-11-01"),by="months"),4),
id=
rep(c(rep(1,5),rep(2,5)),2),
group=
c(rep("a",10),rep("b",10)),
increase=
c(
c(4:8),
c(5:9),
c(6:10),
c(7:11)
),
decrease=
c(
c(1:5),
c(2:6),
c(3:7),
c(4:8)
),
start_count=
c(c(10,13,16,19,22),c(15,18,21,24,27),c(20,23,26,29,32),c(25,28,31,34,37)),
end_count=
c(c(13,16,19,22,25),c(18,21,24,27,30),c(23,26,29,32,35),c(28,31,34,37,40))
)
print(final)
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 16 19
4 2021-09-01 2021-10-01 1 a 7 4 19 22
5 2021-10-01 2021-11-01 1 a 8 5 22 25
6 2021-06-01 2021-07-01 2 a 5 2 15 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 21 24
9 2021-09-01 2021-10-01 2 a 8 5 24 27
10 2021-10-01 2021-11-01 2 a 9 6 27 30
11 2021-06-01 2021-07-01 1 b 6 3 20 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 26 29
14 2021-09-01 2021-10-01 1 b 9 6 29 32
15 2021-10-01 2021-11-01 1 b 10 7 32 35
16 2021-06-01 2021-07-01 2 b 7 4 25 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 31 34
19 2021-09-01 2021-10-01 2 b 10 7 34 37
20 2021-10-01 2021-11-01 2 b 11 8 37 40
Thanks!
Here is a solution. There might be better ones.
library(tidyverse)
initial<-
tibble(
start_date=
rep(seq.Date(as.Date("2021-06-01"),as.Date("2021-10-01"),by="months"),4),
end_date=
rep(seq.Date(as.Date("2021-07-01"),as.Date("2021-11-01"),by="months"),4),
id=
rep(c(rep(1,5),rep(2,5)),2),
group=
c(rep("a",10),rep("b",10)),
increase=
c(
c(4:8),
c(5:9),
c(6:10),
c(7:11)
),
decrease=
c(
c(1:5),
c(2:6),
c(3:7),
c(4:8)
),
start_count=
c(c(10,13,NA,NA,NA),c(15,18,NA,NA,NA),c(20,23,NA,NA,NA),c(25,28,NA,NA,NA)),
end_count=
c(c(13,16,NA,NA,NA),c(18,21,NA,NA,NA),c(23,26,NA,NA,NA),c(28,31,NA,NA,NA))
)
initial %>%
rowid_to_column(var = "obs_id") %>%
mutate(net = increase - decrease) %>%
select(obs_id, id, group, everything()) %>%
arrange(id, group, start_date) %>%
group_by(id, group) %>%
mutate(end_count = cumsum(net) + first(start_count)) %>%
mutate(start_count = end_count - net) %>%
ungroup() %>%
arrange(obs_id) %>%
select(-net)
#> # A tibble: 20 × 9
#> obs_id id group start_date end_date increase decrease start_count
#> <int> <dbl> <chr> <date> <date> <int> <int> <dbl>
#> 1 1 1 a 2021-06-01 2021-07-01 4 1 10
#> 2 2 1 a 2021-07-01 2021-08-01 5 2 13
#> 3 3 1 a 2021-08-01 2021-09-01 6 3 16
#> 4 4 1 a 2021-09-01 2021-10-01 7 4 19
#> 5 5 1 a 2021-10-01 2021-11-01 8 5 22
#> 6 6 2 a 2021-06-01 2021-07-01 5 2 15
#> 7 7 2 a 2021-07-01 2021-08-01 6 3 18
#> 8 8 2 a 2021-08-01 2021-09-01 7 4 21
#> 9 9 2 a 2021-09-01 2021-10-01 8 5 24
#> 10 10 2 a 2021-10-01 2021-11-01 9 6 27
#> 11 11 1 b 2021-06-01 2021-07-01 6 3 20
#> 12 12 1 b 2021-07-01 2021-08-01 7 4 23
#> 13 13 1 b 2021-08-01 2021-09-01 8 5 26
#> 14 14 1 b 2021-09-01 2021-10-01 9 6 29
#> 15 15 1 b 2021-10-01 2021-11-01 10 7 32
#> 16 16 2 b 2021-06-01 2021-07-01 7 4 25
#> 17 17 2 b 2021-07-01 2021-08-01 8 5 28
#> 18 18 2 b 2021-08-01 2021-09-01 9 6 31
#> 19 19 2 b 2021-09-01 2021-10-01 10 7 34
#> 20 20 2 b 2021-10-01 2021-11-01 11 8 37
#> # … with 1 more variable: end_count <dbl>
Created on 2021-08-11 by the reprex package (v2.0.0)
Here is another approach using accumulate2 from purrr:
library(tidyverse)
intial %>%
group_by(id, group) %>%
mutate(end_count = accumulate2(increase[-1], decrease[-1], ~ ..1 + ..2 - ..3, .init = first(end_count)) %>%
flatten_dbl) %>%
ungroup() %>%
mutate(start_count = lag(end_count, default = first(start_count)))
Output
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 16 19
4 2021-09-01 2021-10-01 1 a 7 4 19 22
5 2021-10-01 2021-11-01 1 a 8 5 22 25
6 2021-06-01 2021-07-01 2 a 5 2 25 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 21 24
9 2021-09-01 2021-10-01 2 a 8 5 24 27
10 2021-10-01 2021-11-01 2 a 9 6 27 30
11 2021-06-01 2021-07-01 1 b 6 3 30 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 26 29
14 2021-09-01 2021-10-01 1 b 9 6 29 32
15 2021-10-01 2021-11-01 1 b 10 7 32 35
16 2021-06-01 2021-07-01 2 b 7 4 35 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 31 34
19 2021-09-01 2021-10-01 2 b 10 7 34 37
20 2021-10-01 2021-11-01 2 b 11 8 37 40
With tidyverse, try:
library(tidyverse)
intial %>%
arrange(group, id, start_date) %>%
group_by(group, id) %>%
mutate(delta = if_else(is.na(start_count),
true = increase - decrease,
false = 0L),
end_count = if_else(is.na(end_count),
true = cumsum(delta) + last(na.omit(end_count)),
false = end_count ),
start_count = if_else(is.na(start_count),
true = lag(end_count),
false = start_count)) %>%
select(-delta) %>%
ungroup()
# A tibble: 20 x 8
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 16 19
4 2021-09-01 2021-10-01 1 a 7 4 19 22
5 2021-10-01 2021-11-01 1 a 8 5 22 25
6 2021-06-01 2021-07-01 2 a 5 2 15 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 21 24
9 2021-09-01 2021-10-01 2 a 8 5 24 27
10 2021-10-01 2021-11-01 2 a 9 6 27 30
11 2021-06-01 2021-07-01 1 b 6 3 20 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 26 29
14 2021-09-01 2021-10-01 1 b 9 6 29 32
15 2021-10-01 2021-11-01 1 b 10 7 32 35
16 2021-06-01 2021-07-01 2 b 7 4 25 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 31 34
19 2021-09-01 2021-10-01 2 b 10 7 34 37
20 2021-10-01 2021-11-01 2 b 11 8 37 40
I have a dataset with 110 participants who answered the same questionnaire in multiple sessions within three timeframes. The number of sessions per timeframe differs within and between participants.
I need a new variable that assigns consecutive numbers to the sessions a participant X completed the questionnaire within the timeframe Y from 1 to (number of sessions participant X completed the questionnaire within timeframe Y).
Example: I have
participant timeframe date
1 1 2021-04-30 09:12:00
1 1 2021-04-30 10:03:00
1 1 2021-05-02 09:20:00
2 1 2021-04-30 13:00:00
2 1 2021-05-02 12:13:00
1 2 2021-05-05 08:34:00
1 2 2021-05-06 14:15:00
2 2 2021-05-05 07:12:00
2 2 2021-05-05 14:13:00
2 2 2021-05-08 15:22:00
I need:
participant timeframe date session per timeframe
1 1 2021-04-30 09:12:00 1
1 1 2021-04-30 10:03:00 2
1 1 2021-05-02 09:20:00 3
2 1 2021-04-30 13:00:00 1
2 1 2021-05-02 12:13:00 2
1 2 2021-05-05 08:34:00 1
1 2 2021-05-06 14:15:00 2
2 2 2021-05-05 07:12:00 1
2 2 2021-05-05 14:13:00 2
2 2 2021-05-08 15:22:00 3
Hope that somebody can help! Thank you so much in advance.
Here is a tidyverse approach using row_number():
library(dplyr)
library(tibble)
dat <- tribble(~participant, ~timeframe, ~date,
1, 1, "2021-04-30 09:12:00",
1, 1, "2021-04-30 10:03:00",
1, 1, "2021-05-02 09:20:00",
2, 1, "2021-04-30 13:00:00",
2, 1, "2021-05-02 12:13:00",
1, 2, "2021-05-05 08:34:00",
1, 2, "2021-05-06 14:15:00",
2, 2, "2021-05-05 07:12:00",
2, 2, "2021-05-05 14:13:00",
2, 2, "2021-05-08 15:22:00") %>%
mutate(date = as.POSIXct(date))
dat %>%
group_by(participant, timeframe) %>%
mutate(session = row_number())
#> # A tibble: 10 x 4
#> # Groups: participant, timeframe [4]
#> participant timeframe date session
#> <dbl> <dbl> <dttm> <int>
#> 1 1 1 2021-04-30 09:12:00 1
#> 2 1 1 2021-04-30 10:03:00 2
#> 3 1 1 2021-05-02 09:20:00 3
#> 4 2 1 2021-04-30 13:00:00 1
#> 5 2 1 2021-05-02 12:13:00 2
#> 6 1 2 2021-05-05 08:34:00 1
#> 7 1 2 2021-05-06 14:15:00 2
#> 8 2 2 2021-05-05 07:12:00 1
#> 9 2 2 2021-05-05 14:13:00 2
#> 10 2 2 2021-05-08 15:22:00 3
Created on 2021-04-30 by the reprex package (v0.3.0)
Alternatively, use rleid:
library(data.table)
df %>%
group_by(participant, timeframe) %>%
mutate(session_per_timeframe = rleid(date))
# A tibble: 10 x 4
# Groups: participant, timeframe [4]
participant timeframe date session_per_timeframe
<dbl> <dbl> <dttm> <int>
1 1 1 2021-04-30 09:12:00 1
2 1 1 2021-04-30 10:03:00 2
3 1 1 2021-05-02 09:20:00 3
4 2 1 2021-04-30 13:00:00 1
5 2 1 2021-05-02 12:13:00 2
6 1 2 2021-05-05 08:34:00 1
7 1 2 2021-05-06 14:15:00 2
8 2 2 2021-05-05 07:12:00 1
9 2 2 2021-05-05 14:13:00 2
10 2 2 2021-05-08 15:22:00 3
My answer
data %>% group_by(grp = data.table::rleid(participant)) %>%
mutate(session = row_number())
# A tibble: 10 x 5
# Groups: grp [4]
participant timeframe date grp session
<int> <int> <chr> <int> <int>
1 1 1 2021-04-30 09:12:00 1 1
2 1 1 2021-04-30 10:03:00 1 2
3 1 1 2021-05-02 09:20:00 1 3
4 2 1 2021-04-30 13:00:00 2 1
5 2 1 2021-05-02 12:13:00 2 2
6 1 2 2021-05-05 08:34:00 3 1
7 1 2 2021-05-06 14:15:00 3 2
8 2 2 2021-05-05 07:12:00 4 1
9 2 2 2021-05-05 14:13:00 4 2
10 2 2 2021-05-08 15:22:00 4 3
I observe the number of purchases of (in the example below: 4) different customers on (five) different days. Now I want to create a new variable summing up the number of purchases of every single user during the last 20 purchases that have been made in total, across users.
Example data:
> da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
+ day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
+ n_purchase = c(5,2,8,0,3,2,0,3,4,0,2,4,5,1,0,2,3,5,0,3))
> da
customer_id day n_purchase
1 1 2016-04-11 5
2 1 2016-04-12 2
3 1 2016-04-13 8
4 1 2016-04-14 0
5 1 2016-04-15 3
6 2 2016-04-11 2
7 2 2016-04-12 0
8 2 2016-04-13 3
9 2 2016-04-14 4
10 2 2016-04-15 0
11 3 2016-04-11 2
12 3 2016-04-12 4
13 3 2016-04-13 5
14 3 2016-04-14 1
15 3 2016-04-15 0
16 4 2016-04-11 2
17 4 2016-04-12 3
18 4 2016-04-13 5
19 4 2016-04-14 0
20 4 2016-04-15 3
I need to know three things to construct my variable:
(1) What's the overall number of purchases on a day across users (day purchases)?
(2) What's the cumulative number of purchases across users starting from the first day (cumsum_day_purchases)?
(3) On which day did, originating from the current observation, the 20 immediately precending (across users) purchases start? This is where I have issues with coding such a variable.
> library(dplyr)
> da %>%
+ group_by(day) %>%
+ mutate(day_purchases = sum(n_purchase)) %>%
+ group_by(customer_id) %>%
+ mutate(cumsum_day_purchases = cumsum(day_purchases))
# A tibble: 20 x 5
# Groups: customer_id [4]
customer_id day n_purchase day_purchases cumsum_day_purchases
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11
2 1 2016-04-12 2 9 20
3 1 2016-04-13 8 21 41
4 1 2016-04-14 0 5 46
5 1 2016-04-15 3 6 52
6 2 2016-04-11 2 11 11
7 2 2016-04-12 0 9 20
8 2 2016-04-13 3 21 41
9 2 2016-04-14 4 5 46
10 2 2016-04-15 0 6 52
11 3 2016-04-11 2 11 11
12 3 2016-04-12 4 9 20
13 3 2016-04-13 5 21 41
14 3 2016-04-14 1 5 46
15 3 2016-04-15 0 6 52
16 4 2016-04-11 2 11 11
17 4 2016-04-12 3 9 20
18 4 2016-04-13 5 21 41
19 4 2016-04-14 0 5 46
20 4 2016-04-15 3 6 52
I will now in the following dataset compute the variable I wish to have by hand.
For all observations on day 2016-04-12 , I compute the cumulative sum
of purchases of a specific customer by adding the number of purchases
of the current day and the precending day, because in total all
customers together made 20 purchases on the current day and the
precending day.
For day 2016-04-13, I only use the number of purchases of a user on
this day, because there have been 21 (41-20) new purchases on the day itself
Resulting in the following output:
> da = da %>% ungroup() %>%
+ mutate(cumsum_last_20_purchases = c(5,5+2,8,0,0+3,2,2+0,3,4,4+0,2,2+4,5,1,1+0,2,2+3,5,0,0+3))
> da
# A tibble: 20 x 6
customer_id day n_purchase day_purchases cumsum_day_purchases cumsum_last_20_purchases
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11 5
2 1 2016-04-12 2 9 20 7
3 1 2016-04-13 8 21 41 8
4 1 2016-04-14 0 5 46 0
5 1 2016-04-15 3 6 52 3
6 2 2016-04-11 2 11 11 2
7 2 2016-04-12 0 9 20 2
8 2 2016-04-13 3 21 41 3
9 2 2016-04-14 4 5 46 4
10 2 2016-04-15 0 6 52 4
11 3 2016-04-11 2 11 11 2
12 3 2016-04-12 4 9 20 6
13 3 2016-04-13 5 21 41 5
14 3 2016-04-14 1 5 46 1
15 3 2016-04-15 0 6 52 1
16 4 2016-04-11 2 11 11 2
17 4 2016-04-12 3 9 20 5
18 4 2016-04-13 5 21 41 5
19 4 2016-04-14 0 5 46 0
20 4 2016-04-15 3 6 52 3
We can create a new grouping based on the last day the day_purchase columns is above 20, and then use cumsum on that:
library(dplyr)
da %>%
group_by(day) %>%
mutate(day_purchases = sum(n_purchase)) %>%
group_by(customer_id) %>%
mutate(above = with(rle(day_purchases >= 20), rep(1:length(lengths), lengths))) %>%
group_by(above, .add =TRUE) %>%
mutate(cumsum_last_20_purchases = cumsum(n_purchase))
#> # A tibble: 20 x 6
#> # Groups: customer_id, above [12]
#> customer_id day n_purchase day_purchases above cumsum_last_20_purchas…
#> <dbl> <fct> <dbl> <dbl> <int> <dbl>
#> 1 1 2016-04-11 5 11 1 5
#> 2 1 2016-04-12 2 9 1 7
#> 3 1 2016-04-13 8 21 2 8
#> 4 1 2016-04-14 0 5 3 0
#> 5 1 2016-04-15 3 6 3 3
#> 6 2 2016-04-11 2 11 1 2
#> 7 2 2016-04-12 0 9 1 2
#> 8 2 2016-04-13 3 21 2 3
#> 9 2 2016-04-14 4 5 3 4
#> 10 2 2016-04-15 0 6 3 4
#> 11 3 2016-04-11 2 11 1 2
#> 12 3 2016-04-12 4 9 1 6
#> 13 3 2016-04-13 5 21 2 5
#> 14 3 2016-04-14 1 5 3 1
#> 15 3 2016-04-15 0 6 3 1
#> 16 4 2016-04-11 2 11 1 2
#> 17 4 2016-04-12 3 9 1 5
#> 18 4 2016-04-13 5 21 2 5
#> 19 4 2016-04-14 0 5 3 0
#> 20 4 2016-04-15 3 6 3 3
Created on 2020-07-28 by the reprex package (v0.3.0)