Calculate days from start in R - r

I'm looking for a way, to calculate the number of days a participant (id) spent in a study.
An exemplary data file looks like this:
data <- data.frame(date = as.Date(c("2020-11-29", "2020-11-30", "2020-12-02",
"2020-12-04", "2020-12-05", "2020-12-08",
"2020-11-22", "2020-11-21", "2020-11-24",
"2020-11-25", "2020-11-30", "2020-11-29",
"2021-01-29", "2021-01-20", "2021-01-30",
"2021-02-01", "2021-02-04", "2021-02-04")),
id = rep(1:3, each = 6))
data <- dplyr::arrange(data, id, date)
data
date id
1 2020-11-29 1
2 2020-11-30 1
3 2020-12-02 1
4 2020-12-04 1
5 2020-12-05 1
6 2020-12-08 1
7 2020-11-21 2
8 2020-11-22 2
9 2020-11-24 2
10 2020-11-25 2
11 2020-11-29 2
12 2020-11-30 2
13 2021-01-20 3
14 2021-01-29 3
15 2021-01-30 3
16 2021-02-01 3
17 2021-02-04 3
18 2021-02-04 3
What i'd like to have, is new column days_from_start that will take the 1st day for every id and set it to 0. Then it will compute number of days for every other row within each id. Something like this:
data$days_from_start <- c(0, 1, 3, 4, 5, 8,
0, 1, 3, 4, 8, 10,
0, 9, 10, 11, 14, 14)
data
date id days_from_start
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 4
5 2020-12-05 1 5
6 2020-12-08 1 8
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 10
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 11
17 2021-02-04 3 14
18 2021-02-04 3 14
Any ideas?
Thank you

Simply group the data, work out the earliest date for each id and then calculate differences.
data <- dplyr::arrange(data, id, date)
data %>%
group_by(id) %>%
mutate(
start_date=min(date),
days_from_start=as.numeric(date-start_date)
) %>%
ungroup() %>%
select(-start_date)
# A tibble: 18 x 3
date id days_from_start
<date> <int> <dbl>
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 5
5 2020-12-05 1 6
6 2020-12-08 1 9
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 9
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 12
17 2021-02-04 3 15
18 2021-02-04 3 15

Related

lubridate: Finding weeks within months

I want to find weeks within months (separate numbering of weeks within months) using lubridate R package. My minimum working example is below:
library(tidyverse)
library(lubridate)
dt1 <-
tibble(
Date = seq(from = ymd("2021-01-01"), to = ymd("2021-12-31"), by = '1 day')
, Month = month(Date)
)
dt2 <-
dt1 %>%
group_by(Month) %>%
mutate(Week = week(Date))
dt2 %>%
print(n = 40)
# A tibble: 365 x 3
# Groups: Month [12]
Date Month Week
<date> <dbl> <dbl>
1 2021-01-01 1 1
2 2021-01-02 1 1
3 2021-01-03 1 1
4 2021-01-04 1 1
5 2021-01-05 1 1
6 2021-01-06 1 1
7 2021-01-07 1 1
8 2021-01-08 1 2
9 2021-01-09 1 2
10 2021-01-10 1 2
11 2021-01-11 1 2
12 2021-01-12 1 2
13 2021-01-13 1 2
14 2021-01-14 1 2
15 2021-01-15 1 3
16 2021-01-16 1 3
17 2021-01-17 1 3
18 2021-01-18 1 3
19 2021-01-19 1 3
20 2021-01-20 1 3
21 2021-01-21 1 3
22 2021-01-22 1 4
23 2021-01-23 1 4
24 2021-01-24 1 4
25 2021-01-25 1 4
26 2021-01-26 1 4
27 2021-01-27 1 4
28 2021-01-28 1 4
29 2021-01-29 1 5
30 2021-01-30 1 5
31 2021-01-31 1 5
32 2021-02-01 2 5
33 2021-02-02 2 5
34 2021-02-03 2 5
35 2021-02-04 2 5
36 2021-02-05 2 6
37 2021-02-06 2 6
38 2021-02-07 2 6
39 2021-02-08 2 6
40 2021-02-09 2 6
# ... with 325 more rows
Wondering what am I missing here. For row number 31 in output (31 2021-01-31 1 5), the value in Week column should be 1. Any lead to get the desired output.
It's not completely clear how you are defining a week. If Week 1 starts on the first day of a month, then you can do:
dt2 <- dt1 %>% mutate(Week = 1L + ((day(Date) - 1L) %/% 7L))
dt2 %>% slice(21:40) %>% print(n = 20L)
# A tibble: 20 × 3
Date Month Week
<date> <dbl> <int>
1 2021-01-21 1 3
2 2021-01-22 1 4
3 2021-01-23 1 4
4 2021-01-24 1 4
5 2021-01-25 1 4
6 2021-01-26 1 4
7 2021-01-27 1 4
8 2021-01-28 1 4
9 2021-01-29 1 5
10 2021-01-30 1 5
11 2021-01-31 1 5
12 2021-02-01 2 1
13 2021-02-02 2 1
14 2021-02-03 2 1
15 2021-02-04 2 1
16 2021-02-05 2 1
17 2021-02-06 2 1
18 2021-02-07 2 1
19 2021-02-08 2 2
20 2021-02-09 2 2
With base R, you could simply do:
Week <- 1L + ((as.POSIXlt(Date)$mday - 1L) %/% 7L)

Add column values ​considering Dates

I would like to create a new database from the df database I entered below. My idea is to create a base where only have one day per line. For example, instead of inserting 4 rows for 01/07/2021, it will only be 1, this way the values ​​of the columns of those days will be added.
df <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-04-03","2021-04-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-04-07","2021-04-09","2021-04-10","2021-04-10"),
Week= c("Thursday","Thursday","Thursday","Thursday","Friday","Friday","Friday","Friday",
"Friday","Friday","Saturday","Saturday","Saturday","Saturday","Saturday","Thursday",
"Thursday","Friday","Friday","Saturday","Saturday"),
DTPE = c("Ho","Ho","Ho","Ho","","","","","","","","","","","","","","","","Ho","Ho"),
D1 = c(8,1,9, 3,5,4,7,6,3,8,2,3,4,6,7,8,4,2,6,2,3), DR01 = c(4,1,4,3,3,4,3,6,3,7,2,3,4,6,7,8,4,2,6,7,3),
DR02 = c(8,1,4,3,3,4,1,6,3,7,2,3,4,6,7,8,4,2,6,2,3), DR03 = c(7,5,4,3,3,4,1,5,3,3,2,3,4,6,7,8,4,2,6,4,3),
DR04= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR05 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
> df
Id date1 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-07-01 Thursday Ho 8 4 8 7 4 9
2 1 2021-07-01 Thursday Ho 1 1 1 5 5 5
3 1 2021-07-01 Thursday Ho 9 4 4 4 6 4
4 1 2021-07-01 Thursday Ho 3 3 3 3 7 3
5 1 2021-04-02 Friday 5 3 3 3 3 3
6 1 2021-04-02 Friday 4 4 4 4 2 2
7 1 2021-04-02 Friday 7 3 1 1 7 1
8 1 2021-04-02 Friday 6 6 6 5 4 5
9 1 2021-04-02 Friday 3 3 3 3 2 3
10 1 2021-04-02 Friday 8 7 7 3 1 7
11 1 2021-04-03 Saturday 2 2 2 2 2 2
12 1 2021-04-03 Saturday 3 3 3 3 3 3
13 1 2021-04-03 Saturday 4 4 4 4 4 4
14 1 2021-04-03 Saturday 6 6 6 6 6 7
15 1 2021-04-03 Saturday 7 7 7 7 7 7
16 1 2021-04-08 Thursday 8 8 8 8 8 8
17 1 2021-04-08 Thursday 4 4 4 4 4 4
18 1 2021-04-07 Friday 2 2 2 2 2 2
19 1 2021-04-09 Friday 6 6 6 6 6 6
20 1 2021-04-10 Saturday Ho 2 7 2 4 4 4
21 1 2021-04-10 Saturday Ho 3 3 3 3 3 3
We may do a grouping by 'Id', along with 'date1' and 'Week', then summarise the numeric columns to get the sum in across
library(dplyr)
df %>% group_by(Id, date1, Week) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE), .groups = 'drop')
You can perform this using the following code:
library(dplyr)
df %>%
group_by(Id, date1, Week) %>%
select(D1:DR05) %>%
summarise_all(sum)
# A tibble: 7 × 9
# Groups: Id, date1 [7]
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-03 Saturday 22 22 22 22 22 23
3 1 2021-04-07 Friday 2 2 2 2 2 2
4 1 2021-04-08 Thursday 12 12 12 12 12 12
5 1 2021-04-09 Friday 6 6 6 6 6 6
6 1 2021-04-10 Saturday 5 10 5 7 7 7
7 1 2021-07-01 Thursday 21 12 16 19 22 21
You might want to also convert the date1 field to a DATE object, but can do that using the lubridate verbs for e.g. ymd() inside a mutate
Base R with aggregate:
aggregate(cbind(D1, DR01, DR02, DR03, DR04, DR05) ~ Id+date1+Week, df, sum)
Output:
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-07 Friday 2 2 2 2 2 2
3 1 2021-04-09 Friday 6 6 6 6 6 6
4 1 2021-04-03 Saturday 22 22 22 22 22 23
5 1 2021-04-10 Saturday 5 10 5 7 7 7
6 1 2021-04-08 Thursday 12 12 12 12 12 12
7 1 2021-07-01 Thursday 21 12 16 19 22 21

Use previous data to sequentially perform calculations and populate future values by group in R

I have data that looks like:
intial<-
tibble(
start_date=
rep(seq.Date(as.Date("2021-06-01"),as.Date("2021-10-01"),by="months"),4),
end_date=
rep(seq.Date(as.Date("2021-07-01"),as.Date("2021-11-01"),by="months"),4),
id=
rep(c(rep(1,5),rep(2,5)),2),
group=
c(rep("a",10),rep("b",10)),
increase=
c(
c(4:8),
c(5:9),
c(6:10),
c(7:11)
),
decrease=
c(
c(1:5),
c(2:6),
c(3:7),
c(4:8)
),
start_count=
c(c(10,13,NA,NA,NA),c(15,18,NA,NA,NA),c(20,23,NA,NA,NA),c(25,28,NA,NA,NA)),
end_count=
c(c(13,16,NA,NA,NA),c(18,21,NA,NA,NA),c(23,26,NA,NA,NA),c(28,31,NA,NA,NA))
)
print(initial)
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 NA NA
4 2021-09-01 2021-10-01 1 a 7 4 NA NA
5 2021-10-01 2021-11-01 1 a 8 5 NA NA
6 2021-06-01 2021-07-01 2 a 5 2 15 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 NA NA
9 2021-09-01 2021-10-01 2 a 8 5 NA NA
10 2021-10-01 2021-11-01 2 a 9 6 NA NA
11 2021-06-01 2021-07-01 1 b 6 3 20 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 NA NA
14 2021-09-01 2021-10-01 1 b 9 6 NA NA
15 2021-10-01 2021-11-01 1 b 10 7 NA NA
16 2021-06-01 2021-07-01 2 b 7 4 25 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 NA NA
19 2021-09-01 2021-10-01 2 b 10 7 NA NA
20 2021-10-01 2021-11-01 2 b 11 8 NA NA
My goal is to populate each missing start_count and end_count by sequentially populating the missing start_count with the previous month's end_count, and calculate the subsequent end_count with the newly populated start_count by start_count + increase - decrease within each unique id group.
The final result should look like:
final<-
tibble(
start_date=
rep(seq.Date(as.Date("2021-06-01"),as.Date("2021-10-01"),by="months"),4),
end_date=
rep(seq.Date(as.Date("2021-07-01"),as.Date("2021-11-01"),by="months"),4),
id=
rep(c(rep(1,5),rep(2,5)),2),
group=
c(rep("a",10),rep("b",10)),
increase=
c(
c(4:8),
c(5:9),
c(6:10),
c(7:11)
),
decrease=
c(
c(1:5),
c(2:6),
c(3:7),
c(4:8)
),
start_count=
c(c(10,13,16,19,22),c(15,18,21,24,27),c(20,23,26,29,32),c(25,28,31,34,37)),
end_count=
c(c(13,16,19,22,25),c(18,21,24,27,30),c(23,26,29,32,35),c(28,31,34,37,40))
)
print(final)
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 16 19
4 2021-09-01 2021-10-01 1 a 7 4 19 22
5 2021-10-01 2021-11-01 1 a 8 5 22 25
6 2021-06-01 2021-07-01 2 a 5 2 15 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 21 24
9 2021-09-01 2021-10-01 2 a 8 5 24 27
10 2021-10-01 2021-11-01 2 a 9 6 27 30
11 2021-06-01 2021-07-01 1 b 6 3 20 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 26 29
14 2021-09-01 2021-10-01 1 b 9 6 29 32
15 2021-10-01 2021-11-01 1 b 10 7 32 35
16 2021-06-01 2021-07-01 2 b 7 4 25 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 31 34
19 2021-09-01 2021-10-01 2 b 10 7 34 37
20 2021-10-01 2021-11-01 2 b 11 8 37 40
Thanks!
Here is a solution. There might be better ones.
library(tidyverse)
initial<-
tibble(
start_date=
rep(seq.Date(as.Date("2021-06-01"),as.Date("2021-10-01"),by="months"),4),
end_date=
rep(seq.Date(as.Date("2021-07-01"),as.Date("2021-11-01"),by="months"),4),
id=
rep(c(rep(1,5),rep(2,5)),2),
group=
c(rep("a",10),rep("b",10)),
increase=
c(
c(4:8),
c(5:9),
c(6:10),
c(7:11)
),
decrease=
c(
c(1:5),
c(2:6),
c(3:7),
c(4:8)
),
start_count=
c(c(10,13,NA,NA,NA),c(15,18,NA,NA,NA),c(20,23,NA,NA,NA),c(25,28,NA,NA,NA)),
end_count=
c(c(13,16,NA,NA,NA),c(18,21,NA,NA,NA),c(23,26,NA,NA,NA),c(28,31,NA,NA,NA))
)
initial %>%
rowid_to_column(var = "obs_id") %>%
mutate(net = increase - decrease) %>%
select(obs_id, id, group, everything()) %>%
arrange(id, group, start_date) %>%
group_by(id, group) %>%
mutate(end_count = cumsum(net) + first(start_count)) %>%
mutate(start_count = end_count - net) %>%
ungroup() %>%
arrange(obs_id) %>%
select(-net)
#> # A tibble: 20 × 9
#> obs_id id group start_date end_date increase decrease start_count
#> <int> <dbl> <chr> <date> <date> <int> <int> <dbl>
#> 1 1 1 a 2021-06-01 2021-07-01 4 1 10
#> 2 2 1 a 2021-07-01 2021-08-01 5 2 13
#> 3 3 1 a 2021-08-01 2021-09-01 6 3 16
#> 4 4 1 a 2021-09-01 2021-10-01 7 4 19
#> 5 5 1 a 2021-10-01 2021-11-01 8 5 22
#> 6 6 2 a 2021-06-01 2021-07-01 5 2 15
#> 7 7 2 a 2021-07-01 2021-08-01 6 3 18
#> 8 8 2 a 2021-08-01 2021-09-01 7 4 21
#> 9 9 2 a 2021-09-01 2021-10-01 8 5 24
#> 10 10 2 a 2021-10-01 2021-11-01 9 6 27
#> 11 11 1 b 2021-06-01 2021-07-01 6 3 20
#> 12 12 1 b 2021-07-01 2021-08-01 7 4 23
#> 13 13 1 b 2021-08-01 2021-09-01 8 5 26
#> 14 14 1 b 2021-09-01 2021-10-01 9 6 29
#> 15 15 1 b 2021-10-01 2021-11-01 10 7 32
#> 16 16 2 b 2021-06-01 2021-07-01 7 4 25
#> 17 17 2 b 2021-07-01 2021-08-01 8 5 28
#> 18 18 2 b 2021-08-01 2021-09-01 9 6 31
#> 19 19 2 b 2021-09-01 2021-10-01 10 7 34
#> 20 20 2 b 2021-10-01 2021-11-01 11 8 37
#> # … with 1 more variable: end_count <dbl>
Created on 2021-08-11 by the reprex package (v2.0.0)
Here is another approach using accumulate2 from purrr:
library(tidyverse)
intial %>%
group_by(id, group) %>%
mutate(end_count = accumulate2(increase[-1], decrease[-1], ~ ..1 + ..2 - ..3, .init = first(end_count)) %>%
flatten_dbl) %>%
ungroup() %>%
mutate(start_count = lag(end_count, default = first(start_count)))
Output
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 16 19
4 2021-09-01 2021-10-01 1 a 7 4 19 22
5 2021-10-01 2021-11-01 1 a 8 5 22 25
6 2021-06-01 2021-07-01 2 a 5 2 25 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 21 24
9 2021-09-01 2021-10-01 2 a 8 5 24 27
10 2021-10-01 2021-11-01 2 a 9 6 27 30
11 2021-06-01 2021-07-01 1 b 6 3 30 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 26 29
14 2021-09-01 2021-10-01 1 b 9 6 29 32
15 2021-10-01 2021-11-01 1 b 10 7 32 35
16 2021-06-01 2021-07-01 2 b 7 4 35 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 31 34
19 2021-09-01 2021-10-01 2 b 10 7 34 37
20 2021-10-01 2021-11-01 2 b 11 8 37 40
With tidyverse, try:
library(tidyverse)
intial %>%
arrange(group, id, start_date) %>%
group_by(group, id) %>%
mutate(delta = if_else(is.na(start_count),
true = increase - decrease,
false = 0L),
end_count = if_else(is.na(end_count),
true = cumsum(delta) + last(na.omit(end_count)),
false = end_count ),
start_count = if_else(is.na(start_count),
true = lag(end_count),
false = start_count)) %>%
select(-delta) %>%
ungroup()
# A tibble: 20 x 8
start_date end_date id group increase decrease start_count end_count
<date> <date> <dbl> <chr> <int> <int> <dbl> <dbl>
1 2021-06-01 2021-07-01 1 a 4 1 10 13
2 2021-07-01 2021-08-01 1 a 5 2 13 16
3 2021-08-01 2021-09-01 1 a 6 3 16 19
4 2021-09-01 2021-10-01 1 a 7 4 19 22
5 2021-10-01 2021-11-01 1 a 8 5 22 25
6 2021-06-01 2021-07-01 2 a 5 2 15 18
7 2021-07-01 2021-08-01 2 a 6 3 18 21
8 2021-08-01 2021-09-01 2 a 7 4 21 24
9 2021-09-01 2021-10-01 2 a 8 5 24 27
10 2021-10-01 2021-11-01 2 a 9 6 27 30
11 2021-06-01 2021-07-01 1 b 6 3 20 23
12 2021-07-01 2021-08-01 1 b 7 4 23 26
13 2021-08-01 2021-09-01 1 b 8 5 26 29
14 2021-09-01 2021-10-01 1 b 9 6 29 32
15 2021-10-01 2021-11-01 1 b 10 7 32 35
16 2021-06-01 2021-07-01 2 b 7 4 25 28
17 2021-07-01 2021-08-01 2 b 8 5 28 31
18 2021-08-01 2021-09-01 2 b 9 6 31 34
19 2021-09-01 2021-10-01 2 b 10 7 34 37
20 2021-10-01 2021-11-01 2 b 11 8 37 40

How to create a new variable that assigns consecutive numbers for cases with same combination of values on other variables

I have a dataset with 110 participants who answered the same questionnaire in multiple sessions within three timeframes. The number of sessions per timeframe differs within and between participants.
I need a new variable that assigns consecutive numbers to the sessions a participant X completed the questionnaire within the timeframe Y from 1 to (number of sessions participant X completed the questionnaire within timeframe Y).
Example: I have
participant timeframe date
1 1 2021-04-30 09:12:00
1 1 2021-04-30 10:03:00
1 1 2021-05-02 09:20:00
2 1 2021-04-30 13:00:00
2 1 2021-05-02 12:13:00
1 2 2021-05-05 08:34:00
1 2 2021-05-06 14:15:00
2 2 2021-05-05 07:12:00
2 2 2021-05-05 14:13:00
2 2 2021-05-08 15:22:00
I need:
participant timeframe date session per timeframe
1 1 2021-04-30 09:12:00 1
1 1 2021-04-30 10:03:00 2
1 1 2021-05-02 09:20:00 3
2 1 2021-04-30 13:00:00 1
2 1 2021-05-02 12:13:00 2
1 2 2021-05-05 08:34:00 1
1 2 2021-05-06 14:15:00 2
2 2 2021-05-05 07:12:00 1
2 2 2021-05-05 14:13:00 2
2 2 2021-05-08 15:22:00 3
Hope that somebody can help! Thank you so much in advance.
Here is a tidyverse approach using row_number():
library(dplyr)
library(tibble)
dat <- tribble(~participant, ~timeframe, ~date,
1, 1, "2021-04-30 09:12:00",
1, 1, "2021-04-30 10:03:00",
1, 1, "2021-05-02 09:20:00",
2, 1, "2021-04-30 13:00:00",
2, 1, "2021-05-02 12:13:00",
1, 2, "2021-05-05 08:34:00",
1, 2, "2021-05-06 14:15:00",
2, 2, "2021-05-05 07:12:00",
2, 2, "2021-05-05 14:13:00",
2, 2, "2021-05-08 15:22:00") %>%
mutate(date = as.POSIXct(date))
dat %>%
group_by(participant, timeframe) %>%
mutate(session = row_number())
#> # A tibble: 10 x 4
#> # Groups: participant, timeframe [4]
#> participant timeframe date session
#> <dbl> <dbl> <dttm> <int>
#> 1 1 1 2021-04-30 09:12:00 1
#> 2 1 1 2021-04-30 10:03:00 2
#> 3 1 1 2021-05-02 09:20:00 3
#> 4 2 1 2021-04-30 13:00:00 1
#> 5 2 1 2021-05-02 12:13:00 2
#> 6 1 2 2021-05-05 08:34:00 1
#> 7 1 2 2021-05-06 14:15:00 2
#> 8 2 2 2021-05-05 07:12:00 1
#> 9 2 2 2021-05-05 14:13:00 2
#> 10 2 2 2021-05-08 15:22:00 3
Created on 2021-04-30 by the reprex package (v0.3.0)
Alternatively, use rleid:
library(data.table)
df %>%
group_by(participant, timeframe) %>%
mutate(session_per_timeframe = rleid(date))
# A tibble: 10 x 4
# Groups: participant, timeframe [4]
participant timeframe date session_per_timeframe
<dbl> <dbl> <dttm> <int>
1 1 1 2021-04-30 09:12:00 1
2 1 1 2021-04-30 10:03:00 2
3 1 1 2021-05-02 09:20:00 3
4 2 1 2021-04-30 13:00:00 1
5 2 1 2021-05-02 12:13:00 2
6 1 2 2021-05-05 08:34:00 1
7 1 2 2021-05-06 14:15:00 2
8 2 2 2021-05-05 07:12:00 1
9 2 2 2021-05-05 14:13:00 2
10 2 2 2021-05-08 15:22:00 3
My answer
data %>% group_by(grp = data.table::rleid(participant)) %>%
mutate(session = row_number())
# A tibble: 10 x 5
# Groups: grp [4]
participant timeframe date grp session
<int> <int> <chr> <int> <int>
1 1 1 2021-04-30 09:12:00 1 1
2 1 1 2021-04-30 10:03:00 1 2
3 1 1 2021-05-02 09:20:00 1 3
4 2 1 2021-04-30 13:00:00 2 1
5 2 1 2021-05-02 12:13:00 2 2
6 1 2 2021-05-05 08:34:00 3 1
7 1 2 2021-05-06 14:15:00 3 2
8 2 2 2021-05-05 07:12:00 4 1
9 2 2 2021-05-05 14:13:00 4 2
10 2 2 2021-05-08 15:22:00 4 3

Sum up ending with the current observation starting based on a criteria

I observe the number of purchases of (in the example below: 4) different customers on (five) different days. Now I want to create a new variable summing up the number of purchases of every single user during the last 20 purchases that have been made in total, across users.
Example data:
> da <- data.frame(customer_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
+ day = c("2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15","2016-04-11","2016-04-12","2016-04-13","2016-04-14","2016-04-15"),
+ n_purchase = c(5,2,8,0,3,2,0,3,4,0,2,4,5,1,0,2,3,5,0,3))
> da
customer_id day n_purchase
1 1 2016-04-11 5
2 1 2016-04-12 2
3 1 2016-04-13 8
4 1 2016-04-14 0
5 1 2016-04-15 3
6 2 2016-04-11 2
7 2 2016-04-12 0
8 2 2016-04-13 3
9 2 2016-04-14 4
10 2 2016-04-15 0
11 3 2016-04-11 2
12 3 2016-04-12 4
13 3 2016-04-13 5
14 3 2016-04-14 1
15 3 2016-04-15 0
16 4 2016-04-11 2
17 4 2016-04-12 3
18 4 2016-04-13 5
19 4 2016-04-14 0
20 4 2016-04-15 3
I need to know three things to construct my variable:
(1) What's the overall number of purchases on a day across users (day purchases)?
(2) What's the cumulative number of purchases across users starting from the first day (cumsum_day_purchases)?
(3) On which day did, originating from the current observation, the 20 immediately precending (across users) purchases start? This is where I have issues with coding such a variable.
> library(dplyr)
> da %>%
+ group_by(day) %>%
+ mutate(day_purchases = sum(n_purchase)) %>%
+ group_by(customer_id) %>%
+ mutate(cumsum_day_purchases = cumsum(day_purchases))
# A tibble: 20 x 5
# Groups: customer_id [4]
customer_id day n_purchase day_purchases cumsum_day_purchases
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11
2 1 2016-04-12 2 9 20
3 1 2016-04-13 8 21 41
4 1 2016-04-14 0 5 46
5 1 2016-04-15 3 6 52
6 2 2016-04-11 2 11 11
7 2 2016-04-12 0 9 20
8 2 2016-04-13 3 21 41
9 2 2016-04-14 4 5 46
10 2 2016-04-15 0 6 52
11 3 2016-04-11 2 11 11
12 3 2016-04-12 4 9 20
13 3 2016-04-13 5 21 41
14 3 2016-04-14 1 5 46
15 3 2016-04-15 0 6 52
16 4 2016-04-11 2 11 11
17 4 2016-04-12 3 9 20
18 4 2016-04-13 5 21 41
19 4 2016-04-14 0 5 46
20 4 2016-04-15 3 6 52
I will now in the following dataset compute the variable I wish to have by hand.
For all observations on day 2016-04-12 , I compute the cumulative sum
of purchases of a specific customer by adding the number of purchases
of the current day and the precending day, because in total all
customers together made 20 purchases on the current day and the
precending day.
For day 2016-04-13, I only use the number of purchases of a user on
this day, because there have been 21 (41-20) new purchases on the day itself
Resulting in the following output:
> da = da %>% ungroup() %>%
+ mutate(cumsum_last_20_purchases = c(5,5+2,8,0,0+3,2,2+0,3,4,4+0,2,2+4,5,1,1+0,2,2+3,5,0,0+3))
> da
# A tibble: 20 x 6
customer_id day n_purchase day_purchases cumsum_day_purchases cumsum_last_20_purchases
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2016-04-11 5 11 11 5
2 1 2016-04-12 2 9 20 7
3 1 2016-04-13 8 21 41 8
4 1 2016-04-14 0 5 46 0
5 1 2016-04-15 3 6 52 3
6 2 2016-04-11 2 11 11 2
7 2 2016-04-12 0 9 20 2
8 2 2016-04-13 3 21 41 3
9 2 2016-04-14 4 5 46 4
10 2 2016-04-15 0 6 52 4
11 3 2016-04-11 2 11 11 2
12 3 2016-04-12 4 9 20 6
13 3 2016-04-13 5 21 41 5
14 3 2016-04-14 1 5 46 1
15 3 2016-04-15 0 6 52 1
16 4 2016-04-11 2 11 11 2
17 4 2016-04-12 3 9 20 5
18 4 2016-04-13 5 21 41 5
19 4 2016-04-14 0 5 46 0
20 4 2016-04-15 3 6 52 3
We can create a new grouping based on the last day the day_purchase columns is above 20, and then use cumsum on that:
library(dplyr)
da %>%
group_by(day) %>%
mutate(day_purchases = sum(n_purchase)) %>%
group_by(customer_id) %>%
mutate(above = with(rle(day_purchases >= 20), rep(1:length(lengths), lengths))) %>%
group_by(above, .add =TRUE) %>%
mutate(cumsum_last_20_purchases = cumsum(n_purchase))
#> # A tibble: 20 x 6
#> # Groups: customer_id, above [12]
#> customer_id day n_purchase day_purchases above cumsum_last_20_purchas…
#> <dbl> <fct> <dbl> <dbl> <int> <dbl>
#> 1 1 2016-04-11 5 11 1 5
#> 2 1 2016-04-12 2 9 1 7
#> 3 1 2016-04-13 8 21 2 8
#> 4 1 2016-04-14 0 5 3 0
#> 5 1 2016-04-15 3 6 3 3
#> 6 2 2016-04-11 2 11 1 2
#> 7 2 2016-04-12 0 9 1 2
#> 8 2 2016-04-13 3 21 2 3
#> 9 2 2016-04-14 4 5 3 4
#> 10 2 2016-04-15 0 6 3 4
#> 11 3 2016-04-11 2 11 1 2
#> 12 3 2016-04-12 4 9 1 6
#> 13 3 2016-04-13 5 21 2 5
#> 14 3 2016-04-14 1 5 3 1
#> 15 3 2016-04-15 0 6 3 1
#> 16 4 2016-04-11 2 11 1 2
#> 17 4 2016-04-12 3 9 1 5
#> 18 4 2016-04-13 5 21 2 5
#> 19 4 2016-04-14 0 5 3 0
#> 20 4 2016-04-15 3 6 3 3
Created on 2020-07-28 by the reprex package (v0.3.0)

Resources