Aggregate a tibble based on a consecutive values in a boolean column - r

I've got a fairly straight-forward problem, but I'm struggling to find a solution that doesn't require a wall of code and complicated loops.
I've got a summary table, df, for an hourly timeseries dataset where each observations belongs to a group.
I want to merge some of those groups, based on a boolean column in the summary table.
The boolean column, merge_with_next indicates whether a given group should be merged with the next group (one row down).
The merging effectively occurs by updating the end, value and removing rows:
library(dplyr)
# Demo data
df <- tibble(
group = 1:12,
start = seq.POSIXt(as.POSIXct("2019-01-01 00:00"), as.POSIXct("2019-01-12 00:00"), by = "1 day"),
end = seq.POSIXt(as.POSIXct("2019-01-01 23:59"), as.POSIXct("2019-01-12 23:59"), by = "1 day"),
merge_with_next = rep(c(TRUE, TRUE, FALSE), 4)
)
df
#> # A tibble: 12 x 4
#> group start end merge_with_next
#> <int> <dttm> <dttm> <lgl>
#> 1 1 2019-01-01 00:00:00 2019-01-01 23:59:00 TRUE
#> 2 2 2019-01-02 00:00:00 2019-01-02 23:59:00 TRUE
#> 3 3 2019-01-03 00:00:00 2019-01-03 23:59:00 FALSE
#> 4 4 2019-01-04 00:00:00 2019-01-04 23:59:00 TRUE
#> 5 5 2019-01-05 00:00:00 2019-01-05 23:59:00 TRUE
#> 6 6 2019-01-06 00:00:00 2019-01-06 23:59:00 FALSE
#> 7 7 2019-01-07 00:00:00 2019-01-07 23:59:00 TRUE
#> 8 8 2019-01-08 00:00:00 2019-01-08 23:59:00 TRUE
#> 9 9 2019-01-09 00:00:00 2019-01-09 23:59:00 FALSE
#> 10 10 2019-01-10 00:00:00 2019-01-10 23:59:00 TRUE
#> 11 11 2019-01-11 00:00:00 2019-01-11 23:59:00 TRUE
#> 12 12 2019-01-12 00:00:00 2019-01-12 23:59:00 FALSE
# Desired result
desired <- tibble(
group = c(1, 4, 7, 9),
start = c("2019-01-01 00:00", "2019-01-04 00:00", "2019-01-07 00:00", "2019-01-10 00:00"),
end = c("2019-01-03 23:59", "2019-01-06 23:59", "2019-01-09 23:59", "2019-01-12 23:59")
)
desired
#> # A tibble: 4 x 3
#> group start end
#> <dbl> <chr> <chr>
#> 1 1 2019-01-01 00:00 2019-01-03 23:59
#> 2 4 2019-01-04 00:00 2019-01-06 23:59
#> 3 7 2019-01-07 00:00 2019-01-09 23:59
#> 4 9 2019-01-10 00:00 2019-01-12 23:59
Created on 2019-03-22 by the reprex package (v0.2.1)
I'm looking for a short and clear solution that doesn't involve a myriad of helper tables and loops. The final value in the group column is not significant, I only care about the start and end columns from the result.

We can use dplyr and create groups based on every time TRUE value occurs in merge_with_next column and select first value from start and last value from end column for each group.
library(dplyr)
df %>%
group_by(temp = cumsum(!lag(merge_with_next, default = TRUE))) %>%
summarise(group = first(group),
start = first(start),
end = last(end)) %>%
ungroup() %>%
select(-temp)
# group start end
# <int> <dttm> <dttm>
#1 1 2019-01-01 00:00:00 2019-01-03 23:59:00
#2 4 2019-01-04 00:00:00 2019-01-06 23:59:00
#3 7 2019-01-07 00:00:00 2019-01-09 23:59:00
#4 10 2019-01-10 00:00:00 2019-01-12 23:59:00

Related

How to create groups based on a changin condition using for in R?

I have a data frame and a vector that I want to compare with a column of my data frame to assign groups based on the values that meet the condition, the problem is that these values are dynamic so I need a code that takes into account the different lengths that this vector can take
This is a minimal reproducible example of my data frame
value <- c(rnorm(39, 5, 2))
Date <- seq(as.POSIXct('2021-01-18'), as.POSIXct('2021-10-15'), by = "7 days")
df <- data.frame(Date, value)
This is the vector I have to compare with the Date of the data frame
dates_tour <- as.POSIXct(c('2021-01-18', '2021-05-18', '2021-08-18', '2021-10-15'))
This creates the desired output
df <- df %>% mutate(tour = case_when(Date >= dates_tour[1] & Date <= dates_tour[2] ~ 1,
Date > dates_tour[2] & Date <= dates_tour[3]~2,
Date > dates_tour[3] & Date <= dates_tour[4]~3))
However, I don't want to do it like that since this project needs to be updated frequently and the variable dates_tour change in length
So I would like to take that into account to create the tour variable
I tried to do it like this: but it doesn't work
for (i in 1:length(dates_tour)) {
df <- df %>% mutate(tour = case_when(Date >= dates_tour[i] & Date <= dates_tour[i+1] ~ i))
}
You can use cut to bin a vector based on break points:
df %>%
mutate(
tour = cut(Date, breaks = dates_tour, labels = seq_along(dates_tour[-1]))
)
We may remove the first and last elements to create a tibble and then loop over the rows of the tibble
library(dplyr)
library(purrr)
keydat <- tibble(start = dates_tour[-length(dates_tour)],
end = dates_tour[-1])
df$tour <- imap(seq_len(nrow(keydat)),
~ case_when(df$Date >= keydat$start[.x] &
df$Date <= keydat$end[.x]~ .y )) %>%
invoke(coalesce, .)
-output
> df
Date value tour
1 2021-01-18 00:00:00 7.874620 1
2 2021-01-25 00:00:00 9.704973 1
3 2021-02-01 00:00:00 5.898070 1
4 2021-02-08 00:00:00 3.287319 1
5 2021-02-15 00:00:00 5.488132 1
6 2021-02-22 00:00:00 4.425636 1
7 2021-03-01 00:00:00 6.244084 1
8 2021-03-08 00:00:00 5.528364 1
9 2021-03-15 01:00:00 7.954929 1
10 2021-03-22 01:00:00 4.691995 1
11 2021-03-29 01:00:00 5.943415 1
12 2021-04-05 01:00:00 5.316373 1
13 2021-04-12 01:00:00 5.182952 1
14 2021-04-19 01:00:00 3.330700 1
15 2021-04-26 01:00:00 7.461089 1
16 2021-05-03 01:00:00 4.338873 1
17 2021-05-10 01:00:00 5.768665 1
18 2021-05-17 01:00:00 3.574488 1
19 2021-05-24 01:00:00 5.106042 2
20 2021-05-31 01:00:00 2.828844 2
21 2021-06-07 01:00:00 4.616084 2
22 2021-06-14 01:00:00 7.234506 2
23 2021-06-21 01:00:00 4.760413 2
24 2021-06-28 01:00:00 7.020543 2
25 2021-07-05 01:00:00 7.403235 2
26 2021-07-12 01:00:00 6.368435 2
27 2021-07-19 01:00:00 3.527764 2
28 2021-07-26 01:00:00 5.254025 2
29 2021-08-02 01:00:00 5.676425 2
30 2021-08-09 01:00:00 3.783304 2
31 2021-08-16 01:00:00 6.310292 2
32 2021-08-23 01:00:00 2.938218 3
33 2021-08-30 01:00:00 5.101852 3
34 2021-09-06 01:00:00 3.765659 3
35 2021-09-13 01:00:00 5.489846 3
36 2021-09-20 01:00:00 4.174276 3
37 2021-09-27 01:00:00 7.348895 3
38 2021-10-04 01:00:00 5.103772 3
39 2021-10-11 01:00:00 4.941248 3

Using two summarise function in r

library(lubridate)
library(tidyverse)
step_count_raw <- read_csv("data/step-count/step-count.csv",
locale = locale(tz = "Australia/Melbourne"))
location <- read_csv("data/step-count/location.csv")
step_count <- step_count_raw %>%
rename_with(~ c("date_time", "date", "count")) %>%
left_join(location) %>%
mutate(location = replace_na(location, "Melbourne"))
step_count
#> # A tibble: 5,448 x 4
#> date_time date count location
#> <dttm> <date> <dbl> <chr>
#> 1 2019-01-01 09:00:00 2019-01-01 764 Melbourne
#> 2 2019-01-01 10:00:00 2019-01-01 913 Melbourne
#> 3 2019-01-02 00:00:00 2019-01-02 9 Melbourne
#> 4 2019-01-02 10:00:00 2019-01-02 2910 Melbourne
#> 5 2019-01-02 11:00:00 2019-01-02 1390 Melbourne
#> 6 2019-01-02 12:00:00 2019-01-02 1020 Melbourne
#> 7 2019-01-02 13:00:00 2019-01-02 472 Melbourne
#> 8 2019-01-02 15:00:00 2019-01-02 1220 Melbourne
#> 9 2019-01-02 16:00:00 2019-01-02 1670 Melbourne
#> 10 2019-01-02 17:00:00 2019-01-02 1390 Melbourne
#> # … with 5,438 more rows
I want to calculate average daily step counts for every location, from step_count. Then end up with a tibble called city_avg_steps.
expected output
#> # A tibble: 4 x 2
#> location avg_count
#> <chr> <dbl>
#> 1 Austin 7738.
#> 2 Denver 12738.
#> 3 Melbourne 7912.
#> 4 San Francisco 13990.
My code and output
city_avg_steps <- step_count%>%group_by(location)%>%summarise(avg_count=mean(count))
city_avg_steps
# A tibble: 4 x 2
location avg_count
<chr> <dbl>
1 Austin 721.
2 Denver 650.
3 Melbourne 530.
4 San Francisco 654.
I have a clue is to calculate daily number first then cumulate the result using two summarise fuction,but not sure how to add.
As #dash2 explains in the comments, from what we understand from your desired output, it requires a two stage aggregation, one to aggregate the number of steps per day (adding them together, using sum), the other is aggregating the different days into location level averages, using mean.
step_count %>%
group_by(date, location) %>%
summarise(sum_steps = sum(count, na.rm = TRUE)) %>%
ungroup %>%
group_by(date) %>%
summarise(avg_steps = mean(sum_steps, na.rm = TRUE))

Get next date in group in R

I have data which looks like
library(dplyr)
library(lubridate)
Date_Construct= c("10/03/2018 00:00", "10/03/2018 00:00","01/01/2016 00:00","21/03/2015 01:25", "21/03/2015 01:25", "17/04/2016 00:00","17/04/2016 00:00", "20/02/2012 00:00","20/02/2020 00:00")
Date_first_use = c("02/08/2018 00:00","02/08/2018 00:00", "01/04/2016 00:00","NA", "NA", "NA", "NA","13/08/2012 00:00","20/04/2020 00:00")
Date_fail = c("02/08/2019 00:00","02/08/2019 00:00", "21/06/2018 06:42","NA" , "NA" , "17/04/2016 00:00", "17/04/2016 00:00","13/08/2014 07:45","NA")
P_ID = c("0001", "0001" ,"0001" ,"0001", "0001","34000","34000","34000", "00425")
Comp_date= c("16/05/2019 00:00", "10/04/2018 12:55","25/06/2017 00:00","22/04/2015 00:00","08/05/2015 00:00" ,"04/05/2017 00:00" ,"15/07/2016 00:00","01/03/2014 00:00", "20/03/2020 00:00")
Type = c("a","a","b","c","c","b","b","a","c")
Date_Construct= dmy_hm(Date_Construct)
dfq= data.frame(`P_ID`, `Type`, `Date_Construct`, `Date_first_use`,`Date_fail`, `Comp_date`)%>%
arrange(P_ID, desc(Date_Construct))%>%
group_by( P_ID, Date_Construct, Type)%>%
mutate(A_ID= cur_group_id())%>%
select(P_ID,A_ID,Type, Date_Construct, Date_first_use, Date_fail, Comp_date)%>%
mutate(across(contains("Date", ignore.case = TRUE), dmy_hm))
View(dfq)
It is a data frame of different items (A_ID) of type a/b/c, created for different clients (P_ID), with date of construction, date of first use and date of failure. Each P_ID may have multiple A_ID, and each A_ID may have multiple Comp_date.
I need to supply a date for where Date_fail is NA, which is the Date_construct of the next constructed A_ID for the same P_ID.
i.e. Date_fail for P_ID 0001, A_ID 1 should be 2016-01-01 00:00:00.
For A_ID which there are no subsequent A_ID (as is the case for P_ID 00425, A_ID 4), the Date_fail should remain NA .
So result should look like:
P_ID A_ID Type Date_Construct Date_first_use Date_fail Comp_date
1 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-04-22 00:00:00
2 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-05-08 00:00:00
3 0001 2 b 2016-01-01 00:00:00 2016-04-01 2018-06-21 06:42:00 2017-06-25 00:00:00
4 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2019-05-16 00:00:00
5 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2018-04-10 12:55:00
6 00425 4 c 2020-02-20 00:00:00 2020-04-20 NA 2020-03-20 00:00:00
7 34000 5 a 2012-02-20 00:00:00 2012-08-13 2014-08-13 07:45:00 2014-03-01 00:00:00
8 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2017-05-04 00:00:00
9 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2016-07-15 00:00:00
I tried this, which I thought worked, but it is just given me the Date_Construct of the next row in the group, which isn't correct as some A_ID have multiple entries:
arrange(P_ID, Date_Construct)%>%
group_by(P_ID,) %>%
mutate(Date_fail2 = sort(Date_Construct, decreasing = FALSE)[row_number(Date_Construct) + 1])%>%
mutate(Date_fail = if_else( is.na(Date_fail), paste(Date_fail2), paste(Date_fail)))
I'm ideally looking for a dplyr solution as I find them easier to understand and reproduce.
One solution is to nest all the variables that can be different for the same A_ID. (In this case only Comp_date)
library(tidyr)
nested = dfq %>%
ungroup() %>%
arrange(P_ID, A_ID) %>%
nest(extra = Comp_date)
This results in a tibble with one row for each A_ID, where the different Comp_dates are comfortably nested in their own tibbles:
> nested
# A tibble: 6 x 7
# Groups: P_ID, Type, Date_Construct [6]
P_ID A_ID Type Date_Construct Date_first_use Date_fail extra
<fct> <int> <fct> <dttm> <dttm> <dttm> <list>
1 0001 1 c 2015-03-21 01:25:00 NA NA <tibble [2 × 1]>
2 0001 2 b 2016-01-01 00:00:00 2016-04-01 00:00:00 2018-06-21 06:42:00 <tibble [1 × 1]>
3 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 <tibble [2 × 1]>
4 00425 4 c 2020-02-20 00:00:00 2020-04-20 00:00:00 NA <tibble [1 × 1]>
5 34000 5 a 2012-02-20 00:00:00 2012-08-13 00:00:00 2014-08-13 07:45:00 <tibble [1 × 1]>
6 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 <tibble [2 × 1]>
You can now modify this using normal dplyr methods. Your own approach would probably work as well here, but it can be done much more cleanly using coalesce and lead. Don't forget to unnest at the end to get your original structure back:
result = nested %>%
group_by(P_ID) %>%
mutate(Date_fail = coalesce(Date_fail, lead(Date_Construct))) %>%
unnest(extra)
Result:
> result
# A tibble: 9 x 7
# Groups: P_ID [3]
P_ID A_ID Type Date_Construct Date_first_use Date_fail Comp_date
<fct> <int> <fct> <dttm> <dttm> <dttm> <dttm>
1 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-04-22 00:00:00
2 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-05-08 00:00:00
3 0001 2 b 2016-01-01 00:00:00 2016-04-01 00:00:00 2018-06-21 06:42:00 2017-06-25 00:00:00
4 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 2019-05-16 00:00:00
5 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 2018-04-10 12:55:00
6 00425 4 c 2020-02-20 00:00:00 2020-04-20 00:00:00 NA 2020-03-20 00:00:00
7 34000 5 a 2012-02-20 00:00:00 2012-08-13 00:00:00 2014-08-13 07:45:00 2014-03-01 00:00:00
8 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2017-05-04 00:00:00
9 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2016-07-15 00:00:00

How to deduplicate date sequences across non-consecutive rows in R?

I want to flag the first date in every window of at least 31 days for each id in my data.
Data:
library(tidyverse)
library(lubridate)
library(tibbletime)
D1 <- tibble(id = c(12,12,12,12,12,12,10,10,10,10),
index_date=c("2019-01-01","2019-01-07","2019-01-21","2019-02-02",
"2019-02-09","2019-03-06","2019-01-05","2019-02-01","2019-02-02","2019-02-08"))
D1
# A tibble: 10 x 2
id index_date
<dbl> <chr>
1 12 2019-01-01
2 12 2019-01-07
3 12 2019-01-21
4 12 2019-02-02
5 12 2019-02-09
6 12 2019-03-06
7 10 2019-01-05
8 10 2019-02-01
9 10 2019-02-02
10 10 2019-02-08
The desired rows to flag are rows 1, 4, 6, 7, and 10; these rows represent either the first index_date for a given id or the first index_date after a 31-day skip period from the previously flagged index_date for that given id.
Code:
temp <- D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start"),
keep = index_date == keyed_to_index_date)
temp %>% arrange(desc(id), index_date)
Result:
id index_date keyed_to_index_date keep
<dbl> <date> <date> <lgl>
1 12 2019-01-01 2019-01-01 TRUE
2 12 2019-01-07 2019-01-01 FALSE
3 12 2019-01-21 2019-01-01 FALSE
4 12 2019-02-02 2019-02-02 TRUE
5 12 2019-02-09 2019-02-02 FALSE
6 12 2019-03-06 2019-03-06 TRUE
7 10 2019-01-05 2019-01-05 TRUE
8 10 2019-02-01 2019-02-01 TRUE
9 10 2019-02-02 2019-02-01 FALSE
10 10 2019-02-08 2019-02-01 FALSE
Why does this code flag row 8 (which has an index_date less than 31 days after the previously flagged index_date for that id) and not row 10, and how do I fix this problem?
UPDATE: Adding the option start_date = first(index_date) to collapse_index(), as suggested by #mnaR99, successfully flagged the correct rows in the original example. However, when I applied the same principle to new data, I ran into a problem:
Data:
D2 <- tibble(id = c("A","A","A","B","B","B","B","B","C","C","C"),
index_date = c("2019-03-04","2019-03-05","2019-03-06",
"2019-03-01","2019-03-02","2019-03-04","2019-03-05","2019-03-06",
"2019-03-03","2019-03-04","2019-03-05"))
D2
id index_date
<chr> <chr>
1 A 2019-03-04
2 A 2019-03-05
3 A 2019-03-06
4 B 2019-03-01
5 B 2019-03-02
6 B 2019-03-04
7 B 2019-03-05
8 B 2019-03-06
9 C 2019-03-03
10 C 2019-03-04
11 C 2019-03-05
I now want to apply a 2-day window in the same manner as I previously applied a 31-day window (that is, consecutive calendar days should not both be flagged). The desired rows to flag are Rows 1, 3, 4, 6, 8, 9, and 11, because these rows are either the first `index_date` for a particular `id` or the first after a two-day skip.
Code:
t3 <- D2 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date,
period = '2 d',
side = "start",
start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(id, index_date)
Result:
> t3
# A time tibble: 11 x 4
# Index: index_date
# Groups: id [3]
id index_date keyed_to_index_date keep
<chr> <date> <date> <lgl>
1 A 2019-03-04 2019-03-04 TRUE
2 A 2019-03-05 2019-03-04 FALSE
3 A 2019-03-06 2019-03-06 TRUE
4 B 2019-03-01 2019-03-01 TRUE
5 B 2019-03-02 2019-03-01 FALSE
6 B 2019-03-04 2019-03-04 TRUE
7 B 2019-03-05 2019-03-05 TRUE
8 B 2019-03-06 2019-03-05 FALSE
9 C 2019-03-03 2019-03-03 TRUE
10 C 2019-03-04 2019-03-03 FALSE
11 C 2019-03-05 2019-03-05 TRUE
Row 7 is incorrectly flagged as TRUE, and Row 8 is incorrectly flagged as FALSE.
When I apply the purrr solution suggested by #tmfmnk, I get the correct result.
Code:
t4 <-
D2 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 |
accumulate(c(0, diff(index_date)), ~ if_else(.x >= 2,
.y,
.x + .y)
) >= 2
)
Result:
> t4
# A tibble: 11 x 3
# Groups: id [3]
id index_date keep
<chr> <date> <lgl>
1 A 2019-03-04 TRUE
2 A 2019-03-05 FALSE
3 A 2019-03-06 TRUE
4 B 2019-03-01 TRUE
5 B 2019-03-02 FALSE
6 B 2019-03-04 TRUE
7 B 2019-03-05 FALSE
8 B 2019-03-06 TRUE
9 C 2019-03-03 TRUE
10 C 2019-03-04 FALSE
11 C 2019-03-05 TRUE
What is wrong with the tibbletime approach in this example?
One option utilizing dplyr, lubridate and purrr could be:
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 | accumulate(c(0, diff(index_date)), ~ if_else(.x >= 31, .y, .x + .y)) >= 31)
id index_date keep
<dbl> <date> <lgl>
1 12 2019-01-01 TRUE
2 12 2019-01-07 FALSE
3 12 2019-01-21 FALSE
4 12 2019-02-02 TRUE
5 12 2019-02-09 FALSE
6 12 2019-03-06 TRUE
7 10 2019-01-05 TRUE
8 10 2019-02-01 FALSE
9 10 2019-02-02 FALSE
10 10 2019-02-08 TRUE
You just need to add the start_date argument to collapse_index:
D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start", start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(desc(id), index_date)
#> # A time tibble: 10 x 4
#> # Index: index_date
#> # Groups: id [2]
#> id index_date keyed_to_index_date keep
#> <dbl> <date> <date> <lgl>
#> 1 12 2019-01-01 2019-01-01 TRUE
#> 2 12 2019-01-07 2019-01-01 FALSE
#> 3 12 2019-01-21 2019-01-01 FALSE
#> 4 12 2019-02-02 2019-02-02 TRUE
#> 5 12 2019-02-09 2019-02-02 FALSE
#> 6 12 2019-03-06 2019-03-06 TRUE
#> 7 10 2019-01-05 2019-01-05 TRUE
#> 8 10 2019-02-01 2019-01-05 FALSE
#> 9 10 2019-02-02 2019-01-05 FALSE
#> 10 10 2019-02-08 2019-02-08 TRUE
Created on 2020-09-11 by the reprex package (v0.3.0)
You can use accumulate() from purrr.
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = index_date == accumulate(index_date, ~ if(.y - .x >= 31) .y else .x))
# id index_date keep
# <dbl> <date> <lgl>
# 1 12 2019-01-01 TRUE
# 2 12 2019-01-07 FALSE
# 3 12 2019-01-21 FALSE
# 4 12 2019-02-02 TRUE
# 5 12 2019-02-09 FALSE
# 6 12 2019-03-06 TRUE
# 7 10 2019-01-05 TRUE
# 8 10 2019-02-01 FALSE
# 9 10 2019-02-02 FALSE
# 10 10 2019-02-08 TRUE
The iteration rule is following:
1. 2019-01-07 - 2019-01-01 = 6 < 31 then return 2019-01-01
2. 2019-01-21 - 2019-01-01 = 20 < 31 then return 2019-01-01
3. 2019-02-02 - 2019-01-01 = 32 >= 31 then return (2019-02-02)*
4. 2019-02-09 - (2019-02-02)* = 7 < 31 then return 2019-02-02
5. etc.

Expand rows of data frame date-time column with intervening date-times

I have a date-time column with non-consecutive date-times (all on the hour), like this:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
# Output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-13 01:00:00
#3 2018-02-01 11:00:00
I'd like to expand the rows of column dt so that every hour in between the very minimum and maximum date-times is present, looking like:
# Desired output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-01 13:00:00
#3 2018-01-01 14:00:00
#4 .
#5 .
And so on. tidyverse-based solutions are most preferred.
#DavidArenburg's comment is the way to go for a vector. However, if you want to expand dt inside a data frame with other columns that you would like to keep, you might be interested in tidyr::complete combined with tidyr::full_seq:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
dat$a <- letters[1:3]
dat
#> dt a
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-13 01:00:00 b
#> 3 2018-02-01 11:00:00 c
library(tidyr)
res <- complete(dat, dt = full_seq(dt, 60 ** 2))
print(res, n = 5)
#> # A tibble: 744 x 2
#> dt a
#> <dttm> <chr>
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-01 13:00:00 <NA>
#> 3 2018-01-01 14:00:00 <NA>
#> 4 2018-01-01 15:00:00 <NA>
#> 5 2018-01-01 16:00:00 <NA>
#> # ... with 739 more rows
Created on 2018-03-12 by the reprex package (v0.2.0).

Resources