How to deduplicate date sequences across non-consecutive rows in R? - r

I want to flag the first date in every window of at least 31 days for each id in my data.
Data:
library(tidyverse)
library(lubridate)
library(tibbletime)
D1 <- tibble(id = c(12,12,12,12,12,12,10,10,10,10),
index_date=c("2019-01-01","2019-01-07","2019-01-21","2019-02-02",
"2019-02-09","2019-03-06","2019-01-05","2019-02-01","2019-02-02","2019-02-08"))
D1
# A tibble: 10 x 2
id index_date
<dbl> <chr>
1 12 2019-01-01
2 12 2019-01-07
3 12 2019-01-21
4 12 2019-02-02
5 12 2019-02-09
6 12 2019-03-06
7 10 2019-01-05
8 10 2019-02-01
9 10 2019-02-02
10 10 2019-02-08
The desired rows to flag are rows 1, 4, 6, 7, and 10; these rows represent either the first index_date for a given id or the first index_date after a 31-day skip period from the previously flagged index_date for that given id.
Code:
temp <- D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start"),
keep = index_date == keyed_to_index_date)
temp %>% arrange(desc(id), index_date)
Result:
id index_date keyed_to_index_date keep
<dbl> <date> <date> <lgl>
1 12 2019-01-01 2019-01-01 TRUE
2 12 2019-01-07 2019-01-01 FALSE
3 12 2019-01-21 2019-01-01 FALSE
4 12 2019-02-02 2019-02-02 TRUE
5 12 2019-02-09 2019-02-02 FALSE
6 12 2019-03-06 2019-03-06 TRUE
7 10 2019-01-05 2019-01-05 TRUE
8 10 2019-02-01 2019-02-01 TRUE
9 10 2019-02-02 2019-02-01 FALSE
10 10 2019-02-08 2019-02-01 FALSE
Why does this code flag row 8 (which has an index_date less than 31 days after the previously flagged index_date for that id) and not row 10, and how do I fix this problem?
UPDATE: Adding the option start_date = first(index_date) to collapse_index(), as suggested by #mnaR99, successfully flagged the correct rows in the original example. However, when I applied the same principle to new data, I ran into a problem:
Data:
D2 <- tibble(id = c("A","A","A","B","B","B","B","B","C","C","C"),
index_date = c("2019-03-04","2019-03-05","2019-03-06",
"2019-03-01","2019-03-02","2019-03-04","2019-03-05","2019-03-06",
"2019-03-03","2019-03-04","2019-03-05"))
D2
id index_date
<chr> <chr>
1 A 2019-03-04
2 A 2019-03-05
3 A 2019-03-06
4 B 2019-03-01
5 B 2019-03-02
6 B 2019-03-04
7 B 2019-03-05
8 B 2019-03-06
9 C 2019-03-03
10 C 2019-03-04
11 C 2019-03-05
I now want to apply a 2-day window in the same manner as I previously applied a 31-day window (that is, consecutive calendar days should not both be flagged). The desired rows to flag are Rows 1, 3, 4, 6, 8, 9, and 11, because these rows are either the first `index_date` for a particular `id` or the first after a two-day skip.
Code:
t3 <- D2 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date,
period = '2 d',
side = "start",
start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(id, index_date)
Result:
> t3
# A time tibble: 11 x 4
# Index: index_date
# Groups: id [3]
id index_date keyed_to_index_date keep
<chr> <date> <date> <lgl>
1 A 2019-03-04 2019-03-04 TRUE
2 A 2019-03-05 2019-03-04 FALSE
3 A 2019-03-06 2019-03-06 TRUE
4 B 2019-03-01 2019-03-01 TRUE
5 B 2019-03-02 2019-03-01 FALSE
6 B 2019-03-04 2019-03-04 TRUE
7 B 2019-03-05 2019-03-05 TRUE
8 B 2019-03-06 2019-03-05 FALSE
9 C 2019-03-03 2019-03-03 TRUE
10 C 2019-03-04 2019-03-03 FALSE
11 C 2019-03-05 2019-03-05 TRUE
Row 7 is incorrectly flagged as TRUE, and Row 8 is incorrectly flagged as FALSE.
When I apply the purrr solution suggested by #tmfmnk, I get the correct result.
Code:
t4 <-
D2 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 |
accumulate(c(0, diff(index_date)), ~ if_else(.x >= 2,
.y,
.x + .y)
) >= 2
)
Result:
> t4
# A tibble: 11 x 3
# Groups: id [3]
id index_date keep
<chr> <date> <lgl>
1 A 2019-03-04 TRUE
2 A 2019-03-05 FALSE
3 A 2019-03-06 TRUE
4 B 2019-03-01 TRUE
5 B 2019-03-02 FALSE
6 B 2019-03-04 TRUE
7 B 2019-03-05 FALSE
8 B 2019-03-06 TRUE
9 C 2019-03-03 TRUE
10 C 2019-03-04 FALSE
11 C 2019-03-05 TRUE
What is wrong with the tibbletime approach in this example?

One option utilizing dplyr, lubridate and purrr could be:
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = row_number() == 1 | accumulate(c(0, diff(index_date)), ~ if_else(.x >= 31, .y, .x + .y)) >= 31)
id index_date keep
<dbl> <date> <lgl>
1 12 2019-01-01 TRUE
2 12 2019-01-07 FALSE
3 12 2019-01-21 FALSE
4 12 2019-02-02 TRUE
5 12 2019-02-09 FALSE
6 12 2019-03-06 TRUE
7 10 2019-01-05 TRUE
8 10 2019-02-01 FALSE
9 10 2019-02-02 FALSE
10 10 2019-02-08 TRUE

You just need to add the start_date argument to collapse_index:
D1 %>%
mutate(index_date = ymd(index_date)) %>%
arrange(id, index_date) %>%
as_tbl_time(index_date) %>%
group_by(id) %>%
mutate(keyed_to_index_date =
collapse_index(index_date, period = '31 d', side = "start", start_date = first(index_date)),
keep = index_date == keyed_to_index_date) %>%
arrange(desc(id), index_date)
#> # A time tibble: 10 x 4
#> # Index: index_date
#> # Groups: id [2]
#> id index_date keyed_to_index_date keep
#> <dbl> <date> <date> <lgl>
#> 1 12 2019-01-01 2019-01-01 TRUE
#> 2 12 2019-01-07 2019-01-01 FALSE
#> 3 12 2019-01-21 2019-01-01 FALSE
#> 4 12 2019-02-02 2019-02-02 TRUE
#> 5 12 2019-02-09 2019-02-02 FALSE
#> 6 12 2019-03-06 2019-03-06 TRUE
#> 7 10 2019-01-05 2019-01-05 TRUE
#> 8 10 2019-02-01 2019-01-05 FALSE
#> 9 10 2019-02-02 2019-01-05 FALSE
#> 10 10 2019-02-08 2019-02-08 TRUE
Created on 2020-09-11 by the reprex package (v0.3.0)

You can use accumulate() from purrr.
D1 %>%
group_by(id) %>%
mutate(index_date = ymd(index_date),
keep = index_date == accumulate(index_date, ~ if(.y - .x >= 31) .y else .x))
# id index_date keep
# <dbl> <date> <lgl>
# 1 12 2019-01-01 TRUE
# 2 12 2019-01-07 FALSE
# 3 12 2019-01-21 FALSE
# 4 12 2019-02-02 TRUE
# 5 12 2019-02-09 FALSE
# 6 12 2019-03-06 TRUE
# 7 10 2019-01-05 TRUE
# 8 10 2019-02-01 FALSE
# 9 10 2019-02-02 FALSE
# 10 10 2019-02-08 TRUE
The iteration rule is following:
1. 2019-01-07 - 2019-01-01 = 6 < 31 then return 2019-01-01
2. 2019-01-21 - 2019-01-01 = 20 < 31 then return 2019-01-01
3. 2019-02-02 - 2019-01-01 = 32 >= 31 then return (2019-02-02)*
4. 2019-02-09 - (2019-02-02)* = 7 < 31 then return 2019-02-02
5. etc.

Related

Filter on sequential condition till another condition is met

How do I create a filter to meet two conditions:
Remove all rows where the value drops to greater than 80% of the day before.
Keep removing the rows following the drop till the value rises again over 50
data <- tibble(date = seq.Date(as.Date("2021-01-01"),as.Date("2021-01-01")+14,1),
value = c(89,86,87,76,10,90,92,83,12,15,23,51,32, 88, 92)
) %>%
mutate(diff = (value-lag(value, default = first(value)))/lag(value, default = first(value)))
final output should be:
date value diff
<date> <dbl> <dbl>
1 2021-01-01 89 0
2 2021-01-02 86 -0.0337
3 2021-01-03 87 0.0116
4 2021-01-04 76 -0.126
5 2021-01-06 90 8
6 2021-01-07 92 0.0222
7 2021-01-08 83 -0.0978
8 2021-01-12 51 1.22
9 2021-01-13 32 -0.373
10 2021-01-14 88 1.75
11 2021-01-15 92 0.0455
Here is a approach using data.table::rleid
library(dplyr)
data <- tibble(
date = seq.Date(as.Date("2021-01-01"), as.Date("2021-01-01") + 14, 1),
value = c(43,47,87,76,10,90,92,83,12,15,23,51,32, 88, 92)
)
data %>%
mutate(diff = (value - lag(value, default = first(value))) /
lag(value, default = first(value))) %>%
mutate(to_remove = diff <= -.8) %>%
# calculate the index of removing to group rows after first removing rows
mutate(group_remove = data.table::rleid(to_remove)) %>%
# for those groups as long as no line reach value 50
# to_continue_remove variable is assign FALSE
group_by(group_remove) %>%
mutate(to_continue_remove = (group_remove > 1) & !to_remove &
cumsum(value >= 50) == 0) %>%
# filter remove rows for 1st condition diff < 80%
filter(!to_remove) %>%
# continue filter rows after removed rows that haven't reach 50 yet
filter(!to_continue_remove)
#> # A tibble: 11 x 6
#> # Groups: group_remove [3]
#> date value diff to_remove group_remove to_continue_remove
#> <date> <dbl> <dbl> <lgl> <int> <lgl>
#> 1 2021-01-01 43 0 FALSE 1 FALSE
#> 2 2021-01-02 47 0.0930 FALSE 1 FALSE
#> 3 2021-01-03 87 0.851 FALSE 1 FALSE
#> 4 2021-01-04 76 -0.126 FALSE 1 FALSE
#> 5 2021-01-06 90 8 FALSE 3 FALSE
#> 6 2021-01-07 92 0.0222 FALSE 3 FALSE
#> 7 2021-01-08 83 -0.0978 FALSE 3 FALSE
#> 8 2021-01-12 51 1.22 FALSE 5 FALSE
#> 9 2021-01-13 32 -0.373 FALSE 5 FALSE
#> 10 2021-01-14 88 1.75 FALSE 5 FALSE
#> 11 2021-01-15 92 0.0455 FALSE 5 FALSE
Created on 2021-05-10 by the reprex package (v2.0.0)
Updated: adjust the solution to not remove rows from first group if their starting value is below 50

Determine the number of process running each day and average days of commencing those projects, in R

I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.
Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6

Select rows based on multiple conditions from two independent database

I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same.
I would like to select cases in data2 with conditions:
Match by ID
Eventdate in data2 >= Eventdate in data1.
If one ID has multiple Eventdates in data2, select the earliest one.
If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!
You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])
Does this work:
library(dplyr)
data1 %>% rename(Eventdate_dat1 = Eventdate) %>% left_join(data2, by = 'ID') %>%
group_by(ID) %>% filter(Eventdate >= Eventdate_dat1) %>%
mutate(Eventdate = case_when(length(unique(Eventdate)) > 1 ~ min(Eventdate), TRUE ~ Eventdate),
Eventcode = case_when(length(unique(Eventcode)) > 1 ~ min(Eventcode), TRUE ~ Eventcode)) %>%
distinct() %>% right_join(data1, by = 'ID') %>% select(ID, 'Eventdate' = Eventdate.y, 'Eventdate.data2' = Eventdate.x, Eventcode)
# A tibble: 6 x 4
# Groups: ID [6]
ID Eventdate Eventdate.data2 Eventcode
<dbl> <chr> <chr> <dbl>
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01 NA NA
5 5 2019-05-01 NA NA
6 6 2019-06-01 NA NA

How to show missing dates in case of application of rolling function

Suppose I have a data df of some insurance policies.
library(tidyverse)
library(lubridate)
#Example data
d <- as.Date("2020-01-01", format = "%Y-%m-%d")
set.seed(50)
df <- data.frame(id = 1:10,
activation_dt = round(runif(10)*100,0) +d,
expiry_dt = d+round(runif(10)*100,0)+c(rep(180,5), rep(240,5)))
> df
id activation_dt expiry_dt
1 1 2020-03-12 2020-08-07
2 2 2020-02-14 2020-07-26
3 3 2020-01-21 2020-09-01
4 4 2020-03-18 2020-07-07
5 5 2020-02-21 2020-07-27
6 6 2020-01-05 2020-11-04
7 7 2020-03-11 2020-11-20
8 8 2020-03-06 2020-10-03
9 9 2020-01-05 2020-09-04
10 10 2020-01-12 2020-09-14
I want to see how many policies were active during each month. That I have done by the following method.
# Getting required result
df %>% arrange(activation_dt) %>%
pivot_longer(cols = c(activation_dt, expiry_dt),
names_to = "event",
values_to = "event_date") %>%
mutate(dummy = ifelse(event == "activation_dt", 1, -1)) %>%
mutate(dummy2 = floor_date(event_date, "month")) %>%
arrange(dummy2) %>% group_by(dummy2) %>%
summarise(dummy=sum(dummy)) %>%
mutate(dummy = cumsum(dummy)) %>%
select(dummy2, dummy)
# A tibble: 8 x 2
dummy2 dummy
<date> <dbl>
1 2020-01-01 4
2 2020-02-01 6
3 2020-03-01 10
4 2020-07-01 7
5 2020-08-01 6
6 2020-09-01 3
7 2020-10-01 2
8 2020-11-01 0
Now I am having problem as to how to deal with missing months e.g. April 2020 to June 2020 etc.
A data.table solution :
generate the months sequence
use non equi joins to find policies active every month and count them
library(lubridate)
library(data.table)
setDT(df)
months <- seq(lubridate::floor_date(mindat,'month'),lubridate::floor_date(max(df$expiry_dt),'month'),by='month')
months <- data.table(months)
df[,c("activation_dt_month","expiry_dt_month"):=.(lubridate::floor_date(activation_dt,'month'),
lubridate::floor_date(expiry_dt,'month'))]
df[months, .(months),on = .(activation_dt_month<=months,expiry_dt_month>=months)][,.(nb=.N),by=months]
months nb
1: 2020-01-01 4
2: 2020-02-01 6
3: 2020-03-01 10
4: 2020-04-01 10
5: 2020-05-01 10
6: 2020-06-01 10
7: 2020-07-01 10
8: 2020-08-01 7
9: 2020-09-01 6
10: 2020-10-01 3
11: 2020-11-01 2
Here is an alternative tidyverse/lubridate solution in case you are interested. The data.table version will be faster, but this should give you the correct results with gaps in months.
First use map2 to create a sequence of months between activation and expiration for each row of data. This will allow you to group by month/year to count number of active policies for each month.
library(tidyverse)
library(lubridate)
df %>%
mutate(month = map2(floor_date(activation_dt, "month"),
floor_date(expiry_dt, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2020-01 4
2 2020-02 6
3 2020-03 10
4 2020-04 10
5 2020-05 10
6 2020-06 10
7 2020-07 10
8 2020-08 7
9 2020-09 6
10 2020-10 3
11 2020-11 2

Aggregate a tibble based on a consecutive values in a boolean column

I've got a fairly straight-forward problem, but I'm struggling to find a solution that doesn't require a wall of code and complicated loops.
I've got a summary table, df, for an hourly timeseries dataset where each observations belongs to a group.
I want to merge some of those groups, based on a boolean column in the summary table.
The boolean column, merge_with_next indicates whether a given group should be merged with the next group (one row down).
The merging effectively occurs by updating the end, value and removing rows:
library(dplyr)
# Demo data
df <- tibble(
group = 1:12,
start = seq.POSIXt(as.POSIXct("2019-01-01 00:00"), as.POSIXct("2019-01-12 00:00"), by = "1 day"),
end = seq.POSIXt(as.POSIXct("2019-01-01 23:59"), as.POSIXct("2019-01-12 23:59"), by = "1 day"),
merge_with_next = rep(c(TRUE, TRUE, FALSE), 4)
)
df
#> # A tibble: 12 x 4
#> group start end merge_with_next
#> <int> <dttm> <dttm> <lgl>
#> 1 1 2019-01-01 00:00:00 2019-01-01 23:59:00 TRUE
#> 2 2 2019-01-02 00:00:00 2019-01-02 23:59:00 TRUE
#> 3 3 2019-01-03 00:00:00 2019-01-03 23:59:00 FALSE
#> 4 4 2019-01-04 00:00:00 2019-01-04 23:59:00 TRUE
#> 5 5 2019-01-05 00:00:00 2019-01-05 23:59:00 TRUE
#> 6 6 2019-01-06 00:00:00 2019-01-06 23:59:00 FALSE
#> 7 7 2019-01-07 00:00:00 2019-01-07 23:59:00 TRUE
#> 8 8 2019-01-08 00:00:00 2019-01-08 23:59:00 TRUE
#> 9 9 2019-01-09 00:00:00 2019-01-09 23:59:00 FALSE
#> 10 10 2019-01-10 00:00:00 2019-01-10 23:59:00 TRUE
#> 11 11 2019-01-11 00:00:00 2019-01-11 23:59:00 TRUE
#> 12 12 2019-01-12 00:00:00 2019-01-12 23:59:00 FALSE
# Desired result
desired <- tibble(
group = c(1, 4, 7, 9),
start = c("2019-01-01 00:00", "2019-01-04 00:00", "2019-01-07 00:00", "2019-01-10 00:00"),
end = c("2019-01-03 23:59", "2019-01-06 23:59", "2019-01-09 23:59", "2019-01-12 23:59")
)
desired
#> # A tibble: 4 x 3
#> group start end
#> <dbl> <chr> <chr>
#> 1 1 2019-01-01 00:00 2019-01-03 23:59
#> 2 4 2019-01-04 00:00 2019-01-06 23:59
#> 3 7 2019-01-07 00:00 2019-01-09 23:59
#> 4 9 2019-01-10 00:00 2019-01-12 23:59
Created on 2019-03-22 by the reprex package (v0.2.1)
I'm looking for a short and clear solution that doesn't involve a myriad of helper tables and loops. The final value in the group column is not significant, I only care about the start and end columns from the result.
We can use dplyr and create groups based on every time TRUE value occurs in merge_with_next column and select first value from start and last value from end column for each group.
library(dplyr)
df %>%
group_by(temp = cumsum(!lag(merge_with_next, default = TRUE))) %>%
summarise(group = first(group),
start = first(start),
end = last(end)) %>%
ungroup() %>%
select(-temp)
# group start end
# <int> <dttm> <dttm>
#1 1 2019-01-01 00:00:00 2019-01-03 23:59:00
#2 4 2019-01-04 00:00:00 2019-01-06 23:59:00
#3 7 2019-01-07 00:00:00 2019-01-09 23:59:00
#4 10 2019-01-10 00:00:00 2019-01-12 23:59:00

Resources