Find overlapping intervals in groups and retain largest non-overlapping periods - r

Issue
I have a grouped dataframe with overlapping intervals (date as ymd). I want to retain only the largest non-overlapping intervals in each group.
Example data
# Packages
library(tidyverse)
library(lubridate)
# Example data
df <- tibble(
group = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
start = as_date(
c("2019-01-10", "2019-02-01", "2019-10-05", "2018-07-01", "2019-01-01", "2019-10-01", "2019-10-01", "2019-11-30","2019-11-20")),
end = as_date(
c("2019-02-07", "2019-05-01", "2019-11-15", "2018-07-31", "2019-05-05", "2019-11-06", "2019-10-07", "2019-12-10","2019-12-31"))) %>%
mutate(intval = interval(start, end),
intval_length = intval / days(1))
df
#> # A tibble: 9 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-01-10 2019-02-07 2019-01-10 UTC--2019-02-07 UTC 28
#> 2 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 3 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 4 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 5 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 6 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 7 3 2019-10-01 2019-10-07 2019-10-01 UTC--2019-10-07 UTC 6
#> 8 3 2019-11-30 2019-12-10 2019-11-30 UTC--2019-12-10 UTC 10
#> 9 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
# Goal
# Row: 1 and 2; 6 to 9 have overlaps; Keep rows with largest intervals (in days)
df1 <- df[-c(1, 7, 8),]
df1
#> # A tibble: 6 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 2 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 3 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 4 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 5 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 6 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
Current approach
I found a related question in another thread (see: Find dates within a period interval by group). However, the respective solution identifies all overlapping rows by group. In this way, I can't identify the largest non-overlapping intervals.
df$overlap <- unlist(tapply(df$intval, #loop through intervals
df$group, #grouped by id
function(x) rowSums(outer(x,x,int_overlaps)) > 1))
As an example, consider group 3 in my example data. Here row 6/7 and 8/9 overlap. With row 6 and 9 being the largest non-overlapping periods, I would like to remove row 7 and 8.
I would greatly appreciate it if someone could pinpoint me to a solution.

Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE

Related

if_else with sequence of conditions

I have the following data:
library(tidyverse)
library(lubridate)
df <- tibble(date = as_date(c("2019-11-20", "2019-11-27", "2020-04-01", "2020-04-15", "2020-09-23", "2020-11-25", "2021-03-03")))
# A tibble: 7 x 1
date
<date>
1 2019-11-20
2 2019-11-27
3 2020-04-01
4 2020-04-15
5 2020-09-23
6 2020-11-25
7 2021-03-03
I also have an ordered comparison vector of dates:
comparison <- seq(as_date("2019-12-01"), today(), by = "months") - 1
I now want to compare my dates in df to those comparison dates and so something like:
if date in df is < comparison[1], then assign a 1
if date in df is < comparison[2], then assign a 2
and so on.
I know I could do it with a case_when, e.g.
df %>%
mutate(new_var = case_when(date < comparison[1] ~ 1,
date < comparison[2] ~ 2))
(of course filling this up with all comparisons).
However, this would require to manually write out all sequential conditions and I'm wondering if I couldn't just automate it. I though about creating a match lookup first (i.e. take the comparison vector, then add the respective new_var number (i.e. 1, 2, and so on)) and then match it against my data, but I only know how to do that for exact matches and don't know how I can add the "smaller than" condition.
Expected result:
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17
You can use findInterval as follows:
df %>% mutate(new_var = df$date %>% findInterval(comparison) + 1)
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17

R filtering by month - Harvard Data science

Questions:
Load the brexit_polls data frame from dslabs:
How many polls had a start date (startdate) in April (month number 4)?*
The start date data within brexit_polls data set has multiple years as points but I want to filter only for the month of April.
I have tried using a a regex then april <- brexit_polls %>% regex(startdate,"....-04-..")
I also tried using the tibbletime package but it wouldn't load to my R. Any suggetions?
dat <- data.frame(startdate = seq(as.Date("2021-01-01"), len=30, by="week"))
head(dat)
# startdate
# 1 2021-01-01
# 2 2021-01-08
# 3 2021-01-15
# 4 2021-01-22
# 5 2021-01-29
# 6 2021-02-05
library(dplyr)
dat %>%
filter("04" == format(startdate, format="%m"))
# startdate
# 1 2021-04-02
# 2 2021-04-09
# 3 2021-04-16
# 4 2021-04-23
# 5 2021-04-30
dat %>%
group_by(month = format(startdate, format="%m")) %>%
tally()
# # A tibble: 7 x 2
# month n
# <chr> <int>
# 1 01 5
# 2 02 4
# 3 03 4
# 4 04 5
# 5 05 4
# 6 06 4
# 7 07 4
dat %>%
group_by(month = format(startdate, format="%m")) %>%
tally() %>%
filter(month == "04")
# # A tibble: 1 x 2
# month n
# <chr> <int>
# 1 04 5
I inferred dplyr, but this works in base as well:
subset(dat, format(startdate, format="%m") == "04")
# startdate
# 14 2021-04-02
# 15 2021-04-09
# 16 2021-04-16
# 17 2021-04-23
# 18 2021-04-30

How to convert a single date column into three individual columns (y, m, d)?

I have a large dataset with thousands of dates in the ymd format. I want to convert this column so that way there are three individual columns by year, month, and day. There are literally thousands of dates so I am trying to do this with a single code for the entire dataset.
You can use the year(), month(), and day() extractors in lubridate for this. Here's an example:
library('dplyr')
library('tibble')
library('lubridate')
## create some data
df <- tibble(date = seq(ymd(20190101), ymd(20191231), by = '7 days'))
which yields
> df
# A tibble: 53 x 1
date
<date>
1 2019-01-01
2 2019-01-08
3 2019-01-15
4 2019-01-22
5 2019-01-29
6 2019-02-05
7 2019-02-12
8 2019-02-19
9 2019-02-26
10 2019-03-05
# … with 43 more rows
Then mutate df using the relevant extractor function:
df <- mutate(df,
year = year(date),
month = month(date),
day = day(date))
This results in:
> df
# A tibble: 53 x 4
date year month day
<date> <dbl> <dbl> <int>
1 2019-01-01 2019 1 1
2 2019-01-08 2019 1 8
3 2019-01-15 2019 1 15
4 2019-01-22 2019 1 22
5 2019-01-29 2019 1 29
6 2019-02-05 2019 2 5
7 2019-02-12 2019 2 12
8 2019-02-19 2019 2 19
9 2019-02-26 2019 2 26
10 2019-03-05 2019 3 5
# … with 43 more rows
If you only want the new three columns, use transmute() instead of mutate().
Using lubridate but without having to specify a separator:
library(tidyverse)
df <- tibble(d = c('2019/3/18','2018/10/29'))
df %>%
mutate(
date = lubridate::ymd(d),
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date)
)
Note that you can change the first entry from ymd to fit other formats.
A slighlty different tidyverse solution that requires less code could be:
Code
tibble(date = "2018-05-01") %>%
mutate_at(vars(date), lst(year, month, day))
Result
# A tibble: 1 x 4
date year month day
<chr> <dbl> <dbl> <int>
1 2018-05-01 2018 5 1
#Data
d = data.frame(date = c("2019-01-01", "2019-02-01", "2012/03/04"))
library(lubridate)
cbind(d,
read.table(header = FALSE,
sep = "-",
text = as.character(ymd(d$date))))
# date V1 V2 V3
#1 2019-01-01 2019 1 1
#2 2019-02-01 2019 2 1
#3 2012/03/04 2012 3 4
OR
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(date2 = as.character(ymd(date))) %>%
separate(date2, c("year", "month", "day"), "-")
# date year month day
#1 2019-01-01 2019 01 01
#2 2019-02-01 2019 02 01
#3 2012/03/04 2012 03 04

How to sum time intervals by days: How many intervals include a given day?

I have a very large data set of time intervals (start and end date values), and need to calculate for every single day within the entire range of those dates how many of the intervals include the date.
In essence, I want to know how many people are in jail any given day. I have when they entered, and when they left. I need to be able to determine how many were in jail on every single day over a period of many years.
Example data:
require(tidyverse)
require(lubridate)
x <- tribble(~start, ~end,
today()-5, today()-3,
today()-4, today()-2,
today()-3, today()-1)
x <- x %>% mutate(dtint = interval(start, end))
x
#> # A tibble: 3 x 3
#> start end dtint
#> <date> <date> <Interval>
#> 1 2019-10-13 2019-10-15 2019-10-13 UTC--2019-10-15 UTC
#> 2 2019-10-14 2019-10-16 2019-10-14 UTC--2019-10-16 UTC
#> 3 2019-10-15 2019-10-17 2019-10-15 UTC--2019-10-17 UTC
mydays <- seq(min(x$start), max(x$end), by = "day") %>% enframe(name = NULL, value = "eachday")
mydays
#> # A tibble: 5 x 1
#> eachday
#> <date>
#> 1 2019-10-13
#> 2 2019-10-14
#> 3 2019-10-15
#> 4 2019-10-16
#> 5 2019-10-17
#Expected result:
mydays %>% add_column(expected_result = c(1, 2, 3, 2, 1))
#> # A tibble: 5 x 2
#> eachday expected_result
#> <date> <dbl>
#> 1 2019-10-13 1
#> 2 2019-10-14 2
#> 3 2019-10-15 3
#> 4 2019-10-16 2
#> 5 2019-10-17 1
Created on 2019-10-18 by the reprex package (v0.3.0)
I will also need the ability to do this on grouped tibbles so I can calculate things like totals by demographic properties.
Is there an efficient way to do this in tidyverse/lubridate?
Here's an approach of how you could track totals using the cumulative count across each grouped demographic.
# Example data
library(tidyverse)
set.seed(42)
x <- tibble(demographic = sample(LETTERS[1:3], 100, replace = T),
start = as.Date("2019-01-01") + runif(100, 0, 30),
end = start + runif(100, 1, 50))
## A tibble: 6 x 3
# demographic start end
# <chr> <date> <date>
#1 C 2019-01-19 2019-03-05
#2 C 2019-01-07 2019-02-02
#3 A 2019-01-07 2019-02-19
#4 C 2019-01-12 2019-02-04
#5 B 2019-01-29 2019-02-07
#6 B 2019-01-29 2019-02-21
First we bring it into long format. Then count each start date as incrementing up, each end date incrementing down. Then we group by demographic and take the cumulative total of those increments up and down.
x %>%
pivot_longer(-demographic, "col", values_to = "date") %>%
mutate(change = if_else(col == "start", 1, -1)) %>%
arrange(demographic, date) %>%
group_by(demographic) %>%
mutate(count = cumsum(change)) %>%
ungroup() %>%
ggplot(aes(date, count, color = demographic)) +
geom_step()
One option involving dplyr, tidyr and lubridate could be:
x %>%
mutate(eachday = list(seq.Date(min(start), max(end), by = "1 day"))) %>%
unnest(eachday) %>%
group_by(eachday) %>%
summarise(overlap = sum(int_overlaps(dtint, interval(eachday, eachday))))
eachday overlap
<date> <int>
1 2019-10-13 1
2 2019-10-14 2
3 2019-10-15 3
4 2019-10-16 2
5 2019-10-17 1
First, it creates a list of dates between the first and last date in the data and unnest it. Then, it groups by the dates and sums the overlap between the dtint interval and the interval for dates.
Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT0, create the seq column from min of 'start' and max of 'end' , then join with the original dataset using a non-equi join and get the count (.N)
library(data.table)
setDT(x)[x[, .(eachday = seq(min(start), max(end), by = '1 day'))],
.(eachday, overlap = .N), on = .(start <= eachday,
end >= eachday ), by = .EACHI][, .(eachday, overlap)]
# eachday overlap
#1: 2019-10-13 1
#2: 2019-10-14 2
#3: 2019-10-15 3
#4: 2019-10-16 2
#5: 2019-10-17 1

Expand start and end dates to unbalanced monthly panel with dplyr

I have start and end dates for events that I want to expand into a monthly panel, and I wanted to know if there was any tool in dplyr for solving this problem. The following code does what I want to do with ddply(). It first creates an example tibble data.frame (called "wide") where "id" represents an individual and "HomeNum" is an event for that individual. The next line creates a "date" variable that is a monthly series from "StartDate" to "FinishDate" within each "id" by "HomeNum" group.
library(plyr)
library(dplyr)
library(tibble)
wide =
tibble(
id = c(1, 1, 2, 2, 2),
HomeNum = c(0,1,0,1,2),
StartDate = as.Date(c("2001-01-01", "2001-03-01", "2000-04-01", "2001-02-01", "2002-08-01")),
FinishDate = as.Date(c("2001-02-01", "2002-05-01", "2001-01-01", "2002-07-01", "2002-12-01"))
)
panel =
ddply(wide,
~id+HomeNum,
transform,
date = seq.Date(StartDate, FinishDate, by = "month")
)
I assume that dplyr, as the "the next iteration of plyr", must have some way to implement a similar solution (and output a tibble), but the following did not work:
panel =
wide %>%
group_by(id, HomeNum) %>%
mutate(date = seq.Date(StartDate, FinishDate, by = "month"))
and returned
Error in mutate_impl(.data, dots) :
Column `date` must be length 1 (the group size), not 2
Frankly, I am surprised that the ddply() solution works and does not throw a similar error.
My implementation with ddply() is similar to answers to this question.
You can coerce the elements of date to lists and unnest.
library(tidyverse)
wide %>%
group_by(id, HomeNum) %>%
mutate(date = list(seq.Date(StartDate, FinishDate, by = "month"))) %>%
unnest(date)
Using unnest on a list of dates was an issue in previous versions of tidyr. I got this same error and found a workaround, but then no longer needed the workaround once I updated to tidyr 0.8.1. It's an issue that's documented in a few issues on GitHub—#407 and #450 were ones I looked at.
If you have a version that can't unnest dates, you can build on #hpesoj626's answer by converting the dates to strings, unnesting, then converting the strings back to dates.
library(tidyverse)
wide <- tibble(
id = c(1, 1, 2, 2, 2),
HomeNum = c(0,1,0,1,2),
StartDate = as.Date(c("2001-01-01", "2001-03-01", "2000-04-01", "2001-02-01", "2002-08-01")),
FinishDate = as.Date(c("2001-02-01", "2002-05-01", "2001-01-01", "2002-07-01", "2002-12-01"))
)
# with previous versions of tidyr
wide %>%
group_by(id, HomeNum) %>%
mutate(date = list(seq.Date(StartDate, FinishDate, by = "month") %>% as.character())) %>%
tidyr::unnest() %>%
mutate(date = as.Date(date))
#> # A tibble: 50 x 5
#> # Groups: id, HomeNum [5]
#> id HomeNum StartDate FinishDate date
#> <dbl> <dbl> <date> <date> <date>
#> 1 1 0 2001-01-01 2001-02-01 2001-01-01
#> 2 1 0 2001-01-01 2001-02-01 2001-02-01
#> 3 1 1 2001-03-01 2002-05-01 2001-03-01
#> 4 1 1 2001-03-01 2002-05-01 2001-04-01
#> 5 1 1 2001-03-01 2002-05-01 2001-05-01
#> 6 1 1 2001-03-01 2002-05-01 2001-06-01
#> 7 1 1 2001-03-01 2002-05-01 2001-07-01
#> 8 1 1 2001-03-01 2002-05-01 2001-08-01
#> 9 1 1 2001-03-01 2002-05-01 2001-09-01
#> 10 1 1 2001-03-01 2002-05-01 2001-10-01
#> # ... with 40 more rows
Otherwise, a solution like the one they posted should work:
# with tidyr 0.8.1
wide %>%
group_by(id, HomeNum) %>%
mutate(date = list(seq.Date(StartDate, FinishDate, by = "month"))) %>%
tidyr::unnest()
#> # A tibble: 50 x 5
#> # Groups: id, HomeNum [5]
#> id HomeNum StartDate FinishDate date
#> <dbl> <dbl> <date> <date> <date>
#> 1 1 0 2001-01-01 2001-02-01 2001-01-01
#> 2 1 0 2001-01-01 2001-02-01 2001-02-01
#> 3 1 1 2001-03-01 2002-05-01 2001-03-01
#> 4 1 1 2001-03-01 2002-05-01 2001-04-01
#> 5 1 1 2001-03-01 2002-05-01 2001-05-01
#> 6 1 1 2001-03-01 2002-05-01 2001-06-01
#> 7 1 1 2001-03-01 2002-05-01 2001-07-01
#> 8 1 1 2001-03-01 2002-05-01 2001-08-01
#> 9 1 1 2001-03-01 2002-05-01 2001-09-01
#> 10 1 1 2001-03-01 2002-05-01 2001-10-01
#> # ... with 40 more rows
Another option is to gather the data into a long format, where observations have a type column showing whether it's the start or finish date. Then use complete to fill in missing dates between each group's minimum and maximum dates. Gathering keeps the type column, which gets filled in as NA for the dates that are added. You could then drop the type column if it's no longer useful.
wide %>%
gather(key = type, value = date, StartDate, FinishDate) %>%
group_by(id, HomeNum) %>%
complete(date = seq.Date(min(date), max(date), by = "month"))
#> # A tibble: 50 x 4
#> # Groups: id, HomeNum [5]
#> id HomeNum date type
#> <dbl> <dbl> <date> <chr>
#> 1 1 0 2001-01-01 StartDate
#> 2 1 0 2001-02-01 FinishDate
#> 3 1 1 2001-03-01 StartDate
#> 4 1 1 2001-04-01 <NA>
#> 5 1 1 2001-05-01 <NA>
#> 6 1 1 2001-06-01 <NA>
#> 7 1 1 2001-07-01 <NA>
#> 8 1 1 2001-08-01 <NA>
#> 9 1 1 2001-09-01 <NA>
#> 10 1 1 2001-10-01 <NA>
#> # ... with 40 more rows
Created on 2018-05-22 by the reprex package (v0.2.0).

Resources