I have a large data set of time periods, defined by a 'start' and and an 'end' column. Some of the periods overlap.
I would like to combine (flatten / merge / collapse) all overlapping time periods to have one 'start' value and one 'end' value.
Some example data:
ID start end
1 A 2013-01-01 2013-01-05
2 A 2013-01-01 2013-01-05
3 A 2013-01-02 2013-01-03
4 A 2013-01-04 2013-01-06
5 A 2013-01-07 2013-01-09
6 A 2013-01-08 2013-01-11
7 A 2013-01-12 2013-01-15
Desired result:
ID start end
1 A 2013-01-01 2013-01-06
2 A 2013-01-07 2013-01-11
3 A 2013-01-12 2013-01-15
What I have tried:
require(dplyr)
data <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "A"),
start = structure(c(1356998400, 1356998400, 1357084800, 1357257600,
1357516800, 1357603200, 1357948800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end = structure(c(1357344000, 1357344000, 1357171200,
1357430400, 1357689600, 1357862400, 1358208000), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("ID", "start", "end"), row.names = c(NA,
-7L), class = "data.frame")
remove.overlaps <- function(data){
data2 <- data
for ( i in 1:length(unique(data$start))) {
x3 <- filter(data2, start>=data$start[i] & start<=data$end[i])
x4 <- x3[1,]
x4$end <- max(x3$end)
data2 <- filter(data2, start<data$start[i] | start>data$end[i])
data2 <- rbind(data2,x4)
}
data2 <- na.omit(data2)}
data <- remove.overlaps(data)
Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups
data %>%
arrange(ID, start) %>% # as suggested by #Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
#David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last in the summarise call resulted in the wrong end date. I'd suggest changing first(start) and last(end) to min(start) and max(end)
data %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = min(start), end = max(end))
Also, as #Jonno Bourne mentioned, sorting by start and any grouping variables is important before applying the method.
For the sake of completeness, the IRanges package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce() function which merges overlapping or adjacent ranges.
However, there is a drawback because IRanges works on integer ranges (hence the name), so the convenience of using IRanges functions comes at the expense of converting Date or POSIXct objects to and fro.
Also, it seems that dplyr doesn't play well with IRanges (at least judged by my limited experience with dplyr) so I use data.table:
library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)
setDT(data)[, {
ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
.(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
ID start end
<fctr> <POSc> <POSc>
1: A 2013-01-01 2013-01-06
2: A 2013-01-07 2013-01-11
3: A 2013-01-12 2013-01-15
A code variant is
setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
, lapply(.SD, as_datetime), .SDcols = -"width"],
by = ID]
In both variants the as_datetime() from the lubridate packages is used which spares to specify the origin when converting numbers to POSIXct objects.
Would be interesting to see a benchmark comparision of the IRanges approaches vs David's answer.
It looks like I'm a little late to the party, but I took #zach's code and re-wrote it using data.table below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy version. (I couldn't test the IRange method because the package is not yet available for R 3.5.1)
Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07 to 2017-07-14 is within 2018-05-01 to 2018-12-01). #zach's answer does capture that edge case.
library(data.table)
start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")
# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]
# set keys (also orders)
setkey(d2, ID, start_col, end_col)
# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col,
END_DT = end_col,
indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
]
I think that you can solve this problem pretty nicely with dplyr and the ivs package, which is designed for working with interval vectors, exactly like what you have here. It is inspired by IRanges, but is more suitable for use in the tidyverse and is completely generic so it can handle date intervals automatically (no need to convert to numeric and back).
The key is to combine the start/end boundaries into a single interval vector column, and then use iv_groups(). This merges all of the overlapping intervals in the interval vector and returns the intervals that remain after the overlaps have been merged.
It seems like you want to do this by ID, so I've also grouped by ID.
library(ivs)
library(dplyr)
data <- tribble(
~ID, ~start, ~end,
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-01", "2013-01-05",
"A", "2013-01-02", "2013-01-03",
"A", "2013-01-04", "2013-01-06",
"A", "2013-01-07", "2013-01-09",
"A", "2013-01-08", "2013-01-11",
"A", "2013-01-12", "2013-01-15"
) %>%
mutate(
start = as.Date(start),
end = as.Date(end)
)
data
#> # A tibble: 7 × 3
#> ID start end
#> <chr> <date> <date>
#> 1 A 2013-01-01 2013-01-05
#> 2 A 2013-01-01 2013-01-05
#> 3 A 2013-01-02 2013-01-03
#> 4 A 2013-01-04 2013-01-06
#> 5 A 2013-01-07 2013-01-09
#> 6 A 2013-01-08 2013-01-11
#> 7 A 2013-01-12 2013-01-15
# Combine `start` and `end` into a single interval vector column
data <- data %>%
mutate(interval = iv(start, end), .keep = "unused")
# Note that this is a half-open interval!
data
#> # A tibble: 7 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-05)
#> 2 A [2013-01-01, 2013-01-05)
#> 3 A [2013-01-02, 2013-01-03)
#> 4 A [2013-01-04, 2013-01-06)
#> 5 A [2013-01-07, 2013-01-09)
#> 6 A [2013-01-08, 2013-01-11)
#> 7 A [2013-01-12, 2013-01-15)
# It seems like you'd want to group by ID, so lets do that.
# Then we use `iv_groups()` which merges all overlapping intervals and returns
# the intervals that remain after all the overlaps have been merged
data %>%
group_by(ID) %>%
summarise(interval = iv_groups(interval), .groups = "drop")
#> # A tibble: 3 × 2
#> ID interval
#> <chr> <iv<date>>
#> 1 A [2013-01-01, 2013-01-06)
#> 2 A [2013-01-07, 2013-01-11)
#> 3 A [2013-01-12, 2013-01-15)
Created on 2022-04-05 by the reprex package (v2.0.1)
Related
I'm new to R and I'm facing a problem, I have a date vector and a dataframe containing data regarding sales values and coverage start and end dates.
I need to defer the sale value at each analysis date, for the first analysis period, I can create an algorithm that gives me the desired answer. However in my real data I am working with a base of 200K+ rows and 50+ analysis periods.
I'm not able to build a loop or find an alternative function in R that allows me to create the variables Aux[i] and Test[i] according to the number of dates present in the vec_date vector.
The following is an example of code that works for the first analysis period.
library(tidyverse)
library(lubridate)
df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500))
date <- ymd("2021-12-31")
vec_date <- date %m+% months(seq(0, 12, by = 6))
df_new <- df |>
mutate(duration = as.numeric(DateFin - DateIn),
Pr_day = Premium/duration,
Aux1 = if_else(DateIn > vec_date[1] | DateFin < vec_date[1], "N", "Y"),
test1 = if_else(Aux1 == "Y" & DateFin > vec_date[1], as.numeric(DateFin - vec_date[1])*Pr_day,
if_else(DateIn > vec_date[1], Premium, 0)))
Does anyone have any idea how I could build this loop, or is there any R function/package that allows me to perform this interaction between my df dataframe and vec_date vector?
Edit: an outline of the format you would need as a result would be:
df_final <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500),
Aux1 = c("Y", "Y", "N"),
test1 = c(5421.429, 4849.246, 24500.000),
Aux2 = c("N", "Y", "Y"),
test2 = c(0.0000, 301.5075, 20125.0000),
Aux3 = c("N", "N", "Y"),
test3 = c(0, 0, 4025))
Where, Aux1 and test1 are the results referring to vec_date[1], 2 = vec_date[2], 3 = vec_date[3]. For me it is important to keep the resulting variables in the same dataframe because later analysis will be done.
As #Jon Spring suggests in the comments, probably the preferred approach here
would be to use tidyr::complete() to extend your data frame, repeating each
row in it for each of your analysis dates. Then, you can stick to vectorized
calculations and get the analysis date column in the resulting data, too.
Below is how to do just that with the example data you provided. I took the
liberty of renaming some columns, and simplifying the control-flow based
calculation according to my understanding of the problem, based on what you
shared.
First, the example data slightly reframed:
library(tidyverse)
library(lubridate)
policies <- tibble(
policy_id = seq_len(3),
start = ymd("2021-10-21", "2021-12-25", "2022-05-11"),
end = ymd("2022-03-10", "2022-07-12", "2023-02-15"),
premium = c(11000, 5000, 24500)
)
policies
#> # A tibble: 3 x 4
#> policy_id start end premium
#> <int> <date> <date> <dbl>
#> 1 1 2021-10-21 2022-03-10 11000
#> 2 2 2021-12-25 2022-07-12 5000
#> 3 3 2022-05-11 2023-02-15 24500
Then, finding remaining prorated premiums for policies at given dates:
start_date <- ymd("2021-12-31")
dates <- start_date %m+% months(seq(0, 12, by = 6))
policies %>%
mutate(
days = as.numeric(end - start),
daily_premium = premium / days
) %>%
crossing(date = dates) %>%
mutate(
days_left = pmax(0, end - pmax(start, date)),
premium_left = days_left * daily_premium
) %>%
select(policy_id, date, days_left, premium_left)
#> # A tibble: 9 x 4
#> policy_id date days_left premium_left
#> <int> <date> <dbl> <dbl>
#> 1 1 2021-12-31 69 5421.
#> 2 1 2022-06-30 0 0
#> 3 1 2022-12-31 0 0
#> 4 2 2021-12-31 193 4849.
#> 5 2 2022-06-30 12 302.
#> 6 2 2022-12-31 0 0
#> 7 3 2021-12-31 280 24500
#> 8 3 2022-06-30 230 20125
#> 9 3 2022-12-31 46 4025
I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14
I have a dataset that I'd now like to split at 12:00pm (midday) into two, i.e. if variable goes from 08:00-13:00 it becomes 08:00-12:00 and 12:00-13:00 across two rows. The variable duration and cumulative sum would need to be changed accordingly, but the other variables should be as in the original (unchanged).
This should be applicable across different id variables.
id = unchanged from row 1, just repeated
start = changed in both rows
end = changed in both rows
day = unchanged from row 1, just repeated
duration = changed in both rows
cumulative time = changed in both row
ORIGINAL DATAFILE
#Current dataframe
id<-c("m1","m1")
x<-c("2020-01-03 10:00:00","2020-01-03 19:20:00")
start<-strptime(x,"%Y-%m-%d %H:%M:%S")
y<-c("2020-01-03 16:00:00","2020-01-03 20:50:00")
end<-strptime(y,"%Y-%m-%d %H:%M:%S")
day<-c(1,1)
mydf<-data.frame(id,start,end,day)
# calculate duration and time
mydf$duration<-as.numeric(difftime(mydf$end,mydf$start,units = "hours"))
mydf$time<-c(cumsum(mydf$duration))
REQUIRED DATAFILE
#Required dataframe
id2<-c("m1","m1","m1")
x2<-c("2020-01-03 10:00:00","2020-01-03 12:00:00","2020-01-03 19:20:00")
start2<-strptime(x2,"%Y-%m-%d %H:%M:%S")
y2<-c("2020-01-03 12:00:00","2020-01-03 16:00:00","2020-01-03 20:50:00")
end2<-strptime(y2,"%Y-%m-%d %H:%M:%S")
day2<-c(1,1,1)
mydf2<-data.frame(id2,start2,end2,day2)
# calculate duration and time
mydf2$duration<-c(2,4,1.5)
mydf2$time<-c(2,6,7.5)
Good question. So, each line implicitly contains either one or two intervals, so you should be able to just define those interval(s) on each line and then pivot to long, but you can't pivot with interval values (yet?). So, here's my approach, which computes up to two shift start times for each line, and then infers the shift end from the start of the next shift after pivoting. Comments inline.
library(lubridate, warn.conflicts = FALSE)
library(tidyverse)
library(magrittr, warn.conflicts = FALSE)
library(hablar, warn.conflicts = FALSE)
(mydf <- tibble(
id = "m1",
start = as_datetime(c("2020-01-03 10:00:00", "2020-01-03 19:20:00")),
end = as_datetime(c("2020-01-03 16:00:00", "2020-01-03 20:50:00")),
day = 1
))
#> # A tibble: 2 x 4
#> id start end day
#> <chr> <dttm> <dttm> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 16:00:00 1
#> 2 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1
(mydf2 <-
mydf %>%
# Assume the relevant noontime cutoff is on the same day as the start
mutate(midday =
start %>% as_date() %>%
add(12 %>% hours()) %>%
fit_to_timeline() %>%
# No relevant midday if the shift doesn't include noon
na_if(not(. %within% interval(start, end)))) %>%
# Make an original row ID since there doesn't seem to be one, and we will need
# to build intervals within the data stemming from each original row
rownames_to_column("orig_shift") %>%
pivot_longer(cols = c(start, midday, end),
# The timestamps we have here will be treated as start times
values_to = "start",
# Drop rows that would exist due to irrelevant middays
values_drop_na = TRUE) %>%
select(-name) %>%
# Infer shift end times as the start of the next shift, within lines defined
# by the original shifts
group_by(orig_shift) %>%
arrange(start) %>%
mutate(end = lead(start)) %>%
ungroup() %>%
# Drop lines that represent the end of the last shift and not a full one
drop_na() %>%
# Compute those durations and times (should times really be globally
# cumulative? Also, your specified mydf2 seems to have an incorrect first time
# value)
mutate(duration = start %--% end %>% as.numeric("hours"),
time = cumsum(duration)) %>%
select(id, start, end, day, duration, time))
#> # A tibble: 3 x 6
#> id start end day duration time
#> <chr> <dttm> <dttm> <dbl> <dbl> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2 2
#> 2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4 6
#> 3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5
Created on 2019-10-23 by the reprex package (v0.3.0)
Here is mine solution for a more general case when you have many observations with different dates. The logic is the following.
Firstly, I create a data frame with 12:00pm (midday) splitters.
Next, I identify the rows which should be split by joining the data frame to the initial one and saving them in separate data frame.
Next, I duplicate the rows and create the split_rows
From the original dataset I delete the rows which I split and join the correct doubled rows.
library(dplyr)
split_time_data =
tibble(split_time = as.POSIXct(seq(0, 365*60*60*24, 60*60*24),
origin="2020-01-01 17:00:00")) %>%
mutate(key = TRUE)# I use 17:00 to make it 12:00 EST, adjust for your purposes
data_to_split =
mydf %>%
mutate(key = TRUE) %>%
left_join(split_time_data) %>%
filter(between(split_time, start, end)) %>%
select(-key)
library(lubridate)
split_rows =
data_to_split %>%
rbind(data_to_split) %>%
arrange(start) %>%
group_by(start) %>%
mutate(row_number = row_number() ) %>%
ungroup() %>%
mutate(start = if_else(row_number == 1, start, split_time ),
end = if_else(row_number == 1, split_time, end )) %>%
select(-row_number, -split_time) %>%
mutate(duration = hour(end) - hour(start) )
mydf %>%
anti_join(data_to_split) %>%
full_join(split_rows) %>%
arrange(start) %>%
mutate(time = cumsum(duration) )
The output
id start end day duration time
1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2.0 2.0
2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4.0 6.0
3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5
I would like to remove products from a dataframe that have overlapping start and end dates to avoid duplicates in a subsequent step.
Example data:
library(dplyr)
d <-
bind_rows(
data.frame(product = 1,
start_date = as.Date("2016-01-01"),
end_date = as.Date("2016-01-10"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-02"),
end_date = as.Date("2016-01-04"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-05"),
end_date = as.Date("2016-06-09"),
stringsAsFactors = FALSE),
data.frame(product = 2,
start_date = as.Date("2016-01-03"),
end_date = as.Date("2016-01-07"),
stringsAsFactors = FALSE)
)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-02 2016-01-04
3 1 2016-01-05 2016-06-09
4 2 2016-01-03 2016-01-07
From this example I would like to remove rows 2 and 3 because of the overlaps.
I've used the lag function to remove overlaps that are next to each other:
d_cleaned <-
d %>%
arrange(product, start_date, end_date) %>%
mutate(overlapping = product == lag(product) & start_date <= lag(end_date) & end_date >= lag(start_date)) %>% # define overlaps
mutate(overlapping = ifelse(is.na(overlapping), FALSE, overlapping)) %>% # dont delete the first row
filter(overlapping == FALSE) %>% # remove overlaps
select(-overlapping)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-05 2016-06-09
3 2 2016-01-03 2016-01-07
As can be seen above this step removes overlaps on consecutive rows but not all.
I can solve this with a loop, but I was hoping that someone might be able to suggest a non-looping solution as the dataframe is quite large and each step takes a while.
Using non-equi joins from the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
setDT(d) # convert 'd' to a data.table by reference
idx = d[d, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE]
d[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
# product start_date end_date
# 1: 1 2016-01-01 2016-01-10
# 2: 1 2016-06-10 2016-06-12
# 3: 2 2016-01-03 2016-01-07
For each row in d (the one inside the square bracket), we find any kind of overlap with d (on the outside), i.e., a self-join, based on the condition provided to the on argument, and we extract the index of the first overlap (because which=TRUE and mult="first").
If and only if the first overlap is with itself, we return them. We discard all other intervals.
To install devel version, see installation instructions here.
Here's a benchmark on slightly more rows (the data is not by any means large):
set.seed(1L)
require(data.table) # v1.9.7+
dates = as.Date(sample(16000:17000, 1e5, TRUE), origin="1970-01-01")
dt = data.table(product=sample(100, 1e5, TRUE),
start_date = sample(dates, 1e5, TRUE),
end_date = sample(dates, 1e5, TRUE))
dt[, `:=`(start_date = pmin(start_date, end_date),
end_date = pmax(start_date, end_date))]
system.time({
idx = dt[dt, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE, verbose=TRUE]
ans = dt[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
})
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0.041 secs
# Recomputing forder with non-equi ids ... done in 0.005 secs
# Found 178 non-equi group(s) ...
# Starting bmerge ...done in 2.359 secs
# user system elapsed
# 2.402 0.011 2.421
head(ans)
# product start_date end_date
# 1: 71 2015-12-04 2016-03-22
# 2: 71 2014-04-12 2015-05-01
# 3: 32 2013-11-23 2015-03-18
# 4: 56 2014-07-29 2015-12-26
# 5: 88 2015-03-08 2015-03-21
# 6: 69 2014-10-31 2015-07-05
nrow(ans)
# [1] 186
I believe the following will work
d <- cbind(ID=1:nrow(d),d)
d_cleaned <- d[rep(1:nrow(d), times=nrow(d)),] %>% ## 1
setNames(.,paste0(names(.),"_other")) %>% ## 2
bind_cols(d[rep(1:nrow(d), each=nrow(d)),], .) %>% ## 3
arrange(product,start_date,end_date) %>% ## 4
filter(product == product_other) %>% ## 5
mutate(overlapping = ID_other < ID &
start_date <= end_date_other &
end_date >= start_date_other) %>% ## 6
group_by(ID) %>%
filter(all(overlapping==FALSE)) %>% ## 7
ungroup() %>%
select(product,start_date,end_date) %>%
distinct())
print(d_cleaned)
### A tibble: 2 x 3
## product start_date end_date
## <dbl> <date> <date>
##1 1 2016-01-01 2016-01-10
##2 2 2016-01-03 2016-01-07
First, add a column of IDs that identifies the rows of the data frame to group_by later to determine if there is overlap with any other row. The key is to be able to consider all distinct pairs of rows with the same product in testing for overlap. The above code does this by expanding the data as in an outer-join. Specifically,
Replicate d nrow(d) times
Change the names of the columns by appending _other to them so that they can be referenced separately from the original column names in the overlap test
Replicate each row of d nrow(d) times and append the result from (2) as new columns
The result of (3) have rows that enumerates all pairs of rows from the original data frame. Then:
Sort them as you did.
Consider only pairs where the product matches. Do this first to minimize not needed comparisons later
Do the overlap test. Here comparison is only made with respect to the previous rows in the original data frame. This has the effect of considering all lags and preserving the row itself (i.e., all rows overlap with itself)
Group by the ID (each row in original data frame) and keep those for which all overlapping is FALSE
At this point, the result contains only the non-overlapping rows in the original data frame. However, there are all those extra columns, and there are duplicates where multiple rows overlap with a row. The rest of the code cleans that up.
I have tested it with the following data (augmented yours to add a few more test conditions, but far from exhaustive):
d <- structure(list(product = c(1, 1, 1, 1, 1, 2, 2), start_date = structure(c(16801,
16802, 16805, 16811, 16962, 16803, 16806), class = "Date"), end_date = structure(c(16810,
16804, 16961, 16961, 16964, 16807, 16810), class = "Date")), .Names = c("product",
"start_date", "end_date"), row.names = c(NA, -7L), class = "data.frame")
and got the following results:
# A tibble: 3 x 3
product start_date end_date
<dbl> <date> <date>
1 1 2016-01-01 2016-01-10
2 1 2016-06-10 2016-06-12
3 2 2016-01-03 2016-01-07
Hope this helps.
I have a series of admissions for patients (dataframe 'admissions' below) and a series of events (2nd dataframe called 'events').
I am interested in whether events occurred within 5 days after an admission. Obviously matches have to be made within patient ID ('id').
In real life, the admissions data frame contains >500k admissions on 100k pts. One patient might have multiple admissions, and multiple events. Not all patients will have an event.
admissions <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), date = structure(c(16436,
16443, 16574, 16468, 16481, 16494), class = "Date")), .Names = c("id",
"date"), row.names = c(NA, 6L), class = "data.frame")
> admissions
id date
1 1 2015-01-01
2 1 2015-01-08
3 1 2015-05-19
4 2 2015-02-02
5 2 2015-02-15
6 2 2015-02-28
events <- structure(list(id = c(1L, 1L, 2L), date = structure(c(16453,
16578, 16467), class = "Date")), .Names = c("id", "date"), row.names = 7:9, class = "data.frame")
> events
id date
7 1 2015-01-18
8 1 2015-05-23
9 2 2015-02-01
I guess I just need the minimum difference in days (only positive values considered) for each event relative to the admissions, matched within patients.
Event 1 (id ==1): +10 days (10 days after 08/01/2015)
Event 2 (id ==1): +4 days
Event 3 (id ==2): -1 days
I can then select those events that fall within my window (which will probably be 5 days).
My guess would be that lapply() is involved, but for some reason the apply's are not every natural to me (yet!).
Using dplyr:
library(dplyr)
mutate(events, event_id=row_number()) %>% # Add event id
right_join(admissions, by="id") %>% # Join with admissions
rename(adm_date = date.y, ev_date = date.x) %>% # Clean names
mutate(diff = ev_date - adm_date) %>% # Compute diffrence
filter(diff >= 0) %>% # Filter
group_by(event_id) %>%
arrange(diff) %>% # Sort ascending by diff by event_id
summarise_each(funs(first), ev_date, adm_date, diff) # Get nearest
Source: local data frame [2 x 4]
event_id ev_date adm_date diff
1 1 2015-01-18 2015-01-08 10 days
2 2 2015-05-23 2015-05-19 4 days
Using data.table rolling join:
keycols <- c("id", "date")
admissions_dt <- admissions %>% mutate(adm_date = date) %>% as.data.table()
setkeyv(admissions_dt, keycols)
events_dt <- mutate(events, event_id=row_number()) %>% as.data.table()
setkeyv(events_dt, keycols)
admissions_dt[events_dt, roll=10][order(event_id)]
id date adm_date event_id
1: 1 2015-01-18 2015-01-08 1
2: 1 2015-05-23 2015-05-19 2
3: 2 2015-02-01 <NA> 3
Using data.table 1.9.5 for its on= feature.
For each row in event, find the index corresponding to the closest date <= admissions$date.
idx = setDT(admissions)[events, which=TRUE, roll=TRUE, on=c("id", "date")]
idx
# [1] 2 3 NA
If you already know you'll only prefer 5 day window, then you can use roll=5 instead of roll=TRUE. roll=<positive number> performs a LOCF rolling join.
The indices correspond to matching rows in admission for each row of event. So we can now extract the date as follows:
setDT(events)[, adm_date := admission$date[idx]]
# id date adm_date
# 1: 1 2015-01-18 2015-01-08
# 2: 1 2015-05-23 2015-05-19
# 3: 2 2015-02-01 <NA>