Related
I have 2 dataframes (DFs) that each contain identifiers and date ranges. In both DFs there can be numerous date ranges associated with each ID.
What I want to do is select the rows from the first DF (DF.A) for which there is an overlapping interval of any length, in the second DF (DF.B).
df.A <- data.frame("ID" = c(1,1,1,2,3,3),
"Start.A" = c("2019-01-01", "2019-03-15", "2019-06-10", "2017-01-01", "2015-05-10", "2015-05-15"),
"End.A" = c("2019-01-31", "2019-04-15", "2019-07-09", "2017-01-31", "2015-06-10", "2015-06-02"))
df.B <- data.frame("ID" = c(1,1,1,3,3),
"Start.B" = c("2019-01-01", "2019-02-01", "2019-03-01", "2015-06-01", "2015-07-01"),
"End.B" = c("2019-01-31", "2019-02-28", "2019-03-31", "2015-06-30", "2015-07-31"))
Dataframe A:
ID Start.A End.A
1 2019-01-01 2019-01-31
1 2019-03-15 2019-04-15
1 2019-06-10 2019-07-09
2 2017-01-01 2017-01-31
3 2015-05-10 2015-06-10
3 2015-05-15 2015-06-02
Dataframe B:
ID Start.B End.B
1 2019-01-01 2019-01-31
1 2019-02-01 2019-02-28
1 2019-03-01 2019-03-31
3 2015-06-01 2015-06-30
3 2015-07-01 2015-07-31
Would I would like as my output is:
ID Start.A End.A
1 2019-01-01 2019-01-31
1 2019-03-15 2019-04-15
3 2015-05-10 2015-06-10
3 2015-05-15 2015-06-02
I think I would be able to do this without a problem if I had a one to one match but, as I mentioned, in both DFs there are numerous observations for each ID. I've tried my hand at trying to apply lubridate's interval but I'm struggling with how to how to look for overlaps while dealing with the added complexity of having to look up all corresponding IDs in DF.B for a potential match.
This is a very large dataset (>5 million observations in DF.A and >2 million in DF.B) so speed is crucial. Any recommendations to transform the data to make this operation as fast as possible would also be appreciated.
If helpful: For a given ID, DF.A can have observations that overlap with other observations in DF.A (e.g. ID 3 in the toy example above). Contrarily, there can be no overlaps between the DF.B intervals.
How about this ?
library(data.table)
df.A <- data.table("ID" = c(1,1,1,2,3,3),
"Start.A" = c("2019-01-01", "2019-03-15", "2019-06-10", "2017-01-01", "2015-05-10", "2015-05-15"),
"End.A" = c("2019-01-31", "2019-04-15", "2019-07-09", "2017-01-31", "2015-06-10", "2015-06-02"))
df.B <- data.table("ID" = c(1,1,1,3,3),
"Start.B" = c("2019-01-01", "2019-02-01", "2019-03-01", "2015-06-01", "2015-07-01"),
"End.B" = c("2019-01-31", "2019-02-28", "2019-03-31", "2015-06-30", "2015-07-31"))
And
DF = merge(df.A, df.B , by ='ID',allow.cartesian = TRUE)
DF$SEQ_DATE.A = apply(DF[,c('Start.A','End.A'), with=F],1, function(x){paste(x,collapse = ',')})
DF$SEQ_DATE.A = unlist(lapply(strsplit(DF$SEQ_DATE.A,','),function(x){
out = seq(as.Date(x[1]),as.Date(x[2]),by = 'day')
out = paste(out, collapse = '|')
return(out)
}
))
DF$SEQ_DATE.B = apply(DF[,c('Start.B','End.B'), with=F],1, function(x){paste(x,collapse = ',')})
DF$SEQ_DATE.B = unlist(lapply(strsplit(DF$SEQ_DATE.B,','),function(x){
out = seq(as.Date(x[1]),as.Date(x[2]),by = 'day')
out = paste(out, collapse = '|')
return(out)
}
))
DF$Result= apply(DF[,c('SEQ_DATE.A','SEQ_DATE.B'), with = F], 1, function(x){grepl(x[1],x[2])})
And the result is shown below :
> DF[,-c('SEQ_DATE.A','SEQ_DATE.B'), with =F][Result == 'TRUE']
ID Start.A End.A Start.B End.B Result
1: 1 2019-01-01 2019-01-31 2019-01-01 2019-01-31 TRUE
2: 1 2019-03-15 2019-04-15 2019-03-01 2019-03-31 TRUE
3: 3 2015-05-10 2015-06-10 2015-06-01 2015-06-30 TRUE
4: 3 2015-05-15 2015-06-02 2015-06-01 2015-06-30 TRUE
I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00
I'm trying to use R to measure how many days supply of a prescription an individual already has on-hand when they make a refill, taking into account all previous prescriptions. For example, if I had this table...
member rx_id fill_date to_date days_supply
1 A 1 2018-10-01 2018-10-02 2
2 B 1 2016-11-07 2016-11-10 4
3 B 2 2016-11-07 2016-12-04 28
4 B 3 2016-11-08 2016-11-09 2
5 B 4 2016-11-10 2016-12-03 24
I'd expect the following output
member rx_id fill_date to_date days_supply_on_hand
1 A 1 2018-10-01 2018-10-02 0
2 B 1 2016-11-07 2016-11-10 0
3 B 2 2016-11-07 2016-12-04 4
4 B 3 2016-11-08 2016-11-09 30
5 B 4 2016-11-10 2016-12-03 26
For member B, when the second script is filled on the same day as the first script, the individual already has 4 days worth of RX on hand. When the third script is filled, the individual has 3 days left from the first script and 27 left from the second (30 total). When the fourth script is filled, the third script is depleted, but there is 1 day left from the first script and 25 from the third script (26 total).
I know how to do rolling totals in both dplyr and data.table, but I can't figure out how to take into account variable levels of depletion based on previous records on an individual by individual basis. Below is code to remake the original table, thanks in advance for any suggestions!
structure(list(member = structure(c(1L, 2L, 2L, 2L, 2L), .Label =
c("A",
"B"), class = "factor"), rx_id = c(1, 1, 2, 3, 4), fill_date =
structure(c(17805,
17112, 17112, 17113, 17115), class = "Date"), to_date =
structure(c(17806,
17115, 17139, 17114, 17138), class = "Date"), days_supply = c(2,
4, 28, 2, 24)), .Names = c("member", "rx_id", "fill_date",
"to_date",
"days_supply"), row.names = c(NA, -5L), class = "data.frame")
library(data.table)
dt = as.data.table(your_df) # or setDT to convert in place
# merge on relevant days, then compute sum of supply - days elapsed
dt[dt, on = .(member, fill_date <= fill_date, to_date >= fill_date, rx_id < rx_id), by = .EACHI,
sum(days_supply, na.rm = T) - sum(i.fill_date - x.fill_date, na.rm = T)]
# member fill_date to_date rx_id V1
#1: A 2018-10-01 2018-10-01 1 0 days
#2: B 2016-11-07 2016-11-07 1 0 days
#3: B 2016-11-07 2016-11-07 2 4 days
#4: B 2016-11-08 2016-11-08 3 30 days
#5: B 2016-11-10 2016-11-10 4 26 days
Using a simple loop
dt$days_supply_on_hand <- 0
for (a in unique(dt$member)) {
I <- which(.subset2(dt,1) == a)
flDate <- as.integer(.subset2(dt,3)[I])
toDate <- as.integer(.subset2(dt,4)[I])
V <- vapply(seq_along(I), function (k) sum(toDate[1:(k-1)] - flDate[k] + 1), numeric(1))
dt$days_supply_on_hand[I] <- c(0,V[-1])
}
dt
member rx_id fill_date to_date days_supply days_supply_on_hand
1 A 1 2018-10-01 2018-10-02 2 0
2 B 1 2016-11-07 2016-11-10 4 0
3 B 2 2016-11-07 2016-12-04 28 4
4 B 3 2016-11-08 2016-11-09 2 30
5 B 4 2016-11-10 2016-12-03 24 26
where dt is data frame provided above. (Note that the use of .subset2 or as.integer is for efficiency purposes - they can be changed for more readability).
First question, so please let me know if I'm missing anything in providing thorough information.
Background: I have two tables. One is a table of tech tickets and when they were opened and when they were solved (closed). I want to create a timeline which counts how many tickets were open on each day.
Here is what I have done so far:
# load in data
tickets <- read.csv("tickets.csv",header=TRUE)
#packages
library(tidyr)
library(dplyr)
library(lubridate)
tickets <- tbl_df(tickets)
tickets
## A tibble: 10 × 3
#ID Date.Time.Opened Date.Time.Closed
#<int> <fctr> <fctr>
#1 1 1/19/17 11:51 1/30/17 14:44
#2 2 1/22/16 12:27 1/30/17 13:36
#3 3 1/20/17 17:07 1/27/17 7:24
#4 4 1/20/17 18:23 1/27/17 7:24
#5 5 1/20/17 8:54 1/26/17 12:09
#6 6 1/24/17 18:54 1/26/17 12:09
#7 7 1/25/17 11:33 1/26/17 12:08
#8 8 1/23/17 11:22 1/25/17 16:31
#9 9 1/20/17 16:48 1/25/17 15:06
#10 10 1/9/17 8:57 1/25/17 13:46
#dates are currently factors; change to dates.
tickets2 <-
tickets %>%
mutate(Date.Time.Opened = mdy_hm(Date.Time.Opened)) %>%
mutate(Date.Time.Closed = mdy_hm(Date.Time.Closed))
head(tickets2)
# A tibble: 6 × 3
#ID Date.Time.Opened Date.Time.Closed
#<int> <dttm> <dttm>
#1 1 2017-01-19 11:51:00 2017-01-30 14:44:00
#2 2 2016-01-22 12:27:00 2017-01-30 13:36:00
#3 3 2017-01-20 17:07:00 2017-01-27 07:24:00
#4 4 2017-01-20 18:23:00 2017-01-27 07:24:00
#5 5 2017-01-20 08:54:00 2017-01-26 12:09:00
#6 6 2017-01-24 18:54:00 2017-01-26 12:09:00
The "timeline" is just one column that has each day of the year, the structure of the timeline:
#read in timeline
timeline <- read.csv("timeline.csv",header=TRUE)
timeline <- tbl_df(timeline)
timeline
#change date from factor to date
timeline <- mutate(timeline,tDates = mdy(tDates))
timeline
# A tibble: 10 × 1
#tDates
#<date>
#1 2017-01-20
#2 2017-01-21
#3 2017-01-22
#4 2017-01-23
#5 2017-01-24
#6 2017-01-25
#7 2017-01-26
#8 2017-01-27
#9 2017-01-28
#10 2017-01-29
Here is what I would LIKE to get out at the end:
##----
##----DESIRED OUTCOME: ------
##----
#tDates ticketsOpen
#1 1/20/17 2
#2 1/21/17 6
#3 1/22/17 6
#4 1/23/17 7
#5 1/24/17 8
#6 1/25/17 9
#7 1/26/17 7
#8 1/27/17 4
#9 1/28/17 2
#10 1/29/17 2
#=======================
Here is what I have written after setting up the data:
# write a function which takes a date, searches the tickets table and
returns the number of tickets that are open
nOpenTickets <- function(x){
nrow(filter(tickets,
x > mdy_hm(Date.Time.Opened) &
x < mdy_hm(Date.Time.Closed)))
}
#Add a column to the timeline with the number returned by the function
(the number of open tickets on that date)
timeline <- mutate(timeline,ticketsOpen = nOpenTickets(tDates))
timeline
# my results:
## A tibble: 10 × 2
#tDates ticketsOpen
#<date> <int>
#1 2017-01-20 0
#2 2017-01-21 0
#3 2017-01-22 0
#4 2017-01-23 0
#5 2017-01-24 0
#6 2017-01-25 0
#7 2017-01-26 0
#8 2017-01-27 0
#9 2017-01-28 0
#10 2017-01-29 0
Note: if I don't format my tickets, then it all ends up with "7"...I don't know why, either.
Unfortunately, either I get formatting warnings relating to the dates, or it doesn't run at all, but I don't ever get different calculated values for every row.
I am not a professional programmer, so I may easily be missing some things. I have looked at former questions, but they are not quite the same (not R) or perhaps over my head: dplyr: grouping and summarizing/mutating data with rolling time windows and For each row of one table, count entries in another table pointing to each of those rows in Oracle
Below is the dump of the tickets and timeline objects for easy loading. I am thinking there are some issues between mixing tables, dplyr use, and date formatting that are tripping me up. In my head the logic is so clear, but I'm just not coding it right. :-)
I am wondering...should I not use dplyr and lubridate for this?
Thank you for any advice!
Tickets:
> dput(tickets)
structure(list(ID = 1:10, Date.Time.Opened = structure(c(1L,
6L, 3L, 4L, 5L, 8L, 9L, 7L, 2L, 10L), .Label = c("1/19/17 11:51",
"1/20/17 16:48", "1/20/17 17:07", "1/20/17 18:23", "1/20/17 8:54",
"1/22/16 12:27", "1/23/17 11:22", "1/24/17 18:54", "1/25/17 11:33",
"1/9/17 8:57"), class = "factor"), Date.Time.Closed = structure(c(8L,
7L, 6L, 6L, 5L, 5L, 4L, 3L, 2L, 1L), .Label = c("1/25/17 13:46",
"1/25/17 15:06", "1/25/17 16:31", "1/26/17 12:08", "1/26/17 12:09",
"1/27/17 7:24", "1/30/17 13:36", "1/30/17 14:44"), class = "factor")),
.Names = c("ID",
"Date.Time.Opened", "Date.Time.Closed"), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
Timeline:
> dput(timeline)
structure(list(tDates = structure(c(17186, 17187, 17188, 17189,
17190, 17191, 17192, 17193, 17194, 17195), class = "Date")), .Names =
"tDates", class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
=============================================
UPDATE
=============================================
Here's what ended up working (Thanks, alistaire!)
#packages
library(lubridate)
library(tidyverse)
# load in data
tickets <- read.csv("tickets.csv",header=TRUE)
timeline <- read.csv("timeline.csv",header=TRUE)
#change from factor to date
timeline <- mutate(timeline,tDates = mdy(tDates))
# create new df that shows how many are open each day
tickets2 <-
tickets %>%
mutate_at(-1, mdy_hm) %>%
mutate(int = interval(Date.Time.Opened, Date.Time.Closed)); timeline %>%
rowwise() %>%
mutate(n = sum(tDates %within% tickets2$int))
I have data that looks like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 718184 1/2/2014 1/8/2014
1 885236 1/15/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589963 3/18/2015 3/22/2015
2 589999 2/27/2015 5/9/2015
2 594226 4/11/2015 4/17/2015
2 689959 5/10/2015 6/10/2015
3 656696 5/1/2016 5/5/2016
3 669625 5/6/2016 5/22/2016
4 777777 2/21/2015 3/4/2015
4 778952 2/1/2015 2/28/2015
4 778965 3/1/2015 3/22/2015
I am working on two different problems with this. The first one was answered in a previous post about how to roll dates up (Date roll-up in R) and the second now is that I have intervals that are within intervals and I am trying to get rid of them. So the final outcome should look like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589999 2/27/2015 5/9/2015
3 656696 5/1/2016 5/22/2016
4 778952 2/1/2015 3/22/2015
Now I know I will have to create the extended intervals via the date rollup first, but then how do I get rid of these sub-intervals (a term I am making up for intervals within intervals)? I am also looking for a solution that is efficient since I actually have 75,000 records to go through (i.e. I am trying to avoid iterative solutions).
Using non-equi joins from the current development version of data.table, v1.9.7,
require(data.table) # v1.9.7+
dt[dt, .(CLM_IDs = CLM_IDs[.N==1L]), on=.(ID, Date1<=Date1, Date2>=Date2), by=.EACHI]
# ID Date1 Date2 CLM_ID
# 1: 1 2014-01-01 2014-01-17 718182
# 2: 1 2014-03-20 2014-03-21 885362
# 3: 2 2015-02-27 2015-05-09 589999
# 4: 2 2015-05-10 2015-06-10 689959
# 5: 3 2016-05-01 2016-05-05 656696
# 6: 3 2016-05-06 2016-05-22 669625
# 7: 4 2015-02-21 2015-03-04 777777
# 8: 4 2015-02-01 2015-02-28 778952
# 9: 4 2015-03-01 2015-03-22 778965
What this does is, for each row in dt (the one inside of square bracket), it looks up which rows match in dt (on the outside) based on the condition provided to the on argument.
The matching row indices are returned iff the only match is a self-match (since the condition includes equality as well). This is done by CLM_IDs[.N == 1L], where .N holds the number of observations for each group.
"I am also looking for a solution that is efficient ... (i.e. I am trying to avoid iterative solutions)."
"Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in." - Isaac Asimov
Below is a super fast base R iterative solution. It returns the correct results for very large data frames virtually instantly. (it also "rolls-up" the data, so there is no need to carry out two algorithms):
MakeDFSubInt <- function(df, includeCost = FALSE) {
## Sorting the data frame to allow for fast
## creation of the "Contained" logical vector below
tempDF <- df[order(df$ID, df$Date1, df$Date2), ]
UniIDs <- unique(tempDF$ID)
Len <- length(UniIDs)
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
## Converting dates to integers so that comparison
## will be faster. Internally dates are stored as
## integers, so this isn't a problem
dte1 <- as.integer(tempDF$Date1)
dte2 <- as.integer(tempDF$Date2)
## Building logical vector in order to quickly create sub-intervals
Contained <- rep(FALSE, dfLen)
BegTime <- Sys.time() ## Included to measure time of for loop execution
for (j in 1:Len) {
Compare <- ifelse(dte2[s[j]] >= (dte1[s[j]+1L]+1L), max(dte2[s[j]], dte2[s[j]+1L]), dte2[s[j]+1L])
for (x in (s[j]+1L):e[j]) {
if (!Contained[x-1L]) {
Contained[x] <- dte2[x-1L] >= (dte1[x]-1L)
} else {
Contained[x] <- Compare >= (dte1[x]-1L)
}
## could use ifelse, but this construct is faster
if (Contained[x]) {
Compare <- max(Compare, dte2[x])
} else {
Compare <- dte2[x]
}
}
}
EndTime <- Sys.time()
TotTime <- EndTime - BegTime
if (printTime) {print(paste(c("for loop execution time was: ", format(TotTime)), collapse = ""))}
## identify sub-intervals
nGrps <- which(!Contained)
## Create New fields for our new DF
ID <- tempDF$ID[nGrps]
CLM_ID <- tempDF$CLM_ID[nGrps]
Date1 <- tempDF$Date1[nGrps]
nGrps <- c(nGrps, dfLen+1L)
## as.Date is converting numbers to dates.
## N.B. This only works if origin is supplied
Date2 <- as.Date(vapply(1L:(length(nGrps) - 1L), function(x) {
max(dte2[nGrps[x]:(nGrps[x+1L]-1L)])}, 1L), origin = "1970-01-01")
## in a related question the OP had, "Cost" was
## included to show how the algorithm would handle
## generic summary information
if (includeCost) {
myCost <- tempDF$Cost
Cost <- vapply(1L:(length(nGrps) - 1L), function(x) sum(myCost[nGrps[x]:(nGrps[x+1L]-1L)]), 100.01)
NewDf <- data.frame(ID,CLM_ID,Date1,Date2,Cost)
} else {
NewDf <- data.frame(ID,CLM_ID,Date1,Date2)
}
NewDf
}
For the example given in the question, we have:
ID <- c(rep(1,4),rep(2,4),rep(3,2),rep(4,3))
CLM_ID <- c(718182, 718184, 885236, 885362, 589963, 589999, 594226, 689959, 656696, 669625, 777777, 778952, 778965)
Date1 <- c("1/1/2014","1/2/2014","1/15/2014","3/20/2014","3/18/2015","2/27/2015","4/11/2015","5/10/2015","5/1/2016","5/6/2016","2/21/2015","2/1/2015","3/1/2015")
Date2 <- c("1/17/2014","1/8/2014","1/17/2014","3/21/2014","3/22/2015","5/9/2015","4/17/2015","6/10/2015","5/5/2016","5/22/2016","3/4/2015","2/28/2015","3/22/2015")
myDF <- data.frame(ID, CLM_ID, Date1, Date2)
myDF$Date1 <- as.Date(myDF$Date1, format = "%m/%d/%Y")
myDF$Date2 <- as.Date(myDF$Date2, format = "%m/%d/%Y")
MakeDFSubInt(myDF)
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17
2 1 885362 2014-03-20 2014-03-21
3 2 589999 2015-02-27 2015-06-10
4 3 656696 2016-05-01 2016-05-22
5 4 778952 2015-02-01 2015-03-22
From a similar question the OP posted, we can add a Cost field, to show how we would proceed with calculations for this setup.
set.seed(7777)
myDF$Cost <- round(rnorm(13, 450, sd = 100),2)
MakeDFSubInt(myDF, includeCost = TRUE)
ID CLM_ID Date1 Date2 Cost
1 1 718182 2014-01-01 2014-01-17 1164.66
2 1 885362 2014-03-20 2014-03-21 568.16
3 2 589999 2015-02-27 2015-06-10 2019.16
4 3 656696 2016-05-01 2016-05-22 990.14
5 4 778952 2015-02-01 2015-03-22 1578.68
This algorithm scales very well. For data frames the size the OP is looking for, returning the requested DF returns almost instantaneously and for very large data frames, it returns in just seconds.
First we build a function that will generate a random data frame with n rows.
MakeRandomDF <- function(n) {
set.seed(109)
CLM_Size <- ifelse(n < 10^6, 10^6, 10^(ceiling(log10(n))))
numYears <- trunc((6/425000)*n + 5)
StrtYear <- ifelse(numYears > 16, 2000, 2016 - numYears)
numYears <- ifelse(numYears > 16, 16, numYears)
IDs <- sort(sample(trunc(n/100), n, replace = TRUE))
CLM_IDs <- sample(CLM_Size, n)
StrtDate <- as.Date(paste(c(as.character(StrtYear),"-01-01"), collapse = ""))
myPossibleDates <- StrtDate+(0:(numYears*365)) ## "numYears" years of data
Date1 <- sample(myPossibleDates, n, replace = TRUE)
Date2 <- Date1 + sample(1:100, n, replace = TRUE)
Cost <- round(rnorm(n, 850, 100), 2)
tempDF <- data.frame(IDs,CLM_IDs,Date1,Date2,Cost)
tempDF$Date1 <- as.Date(tempDF$Date1, format = "%m/%d/%Y")
tempDF$Date2 <- as.Date(tempDF$Date2, format = "%m/%d/%Y")
tempDF
}
For moderate size DFs (i.e. 75,000 rows)
TestDF <- MakeRandomDF(75000)
system.time(test1 <- MakeDFSubInt(TestDF, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.06500006 secs"
user system elapsed
0.14 0.00 0.14
nrow(test1)
[1] 7618
head(test1)
ID CLM_ID Date1 Date2 Cost
1 1 116944 2010-01-29 2010-01-30 799.90 ## The range of dates for
2 1 515993 2010-02-15 2011-10-12 20836.83 ## each row are disjoint
3 1 408037 2011-12-13 2013-07-21 28149.26 ## as requested by the OP
4 1 20591 2013-07-25 2014-03-11 10449.51
5 1 338609 2014-04-24 2014-07-31 4219.48
6 1 628983 2014-08-03 2014-09-11 2170.93
For very large DFs (i.e. > 500,000 rows)
TestDF2 <- MakeRandomDF(500000)
system.time(test2 <- MakeDFSubInt(TestDF2, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.3679998 secs"
user system elapsed
1.19 0.03 1.21
nrow(test2)
[1] 154839
head(test2)
ID CLM_ID Date1 Date2 Cost
1 1 71251 2004-04-19 2004-06-29 2715.69 ## The range of dates for
2 1 601676 2004-07-05 2004-09-23 2675.04 ## each row are disjoint
3 1 794409 2004-12-28 2005-04-05 1760.63 ## as requested by the OP
4 1 424671 2005-06-03 2005-08-20 1973.67
5 1 390353 2005-09-16 2005-11-06 785.81
6 1 496611 2005-11-21 2005-11-24 904.09
system.time(test3 <- MakeDFSubInt(TestDF3, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.6930001 secs"
user system elapsed
2.68 0.08 2.79 ## 1 million rows in under 3 seconds!!!
nrow(test3)
[1] 413668
Explanation
The main part of the algorithm is generating the Contained logical vector that is used to determine the sub-intervals of continuous dates. Generation of this vector relies on the fact that the data frame is sorted, first by ID, second by Date1, and finally by Date2. We begin by locating the starting and ending rows of each group of IDs. For example, with the example provided by the OP we have:
myDF
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17 ## <- 1 s[1]
2 1 718184 2014-01-02 2014-01-08
3 1 885236 2014-01-15 2014-01-17
4 1 885362 2014-03-20 2014-03-21 ## <- 4 e[1]
5 2 589963 2015-03-18 2015-03-22 ## <- 5 s[2]
6 2 589999 2015-02-27 2015-05-09
7 2 594226 2015-04-11 2015-04-17
8 2 689959 2015-05-10 2015-06-10 ## <- 8 e[2]
9 3 656696 2016-05-01 2016-05-05 ## <- 9 s[3]
10 3 669625 2016-05-06 2016-05-22 ## <- 10 e[3]
11 4 777777 2015-02-21 2015-03-04 ## <- 11 s[4]
12 4 778952 2015-02-01 2015-02-28
13 4 778965 2015-03-01 2015-03-22 ## <- 13 e[4]
Below is the code that generates s and e.
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
s
1 5 9 11
e
4 8 10 13
Now, we loop over each group and begin populating the logical vector Contained. If the date range for a particular row overlaps (or is a continuance of) the date range above it, we set that particular index of Contained to TRUE. This is why the first row in each group is set to FALSE since there is nothing above to compare it to. As we are doing this, we are updating the largest date to compare against moving forward, hence the Compare variable. It should be noted that it isn't necessarily true that Date2[n] < Date2[n+1L], this is why Compare <- max(Compare, dte2[x]) for a succession of TRUEs. The result for our example is give below.
ID CLM_ID Date1 Date2 Contained
1 1 718182 2014-01-01 2014-01-17 FALSE
2 1 718184 2014-01-02 2014-01-08 TRUE ## These two rows are contained
3 1 885236 2014-01-15 2014-01-17 TRUE ## in the date range 1/1 - 1/17
4 1 885362 2014-03-20 2014-03-21 FALSE ## This row isn't
6 2 589999 2015-02-27 2015-05-09 FALSE
5 2 589963 2015-03-18 2015-03-22 TRUE
7 2 594226 2015-04-11 2015-04-17 TRUE
8 2 689959 2015-05-10 2015-06-10 TRUE ## N.B. 5/10 is a continuance of 5/09
9 3 656696 2016-05-01 2016-05-05 FALSE
10 3 669625 2016-05-06 2016-05-22 TRUE
12 4 778952 2015-02-01 2015-02-28 FALSE
11 4 777777 2015-02-21 2015-03-04 TRUE
13 4 778965 2015-03-01 2015-03-22 TRUE
Now we can easily identify the "starting" rows by identifying all rows with a corresponding FALSE. After this, finding summary information is a breeze by simply calculating whatever you are interested in (e.g. max(Date2), sum(Cost)) over each succession of TRUEs and Voila!!
Here is a not-so-pretty solution comparing each row with the dates of all other rows. I corrected the one year 3015 to 2015. The results are different from what you are expecting, though. Either I misunderstood your question, or you misread the data.
Data:
dta <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
CLM_ID = c(718182L, 718184L, 885236L, 885362L, 589963L, 589999L, 594226L, 689959L, 656696L, 669625L, 777777L, 778952L, 778965L),
Date1 = structure(c(1L, 3L, 2L, 9L, 8L, 6L, 10L, 12L, 11L, 13L, 5L, 4L, 7L), .Label = c("1/1/2014", "1/15/2014", "1/2/2014", "2/1/2015", "2/21/2015", "2/27/2015", "3/1/2015", "3/18/2015", "3/20/2014", "4/11/2015", "5/1/2016", "5/10/2015", "5/6/2016"), class = "factor"),
Date2 = structure(c(1L, 2L, 1L, 4L, 5L, 10L, 7L, 11L, 9L, 8L, 6L, 3L, 5L), .Label = c("1/17/2014", "1/8/2014", "2/28/2015", "3/21/2014", "3/22/2015", "3/4/2015", "4/17/2015", "5/22/2016", "5/5/2016", "5/9/2015", "6/10/2015"), class = "factor")),
.Names = c("ID", "CLM_ID", "Date1", "Date2"), class = "data.frame",
row.names = c(NA, -13L))
Code:
dta$Date1 <- as.Date(dta$Date1, format = "%m/%d/%Y")
dta$Date2 <- as.Date(dta$Date2, format = "%m/%d/%Y")
# Boolean vector to memorize results
keep <- logical(length = nrow(dta))
for(i in 1:nrow(dta)) {
match <- dta[dta$Date1 <= dta$Date1[i] & dta$Date2 >= dta$Date2[i], ]
if(nrow(match) == 1) keep[i] <- TRUE
}
# Result
dta[keep, ]