Consider the following problem: I have two dataframes cases and events. For every case there can be several events (1:n). events$caseId (foreign key) refers to cases$id (primary key):
cases events
------ ------
id id
date caseId
var1 date
var2 var1
... var2
...
Now, I'd like to create a new column in cases giving the number of events for every case (i.e. where events$caseId equals cases$id)
# Sample data
cases<-data.frame(id=1:5, date=c("2017-01-02","2016-02-03","2015-02-12","2016-01-03","2016-08-09"), var1=sample(c("A", "B"), 5, replace=T))
events<-data.frame(id=1:10, date=c("2017-01-01","2016-12-12","2016-07-04","2017-04-03","2015-02-13","2015-01-01","2013-05-07","2015-12-25","2016-05-04","2016-10-11"), caseId=c(1,1,1,1,3,3,3,4,5,5))
# Calculate the number of events for every caseId
library(tidyverse)
events %>%
count(caseId) %>%
right_join(cases, by = c("caseId" = "id"))
The problem I have is that I want to count only those events that happened after the date specified in the case-dataframe (i.e. events$date > cases$date).
I would appreciate any help.
You could do a non-equi join with the data.table package (the date-columns need to be converted to date-format first, see below under Data-heading):
library(data.table)
setDT(cases)
setDT(events)
cases[events, on = .(id = caseId, date < date), events := .N, by = .EACHI][]
The result:
id date var1 events
1: 1 2017-01-02 B 1
2: 2 2016-02-03 A NA
3: 3 2015-02-12 A 1
4: 4 2016-01-03 B NA
5: 5 2016-08-09 B 1
A variant of this:
cases[, events := events[cases, on = .(caseId = id, date > date), .N, by = .EACHI]$N][]
The result of that:
id date var1 events
1: 1 2017-01-02 A 1
2: 2 2016-02-03 A 0
3: 3 2015-02-12 A 1
4: 4 2016-01-03 A 0
5: 5 2016-08-09 A 1
Data:
cases <- data.frame(id=1:5, date=c("2017-01-02","2016-02-03","2015-02-12","2016-01-03","2016-08-09"),
var1=sample(c("A", "B"), 5, replace=T))
events <- data.frame(id=1:10,
date=c("2017-01-01","2016-12-12","2016-07-04","2017-04-03","2015-02-13","2015-01-01","2013-05-07","2015-12-25","2016-05-04","2016-10-11"),
caseId=c(1,1,1,1,3,3,3,4,5,5))
cases$date <- as.Date(cases$date)
events$date <- as.Date(events$date)
Related
Background
I've got an R dataframe d:
d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
event = c(1,1,0,0,1,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
stringsAsFactors=FALSE)
As you can see, it's got 3 distinct people in the ID column, and they've either had or not had an event, along with a date their event status was recorded (event_date).
The Problem
I'd like to create a new variable / column, event_within_interval, which assigns 1 to all the cells of a given ID if that ID has 2 or more event=1 within 180 days of their first event=1.
Let me explain further: both ID=a and ID=c have 2 or more events each, but only ID=c has their second event within 180 days of their first (so here, the 4/7/2013 - 3/14/2013 = 24 days for ID=c).
The problem is that I'm not sure how to tell R this idea of "if the second happens within 180 days of the first event=1".
What I'd like
Here's what I'm looking for:
want <- data.frame(ID = c("a","a","b","b","c","c","c"),
event = c(1,1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
event_within_interval = c(0,0,0,0,1,1,1),
stringsAsFactors=FALSE)
What I've tried
I've only got the beginnings of an attempt thus far:
d <- d %>%
mutate(event_within_interval = ID %in% if_else(d$event == 1, 1, 0))
But this doesn't give me what I'd like, as you can tell if you run the code.
I've set the thing up as an if_else, but I'm not sure where to go from here.
UPDATE: I've edited both reproducible examples (what I've got and what I want) to emphasize the fact that the desired date interval needs to be between the first event and the second event, not the first event and the last event. (A couple of users submitted examples using last, which worked for the previous iteration of the reproducible example but wouldn't have worked on the real dataset.)
What about by packages lubridate and data.table?
library(data.table)
library(lubridate)
d <- data.frame(ID = c("a","a","b","b", "c","c"),
event = c(1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07")),
stringsAsFactors=FALSE)
d <- data.table(d)
d <- d[, event_within_interval := 0]
timeInterval <- interval(start = "2013-03-14", end = "2013-04-07")
d <- d[event == 1 & event_date %within% timeInterval, event_within_interval := 1]
d
# ID event event_date event_within_interval
# 1: a 1 2011-01-01 0
# 2: a 1 2012-08-21 0
# 3: b 0 2011-12-23 0
# 4: b 0 2011-12-31 0
# 5: c 1 2013-03-14 1
# 6: c 1 2013-04-07 1
This is good fun.
Scenario 1
My approach would be to
group events by ID
Apply first condition check on two the span of days between current date and initial date
check if the sum of events is bigger or equal two: sum(event) >= 2
only if the two conditions are met I would return one for the event
For readability, I've returned values of conditions in the data as test_* variables.
d %>%
group_by(ID) %>%
mutate(test_interval = event_date - min(event_date) < 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(test_interval & test_sum_events,
1, 0)) %>%
ungroup()
Scenario 2
In this scenario, the data is sorted by event_date within ID and the difference between the first event and second event has to be under 180 days. Rest of events is ignored.
d %>%
group_by(ID) %>%
arrange(event_date) %>%
mutate(
# Check the difference between first event: min(event_date) and
# second event: event_date[2]
test_interval_first_two = event_date[2] - min(event_date) <= 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(
test_interval_first_two & test_sum_events, 1, 0)
) %>%
ungroup()
You can first group_by the ID column, so that we can calculate days within the same ID. Then in the condition in the if_else statement, use condition with sum() > 1 AND day difference <= 180.
Here I assume there's only two "events" or rows per ID.
library(dplyr)
d %>%
group_by(ID) %>%
mutate(event_within_interval = if_else(sum(event) > 1 & last(event_date) - first(event_date) <= 180, 1L, 0L))
# A tibble: 6 x 4
# Groups: ID [3]
ID event event_date event_within_interval
<chr> <dbl> <date> <int>
1 a 1 2011-01-01 0
2 a 1 2012-08-21 0
3 b 0 2011-12-23 0
4 b 0 2011-12-31 0
5 c 1 2013-03-14 1
6 c 1 2013-04-07 1
Here is how we could do it. In this example with an additional column interval to see the interval and then use an ifelse statement.
library(dpylr)
d %>%
group_by(ID) %>%
mutate(interval = last(event_date)- first(event_date),
event_within_interval = ifelse(event == 1 &
interval < 180, 1, 0))
ID event event_date interval event_within_interval
<chr> <dbl> <date> <drtn> <dbl>
1 a 1 2011-01-01 598 days 0
2 a 1 2012-08-21 598 days 0
3 b 0 2011-12-23 8 days 0
4 b 0 2011-12-31 8 days 0
5 c 1 2013-03-14 24 days 1
6 c 1 2013-04-07 24 days 1
I'm trying to wrap my head around how to use data.table::foverlaps() to generate new data tables. In one application, I would like to use foverlaps to identify gaps and then use this information to truncate my original data table.
Suppose that I have a dataset (df1) of 2 employees (id) at a company with date ranges (start_date and end_date) for the periods in which they work on different projects
(proj_id; either "A", "B" or "C").
library(data.table)
library(lubridate)
df1<-data.table(id = rep(1:2,each=3),
start_date = ymd(c("1998-04-03","1999-03-08","2000-08-13",
"2005-03-03","2007-10-12","2014-02-23")),
end_date = ymd(c("1999-03-07","2000-08-12","2021-04-23",
"2007-09-05","2014-02-22","2019-05-04")),
proj_id = c("A","B","A","B","C","A"))
> df1
id start_date end_date proj_id
1: 1 1998-04-03 1999-03-07 A
2: 1 1999-03-08 2000-08-12 B
3: 1 2000-08-13 2021-04-23 A
4: 2 2005-03-03 2007-09-05 B
5: 2 2007-10-12 2014-02-22 C
6: 2 2014-02-23 2019-05-04 A
Now I have another dataset (df2) that specifies the time that I want to truncate from df1.
df2 <- data.table(id = 1:2,
start_date = ymd("1998-07-20", "2006-06-12"),
end_date = ymd("1998-08-15", "2016-04-08"))
> df2
id start_date end_date
1: 1 1998-07-20 1998-08-15
2: 2 2006-06-12 2016-04-08
I can then use data.table::foverlaps() to identify the overlapping episodes:
> setkey(df1,id,start_date,end_date)
> foverlaps(df2, df1, type="any",
+ by.x=c("id","start_date","end_date"))
id start_date end_date proj_id i.start_date i.end_date
1: 1 1998-04-03 1999-03-07 A 1998-07-20 1998-08-15
2: 2 2005-03-03 2007-09-05 B 2006-06-12 2016-04-08
3: 2 2007-10-12 2014-02-22 C 2006-06-12 2016-04-08
4: 2 2014-02-23 2019-05-04 A 2006-06-12 2016-04-08
I would now like to use this data to generate a new version of df1, where I generate new episodes by truncating the gaps identified above. My desired DT is therefore:
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A
```
There may be alternatives that work better, but this could work based on your foverlaps result.
Assume you created another data.table called df3 with your foverlaps result:
df3 <- foverlaps(df2, df1, type = "any", by.x = c("id", "start_date", "end_date"))
Then you could iterate through each row, and add 0, 1, or 2 date ranges depending on overlap (truncate at end, or beginning, or entire range is blocked out).
dt <- data.table(start_date = Date(), end_date = Date(), id = numeric(), proj_id = numeric())
for (i in seq_len(nrow(df3))) {
if (df3$start_date[i] < df3$i.start_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$start_date[i], end_date = df3$i.start_date[i] - 1, id = df3$id[i], proj_id = df3$proj_id[i]))
}
if (df3$end_date[i] > df3$i.end_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$i.end_date[i] + 1, end_date = df3$end_date[i], id = df3$id[i], proj_id = df3$proj_id[i]))
}
}
Finally, you can remove the foverlaps results from your initial df1 since new ranges have been determine for those (using fsetdiff). Then, you can add the new ranges back.
rbind(fsetdiff(df1, df3[,1:4]), dt)[order(id, start_date)]
Output
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A
The dataframe I am working with has two columns: 1) person ID and 2) date. I am trying to assign numeric day values of date for each person.
For instance, person 1 has date from 2016-01-01 (baseline) to 2016-01-05 (last date for person 1). I want to create a day column that would translate this to 1, 2, 3, 4, 5. If person 2 has date from 2016-01-13 to 2016-01-16, the day column for person 2 would be 1, 2, 3, 4.
df <- for(i in length(unique(per1$date))){df$day[per1$date[1] + i] <- i+1}
This is basically what I am trying to do, but I get an error message saying:
"replacement has 17119 rows, data has 1670"
Please let me know how I can write the code for this. Thank you.
you can use this
library(data.table)
## Create Data
df <- data.table(personID = c(1,1,1,2,2,2,2),
Date = c("2016-01-01", "2016-01-02", "2016-01-03", "2016-01-13", "2016-01-14", "2016-01-15", "2016-01-16"))
## Order the data according to date, per user
df <- df[order(Date), .SD, by = personID]
## Rank the date, within each personID group
df <- df[, Day:= 1:.N, .(personID)]
df
personID Date Day
1: 1 2016-01-01 1
2: 1 2016-01-02 2
3: 1 2016-01-03 3
4: 2 2016-01-13 1
5: 2 2016-01-14 2
6: 2 2016-01-15 3
7: 2 2016-01-16 4
I have this data basically, but larger:
I want to count a number of distinct combinations of (customer_id, account_id) - that is, distinct or unique values based on two columns, but for each start_date. I can't find the solution anywhere. The result should be another column added to my data.table that should look like this:
That is, for each start_date, it calculates number of distinct values based on both customer_id and account_id.
For example, for start_date equal to 2.2.2018, I have distinct combinations in (customer_id,account_id) being (4,22) (5,38) and (6,13), so I want count to be equal to 3 because I have 3 distinct combinations. I also need the solution to work with character values in customer_id and account_id columns.
Code to replicate the data:
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-02","%Y-%m-%d"),each=4))
data <- data.table(customer_id,account_id,start_date)
Another dplyr option:
library(dplyr)
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date)%>%
mutate(distinct_values = n_distinct(customer_id, account_id)) %>%
ungroup()
dplyr option
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date, customer_id, account_id) %>%
summarise(Total = 1) %>%
group_by(start_date) %>%
summarise(Count =n())
Here is a data.table option
data[, N := uniqueN(paste0(customer_id, account_id, "_")), by = start_date]
# customer_id account_id start_date N
# 1: 1 11 2017-01-01 4
# 2: 1 11 2017-01-01 4
# 3: 1 11 2017-01-01 4
# 4: 2 11 2017-01-01 4
# 5: 3 55 2017-01-01 4
# 6: 3 88 2017-01-01 4
# 7: 4 22 2018-02-02 3
# 8: 5 38 2018-02-02 3
# 9: 5 38 2018-02-02 3
#10: 6 13 2018-02-02 3
Or
data[, N := uniqueN(.SD, by = c("customer_id", "account_id")), by = start_date]
I would like to remove products from a dataframe that have overlapping start and end dates to avoid duplicates in a subsequent step.
Example data:
library(dplyr)
d <-
bind_rows(
data.frame(product = 1,
start_date = as.Date("2016-01-01"),
end_date = as.Date("2016-01-10"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-02"),
end_date = as.Date("2016-01-04"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-05"),
end_date = as.Date("2016-06-09"),
stringsAsFactors = FALSE),
data.frame(product = 2,
start_date = as.Date("2016-01-03"),
end_date = as.Date("2016-01-07"),
stringsAsFactors = FALSE)
)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-02 2016-01-04
3 1 2016-01-05 2016-06-09
4 2 2016-01-03 2016-01-07
From this example I would like to remove rows 2 and 3 because of the overlaps.
I've used the lag function to remove overlaps that are next to each other:
d_cleaned <-
d %>%
arrange(product, start_date, end_date) %>%
mutate(overlapping = product == lag(product) & start_date <= lag(end_date) & end_date >= lag(start_date)) %>% # define overlaps
mutate(overlapping = ifelse(is.na(overlapping), FALSE, overlapping)) %>% # dont delete the first row
filter(overlapping == FALSE) %>% # remove overlaps
select(-overlapping)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-05 2016-06-09
3 2 2016-01-03 2016-01-07
As can be seen above this step removes overlaps on consecutive rows but not all.
I can solve this with a loop, but I was hoping that someone might be able to suggest a non-looping solution as the dataframe is quite large and each step takes a while.
Using non-equi joins from the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
setDT(d) # convert 'd' to a data.table by reference
idx = d[d, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE]
d[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
# product start_date end_date
# 1: 1 2016-01-01 2016-01-10
# 2: 1 2016-06-10 2016-06-12
# 3: 2 2016-01-03 2016-01-07
For each row in d (the one inside the square bracket), we find any kind of overlap with d (on the outside), i.e., a self-join, based on the condition provided to the on argument, and we extract the index of the first overlap (because which=TRUE and mult="first").
If and only if the first overlap is with itself, we return them. We discard all other intervals.
To install devel version, see installation instructions here.
Here's a benchmark on slightly more rows (the data is not by any means large):
set.seed(1L)
require(data.table) # v1.9.7+
dates = as.Date(sample(16000:17000, 1e5, TRUE), origin="1970-01-01")
dt = data.table(product=sample(100, 1e5, TRUE),
start_date = sample(dates, 1e5, TRUE),
end_date = sample(dates, 1e5, TRUE))
dt[, `:=`(start_date = pmin(start_date, end_date),
end_date = pmax(start_date, end_date))]
system.time({
idx = dt[dt, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE, verbose=TRUE]
ans = dt[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
})
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0.041 secs
# Recomputing forder with non-equi ids ... done in 0.005 secs
# Found 178 non-equi group(s) ...
# Starting bmerge ...done in 2.359 secs
# user system elapsed
# 2.402 0.011 2.421
head(ans)
# product start_date end_date
# 1: 71 2015-12-04 2016-03-22
# 2: 71 2014-04-12 2015-05-01
# 3: 32 2013-11-23 2015-03-18
# 4: 56 2014-07-29 2015-12-26
# 5: 88 2015-03-08 2015-03-21
# 6: 69 2014-10-31 2015-07-05
nrow(ans)
# [1] 186
I believe the following will work
d <- cbind(ID=1:nrow(d),d)
d_cleaned <- d[rep(1:nrow(d), times=nrow(d)),] %>% ## 1
setNames(.,paste0(names(.),"_other")) %>% ## 2
bind_cols(d[rep(1:nrow(d), each=nrow(d)),], .) %>% ## 3
arrange(product,start_date,end_date) %>% ## 4
filter(product == product_other) %>% ## 5
mutate(overlapping = ID_other < ID &
start_date <= end_date_other &
end_date >= start_date_other) %>% ## 6
group_by(ID) %>%
filter(all(overlapping==FALSE)) %>% ## 7
ungroup() %>%
select(product,start_date,end_date) %>%
distinct())
print(d_cleaned)
### A tibble: 2 x 3
## product start_date end_date
## <dbl> <date> <date>
##1 1 2016-01-01 2016-01-10
##2 2 2016-01-03 2016-01-07
First, add a column of IDs that identifies the rows of the data frame to group_by later to determine if there is overlap with any other row. The key is to be able to consider all distinct pairs of rows with the same product in testing for overlap. The above code does this by expanding the data as in an outer-join. Specifically,
Replicate d nrow(d) times
Change the names of the columns by appending _other to them so that they can be referenced separately from the original column names in the overlap test
Replicate each row of d nrow(d) times and append the result from (2) as new columns
The result of (3) have rows that enumerates all pairs of rows from the original data frame. Then:
Sort them as you did.
Consider only pairs where the product matches. Do this first to minimize not needed comparisons later
Do the overlap test. Here comparison is only made with respect to the previous rows in the original data frame. This has the effect of considering all lags and preserving the row itself (i.e., all rows overlap with itself)
Group by the ID (each row in original data frame) and keep those for which all overlapping is FALSE
At this point, the result contains only the non-overlapping rows in the original data frame. However, there are all those extra columns, and there are duplicates where multiple rows overlap with a row. The rest of the code cleans that up.
I have tested it with the following data (augmented yours to add a few more test conditions, but far from exhaustive):
d <- structure(list(product = c(1, 1, 1, 1, 1, 2, 2), start_date = structure(c(16801,
16802, 16805, 16811, 16962, 16803, 16806), class = "Date"), end_date = structure(c(16810,
16804, 16961, 16961, 16964, 16807, 16810), class = "Date")), .Names = c("product",
"start_date", "end_date"), row.names = c(NA, -7L), class = "data.frame")
and got the following results:
# A tibble: 3 x 3
product start_date end_date
<dbl> <date> <date>
1 1 2016-01-01 2016-01-10
2 1 2016-06-10 2016-06-12
3 2 2016-01-03 2016-01-07
Hope this helps.