I have two data tables with millions of rows where there are pairs of IDs with partial date overlapping. Please see a very short example below:
library(data.table)
dt1 <- data.table(ID=720,
startdate=as.IDate("2000-01-01"),
enddate=as.IDate("2017-10-09"))
dt2 <- data.table(ID=720,
startdate=as.IDate("2000-06-08"),
enddate=as.IDate("2020-04-12"))
I would like to find the overlapping period of time between the two datasets. I am attempting to do so using foverlaps:
setkey(dt1, ID, startdate, enddate)
setkey(dt2, ID, startdate, enddate)
foverlaps(dt1, dt2, by.x=c("ID", "startdate", "enddate"),
by.y=c("ID", "startdate", "enddate"), type='within', nomatch = 0L)
Empty data.table (0 rows and 5 cols): ID,startdate,enddate,i.startdate,i.enddate
The code above returns an empty data table, because the date range in dt1 is not completely within the date range in dt2.
However, I was expecting a data table with whatever date range is common for the two datasets, which would be:
ID startdate enddate
1: 720 2000-06-08 2017-10-09
Is there anyway to achieve that using foverlaps? If not, is there any alternative that would work just as fast for million of rows?
I think you firstly need to change type='within' to type = 'any'
As within means date range in dt1 sits within dt2
After that, you may need to find the overlapping date range by yourself (which is pretty strightforward). As foverlaps just does the join.
library(data.table)
dt1 <- data.table(ID=720,
startdate=as.IDate("2000-01-01"),
enddate=as.IDate("2017-10-09"))
dt2 <- data.table(ID=720,
startdate=as.IDate("2000-06-08"),
enddate=as.IDate("2020-04-12"))
setkey(dt1, ID, startdate, enddate)
setkey(dt2, ID, startdate, enddate)
result <- foverlaps(dt1, dt2, by.x=c("ID", "startdate", "enddate"),
by.y=c("ID", "startdate", "enddate"), type='any', nomatch = 0L)
result
#> ID startdate enddate i.startdate i.enddate
#> 1: 720 2000-06-08 2020-04-12 2000-01-01 2017-10-09
result[,`:=`(overlapping_start=fifelse(i.startdate>=startdate,i.startdate,startdate),
overlapping_end = fifelse(i.enddate<=enddate,i.enddate,enddate))]
result[,.(ID,overlapping_start,overlapping_end)]
#> ID overlapping_start overlapping_end
#> 1: 720 2000-06-08 2017-10-09
Created on 2020-04-19 by the reprex package (v0.3.0)
Related
I have two df and I'm trying to left or right join them based on two-column. ID and Datetime column. how do I allow DateTime from another df to match the first df even if it's within 10-20sec difference range?
df1 :
ID
Datetime
123
2021-04-02 09:50:11
456
2021-04-02 09:50:15
df2:
ID
Datetime
123
2021-04-02 09:50:31
456
2021-04-02 09:50:23
if the times are within 10-20 diff on df2, return all the columns and DateTime column from the df2 to new,df3. For all matching IDs and yyyy-mm-dd H:M matches to both dfs. so if the change in :SS is between 10-20 on df2, pick it and do join, If its not within 10-20sec range,skip. someone, please help?
Your sample data is very minimalistic. Not sure how you wanetd to implement the 10-20 secs. I assumed everything within -20 to +20 seconds should be matched. This can easily be adjusted in filtering part ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20).
Here is a data.table approach
library(data.table)
# Sample data
DT1 <- fread("ID Datetime
123 2021-04-02T09:50:11
456 2021-04-02T09:50:15")
DT2 <- fread("ID Datetime
123 2021-04-02T09:50:31
456 2021-04-02T09:50:23")
# Set datetimes to posix
DT1[, Datetime := as.POSIXct(Datetime)]
DT2[, Datetime := as.POSIXct(Datetime)]
# possible rowwise approach
DT1[, rowid := .I]
setkey(DT1, rowid)
DT1[DT1, Datetime2 := DT2[ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20),
lapply(.SD, paste0, collapse = ";"), .SDcols = c("Datetime")],
by = .EACHI][, rowid := NULL][]
# ID Datetime Datetime2
# 1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
# 2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23
If I understand correctly, the OP wants to retrieve those rows of df2 (including all columns) which have a matching ID in df1 and where the time difference of the time stamps Datetime between df1 and df2 is less or equal than a given value.
So, for the given sample data
if the allowed time difference is 20 seconds at most both rows of df2 are returned.
If the allowed time difference is 10 seconds at most only the second row of df2 with ID == 456 is returned.
If the allowed time difference is 5 seconds at most an empty dataset is returned because non of df2's rows fulfills the conditions.
One possible approach is to use a non-equi join which is available with data.table:
library(data.table)
timediff <- 10 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[df1, on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .SD][
, c("from", "to") := NULL][]
df3
ID Datetime
1: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 20
the result is
df3
ID Datetime
1: 123 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 5
df3 becomes an empty data.table.
EDIT: Show Datetime from df1 and df2
By request of the OP, here is a version which returns both Datetime columns from df1 and df2, renamed as Datetime1 and Datetime2, resp.,:
library(data.table)
timediff <- 20 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[setDT(df1), on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .(ID, Datetime1 = i.Datetime, Datetime2 = x.Datetime)]
df3
ID Datetime1 Datetime2
1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23
Currently solve this with a workaround, but I would like to know if there is a more efficient way.
See below for exemplary data:
library(data.table)
library(anytime)
library(tidyverse)
library(dplyr)
library(batchtools)
# Lookup table
Date <- c("1990-03-31", "1990-06-30", "1990-09-30", "1990-12-31",
"1991-03-31", "1991-06-30", "1991-09-30", "1991-12-31")
period <- c(1:8)
metric_1 <- rep(c(2000, 3500, 4000, 100000), 2)
metric_2 <- rep(c(200, 350, 400, 10000), 2)
id <- 22
dt <- setDT(data.frame(Date, period, id, metric_1, metric_2))
# Fill and match table 2
Date_2 <- c("1990-08-30", "1990-02-28", "1991-07-31", "1991-09-30", "1991-10-31")
random <- c(10:14)
id_2 <- c(22,33,57,73,999)
dt_fill <- setDT(data.frame(EXCL_DATE, random, id_2))
# Convert date columns to type date
dt[ , Date := anydate(Date)]
dt_fill[ , Date_2 := anydate(Date_2)]
Now for the data wrangling. I want to get the most recent preceding data from dt (aka lookup table) into dt_fill. I do this with an easy 1-line rolling join like this.
# Rolling join
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# if not all id_2 present in id column in table 1, we get rows with NA
# I want to only retain the rows with id's that were originally in the lookup table
Then I end with a bunch of rows filled with NAs for the newly added columns that I would like to get rid of. I do this with a semi-join. I found outdated solutions to be quite hard to understand and settled for batchtools::sjoin() function which is essentially also a one liner.
dt_final <- sjoin(dt_res, dt, by = "id")
Is there a more efficient way of accomplishing a clean output result from a rolling join than by doing the rolling join first and then a semi-join with the original dataset. It is also not very fast for very long data sets. Thanks!
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
When doing a non-equi inner join, should the order of X[Y] and Y[X] matters? I am under the impression that it should not.
library(data.table) #data.table_1.12.2
dt1 <- data.table(ID=LETTERS[1:4], TIME=2L:5L)
cols1 <- names(dt1)
dt2 <- data.table(ID=c("A", "B"), START=c(1L, 20L), END=c(3L, 30L))
cols2 <- names(dt2)
> dt1
ID TIME
1: A 2
2: B 3
3: C 4
4: D 5
> dt2
ID START END
1: A 1 3
2: B 20 30
I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END. Desired output:
ID TIME
1: A 2
Since I wanted rows from dt1, I started with using dt1 as i in data.table[ but I am getting either columns from dt2 or encountered errors:
#no error but using x. values
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L]
#error for the rest
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, mget(paste0("i.", cols1))]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .SD]
dt2[dt1, on=.(ID, START<TIME, END>TIME), nomatch=0L, .(START)]
Error message:
Error in [.data.table(dt2, dt1, on = .(ID, START < TIME, END > TIME), : column(s) not found: START
So I had to use dt2 as the i as a workaround:
#need to type out all the columns:
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L, .(ID, TIME=x.TIME)]
#using setNames
dt1[dt2, on=.(ID, TIME>START, TIME<END), nomatch=0L,
setNames(mget(paste0("x.", cols1)), cols1)]
Or is this a simple case of my misunderstanding?
References:
Confusion arise from answering: r compare two data.tables by row
https://github.com/Rdatatable/data.table/issues/1700
https://github.com/Rdatatable/data.table/issues/1807
https://github.com/Rdatatable/data.table/pull/2706
https://github.com/Rdatatable/data.table/pull/3093
I am trying to filter for rows in dt1 such that 1) ID matches and 2) dt1$TIME lies between dt2$START and dt2$END.
That sounds like a semi join: Perform a semi-join with data.table
dt1[
dt1[dt2, on=.(ID, TIME >= START, TIME <= END), nomatch=0, which=TRUE]
]
# ID TIME
# 1: A 2
If it's possible that multiple rows of dt2 will match rows of dt1, then the "which" output can be wrapped in unique() as in the linked answer.
There are a couple linked feature requests for a more convenient way to do this: https://github.com/Rdatatable/data.table/issues/2158
I am trying to achieve the 'Final.Data' output shown below.
We start with the Reference data and I want to add the 'Add.Data' but join on the 'Person' and return the most recent result prior to the reference (date).
I am looking for dplyr, data.table or sql solutions in r.
I then want to be able to reproduce this for 1000s of entries, so looking for a reasonable efficient solution.
library(tibble)
Reference.Data <- tibble(Person = "John",
Date = "2019-07-10")
Add.Data <- tibble(Person = "John",
Order.Date = c("2019-07-09","2019-07-08") ,
Order = 1:2)
Final.Data <- tibble(Person = "John",
Date = "2019-07-10",
Order.Date = "2019-07-09",
Order = 1)
A roling join to the nearest before date should work pretty fast..
#data preparation:
# convert to data.tables, set dates as 'real' dates
DT1 <- setDT(Reference.Data)[, Date := as.IDate( Date )]
DT2 <- setDT(Add.Data)[, Order.Date := as.IDate( Order.Date )]
#set keys (this also orders the dates, convenient for the join later)
setkey(DT1, Person, Date)
setkey(DT2, Person, Order.Date)
#perform rolling update join on DT1
DT1[ DT2, `:=`( Order.date = i.Order.Date, Order = i.Order), roll = -Inf][]
# Person Date Order.date Order
# 1: John 2019-07-10 2019-07-09 1
An approach using data.table non-equi join and update by reference directly on Reference.Data:
library(data.table)
setDT(Add.Data)
setDT(Reference.Data)
setorder(Add.Data, Person, Order.Date)
Reference.Data[, (names(Add.Data)) :=
Add.Data[.SD, on=.(Person, Order.Date<Date), mult="last",
mget(paste0("x.", names(Add.Data)))]
]
output:
Person Date Order.Date Order
1: John 2019-07-10 2019-07-09 1
Another data.table solution:
setDT(Add.Data)[, Order.Date := as.Date(Order.Date)]
setDT(Reference.Data)[, Date := as.Date(Date)]
Reference.Data[, c("Order.Date", "Order") := Add.Data[.SD,
on = .(Person, Order.Date = Date),
roll = TRUE,
.(x.Order.Date, x.Order)]]
Reference.Data
# Person Date Order.Date Order
# 1: John 2019-07-10 2019-07-09 1
We can do a inner_join and then group by 'Person', slice the row with the max 'Order.Date'
library(tidyverse)
inner_join(Add.Data, Reference.Data) %>%
group_by(Person) %>%
slice(which.max(as.Date(Order.Date)))
# A tibble: 1 x 4
# Groups: Person [1]
# Person Order.Date Order Date
# <chr> <chr> <int> <chr>
#1 John 2019-07-09 1 2019-07-10
Or using data.tabl#
library(data.table)
setDT(Add.Data)[as.data.table(Reference.Data), on = .(Person)][,
.SD[which.max(as.Date(Order.Date))], by = Person]
Left join the Reference.Data to the Add.Data joining on Person and on Order.Date being at or before Date. Group that by the original Reference.Data rows and take the maximum Order.Date from those. The way it works is that the Add.Data row that is used for each row of Reference.Data will be the one with the maximum Order.Date so the correct Order will be shown.
Note that dot is an SQL operator and order is an SQL keyword so we must surround names with a dot or the name order (regardless of case) with square brackets.
library(sqldf)
sqldf("select r.*, max(a.[Order.Date]) as [Order.Date], a.[Order]
from [Reference.Data] as r
left join [Add.Data] as a on r.Person = a.Person and a.[Order.Date] <= r.Date
group by r.rowid")
giving:
Person Date Order.Date Order
1 John 2019-07-10 2019-07-09 1
I haven't checked how fast this is (adding indexes could speed it up if need be) but with only a few thousand rows efficiency is not likely as important as readability.
Suppose I've several intervals which are subset of real line as follows:
I_1 = [0, 1]
I_2 = [1.5, 2]
I_3 = [5, 9]
I_4 = [13, 16]
Now given a real number x = 6.4, say, I'd like to find which interval contains the number x. I would like to know the algorithm to find this interval, and/or how to do this in R.
Thanks in advance.
Update using non-equi joins:
This is much simpler and straightforward using the new non-equi joins feature in the current development version of data.table, v1.9.7:
require(data.table) # v1.9.7+
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[.(x=4.5), on=.(start<=x, end>=x), which=TRUE]
# [1] 7
No need to set keys or create indices.
Old solution using foverlaps:
One way would be to use interval/overlap joins using the data.table package:
require(data.table) ## 1.9.4+
DT1 = data.table(start=c(0,1.5,5,13), end=c(1,2,9,16))
DT2 = data.table(start=6.4, end=6.4)
setkey(DT1)
foverlaps(DT2, DT1, which=TRUE, type="within")
# xid yid
# 1: 1 3
This searches if each interval in DT2 lies completely within DT1 efficiently. In your case DT2 is a point, not an interval. If it did not exist within any intervals in DT1, it'd return NA.
Have a look at ?foverlaps to check out the other arguments you can use. For example mult= argument controls if you'd want to return all the matching rows or just the first or last etc..
Since setkey sorts the result, you'll have to add a separate id as follows:
DT1 = data.table(start=c(0,1.5,5,1,2,3,4,5), end=c(1,2,9,2,3,4,5,6))
DT1[, id := .I] # .I is a special variable. See ?data.table
setkey(DT1, start, end)
DT2 = data.table(start=4.5 ,end=4.5)
olaps = foverlaps(DT2, DT1, type="within", which=TRUE)
olaps[, yid := DT1$id[yid]]
# xid yid
# 1: 1 7