Join two dataframe with two columns [ one of datetime column ] in R - r

I have two df and I'm trying to left or right join them based on two-column. ID and Datetime column. how do I allow DateTime from another df to match the first df even if it's within 10-20sec difference range?
df1 :
ID
Datetime
123
2021-04-02 09:50:11
456
2021-04-02 09:50:15
df2:
ID
Datetime
123
2021-04-02 09:50:31
456
2021-04-02 09:50:23
if the times are within 10-20 diff on df2, return all the columns and DateTime column from the df2 to new,df3. For all matching IDs and yyyy-mm-dd H:M matches to both dfs. so if the change in :SS is between 10-20 on df2, pick it and do join, If its not within 10-20sec range,skip. someone, please help?

Your sample data is very minimalistic. Not sure how you wanetd to implement the 10-20 secs. I assumed everything within -20 to +20 seconds should be matched. This can easily be adjusted in filtering part ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20).
Here is a data.table approach
library(data.table)
# Sample data
DT1 <- fread("ID Datetime
123 2021-04-02T09:50:11
456 2021-04-02T09:50:15")
DT2 <- fread("ID Datetime
123 2021-04-02T09:50:31
456 2021-04-02T09:50:23")
# Set datetimes to posix
DT1[, Datetime := as.POSIXct(Datetime)]
DT2[, Datetime := as.POSIXct(Datetime)]
# possible rowwise approach
DT1[, rowid := .I]
setkey(DT1, rowid)
DT1[DT1, Datetime2 := DT2[ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20),
lapply(.SD, paste0, collapse = ";"), .SDcols = c("Datetime")],
by = .EACHI][, rowid := NULL][]
# ID Datetime Datetime2
# 1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
# 2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23

If I understand correctly, the OP wants to retrieve those rows of df2 (including all columns) which have a matching ID in df1 and where the time difference of the time stamps Datetime between df1 and df2 is less or equal than a given value.
So, for the given sample data
if the allowed time difference is 20 seconds at most both rows of df2 are returned.
If the allowed time difference is 10 seconds at most only the second row of df2 with ID == 456 is returned.
If the allowed time difference is 5 seconds at most an empty dataset is returned because non of df2's rows fulfills the conditions.
One possible approach is to use a non-equi join which is available with data.table:
library(data.table)
timediff <- 10 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[df1, on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .SD][
, c("from", "to") := NULL][]
df3
ID Datetime
1: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 20
the result is
df3
ID Datetime
1: 123 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 5
df3 becomes an empty data.table.
EDIT: Show Datetime from df1 and df2
By request of the OP, here is a version which returns both Datetime columns from df1 and df2, renamed as Datetime1 and Datetime2, resp.,:
library(data.table)
timediff <- 20 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[setDT(df1), on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .(ID, Datetime1 = i.Datetime, Datetime2 = x.Datetime)]
df3
ID Datetime1 Datetime2
1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23

Related

Sum variable between dates in R?

How do you sum a value that occurs between two dates in R?
For example, I have two data tables, df1 has start and end dates, df2 has values corresponding to certain dates between the start and end dates in df1. I would like to sum the values in df2 between each Start and End date in df1 and record that information in df1.
df1 <- data.frame(Start = c('1/1/20', '5/1/20', '10/1/20', '2/2/21', '3/20/21'),
End = c('1/7/20', '5/7/20', '10/7/20', '2/7/21', '3/30/21'))
df2 <- data.frame(Date = c('1/1/20','1/3/20' ,'5/1/20','5/2/20','6/2/20' ,'6/4/20','10/1/20', '2/2/21', '3/20/21'),value=c('1','2','5','15','20','2','3','78','100'))
I have tried following the example at the following link that provides information on counting between two dates in R but I am struggling to apply it to the function sum. Sum/count between two dates in R
Thank you!
We can use a non-equi join in data.table after converting the date columns to Date class
library(data.table)
library(lubridate)
setDT(df1)[df2, value := sum(value),
on = .(Start <= Date, End >= Date), by = .EACHI]
-output
df1
# Start End value
#1: 2020-01-01 2020-01-07 2
#2: 2020-05-01 2020-05-07 15
#3: 2020-10-01 2020-10-07 3
#4: 2021-02-02 2021-02-07 78
#5: 2021-03-20 2021-03-30 100
data
df1[] <- lapply(df1, mdy)
df2$Date <- mdy(df2$Date)
df2$value <- as.numeric(df2$value)

R - unexpected result using foverlaps with dates

I have two data tables with millions of rows where there are pairs of IDs with partial date overlapping. Please see a very short example below:
library(data.table)
dt1 <- data.table(ID=720,
startdate=as.IDate("2000-01-01"),
enddate=as.IDate("2017-10-09"))
dt2 <- data.table(ID=720,
startdate=as.IDate("2000-06-08"),
enddate=as.IDate("2020-04-12"))
I would like to find the overlapping period of time between the two datasets. I am attempting to do so using foverlaps:
setkey(dt1, ID, startdate, enddate)
setkey(dt2, ID, startdate, enddate)
foverlaps(dt1, dt2, by.x=c("ID", "startdate", "enddate"),
by.y=c("ID", "startdate", "enddate"), type='within', nomatch = 0L)
Empty data.table (0 rows and 5 cols): ID,startdate,enddate,i.startdate,i.enddate
The code above returns an empty data table, because the date range in dt1 is not completely within the date range in dt2.
However, I was expecting a data table with whatever date range is common for the two datasets, which would be:
ID startdate enddate
1: 720 2000-06-08 2017-10-09
Is there anyway to achieve that using foverlaps? If not, is there any alternative that would work just as fast for million of rows?
I think you firstly need to change type='within' to type = 'any'
As within means date range in dt1 sits within dt2
After that, you may need to find the overlapping date range by yourself (which is pretty strightforward). As foverlaps just does the join.
library(data.table)
dt1 <- data.table(ID=720,
startdate=as.IDate("2000-01-01"),
enddate=as.IDate("2017-10-09"))
dt2 <- data.table(ID=720,
startdate=as.IDate("2000-06-08"),
enddate=as.IDate("2020-04-12"))
setkey(dt1, ID, startdate, enddate)
setkey(dt2, ID, startdate, enddate)
result <- foverlaps(dt1, dt2, by.x=c("ID", "startdate", "enddate"),
by.y=c("ID", "startdate", "enddate"), type='any', nomatch = 0L)
result
#> ID startdate enddate i.startdate i.enddate
#> 1: 720 2000-06-08 2020-04-12 2000-01-01 2017-10-09
result[,`:=`(overlapping_start=fifelse(i.startdate>=startdate,i.startdate,startdate),
overlapping_end = fifelse(i.enddate<=enddate,i.enddate,enddate))]
result[,.(ID,overlapping_start,overlapping_end)]
#> ID overlapping_start overlapping_end
#> 1: 720 2000-06-08 2017-10-09
Created on 2020-04-19 by the reprex package (v0.3.0)

subsetting ID and dates in data.table R

I have a large matrix similar to the next example that I create (I have 70 columns and millions of rows):
a <- seq(as.IDate("2011-12-30"), as.IDate("2014-01-04"), by="days")
data <- data.table(ID = 1:length(a), date1 = a)
I want to extract all those lines that are in IDs, it contains the ID of the individual, and the dates that I need to extract from that individual. An individual can have multiple lines.
a <- seq(as.IDate("2011-12-30"), as.IDate("2014-01-04"), by="week")
b <- seq(as.IDate("2012-01-01"), as.IDate("2014-01-06"), by="week")
IDs <- data.table(ID = 1:length(a), date1 = a, date2 = b)
Currently, my solution is not very fast, what would be better?
A <- list()
for(i in 1:dim(IDs)[1]){
A[[i]] <- data[ID == IDs[i,ID] & (date1 %between% IDs[i,.(date1,date2)]),]
}
I think you are looking for a non-equi inner join:
IDs[data, on=.(ID, date1<=date1, date2>=date1), nomatch=0L, .(ID, date1=i.date1)]
Or associatively,
data[IDs, on=.(ID, date1>=date1, date1<=date2), nomatch=0L, .(ID, date1=x.date1)]
Or viewing it as a non-equi semi-join:
data[IDs[data, on=.(ID, date1<=date1, date2>=date1), nomatch=0L, which=TRUE]]
output:
ID date1
1: 1 2011-12-30

dplyr into data.table: filter > group by > count

I usually work with dplyr but face a rather large data set and my approach is very slow. I basically need to filter a df group it by dates and count the occurrence within
sample data (turned already everything into data.table)
library(data.table)
library(dplyr)
set.seed(123)
df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)
df2 <- data.table(id = sample(1:10, 5, replace = T),
start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
end = df$startmonth + sample(10:90,5, replace = T)
)
#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
My dplyr approach works but is slow
res %>% filter(start <=endmonth & end>= startmonth) %>%
group_by(startmonth,endmonth) %>%
summarise(countmonth=n())
My data.table knowledge is limited but I guess we would setkeys() on the date columns and something like res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] to get the counts by group but I'm not sure how the filter goes in there.
Appreciate your help!
You could do the counting inside the join:
df2[df, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]
start end N
1: 2014-07-31 2014-07-01 1
2: 2014-08-31 2014-08-01 4
3: 2014-09-30 2014-09-01 5
4: 2014-10-31 2014-10-01 3
5: 2014-11-30 2014-11-01 3
or add it as a new column in df:
df[, n :=
df2[.SD, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]$N
]
How it works. The syntax is x[i, on=, allow.cartesian=, j, by=.EACHI]. Each row if i is used to look up values in x. The symbol .EACHI indicates that aggregation (j=.N) will be done for each row of i.

Join two data frames together and use most recent result as rows added

I am trying to achieve the 'Final.Data' output shown below.
We start with the Reference data and I want to add the 'Add.Data' but join on the 'Person' and return the most recent result prior to the reference (date).
I am looking for dplyr, data.table or sql solutions in r.
I then want to be able to reproduce this for 1000s of entries, so looking for a reasonable efficient solution.
library(tibble)
Reference.Data <- tibble(Person = "John",
Date = "2019-07-10")
Add.Data <- tibble(Person = "John",
Order.Date = c("2019-07-09","2019-07-08") ,
Order = 1:2)
Final.Data <- tibble(Person = "John",
Date = "2019-07-10",
Order.Date = "2019-07-09",
Order = 1)
A roling join to the nearest before date should work pretty fast..
#data preparation:
# convert to data.tables, set dates as 'real' dates
DT1 <- setDT(Reference.Data)[, Date := as.IDate( Date )]
DT2 <- setDT(Add.Data)[, Order.Date := as.IDate( Order.Date )]
#set keys (this also orders the dates, convenient for the join later)
setkey(DT1, Person, Date)
setkey(DT2, Person, Order.Date)
#perform rolling update join on DT1
DT1[ DT2, `:=`( Order.date = i.Order.Date, Order = i.Order), roll = -Inf][]
# Person Date Order.date Order
# 1: John 2019-07-10 2019-07-09 1
An approach using data.table non-equi join and update by reference directly on Reference.Data:
library(data.table)
setDT(Add.Data)
setDT(Reference.Data)
setorder(Add.Data, Person, Order.Date)
Reference.Data[, (names(Add.Data)) :=
Add.Data[.SD, on=.(Person, Order.Date<Date), mult="last",
mget(paste0("x.", names(Add.Data)))]
]
output:
Person Date Order.Date Order
1: John 2019-07-10 2019-07-09 1
Another data.table solution:
setDT(Add.Data)[, Order.Date := as.Date(Order.Date)]
setDT(Reference.Data)[, Date := as.Date(Date)]
Reference.Data[, c("Order.Date", "Order") := Add.Data[.SD,
on = .(Person, Order.Date = Date),
roll = TRUE,
.(x.Order.Date, x.Order)]]
Reference.Data
# Person Date Order.Date Order
# 1: John 2019-07-10 2019-07-09 1
We can do a inner_join and then group by 'Person', slice the row with the max 'Order.Date'
library(tidyverse)
inner_join(Add.Data, Reference.Data) %>%
group_by(Person) %>%
slice(which.max(as.Date(Order.Date)))
# A tibble: 1 x 4
# Groups: Person [1]
# Person Order.Date Order Date
# <chr> <chr> <int> <chr>
#1 John 2019-07-09 1 2019-07-10
Or using data.tabl#
library(data.table)
setDT(Add.Data)[as.data.table(Reference.Data), on = .(Person)][,
.SD[which.max(as.Date(Order.Date))], by = Person]
Left join the Reference.Data to the Add.Data joining on Person and on Order.Date being at or before Date. Group that by the original Reference.Data rows and take the maximum Order.Date from those. The way it works is that the Add.Data row that is used for each row of Reference.Data will be the one with the maximum Order.Date so the correct Order will be shown.
Note that dot is an SQL operator and order is an SQL keyword so we must surround names with a dot or the name order (regardless of case) with square brackets.
library(sqldf)
sqldf("select r.*, max(a.[Order.Date]) as [Order.Date], a.[Order]
from [Reference.Data] as r
left join [Add.Data] as a on r.Person = a.Person and a.[Order.Date] <= r.Date
group by r.rowid")
giving:
Person Date Order.Date Order
1 John 2019-07-10 2019-07-09 1
I haven't checked how fast this is (adding indexes could speed it up if need be) but with only a few thousand rows efficiency is not likely as important as readability.

Resources