subsetting ID and dates in data.table R - r

I have a large matrix similar to the next example that I create (I have 70 columns and millions of rows):
a <- seq(as.IDate("2011-12-30"), as.IDate("2014-01-04"), by="days")
data <- data.table(ID = 1:length(a), date1 = a)
I want to extract all those lines that are in IDs, it contains the ID of the individual, and the dates that I need to extract from that individual. An individual can have multiple lines.
a <- seq(as.IDate("2011-12-30"), as.IDate("2014-01-04"), by="week")
b <- seq(as.IDate("2012-01-01"), as.IDate("2014-01-06"), by="week")
IDs <- data.table(ID = 1:length(a), date1 = a, date2 = b)
Currently, my solution is not very fast, what would be better?
A <- list()
for(i in 1:dim(IDs)[1]){
A[[i]] <- data[ID == IDs[i,ID] & (date1 %between% IDs[i,.(date1,date2)]),]
}

I think you are looking for a non-equi inner join:
IDs[data, on=.(ID, date1<=date1, date2>=date1), nomatch=0L, .(ID, date1=i.date1)]
Or associatively,
data[IDs, on=.(ID, date1>=date1, date1<=date2), nomatch=0L, .(ID, date1=x.date1)]
Or viewing it as a non-equi semi-join:
data[IDs[data, on=.(ID, date1<=date1, date2>=date1), nomatch=0L, which=TRUE]]
output:
ID date1
1: 1 2011-12-30

Related

Join two dataframe with two columns [ one of datetime column ] in R

I have two df and I'm trying to left or right join them based on two-column. ID and Datetime column. how do I allow DateTime from another df to match the first df even if it's within 10-20sec difference range?
df1 :
ID
Datetime
123
2021-04-02 09:50:11
456
2021-04-02 09:50:15
df2:
ID
Datetime
123
2021-04-02 09:50:31
456
2021-04-02 09:50:23
if the times are within 10-20 diff on df2, return all the columns and DateTime column from the df2 to new,df3. For all matching IDs and yyyy-mm-dd H:M matches to both dfs. so if the change in :SS is between 10-20 on df2, pick it and do join, If its not within 10-20sec range,skip. someone, please help?
Your sample data is very minimalistic. Not sure how you wanetd to implement the 10-20 secs. I assumed everything within -20 to +20 seconds should be matched. This can easily be adjusted in filtering part ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20).
Here is a data.table approach
library(data.table)
# Sample data
DT1 <- fread("ID Datetime
123 2021-04-02T09:50:11
456 2021-04-02T09:50:15")
DT2 <- fread("ID Datetime
123 2021-04-02T09:50:31
456 2021-04-02T09:50:23")
# Set datetimes to posix
DT1[, Datetime := as.POSIXct(Datetime)]
DT2[, Datetime := as.POSIXct(Datetime)]
# possible rowwise approach
DT1[, rowid := .I]
setkey(DT1, rowid)
DT1[DT1, Datetime2 := DT2[ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20),
lapply(.SD, paste0, collapse = ";"), .SDcols = c("Datetime")],
by = .EACHI][, rowid := NULL][]
# ID Datetime Datetime2
# 1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
# 2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23
If I understand correctly, the OP wants to retrieve those rows of df2 (including all columns) which have a matching ID in df1 and where the time difference of the time stamps Datetime between df1 and df2 is less or equal than a given value.
So, for the given sample data
if the allowed time difference is 20 seconds at most both rows of df2 are returned.
If the allowed time difference is 10 seconds at most only the second row of df2 with ID == 456 is returned.
If the allowed time difference is 5 seconds at most an empty dataset is returned because non of df2's rows fulfills the conditions.
One possible approach is to use a non-equi join which is available with data.table:
library(data.table)
timediff <- 10 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[df1, on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .SD][
, c("from", "to") := NULL][]
df3
ID Datetime
1: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 20
the result is
df3
ID Datetime
1: 123 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 5
df3 becomes an empty data.table.
EDIT: Show Datetime from df1 and df2
By request of the OP, here is a version which returns both Datetime columns from df1 and df2, renamed as Datetime1 and Datetime2, resp.,:
library(data.table)
timediff <- 20 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[setDT(df1), on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .(ID, Datetime1 = i.Datetime, Datetime2 = x.Datetime)]
df3
ID Datetime1 Datetime2
1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23

R - Most efficient way to remove all non-matched rows in a data.table rolling join (instead of 2-step procedure with semi join)

Currently solve this with a workaround, but I would like to know if there is a more efficient way.
See below for exemplary data:
library(data.table)
library(anytime)
library(tidyverse)
library(dplyr)
library(batchtools)
# Lookup table
Date <- c("1990-03-31", "1990-06-30", "1990-09-30", "1990-12-31",
"1991-03-31", "1991-06-30", "1991-09-30", "1991-12-31")
period <- c(1:8)
metric_1 <- rep(c(2000, 3500, 4000, 100000), 2)
metric_2 <- rep(c(200, 350, 400, 10000), 2)
id <- 22
dt <- setDT(data.frame(Date, period, id, metric_1, metric_2))
# Fill and match table 2
Date_2 <- c("1990-08-30", "1990-02-28", "1991-07-31", "1991-09-30", "1991-10-31")
random <- c(10:14)
id_2 <- c(22,33,57,73,999)
dt_fill <- setDT(data.frame(EXCL_DATE, random, id_2))
# Convert date columns to type date
dt[ , Date := anydate(Date)]
dt_fill[ , Date_2 := anydate(Date_2)]
Now for the data wrangling. I want to get the most recent preceding data from dt (aka lookup table) into dt_fill. I do this with an easy 1-line rolling join like this.
# Rolling join
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# if not all id_2 present in id column in table 1, we get rows with NA
# I want to only retain the rows with id's that were originally in the lookup table
Then I end with a bunch of rows filled with NAs for the newly added columns that I would like to get rid of. I do this with a semi-join. I found outdated solutions to be quite hard to understand and settled for batchtools::sjoin() function which is essentially also a one liner.
dt_final <- sjoin(dt_res, dt, by = "id")
Is there a more efficient way of accomplishing a clean output result from a rolling join than by doing the rolling join first and then a semi-join with the original dataset. It is also not very fast for very long data sets. Thanks!
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]

consecutively subtracting columns in data.table

Suppose I have the following data.table:
player_id prestige_score_0 prestige_score_1 prestige_score_2 prestige_score_3 prestige_score_4
1: 100284 0.0001774623 2.519792e-03 5.870781e-03 7.430179e-03 7.937716e-03
2: 103819 0.0001774623 1.426482e-03 3.904329e-03 5.526974e-03 6.373850e-03
3: 100656 0.0001774623 2.142518e-03 4.221423e-03 5.822705e-03 6.533448e-03
4: 104745 0.0001774623 1.084913e-03 3.061197e-03 4.383649e-03 5.091851e-03
5: 104925 0.0001774623 1.488457e-03 2.926728e-03 4.360301e-03 5.068171e-03
And I want to find the difference between values in each column starting from column prestige_score_0
In one step it should look like this: df[,prestige_score_0] - df[,prestige_score_1]
How can I do it in data.table(and save this differences as data.table and keep player_id as well)?
This is how you can do this in a tidy way:
# make it tidy
df2 <- melt(df,
id = "player_id",
variable.name = "column_name",
value.name = "prestige_score")
# extract numbers from column names
df2[, score_number := as.numeric(gsub("prestige_score_", "", column_name))]
# compute differences by player
df2[, diff := prestige_score - shift(prestige_score, n = 1L, type = "lead"),
by = player_id]
# if necessary, reshape back to original format
dcast(df2, player_id ~ score_number, value.var = c("prestige_score", "diff"))
you can subtract a whole dt with a shifted version of itself
dt = data.table(id=c("A","B"),matrix(rexp(10, rate=.1), ncol=5))
dt_shift = data.table(id=dt[,id], dt[, 2:(ncol(dt)-1)] - dt[,3:ncol(dt)])
You could use a for loop -
for(i in c(1:(ncol(df)-1)){
df[, paste0("diff_", i-1, "_", i)] = df[, paste0("prestige_score_", i-1)] -
df[, paste0("prestige_score_", i)]
}
This might not be the most efficient if you have a lot of columns though.

Subset by group with data.table compared to aggregate a data.table

This is a follow up question to Subset by group with data.table using the same data.table:
library(data.table)
bdt <- as.data.table(baseball)
# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]$V1]
Why do dt1 and dt2 differ in number of rows?
Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?
As #Frank pointed out:
bdt[ , .(max_g = max(g)), by = id] provides you with the maximum value, while
bdt[bdt[ , .I[g == max(g)], by = id]$V1] identifies all rows that have this maximum.
See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:
library(data.table)
bdt <- as.data.table(baseball)
dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]$V1 ]

Assign a value based on closest neighbour from other data frame

With generic data:
set.seed(456)
a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)
c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)
For each df1$b we want to find the nearest df2$d.
Then we create a new variable df1$XYZ that takes the df2$c value of the nearest df2$d
This question has guided me towards data.table library. But I am not sure if ddplyr and group_by can also be used:
Here was my data.table attempt:
library(data.table)
dt1 <- data.table( df1 , key = "b" )
dt2 <- data.table( df2 , key = "d" )
dt[ ldt , list( d ) , roll = "nearest" ]
Here's one way with data.table:
require(data.table)
setDT(df1)[, XYZ := setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]]
You need to get df2$c corresponding to the nearest value in df2$d for every df1$b. So, we need to join as df2[df1] which results in nrow(df1) rows.That can be done with setDT(df2)[df1, c, on=c(d="b"), roll="nearest"].
It returns the result you require. All we need to do is to add this back to df1 with the name XYZ. We do that using :=.
The thought process in constructing the rolling join is something like this (assuming df1 and df2 are both data tables):
We need get some value(s) for each row of df1. That means, i = df1 in x[i] syntax.
df2[df1]
We need to join df2$d with df1$b. Using on= that'd be:
df2[df1, on=c(d="b")]
We need just the c column. Use j to select just that column.
df2[df1, c, on=c(d="b")]
We don't need equi-join but roll to nearest join.
df2[df1, c, on=c(d="b"), roll="nearest"]
Hope this helps.

Resources