data.table join based on switched string combinations - r

I have df1 which I would like to merge with df2 based on a common field id
id is always in the form of 21_2342_A_C (i.e. num_num_char_char). I want to merge df2 into df1 if either of the last two fields (sep="_") in id are switched.
So, if ID in df1 is 21_2342_A_C, then I want it to match if the entry in df2 is either 21_2342_A_C or 21_2342_C_A.
Is this possible using data.table? I've developed a cumbersome way involving creating two different columns and doing two different joins, but I was hoping there'd be a more elegant solution. I'll also happily take a non data.table solution.

This also includes creating two additional columns but only 1 merge:
dt <- data.table(
id = c("21_2342_A_C", "21_2342_C_A", "21_2342_A_B")
)
extract number and character part of id
sort character part
merge if number and character parts are same
remove merges on itself and/or duplicated merges (if row i is merged to row j then row j is merged on row i)
dt[, row_id := seq_len(.N)]
dt[, (c("id1", "id2")) := transpose(str_extract_all(dt$id, "([0-9]{2}_[0-9]{4})|([A-Z]_[A-Z])"))]
dt[, id2 := map_chr(str_split(id2, "_"), ~str_c(sort(.x), collapse = ""))]
res <- dt[dt, on = .(id1, id2)][row_id < i.row_id]
res[, c("row_id", "id1", "id2", "i.row_id") := NULL]

I also could not figure out how to do it without an intermediate id.
Here is my take:
df1 <- data.table(V1= "hello", id= "21_2342_A_C")
df2 <- data.table(V1= c("world1", "world2"), id= c("21_2342_A_C", "21_2342_C_A"))
sort_id <- function(x)
{
x <- unlist(tstrsplit(x, "_"))
return(paste0(c(x[1:2], sort(x[3:4])), collapse= "_"))
}
df1[, id2:= sort_id(id), id]
df2[, id2:= sort_id(id), id]
merge(df1,
df2,
"id2")

Related

R - Most efficient way to remove all non-matched rows in a data.table rolling join (instead of 2-step procedure with semi join)

Currently solve this with a workaround, but I would like to know if there is a more efficient way.
See below for exemplary data:
library(data.table)
library(anytime)
library(tidyverse)
library(dplyr)
library(batchtools)
# Lookup table
Date <- c("1990-03-31", "1990-06-30", "1990-09-30", "1990-12-31",
"1991-03-31", "1991-06-30", "1991-09-30", "1991-12-31")
period <- c(1:8)
metric_1 <- rep(c(2000, 3500, 4000, 100000), 2)
metric_2 <- rep(c(200, 350, 400, 10000), 2)
id <- 22
dt <- setDT(data.frame(Date, period, id, metric_1, metric_2))
# Fill and match table 2
Date_2 <- c("1990-08-30", "1990-02-28", "1991-07-31", "1991-09-30", "1991-10-31")
random <- c(10:14)
id_2 <- c(22,33,57,73,999)
dt_fill <- setDT(data.frame(EXCL_DATE, random, id_2))
# Convert date columns to type date
dt[ , Date := anydate(Date)]
dt_fill[ , Date_2 := anydate(Date_2)]
Now for the data wrangling. I want to get the most recent preceding data from dt (aka lookup table) into dt_fill. I do this with an easy 1-line rolling join like this.
# Rolling join
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# if not all id_2 present in id column in table 1, we get rows with NA
# I want to only retain the rows with id's that were originally in the lookup table
Then I end with a bunch of rows filled with NAs for the newly added columns that I would like to get rid of. I do this with a semi-join. I found outdated solutions to be quite hard to understand and settled for batchtools::sjoin() function which is essentially also a one liner.
dt_final <- sjoin(dt_res, dt, by = "id")
Is there a more efficient way of accomplishing a clean output result from a rolling join than by doing the rolling join first and then a semi-join with the original dataset. It is also not very fast for very long data sets. Thanks!
Essentially, there are two approaches I find that are both viable solutions.
Solution 1
First, proposed by lil_barnacle is an elegant one-liner that reads like following:
# Rolling join with nomtach-argument set to 0
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE, nomatch=0]
Original approach
Adding the nomatch argument and setting it to 0 like this nomatch = 0, is equivalent to doing the rolling join first and doing the semi-join thereafter.
# Rolling join without specified nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]
# Semi-join required
dt_final <- sjoin(dt_res, dt, by = "id")
Solution 2
Second, the solution that I came up with was to 'align' both data sets before the rolling join by means of filtering by the 'joined variable' like so:
# Aligning data sets by filtering accd. to joined 'variable'
dt_fill <- dt_fill[id_2 %in% dt[ , unique(id)]]
# Rolling join without need to specify nomatch argument
dt_res <- dt[dt_fill, on = .(id = id_2, Date = Date_2), roll = TRUE]

replace values in dataframe based on indexes on second dataframe R

I have the following task: replace values of variable V1 in dataframe A with values fo the same variable in dataframe B. Next I simulate the dataframes:
set.seed(123)
A<-data.frame(id1=sample(1:10,10),id2=sample(1:10,10),V1=rnorm(10),V2=rnorm(10))
###create dataframe B
B<-A[sample(1:10,5),1:3]
###change values to be updated in df A
B$V1<-rnorm(5)
###create a row which is not in A, to make it more interesting
B<-rbind(B,c(11,12,rnorm(1)))
Now I provide a non optimal solution which I wish to make more cleaner
temp<-left_join(A,B,by=c("id1","id2"))
temp[!is.na(temp$V1.y),"V1.x"]<-temp[!is.na(temp$V1.y),"V1.y"]
A<-temp[,setdiff(colnames(temp),"V1.y")]
colnames(A)[colnames(A) %in% "V1.x"]<-"V1"
It would be desirable to avoid creating temporal objects and modify directly df A. Also the solution should be scalable to replace values in more than one column of A. I am think in something like
A[expression1,desired_cols]<-B[expression2,desired_cols]
where expression1 and expression2 are inteded to match indexes in both df and desired_cols are the names of columns to be replaced
We can use a join from data.table and update the columns of 'A' with the corresponding i. column of the second dataset ('B')
library(data.table)
setDT(A)[B, V1 := i.V1, on = .(id1, id2)]
If we are replacing multiple columns, make note of the columns to replace
nm1 <- names(A)[3:4]
nm2 <- paste0("i.", nm1)
setDT(A)[B, (nm1) := mget(nm2), on = .(id1, id2)]
Or if we use left_join, then coalesce would be better
library(dplyr)
left_join(A, B, by = c('id1', 'id2')) %>%
transmute(id1, id2, V1 = coalesce(V1.y, V1.x), V2)

Keep columns in join data.table

I do not get why in this join I can not retrieve the column sub_item of my DT2?
DT <- data.table(ID=c(1:4),OBS_VALUE=10:13)
DT2 <- data.table(ID=c(1:4),sum_item=c(10,11.5,12.5,18))
setkey(DT,ID)
setkey(DT2,ID)
S_toset_sum <- DT[DT2,diff := abs(OBS_VALUE-sum_item)][diff<3]
in the output I would like to have still sum_item as I want to keep this column instead of the OBS_VALUE column.
You have to specify the columns you wish to keep, as well as the key you wish to join on.
S_toset_sum <- DT[DT2, on = 'ID', .(ID, OBS_VALUE, sum_item, diff = abs(OBS_VALUE-sum_item))][diff<3]

Assign a value based on closest neighbour from other data frame

With generic data:
set.seed(456)
a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)
c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)
For each df1$b we want to find the nearest df2$d.
Then we create a new variable df1$XYZ that takes the df2$c value of the nearest df2$d
This question has guided me towards data.table library. But I am not sure if ddplyr and group_by can also be used:
Here was my data.table attempt:
library(data.table)
dt1 <- data.table( df1 , key = "b" )
dt2 <- data.table( df2 , key = "d" )
dt[ ldt , list( d ) , roll = "nearest" ]
Here's one way with data.table:
require(data.table)
setDT(df1)[, XYZ := setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]]
You need to get df2$c corresponding to the nearest value in df2$d for every df1$b. So, we need to join as df2[df1] which results in nrow(df1) rows.That can be done with setDT(df2)[df1, c, on=c(d="b"), roll="nearest"].
It returns the result you require. All we need to do is to add this back to df1 with the name XYZ. We do that using :=.
The thought process in constructing the rolling join is something like this (assuming df1 and df2 are both data tables):
We need get some value(s) for each row of df1. That means, i = df1 in x[i] syntax.
df2[df1]
We need to join df2$d with df1$b. Using on= that'd be:
df2[df1, on=c(d="b")]
We need just the c column. Use j to select just that column.
df2[df1, c, on=c(d="b")]
We don't need equi-join but roll to nearest join.
df2[df1, c, on=c(d="b"), roll="nearest"]
Hope this helps.

data.table join (multiple) selected columns with new names

I like to join two tables that have some identical columns (names and values) and others that are not. I'm only interested in joining those that are not identical and I would like to determine a new name for them. The way I currently do it seems verbose and hard to handle for the real tables I have with 100+ columns, i.e. I would like to determine the columns to be joined in advance and not in join statement. Reproducible example:
# create table 1
DT1 = data.table(id = 1:5, x=letters[1:5], a=11:15, b=21:25)
# create table 2 with changed values for a, b via pre-determined cols
DT2 = copy(DT1)
cols <- c("a", "b")
DT2[, (cols) := lapply(.SD, function(x) x*2), .SDcols = cols]
# this both works but is verbose for many columns
DT1[DT2, c("a_new", "b_new") := list(i.a, i.b), on=c(id="id")]
DT1[DT2, `:=` (a_new=i.a, b_new=i.b), on = c(id="id")]
I was thinking about something like this (doesn't work):
cols_new <- c("a_new", "b_new")
cols <- c("a", "b")
DT1[DT2, cols_new := i.cols, on=c(id="id")]
Updated answer based on Arun's recommendation:
cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]
you could also generate the cols_old by doing:
paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))
See history for the old answer.

Resources