Get indices of matches with a column in a second data.table - r

I have two data.tables. Each has a column called 'firstName' and another called 'lastName', which contain some values which will match each other and some that won't. Some values in both data sets might be duplicated.
I want to add a new column to the second data.table, in which I will store the indices of matches from the first data set for each element of 'firstName' within the second data set. I will then repeat the whole matching process with the 'lastName' column and get the intersect of index matches for 'firstName' and 'lastName'. I will then use the intersect of the indices to fetch the case ID (cid) from the first data set and append it to the second data set.
Because there might be more than one match per element, I will store them as lists within my data.table. I cannot use base::match function because it will only return the first match for each element, but I do need the answer to be vectorised in just the same way as the match function.
I've tried different combinations of which(d1$x %in% y) but this does not work either because it matches for all of y at once instead of one element at a time. I am using data.table because for my real-world use case, the data set to match on could be hundreds of thousands of records, so speed is important.
I have found a related question here, but I can't quite figure out how to efficiently convert this to data.table syntax.
Here is some example data:
# Load library
library(data.table)
# First data set (lookup table):
dt1 <- data.table(cid = c("c1", "c2", "c3", "c4", "c5"),
firstName = c("Jim", "Joe", "Anne", "Jim", "Anne"),
lastName = c("Gracea", "Ali", "Mcfee", "Dutto", "Crest"))
# Second data set (data to match with case IDs from lookup table):
dt2 <- data.table(lid = c(1, 2, 3, 4),
firstName = c("Maria", "Jim", "Jack", "Anne"),
lastName = c("Antonis", "Dutto", "Blogs", "Mcfee"),
result = c("pos", "neg", "neg", "pos"))
My desired output would look like this:
# Output:
> dt2
lid firstName lastName result fn_match ln_match casematch caseid
1: 1 Maria Antonis pos NA NA NA <NA>
2: 2 Jim Dutto neg 1,4 4 4 c4
3: 3 Jack Blogs neg NA NA NA <NA>
4: 4 Anne Mcfee pos 3,5 3 3 c3

A possible solution:
dt1[,id:=seq_along(cid)]
dt1[dt2,.(lid,id,firstName = i.firstName),on=.(firstName)][
,.(casematch =.( id)),by=.(lid,firstName)]
lid firstName casematch
<num> <char> <list>
1: 1 Maria NA
2: 2 Jim 1,4
3: 3 Jack NA
4: 4 Anne 3,5

We could use
library(data.table)
dt1[dt2, .(casematch = toString(cid), lid),on = .(firstName), by = .EACHI]
-output
firstName casematch lid
<char> <char> <num>
1: Maria NA 1
2: Jim c1, c4 2
3: Jack NA 3
4: Anne c3, c5 4
Or with row index
dt1[dt2, .(casematch = na_if(toString(.I), 0), lid),on = .(firstName), by = .EACHI]
firstName casematch lid
<char> <char> <num>
1: Maria <NA> 1
2: Jim 1, 4 2
3: Jack <NA> 3
4: Anne 3, 5 4

Using .EACHI and adding the resulting list column by reference.
dt2[ , res := dt1[ , i := .I][.SD, on = .(firstName), .(.(i)), by = .EACHI]$V1]
# lid firstName res
# 1: 1 Maria NA
# 2: 2 Jim 1,4
# 3: 3 Jack NA
# 4: 4 Anne 3,5

Another data.table option
> dt1[, .(cid = toString(cid)), firstName][dt2, on = .(firstName)]
firstName cid lid
1: Maria <NA> 1
2: Jim c1, c4 2
3: Jack <NA> 3
4: Anne c3, c5 4

In my real life scenario, I need to retrieve the indices for matches on more than one column. I found a way to do this in one step by combining some of the other solutions and figured it would be useful to also share this and the explanation of how it works below.
The code below adds a new column caseid to dt2, which gets its values from the column cid in dt1 for the row indices that matched on both firstName and lastName.
Putting dt1 inside the square brackets and specifying on = .(...) is equivalent to merging dt1 with dt2 on firstName and lastName, but instead of merging all columns from both datasets, one new column called caseid is created.
The lower case i. prefix to cid indicates that cid is a column from the second data set (dt1).
The upper case .I inside the square brackets after i.cid will retrieve the row indices of dt1 that match dt2 on firstName and lastName.
# Get case IDs from dt1 for matches of firstName and lastName in one step:
dt2[dt1, caseid := i.cid[.I], on = .(firstName, lastName)]
# Output:
> dt2
lid firstName lastName result caseid
1: 1 Maria Antonis pos <NA>
2: 2 Jim Dutto neg c4
3: 3 Jack Blogs neg <NA>
4: 4 Anne Mcfee pos c3

Related

Using data.table to match multiple patterns against multiple strings in R

library(data.table)
dat1 <- data.table(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.table(id2 = c(1174, 1231),
description = c("apple is sweet", "apple is a computer"),
memo = c("bananas, sweet yes", "bananas, sweetyes"))
> dat1
id1 pattern
1: 1 apple
2: 1 applejack
3: 2 bananas, sweet
> dat2
id2 description memo
1: 1174 apple is sweet bananas, sweet yes
2: 1231 apple is a computer bananas, sweetyes
I have two data.tables, dat1 and dat2. I want to search for each pattern in dat2 against the description and memo columns in dat2 and store the corresponding id2s.
The final output table should look something like this:
id1 pattern description_match memo_match
1: 1 apple 1174,1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
The regular expression I want to use is \\b[pattern]\\b. Below is my attempt:
dat1[, description_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$description), .(id2 = paste(id2, collapse = ","))]]
dat1[, memo_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$memo), .(id2 = paste(id2, collapse = ","))]]
However, both give me the error that grepl can only use the first pattern.
We group by row sequence, create the match columns from 'dat2', by pasteing the 'id2' extracted from the logical output from grepl
library(data.table)
dat1[, c("description_match", "memo_match") := {
pat <- sprintf('\\b(%s)\\b', paste(pattern, collapse = "|"))
.(toString(dat2$id2[grepl(pat, dat2$description)]),
toString(dat2$id2[grepl(pat, dat2$memo)]))
}, seq_along(id1)]
dplyr::na_if(dat1, "")
id1 pattern description_match memo_match
<num> <char> <char> <char>
1: 1 apple 1174, 1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
According to ?sprintf
The string fmt contains normal characters, which are passed through to the output string, and also conversion specifications which operate on the arguments provided through .... The allowed conversion specifications start with a % and end with one of the letters in the set aAdifeEgGosxX%
s - Character string. Character NAs are converted to "NA".
Or use a for loop
for(nm in names(dat2)[-1])
dat1[, paste0(nm, "_match") :=
toString(dat2$id2[grepl(paste0("\\b", pattern, "\\b"),
dat2[[nm]])]), seq_along(id1)][]

Updating by reference a data.table using another, but only empty values

I am trying to update some dt1 data.table using another dt2 data.table on common id for a specific column. That specific column should be updated only if the value of the specific column in dt1 is empty (NA).
For instance, this updates everything :
dt1 <- data.table(
"id" = c(1,2,3,4),
"animal" = c("cat","cat","dog",NA)
)
dt2 <- data.table(
"id" = c(3,4),
"animal" = c("human being","duck")
)
dt1 <- dt1[dt2, on = c("id"),"animal" := .(i.animal)]
Which gives for dt1
id animal
1: 1 cat
2: 2 cat
3: 3 human being
4: 4 duck
While I need
id animal
1: 1 cat
2: 2 cat
3: 3 dog
4: 4 duck
We could use fcoalecse
library(data.table)
dt1[dt2, animal := fcoalesce(animal, i.animal), on = .(id)]
-output
> dt1
id animal
<num> <char>
1: 1 cat
2: 2 cat
3: 3 dog
4: 4 duck

Find Groups by multiple cascadingly related conditions

Problem
I have some data. I would like to flag the same instance (e.g. a person, company, machine, whatever) in my data by a unique ID. The data actually has some IDs but they are either not always present or one instance has different IDs.
What I try to acheive is to use these IDs along with individual information to find the same instance and assign a unique ID to them.
I found a solution, but this one is highly inefficient. I would appreciate both tipps to improve the performance of my code or - probably more promising - another approach.
Code
Example Data
dt1 <- data.table(id1 = c(1, 1, 2, 3, 4),
id2 = c("A", "B", "A", "C", "D"),
surname = "Smith",
firstname = c("John", "John", "Joe", "Joe", "Jack"))
dt1
#> id1 id2 surname firstname
#> 1: 1 A Smith John
#> 2: 1 B Smith John
#> 3: 2 A Smith Joe
#> 4: 3 C Smith Joe
#> 5: 4 D Smith Jack
Current Solution
find_grp <- function(dt,
by) {
# keep necessary variables only
dtx <- copy(dt)[, .SD, .SDcols = c(unique(unlist(by)))]
# unique data.table to improve performance
dtx <- unique(dtx)
# assign a row id column
dtx[, ID := .I]
# for every row and every by group, find all rows that match each row
# on at least one condition
res <- lapply(X = dtx$ID,
FUN = function(i){
unique(unlist(lapply(X = by,
FUN = function(by_sub) {
merge(dtx[ID == i, ..by_sub],
dtx,
by = by_sub,
all = FALSE)$ID
}
)))
})
res
print("merge done")
# keep all unique matching rows
l <- unique(res)
# combine matching rows together, if there is at least one overlap between
# two groups.
# repeat until all row-groups are completely disjoint form one another
repeat{
l1 <- l
iterator <- seq_len(length(l1))
for (i in iterator) {
for (ii in iterator[-i]) {
# is there any overlap between both row-groups
if (length(intersect(l1[[i]], l1[[ii]])) > 0) {
l1[[i]] <- sort(union(l1[[i]], l1[[ii]]))
}
}
}
if (isTRUE(all.equal(l1, l))) {
break
} else {
l <- unique(l1)
}
}
print("repeat done")
# use result to assign a groupId to the helper data.table
Map(l,
seq_along(l),
f = function(ll, grp) dtx[ID %in% ll, ID_GRP := grp])
# remove helper Id
dtx[, ID := NULL]
# assign the groupId to the original data.table
dt_out <- copy(dt)[dtx,
on = unique(unlist(by)),
ID_GRP := ID_GRP]
return(dt_out[])
}
Result
find_grp(dt1, by = list("id1",
"id2"
, c("surname", "firstname"))
)
#> [1] "merge done"
#> [1] "repeat done"
#> id1 id2 surname firstname ID_GRP
#> 1: 1 A Smith John 1
#> 2: 1 B Smith John 1
#> 3: 2 A Smith Joe 1
#> 4: 3 C Smith Joe 1
#> 5: 4 D Smith Jack 2
As you can see, ID_GRP is identified because
the first two rows share id1
since id2 for id1 contains A, row 3 with id2 = A belongs to the same group.
finally, all Joe Smith belong to the same group as well because its the name in row 3
so on and so forth
only row 5 is completely unrelated
{data.table} solutions are preferred
This might help you. I'm not sure if I've completely understood your question. I've written a function (gen_grp(), that takes a data table d, and a vector of variables v. It steps through each unique id1, and replaces id1 values if matches of certain types are found.
gen_grp <- function(d,v) {
for(id in unique(d$id1)) {
d[id2 %in% d[id1==id,id2], id1:=id]
k=unique(d[id1==id, ..v])[,t:=id]
d = k[d, on=v][!is.na(t), id1:=t][, t:=NULL]
}
d[, grp:=rleid(id1)]
return(d[])
}
Usage:
gen_grp(dt1,c("surname","firstname"))
Output:
surname firstname id1 id2 grp
<char> <char> <num> <char> <int>
1: Smith John 1 A 1
2: Smith John 1 B 1
3: Smith Joe 1 A 1
4: Smith Joe 1 C 1
5: Smith Jack 4 D 2

dplyr - Join tables based on a date difference

It's late and I can't figure this out. I'm using lubridate and dlypr.
My data is as follows:
table1 =1 observation per subject with a date
table2= 1 or more observations per subject with associated dates
When I left join I actually add observations. This is because I have multiple records in table 2 that match the key. How can I simply make this a conditional join so that only 1 matching record from table 2 is joined given that its date is closest to the date in table 1.
Sorry if this was verbose.
Use the data.table-package to join. Use roll = "nearest" to get the nearest match..
library(data.table)
dt1 <- data.table( id = 1:10, date = 1:10, stringsAsFactors = FALSE )
dt2 <- data.table( date = 6:15, letter = letters[1:10], stringsAsFactors = FALSE )
dt1[, letter := dt2[dt1, letter, on = "date", roll = "nearest"] ][]
# id date letter
# 1: 1 1 a
# 2: 2 2 a
# 3: 3 3 a
# 4: 4 4 a
# 5: 5 5 a
# 6: 6 6 a
# 7: 7 7 b
# 8: 8 8 c
# 9: 9 9 d
# 10: 10 10 e

Assign category based on magnitude of value

I have a data.table like this
dt1=data.table(id=c(001,001,002,002,003,003),
score=c(4,6,3,7,2,8))
where each individual has 2 scores on the variable "score".
I would like to assign each individual to a category in the variable outcome based on their score.
For their lower score, they get an "A", for their higher, they get a "B". So the final table looks like this
dt2=data.table(id=c(001,001,002,002,003,003),
score=c(4,6,3,7,2,8),
category=c('A','B', 'A','B', 'A','B'))
Since the values in column "score" are random, the category should be assigned based on the magnitude of the numbers assigned to each person. Any help is much appreciated.
We can order the 'score' in i, grouped by 'id' and assign the 'category' as 'A', 'B'
library(data.table)
dt1[order(score), category := c('A', 'B') , by = id]
dt1
# id score category
#1: 001 4 A
#2: 001 6 B
#3: 002 3 A
#4: 002 7 B
#5: 003 2 A
#6: 003 8 B
Or another option is to convert a logical vector to a numeric index and replace the values based on that
dt1[, category := c('A', 'B')[(score != min(score)) + 1] ,by = id]
data
dt1 <- data.table(id=c('001','001','002','002','003','003'),
score=c(4,6,3,7,2,8))
We can use ifelse:
library(data.table)
dt1[, category := ifelse(score == min(score), 'A', 'B'), by = id]
Result:
id score category
1: 1 4 A
2: 1 6 B
3: 2 3 A
4: 2 7 B
5: 3 2 A
6: 3 8 B

Resources