Problem
I have some data. I would like to flag the same instance (e.g. a person, company, machine, whatever) in my data by a unique ID. The data actually has some IDs but they are either not always present or one instance has different IDs.
What I try to acheive is to use these IDs along with individual information to find the same instance and assign a unique ID to them.
I found a solution, but this one is highly inefficient. I would appreciate both tipps to improve the performance of my code or - probably more promising - another approach.
Code
Example Data
dt1 <- data.table(id1 = c(1, 1, 2, 3, 4),
id2 = c("A", "B", "A", "C", "D"),
surname = "Smith",
firstname = c("John", "John", "Joe", "Joe", "Jack"))
dt1
#> id1 id2 surname firstname
#> 1: 1 A Smith John
#> 2: 1 B Smith John
#> 3: 2 A Smith Joe
#> 4: 3 C Smith Joe
#> 5: 4 D Smith Jack
Current Solution
find_grp <- function(dt,
by) {
# keep necessary variables only
dtx <- copy(dt)[, .SD, .SDcols = c(unique(unlist(by)))]
# unique data.table to improve performance
dtx <- unique(dtx)
# assign a row id column
dtx[, ID := .I]
# for every row and every by group, find all rows that match each row
# on at least one condition
res <- lapply(X = dtx$ID,
FUN = function(i){
unique(unlist(lapply(X = by,
FUN = function(by_sub) {
merge(dtx[ID == i, ..by_sub],
dtx,
by = by_sub,
all = FALSE)$ID
}
)))
})
res
print("merge done")
# keep all unique matching rows
l <- unique(res)
# combine matching rows together, if there is at least one overlap between
# two groups.
# repeat until all row-groups are completely disjoint form one another
repeat{
l1 <- l
iterator <- seq_len(length(l1))
for (i in iterator) {
for (ii in iterator[-i]) {
# is there any overlap between both row-groups
if (length(intersect(l1[[i]], l1[[ii]])) > 0) {
l1[[i]] <- sort(union(l1[[i]], l1[[ii]]))
}
}
}
if (isTRUE(all.equal(l1, l))) {
break
} else {
l <- unique(l1)
}
}
print("repeat done")
# use result to assign a groupId to the helper data.table
Map(l,
seq_along(l),
f = function(ll, grp) dtx[ID %in% ll, ID_GRP := grp])
# remove helper Id
dtx[, ID := NULL]
# assign the groupId to the original data.table
dt_out <- copy(dt)[dtx,
on = unique(unlist(by)),
ID_GRP := ID_GRP]
return(dt_out[])
}
Result
find_grp(dt1, by = list("id1",
"id2"
, c("surname", "firstname"))
)
#> [1] "merge done"
#> [1] "repeat done"
#> id1 id2 surname firstname ID_GRP
#> 1: 1 A Smith John 1
#> 2: 1 B Smith John 1
#> 3: 2 A Smith Joe 1
#> 4: 3 C Smith Joe 1
#> 5: 4 D Smith Jack 2
As you can see, ID_GRP is identified because
the first two rows share id1
since id2 for id1 contains A, row 3 with id2 = A belongs to the same group.
finally, all Joe Smith belong to the same group as well because its the name in row 3
so on and so forth
only row 5 is completely unrelated
{data.table} solutions are preferred
This might help you. I'm not sure if I've completely understood your question. I've written a function (gen_grp(), that takes a data table d, and a vector of variables v. It steps through each unique id1, and replaces id1 values if matches of certain types are found.
gen_grp <- function(d,v) {
for(id in unique(d$id1)) {
d[id2 %in% d[id1==id,id2], id1:=id]
k=unique(d[id1==id, ..v])[,t:=id]
d = k[d, on=v][!is.na(t), id1:=t][, t:=NULL]
}
d[, grp:=rleid(id1)]
return(d[])
}
Usage:
gen_grp(dt1,c("surname","firstname"))
Output:
surname firstname id1 id2 grp
<char> <char> <num> <char> <int>
1: Smith John 1 A 1
2: Smith John 1 B 1
3: Smith Joe 1 A 1
4: Smith Joe 1 C 1
5: Smith Jack 4 D 2
Related
Let's say the below df
df <- data.table(id = c(1, 2, 2, 3)
, datee = as.Date(c('2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'))
); df
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 2 2022-01-02
4: 3 2022-01-03
and I wanted to keep only the non-duplicated rows
df[!duplicated(id, datee)]
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 3 2022-01-03
which worked.
However, with the below df_1
df_1 <- data.table(a = c(1,1,2)
, b = c(1,1,3)
); df_1
a b
1: 1 1
2: 1 1
3: 2 3
using the same method does not rid the duplicated rows
df_1[!duplicated(a, b)]
a b
1: 1 1
2: 1 1
3: 2 3
What am I doing wrong?
Let's dive in to why your df_1[!duplicated(a, b)] doesn't work.
duplicated uses S3 method dispatch.
library(data.table)
.S3methods("duplicated")
# [1] duplicated.array duplicated.data.frame
# [3] duplicated.data.table* duplicated.default
# [5] duplicated.matrix duplicated.numeric_version
# [7] duplicated.POSIXlt duplicated.warnings
# see '?methods' for accessing help and source code
Looking at those, we aren't using duplicated.data.table since we're calling it with individual vectors (it has no idea it is being called from within a data.table context), so it makes sense to look into duplicated.default.
> debugonce(duplicated.default)
> df_1[!duplicated(a, b)]
debugging in: duplicated.default(a, b)
debug: .Internal(duplicated(x, incomparables, fromLast, if (is.factor(x)) min(length(x),
nlevels(x) + 1L) else nmax))
Browse[2]> match.call() # ~ "how this function was called"
duplicated.default(x = a, incomparables = b)
Confirming with ?duplicated:
x: a vector or a data frame or an array or 'NULL'.
incomparables: a vector of values that cannot be compared. 'FALSE' is
a special value, meaning that all values can be compared, and
may be the only value accepted for methods other than the
default. It will be coerced internally to the same type as
'x'.
From this we can see that a is being used for deduplication, and b is used as "incomparable". Because b contains the value 1 that is in a and duplicated, then rows where a==1 are not tested for duplication.
To confirm, if we change b such that it does not share (duplicated) values with a, we see that the deduplication of a works as intended (though it is silently ignoring b's dupes due to the argument problem):
df_1 <- data.table(a = c(1,1,2) , b = c(2,2,4))
df_1[!duplicated(a, b)] # accidentally correct, `b` is not used
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_1, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
df_2 <- data.table(a = c(1,1,2) , b = c(2,3,4))
df_2[!duplicated(a, b)] # wrong, `b` is not considered
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_2, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 1 3
# 3: 2 4
(Note that unique above is actually data.table:::unique.data.table, another S3 method dispatch provided by the data.table package.)
debug and debugonce are your friends :-)
I am trying to update some dt1 data.table using another dt2 data.table on common id for a specific column. That specific column should be updated only if the value of the specific column in dt1 is empty (NA).
For instance, this updates everything :
dt1 <- data.table(
"id" = c(1,2,3,4),
"animal" = c("cat","cat","dog",NA)
)
dt2 <- data.table(
"id" = c(3,4),
"animal" = c("human being","duck")
)
dt1 <- dt1[dt2, on = c("id"),"animal" := .(i.animal)]
Which gives for dt1
id animal
1: 1 cat
2: 2 cat
3: 3 human being
4: 4 duck
While I need
id animal
1: 1 cat
2: 2 cat
3: 3 dog
4: 4 duck
We could use fcoalecse
library(data.table)
dt1[dt2, animal := fcoalesce(animal, i.animal), on = .(id)]
-output
> dt1
id animal
<num> <char>
1: 1 cat
2: 2 cat
3: 3 dog
4: 4 duck
I have two data.tables. Each has a column called 'firstName' and another called 'lastName', which contain some values which will match each other and some that won't. Some values in both data sets might be duplicated.
I want to add a new column to the second data.table, in which I will store the indices of matches from the first data set for each element of 'firstName' within the second data set. I will then repeat the whole matching process with the 'lastName' column and get the intersect of index matches for 'firstName' and 'lastName'. I will then use the intersect of the indices to fetch the case ID (cid) from the first data set and append it to the second data set.
Because there might be more than one match per element, I will store them as lists within my data.table. I cannot use base::match function because it will only return the first match for each element, but I do need the answer to be vectorised in just the same way as the match function.
I've tried different combinations of which(d1$x %in% y) but this does not work either because it matches for all of y at once instead of one element at a time. I am using data.table because for my real-world use case, the data set to match on could be hundreds of thousands of records, so speed is important.
I have found a related question here, but I can't quite figure out how to efficiently convert this to data.table syntax.
Here is some example data:
# Load library
library(data.table)
# First data set (lookup table):
dt1 <- data.table(cid = c("c1", "c2", "c3", "c4", "c5"),
firstName = c("Jim", "Joe", "Anne", "Jim", "Anne"),
lastName = c("Gracea", "Ali", "Mcfee", "Dutto", "Crest"))
# Second data set (data to match with case IDs from lookup table):
dt2 <- data.table(lid = c(1, 2, 3, 4),
firstName = c("Maria", "Jim", "Jack", "Anne"),
lastName = c("Antonis", "Dutto", "Blogs", "Mcfee"),
result = c("pos", "neg", "neg", "pos"))
My desired output would look like this:
# Output:
> dt2
lid firstName lastName result fn_match ln_match casematch caseid
1: 1 Maria Antonis pos NA NA NA <NA>
2: 2 Jim Dutto neg 1,4 4 4 c4
3: 3 Jack Blogs neg NA NA NA <NA>
4: 4 Anne Mcfee pos 3,5 3 3 c3
A possible solution:
dt1[,id:=seq_along(cid)]
dt1[dt2,.(lid,id,firstName = i.firstName),on=.(firstName)][
,.(casematch =.( id)),by=.(lid,firstName)]
lid firstName casematch
<num> <char> <list>
1: 1 Maria NA
2: 2 Jim 1,4
3: 3 Jack NA
4: 4 Anne 3,5
We could use
library(data.table)
dt1[dt2, .(casematch = toString(cid), lid),on = .(firstName), by = .EACHI]
-output
firstName casematch lid
<char> <char> <num>
1: Maria NA 1
2: Jim c1, c4 2
3: Jack NA 3
4: Anne c3, c5 4
Or with row index
dt1[dt2, .(casematch = na_if(toString(.I), 0), lid),on = .(firstName), by = .EACHI]
firstName casematch lid
<char> <char> <num>
1: Maria <NA> 1
2: Jim 1, 4 2
3: Jack <NA> 3
4: Anne 3, 5 4
Using .EACHI and adding the resulting list column by reference.
dt2[ , res := dt1[ , i := .I][.SD, on = .(firstName), .(.(i)), by = .EACHI]$V1]
# lid firstName res
# 1: 1 Maria NA
# 2: 2 Jim 1,4
# 3: 3 Jack NA
# 4: 4 Anne 3,5
Another data.table option
> dt1[, .(cid = toString(cid)), firstName][dt2, on = .(firstName)]
firstName cid lid
1: Maria <NA> 1
2: Jim c1, c4 2
3: Jack <NA> 3
4: Anne c3, c5 4
In my real life scenario, I need to retrieve the indices for matches on more than one column. I found a way to do this in one step by combining some of the other solutions and figured it would be useful to also share this and the explanation of how it works below.
The code below adds a new column caseid to dt2, which gets its values from the column cid in dt1 for the row indices that matched on both firstName and lastName.
Putting dt1 inside the square brackets and specifying on = .(...) is equivalent to merging dt1 with dt2 on firstName and lastName, but instead of merging all columns from both datasets, one new column called caseid is created.
The lower case i. prefix to cid indicates that cid is a column from the second data set (dt1).
The upper case .I inside the square brackets after i.cid will retrieve the row indices of dt1 that match dt2 on firstName and lastName.
# Get case IDs from dt1 for matches of firstName and lastName in one step:
dt2[dt1, caseid := i.cid[.I], on = .(firstName, lastName)]
# Output:
> dt2
lid firstName lastName result caseid
1: 1 Maria Antonis pos <NA>
2: 2 Jim Dutto neg c4
3: 3 Jack Blogs neg <NA>
4: 4 Anne Mcfee pos c3
I don't know how to say it clearly, that is maybe why i did not find the answer, but i want to edit the values of two different columns at the same time, while they are the identifying columns.
For example this is the data :
> data = data.frame(name1 = c("John","Jake","John","Paul"),
name2 = c("Paul", "Paul","John","John"),
value1 = c(0,0,1,0),
value2 = c(1,0,1,0))
> data
name1 name2 value1 value2
1 John Paul 0 1
2 Jake Paul 0 0
3 John John 1 1
4 Paul John 0 0
I would like to edit the values of the first row so the first row become Jake & John instead of John & Paul, and so i would like to combine these two lines of code for doing it at the same time :
data$name1[(data$name1 == "John" & data$name2 == "Paul")] <- "Jake"
data$name2[(data$name1 == "John" & data$name2 == "Paul")] <- "John"
Should be a simple trick but i dont have it !
Also, i should do that on larger datasets each modification can appear on multiple lines, and i cant know on which rows will be the modification
How about this ?
data[data$col1 == "A" & data$col2 == "B", ] <- list("B", "D")
data
# col1 col2
#1 B D
#2 A C
#3 B A
#4 B B
library(tidyverse)
data %>%
mutate(
name1=
case_when(
name1=="John" & name2=="Paul" ~ "Jake",
TRUE ~ name1
),
name2=
case_when(
name1=="John" & name2=="Paul" ~ "John",
TRUE ~ name2))
I have a dataframe as below. I want to split the last column into 2. Splitting needs to be done based upon the only first : and rest of the columns dont matter.
In the new dataframe, there will be 4 columns. 3 rd column will be (a,b,d) while 4th column will be (1,2:3,3:4:4)
any suggestions? 4th line of my code doesnt work :(. I am okay with completely new solution or corrections to the line 4
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
as.data.frame(do.call(rbind, strsplit(df,":")))
--------------------update1
Below solutions work well. But i need a modified solution as I just realized that some of the cells in column 3 wont have ":". In such case i want text in that cell to appear in only 1st column after splitting that column
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b", "d: 3:4:4"))
You could use cSplit. On your updated data frame,
library(splitstackshape)
cSplit(df, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b NA
# 3: Jolie Hope 1 d 3:4:4
And on your original data frame,
df1 <- data.frame(employee, salary,
originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
cSplit(df1, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b 2:3
# 3: Jolie Hope 1 d 3:4:4
Note: I'm using splitstackshape version 1.4.2. I believe the sep argument has been changed from version 1.4.0
You could use extract from tidyr to split the originalColumn in to two columns. In the below code, I am creating 3 columns and removing one of the unwanted columns from the result.
library(tidyr)
pat <- "([^ :])( ?:|: ?|)(.*)"
extract(df, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Using the updated df, (for better identification - df1)
extract(df1, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b
#3 Jolie Hope 1 d 3:4:4
Or without creating a new column in df
extract(df, originalColumn, c("Col1", "Col2"), "(.)[ :](.*)") %>%
mutate(Col2= gsub("^\\:", "", Col2))
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Based on the pattern in df, the below code also works. Here, the regex used to extract the first column is (.). A dot is a single element at the beginning of the string inside the parentheses will be extracted for the Col1. Then .{2} two elements following the first are discarded and the rest within the parentheses (.*) forms the Col2.
extract(df, originalColumn, c("Col1", "Col2"), "(.).{2}(.*)")
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
or using strsplit
as.data.frame(do.call(rbind, strsplit(as.character(df$originalColumn), " :|: ")))
# V1 V2
#1 a 1
#2 b 2:3
#3 d 3:4:4
For df1, here is a solution using strsplit
lst <- strsplit(as.character(df1$originalColumn), " :|: ")
as.data.frame(do.call(rbind,lapply(lst,
`length<-`, max(sapply(lst, length)))) )
# V1 V2
#1 a 1
#2 b <NA>
#3 d 3:4:4
You were close, here's a solution:
library(stringr)
df[, c('Col1','Col2')] <- do.call(rbind, str_split_fixed(df$originalColumn,":",n=2))
df$originalColumn <- NULL
employee salary Col1 Col2
1 John Doe 3 a 1
2 Peter Gynn 2 b 2:3
3 Jolie Hope 1 d 3:4:4
Notes:
stringr::str_split() is better than base::strsplit() because you don't have to do as.character(), also it has the n=2 argument you want to limit to only split on the first ':'