create id variable from table of duplicates - r

I have a dataframe where each row has a unique identifier, but some rows are actually duplicates.
fdf <- data.frame(name = c("fred", "ferd", "frad", 'eric', "eirc", "george"),
id = 1:6)
fdf
#> name id
#> 1 fred 1
#> 2 ferd 2
#> 3 frad 3
#> 4 eric 4
#> 5 eirc 5
#> 6 george 6
I have determined which rows are duplicated and this information is stored in a second dataframe as pairs of the unique id's. So the key tells me row 1 is the same individual as rows 2 and 3, etc.
key <- data.frame(id1 = c(1,1,2,4), id2 = c(2,3,3,5))
key
#> id1 id2
#> 1 1 2
#> 2 1 3
#> 3 2 3
#> 4 4 5
I'm struggling to think up a straightforward way to use the key to create an id variable in my original dataframe. Desired output would be:
fdf$realid <- c(1,1,1,2,2,3)
fdf
#> name id realid
#> 1 fred 1 1
#> 2 ferd 2 1
#> 3 frad 3 1
#> 4 eric 4 2
#> 5 eirc 5 2
#> 6 george 6 3
Edit for clarity
Keys here are the set of true connections between rows in the data.frame fdf. Thus you can imagine starting with the set of all feasible connections:
# id1 id2
# 1 2
# 1 3
# 1 4
# ...
# 6 4
# 6 5
determining which are true connections (based on the other variables in each observation).
# id1 id2 match
# 1 2 match
# 1 3 no match
# 1 4 match
# ...
# 6 4 no match
# 6 5 no match
and sub-setting to the cases that are matches.

The easiest way would be to recreate the key data frame to the following format (i.e. which id belongs to which realid)
key <- data.frame(id = c(1, 2, 3, 4, 5, 6),
realid = c(1, 1, 1, 2, 2, 3))
Then it is just a matter of merging fdf and key together with merge
fdf <- merge(fdf, key_table, by.x = "id")
fdf
id name realid
1 1 fred 1
2 2 ferd 1
3 3 frad 1
4 4 eric 2
5 5 eirc 2
6 6 george 3

I didn't find a 'straight forward way', but it seems to work well.
First you check which IDs are together in a group, by checking whether there's 'overlap', i.e. whether the intersection between two rows in key is non-empty:
check_overlap <- function(pair1, pair2){
newset <- intersect(pair1, pair2)
length(newset) != 0
}
Then we can apply this function to the rows in key against the other rows. If a row has been matched already, it is automatically removed from key, like this:
check_overlaps <- function(key){
cont <- data.frame()
i <- 1
while(nrow(key) > 0){
ids <- apply(key, 1, check_overlap, key[1, ])
vals <- unique(unlist(key[ids, ]))
key <- key[!ids, ]
cont <- rbind(cont, cbind(vals, rep(i, length(vals))))
i <- i+1
}
return(cont)
}
new_ids <- check_overlaps(key)
# vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
The problem with merging fdf and new_ids, however, is that some old IDs may not occur in key, but they should be mapped to a new ID according to the new order. You can manipulate key a bit a priori and do:
for(val in unique(fdf$id)){
if(!(val %in% unlist(key))){
key <- rbind(key, c(val, val))
}
}
new_ids2 <- check_overlaps(key)
vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 3
Which is easy to merge with fdf like:
merge(fdf, new_ids2, by.x = "id", by.y = "vals")
id name V2
# 1 1 fred 1
# 2 2 ferd 1
# 3 3 frad 1
# 4 4 eric 2
# 5 5 eirc 2
# 6 6 george 3

If I understand your question correctly it can be solved by creating groups of matching ids and creating a new (real) id out of these groups:
# determine the groups of ids
id_groups <- list()
i = 1
for (id in unique(key$id1)) {
if (!(id %in% unlist(id_groups))) {
id_groups[[i]] <- c(id, key$id2[key$id1 == id])
i = i + 1
}
}
# add ids without match
id_groups <- c(id_groups, setdiff(fdf$id, unlist(id_groups)))
# for every id in fdf, set real_id to index in id_groups to which id belongs
fdf$real_id <- sapply(fdf$id, function(id) {
which(sapply(id_groups, function(group) id %in% group))
})

Related

R: Missing data on table, complete it by referencing partial matches to a "Reference" table

I have two tables; "Reference" and "TableA".
I am looking through TableA which is an incomplete table and would like to turn it into a "complete" table by referencing the "Reference" table, filling in missing values, and/or adding rows where there are multiple matches are found.
Reproducible example of "Reference" and "TableA" are below:
A <- c(1,1,1,2,4,4,5,5,7,6,2,1)
B <- c(1,2,2,2,4,4,9,5,8,6,2,9)
C <- c(1,1,3,3,4,5,5,5,7,6,3,3)
D <- c(1,2,1,1,2,1,2,1,2,2,2,1)
Reference <- data.frame(A,B,C,D)
A <- c(NA,1,5,2,4,1)
B <- c(NA,2,NA,2,NA,1)
C <- c(3,NA,5,NA,NA,1)
D <- c(1,1,2,2,1,1)
TableA <- data.frame(A,B,C,D)
I have attempted to resolve this by doing the following:
for (i in 1:dim(TableA)[1])
{
tmp<-TableA[i,]
repet<-ifelse(is.na(TableA$D[i]), Reference, 1 )
for (j in 1:repet) {
tmp$D<-ifelse(repet>1, Reference$D[j,], tmp$D)
collector<-rbind(collector, tmp)
}
}
collector
However, this solution will return the entirety of Reference$D, but I would only like to return those records from Reference$D whose columns A,B,C match (or partially match) what is on TableA.
For example, in Row 1 of TableA, I would like to replace Row 1 with the Reference table's rows 3,4, and 12.
Expected output below.
Note that the Reference table combination 1,2,3,1 appears twice on the expected output as it is a match for both rows 1 & 2 of TableA.
A
B
C
D
1
2
3
1
2
2
3
1
1
9
3
1
1
2
3
1
5
9
5
2
2
2
3
2
4
4
5
1
1
1
1
1
I'll first create an extra column "string" in both TableA and Reference, with NA replaced with a dot . in TableA, which would be used in regex matching.
Then find out which string in TableA appeared in Reference, and store them in a matrix.
Finally, replicate the lgl_matrix row number by the number of matches, and use those row numbers as index in Reference.
library(tidyverse)
TableA <- TableA %>%
mutate(across(A:D, ~ replace_na(as.character(.x), "."))) %>%
rowwise() %>%
mutate(string = paste0(c_across(A:D), collapse = ""))
Reference <- Reference %>%
rowwise() %>%
mutate(string = paste0(c_across(A:D), collapse = ""))
lgl_matrix <- sapply(TableA$string, grepl, x = Reference$string)
Reference[rep(1:nrow(lgl_matrix), rowSums(lgl_matrix)), -5]
# A tibble: 8 x 4
# Rowwise:
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 3 1
3 1 2 3 1
4 2 2 3 1
5 4 4 5 1
6 5 9 5 2
7 2 2 3 2
8 1 9 3 1

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Extract data based on another list

I am trying to extract rows of a dataset based on a list of time points nested within individuals. I have repeated time points (therefore exactly the same variable values) but I still want to keep the duplicated rows. How to achieve that in base R?
Here is the original dataset:
xx <- data.frame(id=rep(1:3, each=3), time=1:3, y=rep(1:3, each=3))
Here is the list of matrices where the third one is a vector
lst <- list(`1` = c(1, 1, 2), `2` = c(1, 3, 3), `3` = c(2, 2, 3))
Desirable outcome:
id time y
1 1 1
1 1 1 #this is the duplicated row
1 2 1
2 1 2
2 3 2
2 3 2 #this is the duplicated row
3 2 3
3 2 3 #this is the duplicated row
3 3 3
The code do.call(rbind, Map(function(p, q) subset(xx, id == q & time %in% p), lst, names(lst))) did not work for me because subset removes duplicated rows
The issue is that %in% doesn't iterate over the non-unique values repeatedly. To do so, we need to also iterate (lapply) over p internally. I'll wrap your inner subset in another do.call(rbind, lapply(p, ...)) to get what you expect:
do.call(rbind, Map(function(p, q) {
do.call(rbind, lapply(p, function(p0) subset(xx, id == q & time %in% p0)))
}, lst, names(lst)))
# id time y
# 1.1 1 1 1
# 1.2 1 1 1
# 1.21 1 2 1
# 2.4 2 1 2
# 2.6 2 3 2
# 2.61 2 3 2
# 3.8 3 2 3
# 3.81 3 2 3
# 3.9 3 3 3
(Row names are a distraction here ...)
An alternative would be to convert your lst into a frame of id and time, and then left-join on it:
frm <- do.call(rbind, Map(function(x, nm) data.frame(id = nm, time = x), lst, names(lst)))
frm
# id time
# 1.1 1 1
# 1.2 1 1
# 1.3 1 2
# 2.1 2 1
# 2.2 2 3
# 2.3 2 3
# 3.1 3 2
# 3.2 3 2
# 3.3 3 3
merge(frm, xx, by = c("id", "time"), all.x = TRUE)
# id time y
# 1 1 1 1
# 2 1 1 1
# 3 1 2 1
# 4 2 1 2
# 5 2 3 2
# 6 2 3 2
# 7 3 2 3
# 8 3 2 3
# 9 3 3 3
Two good resources for learning about merges/joins:
How to join (merge) data frames (inner, outer, left, right)
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?

Finding group with distinct (non-overlapping) elements

I have a simple dataframe with group IDs and elements of each group, like this:
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
Each ID may have a different number of elements. Now I want to find all IDs that have distinct elements with other IDs. In this example, ID1 and ID3 will be selected because they have distinct elements (3,5,7 vs 2,4,6). I also want to copy these unique IDs and their elements into a new dataframe, similar to the original.
How would I do that in R? My skills with R is quite limited.
Thank you very much!
Bests,
Seems like a good question for igraph cliques with one edge to another clique but I cant seem to wrap my head on how to use it.
Anyway, here is an option applying join to identify IDs with same Values and then anti-join to remove those IDs using data.table:
library(data.table)
DT <- as.data.table(x)
for (i in DT[, unique(ID)]) {
dupeID <- DT[DT[ID==i], on=.(Values), .(ID=unique(x.ID[x.ID!=i.ID]))]
DT <- DT[!dupeID , on=.(ID)]
}
output:
ID Values
1: 1 3
2: 1 5
3: 1 7
4: 3 2
5: 3 4
6: 3 6
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
gps = split(x, x$ID)
nGroups = length(gps)
k = 1
results = data.frame(ID = NULL, Values = NULL)
for(i in 1:(nGroups - 1)){
j = i + 1
while(j <= nGroups){
if(length(intersect(gps[[i]]$Values, gps[[j]]$Values)) == 0){
print(c(i,j))
results = rbind(results, gps[[i]], gps[[j]])
}
j = j + 1
}
}
results
> results
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6
You can try the following code, where the y is the list of data frames (including all data frames that have exclusive Value)
xs <- split(x,x$ID)
id <- names(xs)
y <- list()
ids <- seq_along(xs)
repeat {
if (length(ids)==0) break;
y[[length(y)+1]] <- xs[[ids[1]]]
p <- ids[[1]]
qs <- p
for (q in ids[-1]) {
if (length(intersect(xs[[p]]$Value,xs[[q]]$Value))==0) {
y[[length(y)]] <- rbind(y[[length(y)]],xs[[q]])
qs <- c(qs,q)
}
}
ids <- setdiff(ids,qs)
}
Example
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3,4,4),
"Values" = c(3,5,7,2,4,5,2,4,6,1,3))
> x
ID Values
1 1 3
2 1 5
3 1 7
4 2 2
5 2 4
6 2 5
7 3 2
8 3 4
9 3 6
10 4 1
11 4 3
then you will get
> y
[[1]]
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6
[[2]]
ID Values
4 2 2
5 2 4
6 2 5
10 4 1
11 4 3

Remover observations for which there is not a duplicate

I would like to break a dataset into two frames - one for which the original dataset has duplicate observations based on a condition and one for which the original dataset does not have duplicate observations based on a condition. In the following example, I would like to break the frame into one for which there is only one coder for an observation and one for which there are two coders::
frame <- data.frame(id = c(1,1,1,2,2,3), coder = c("A", "A", "B", "A", "B", "A"), y = c(4,5,4,1,1,2))
frame
For this, I would like to produce, such that:
frame1:
id coder y
1 1 A 4
2 1 A 5
3 1 B 4
4 2 A 1
5 2 B 1
frame2:
6 3 A 2
You can use aggregate to determine the ids you want in each data frame:
cts <- aggregate(coder~id, frame, function(x) length(unique(x)))
cts
# id coder
# 1 1 2
# 2 2 2
# 3 3 1
Then you can subset as appropriate based on this:
subset(frame, id %in% cts$id[cts$coder >= 2])
# id coder y
# 1 1 A 4
# 2 1 A 5
# 3 1 B 4
# 4 2 A 1
# 5 2 B 1
subset(frame, id %in% cts$id[cts$coder < 2])
# id coder y
# 6 3 A 2
You may also try:
indx <- !colSums(!table(frame$coder, frame$id))
frame[frame$id %in% names(indx)[indx],]
# id coder y
#1 1 A 4
#2 1 A 5
#3 1 B 4
#4 2 A 1
#5 2 B 1
frame[frame$id %in% names(indx)[!indx],]
# id coder y
#6 3 A 2
Explanation
table(frame$coder, frame$id)
# 1 2 3
# A 2 1 1
# B 1 1 0 #Here for id 3, B==0
If we Negate that, the result would be a logical index
!table(frame$coder, frame$id).
Do the colSums of the above, which results
# 1 2 3
# 0 0 1
Negate again and get the index for ids and subset those ids which are TRUE
From this you can subset by matching with the names of the ids

Resources