Finding group with distinct (non-overlapping) elements - r

I have a simple dataframe with group IDs and elements of each group, like this:
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
Each ID may have a different number of elements. Now I want to find all IDs that have distinct elements with other IDs. In this example, ID1 and ID3 will be selected because they have distinct elements (3,5,7 vs 2,4,6). I also want to copy these unique IDs and their elements into a new dataframe, similar to the original.
How would I do that in R? My skills with R is quite limited.
Thank you very much!
Bests,

Seems like a good question for igraph cliques with one edge to another clique but I cant seem to wrap my head on how to use it.
Anyway, here is an option applying join to identify IDs with same Values and then anti-join to remove those IDs using data.table:
library(data.table)
DT <- as.data.table(x)
for (i in DT[, unique(ID)]) {
dupeID <- DT[DT[ID==i], on=.(Values), .(ID=unique(x.ID[x.ID!=i.ID]))]
DT <- DT[!dupeID , on=.(ID)]
}
output:
ID Values
1: 1 3
2: 1 5
3: 1 7
4: 3 2
5: 3 4
6: 3 6

x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
gps = split(x, x$ID)
nGroups = length(gps)
k = 1
results = data.frame(ID = NULL, Values = NULL)
for(i in 1:(nGroups - 1)){
j = i + 1
while(j <= nGroups){
if(length(intersect(gps[[i]]$Values, gps[[j]]$Values)) == 0){
print(c(i,j))
results = rbind(results, gps[[i]], gps[[j]])
}
j = j + 1
}
}
results
> results
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6

You can try the following code, where the y is the list of data frames (including all data frames that have exclusive Value)
xs <- split(x,x$ID)
id <- names(xs)
y <- list()
ids <- seq_along(xs)
repeat {
if (length(ids)==0) break;
y[[length(y)+1]] <- xs[[ids[1]]]
p <- ids[[1]]
qs <- p
for (q in ids[-1]) {
if (length(intersect(xs[[p]]$Value,xs[[q]]$Value))==0) {
y[[length(y)]] <- rbind(y[[length(y)]],xs[[q]])
qs <- c(qs,q)
}
}
ids <- setdiff(ids,qs)
}
Example
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3,4,4),
"Values" = c(3,5,7,2,4,5,2,4,6,1,3))
> x
ID Values
1 1 3
2 1 5
3 1 7
4 2 2
5 2 4
6 2 5
7 3 2
8 3 4
9 3 6
10 4 1
11 4 3
then you will get
> y
[[1]]
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6
[[2]]
ID Values
4 2 2
5 2 4
6 2 5
10 4 1
11 4 3

Related

Divide the number into different groups according to the adjacency relationship

I have a dataframe that stores adjacency relations. I want to divide numbers into different groups according to this dataframe. The dataframe are as follows:
df = data.frame(from=c(1,1,2,2,2,3,3,3,4,4,4,5,5), to=c(1,3,2,3,4,1,2,3,2,4,5,4,5))
df
from to
1 1 1
2 1 3
3 2 2
4 2 3
5 2 4
6 3 1
7 3 2
8 3 3
9 4 2
10 4 4
11 4 5
12 5 4
13 5 5
In above dataframe, number 1 has links with number 1 and 3, number 2 has links with number 2, 3, 4, so number 1 can not be in same group with number 3 and number 2 can not be in same group with number 3 and number 4. In the end, groups can be c(1, 2, 5) and c(3, 4).
I wonder how to program it?
First replace the values of to with NA when from and to are equal.
df2 <- transform(df, to = replace(to, from == to, NA))
Then recursively bind each row of the data if from of the latter row has not appeared in to of the former rows.
Reduce(function(x, y) {
if(y$from %in% x$to) x else rbind(x, y)
}, split(df2, 1:nrow(df2)))
# from to
# 1 1 NA
# 2 1 3
# 3 2 NA
# 4 2 3
# 5 2 4
# 12 5 4
# 13 5 NA
Finally, you could extract unique elements for the both columns to get the two groups.
The overall pipeline should be
df |>
transform(to = replace(to, from == to, NA)) |>
(\(dat) split(dat, 1:nrow(dat)))() |>
Reduce(f = \(x, y) if(y$from %in% x$to) x else rbind(x, y))
The answer of Darren Tsai has solved this problem, but with some flaw.
Following is a very clumsy solution:
df = data.frame(from=c(1,1,2,2,2,3,3,3,4,4,4,5,5), to=c(1,3,2,3,4,1,2,3,2,4,5,4,5))
df.list = lapply(split(df,df$from), function(x){
x$to
})
group.idx = rep(1, length(unique(df$from)))
for (i in seq_along(df.list)) {
df.vec <- df.list[[i]]
curr.group = group.idx[i]
remain.vec = setdiff(df.vec, i)
for (j in remain.vec) {
if(group.idx[j] == curr.group){
group.idx[j] = curr.group + 1
}
}
}
group.idx
[1] 1 1 2 2 1

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

create id variable from table of duplicates

I have a dataframe where each row has a unique identifier, but some rows are actually duplicates.
fdf <- data.frame(name = c("fred", "ferd", "frad", 'eric', "eirc", "george"),
id = 1:6)
fdf
#> name id
#> 1 fred 1
#> 2 ferd 2
#> 3 frad 3
#> 4 eric 4
#> 5 eirc 5
#> 6 george 6
I have determined which rows are duplicated and this information is stored in a second dataframe as pairs of the unique id's. So the key tells me row 1 is the same individual as rows 2 and 3, etc.
key <- data.frame(id1 = c(1,1,2,4), id2 = c(2,3,3,5))
key
#> id1 id2
#> 1 1 2
#> 2 1 3
#> 3 2 3
#> 4 4 5
I'm struggling to think up a straightforward way to use the key to create an id variable in my original dataframe. Desired output would be:
fdf$realid <- c(1,1,1,2,2,3)
fdf
#> name id realid
#> 1 fred 1 1
#> 2 ferd 2 1
#> 3 frad 3 1
#> 4 eric 4 2
#> 5 eirc 5 2
#> 6 george 6 3
Edit for clarity
Keys here are the set of true connections between rows in the data.frame fdf. Thus you can imagine starting with the set of all feasible connections:
# id1 id2
# 1 2
# 1 3
# 1 4
# ...
# 6 4
# 6 5
determining which are true connections (based on the other variables in each observation).
# id1 id2 match
# 1 2 match
# 1 3 no match
# 1 4 match
# ...
# 6 4 no match
# 6 5 no match
and sub-setting to the cases that are matches.
The easiest way would be to recreate the key data frame to the following format (i.e. which id belongs to which realid)
key <- data.frame(id = c(1, 2, 3, 4, 5, 6),
realid = c(1, 1, 1, 2, 2, 3))
Then it is just a matter of merging fdf and key together with merge
fdf <- merge(fdf, key_table, by.x = "id")
fdf
id name realid
1 1 fred 1
2 2 ferd 1
3 3 frad 1
4 4 eric 2
5 5 eirc 2
6 6 george 3
I didn't find a 'straight forward way', but it seems to work well.
First you check which IDs are together in a group, by checking whether there's 'overlap', i.e. whether the intersection between two rows in key is non-empty:
check_overlap <- function(pair1, pair2){
newset <- intersect(pair1, pair2)
length(newset) != 0
}
Then we can apply this function to the rows in key against the other rows. If a row has been matched already, it is automatically removed from key, like this:
check_overlaps <- function(key){
cont <- data.frame()
i <- 1
while(nrow(key) > 0){
ids <- apply(key, 1, check_overlap, key[1, ])
vals <- unique(unlist(key[ids, ]))
key <- key[!ids, ]
cont <- rbind(cont, cbind(vals, rep(i, length(vals))))
i <- i+1
}
return(cont)
}
new_ids <- check_overlaps(key)
# vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
The problem with merging fdf and new_ids, however, is that some old IDs may not occur in key, but they should be mapped to a new ID according to the new order. You can manipulate key a bit a priori and do:
for(val in unique(fdf$id)){
if(!(val %in% unlist(key))){
key <- rbind(key, c(val, val))
}
}
new_ids2 <- check_overlaps(key)
vals V2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 3
Which is easy to merge with fdf like:
merge(fdf, new_ids2, by.x = "id", by.y = "vals")
id name V2
# 1 1 fred 1
# 2 2 ferd 1
# 3 3 frad 1
# 4 4 eric 2
# 5 5 eirc 2
# 6 6 george 3
If I understand your question correctly it can be solved by creating groups of matching ids and creating a new (real) id out of these groups:
# determine the groups of ids
id_groups <- list()
i = 1
for (id in unique(key$id1)) {
if (!(id %in% unlist(id_groups))) {
id_groups[[i]] <- c(id, key$id2[key$id1 == id])
i = i + 1
}
}
# add ids without match
id_groups <- c(id_groups, setdiff(fdf$id, unlist(id_groups)))
# for every id in fdf, set real_id to index in id_groups to which id belongs
fdf$real_id <- sapply(fdf$id, function(id) {
which(sapply(id_groups, function(group) id %in% group))
})

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

Count and label observations per participant using loop

I have repeated-measures data.
I need to create a loop that will incrementally count each observation, within a participant, and label it.
I am new to writing loops. My logic was to say, for each item in the list of unique ids, count each row in that, and apply some function to that row.
Could someone point our what I am doing wrong?
data$Ob <- 0
for (i in unique(data$id)) {
count <- 1
for (u in data[data$id == i,]) {
data[data$id ==u,]$Ob <- count
count <- count + 1
print(count)
}
}
Thanks!
Justin
You can also use ave:
set.seed(1)
data <- data.frame(id = sample(4, 10, TRUE))
data$Ob = ave(data$id, data$id, FUN=seq_along)
data
id Ob
1 2 1
2 2 2
3 3 1
4 4 1
5 1 1
6 4 2
7 4 3
8 3 2
9 3 3
10 1 2
# Generate some dummy data
data <- data.frame(Ob=0, id=sample(4,20,TRUE))
# Go through every id value
for(i in unique(data$id)){
# Label observations
data$Ob[data$id == i] = 1:sum(data$id == i)
}
Be aware though that for loops are notoriously slow in R. In this simple case they work fine, but should you have millions and millions of rows in your data frame you'd better do something purely vectorized.
But you don't need a loop...
data <- data.frame (id = sample (4, 10, TRUE))
## id
## 1 3
## 2 4
## 3 1
## 4 3
## 5 3
## 6 4
## 7 2
## 8 1
## 9 1
## 10 4
data$Ob [order (data$id)] <- sequence (table (data$id))
## id Ob
## 1 3 1
## 2 4 1
## 3 1 1
## 4 3 2
## 5 3 3
## 6 4 2
## 7 2 1
## 8 1 2
## 9 1 3
## 10 4 3
(works also with character or factor IDs)
(isn't R just cool!?)

Resources