From a set of pairs, find all subsets s.t. no pair in the subset shares any element with a pair not in the subset - r

I have a set of pairs. Each pair is represented as [i,1:2]. That is, the ith pair are the numbers in the first and second column in the ith row.
I need to sort these pairs into distinct groups, s.t. there is no element in any pair in the jth group that is in any group not j. For example:
EXAMPLE 1: DATA
> col1 <- c(3, 4, 6, 7, 10, 8)
> col2 <- c(6, 7, 3, 4, 3, 1)
>
> dat <- cbind(col1, col2)
> rownames(dat) <- 1:nrow(dat)
>
> dat
col1 col2
1 3 6
2 4 7
3 6 3
4 7 4
5 10 3
6 8 1
For all pairs, it doesn't matter if the number is in column 1 or column 2, the pairs should be sorted into groups s.t. every number in every pair in every group exists only in one group. So the solved example would look like this.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 3
Rows 1, 3, and 5 are grouped together because 1 and 3 contain the same numbers and 5 shares the number 3, so it must be grouped with them. 2 and 4 share the same distinct numbers so they are grouped together and 6 has unique numbers so it is left alone.
If we change the data slightly, note the following.
EXAMPLE 2: NEW DATA
Note what happens when we add a row that shares an element with row 6 and row 5.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 1
7 1 10 1
The 10 in the 7th row connects it to the first group because it shares an elements with the 5th row. It also shares an element with the 6th row (the number 1), so the 6th row would instead be in group 1.
PROBLEM
Is there a simple way to form the groups? A vector operation? A sorting algorithm? It gets very nasty very quickly if you try to do it with a loop, since each subsequent row can change the membership of earlier rows, as demonstrated in the example.

To draw on the old answer at: identify groups of linked episodes which chain together , which assigns a group to each individual value, you could try this to assign a group to each linked pair:
library(igraph)
g <- graph_from_data_frame(dat)
links <- data.frame(col1=V(g)$name,group=components(g)$membership)
merge(dat,links,by="col1",all.x=TRUE,sort=FALSE)
# col1 col2 group
#1 3 6 1
#2 4 7 2
#3 6 3 1
#4 7 4 2
#5 10 3 1
#6 8 1 3

Your elements can be regarded as vertices in an undirected graph, and your pairs can be regarded as edges, and then (assuming that you want to find groups of minimal size -- if you don't, then e.g. the entire set of pairs could be labelled "Group 1") the groups you're looking for are the connected components in this graph. They can all be found in linear time with a depth-first or breadth-first search.

Related

Create new column with shared ID to randomly link two rows in R

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!
Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

How to create a dataframe by sampling 1 case (row) from each group in R

I would like to randomly select 1 case (so 1 row from a dataframe) from each group in R, but I cannot work out how to do it.
My data is structured in longformat: 400 cases (rows) clustered within 250 groups (some groups only contain a single case, others 2, 3, 4, 5, or even 6). So what I would like to end up with is a dataframe containing 250 rows (with each row representing 1 randomly selected case from the 250 different groups).
I have the idea that I should use the sample function for this, but I could work out how to do it. Anyone any ideas?
Suppose your data frame X indicates group membership with a variable named "Group," as in this synthetic example:
G <- 8
set.seed(17)
X <- data.frame(Group=sort(sample.int(G, G, replace=TRUE)),
Case=1:G)
Here is a printout of X:
Group Case
1 2 1
2 2 2
3 2 3
4 4 4
5 4 5
6 5 6
7 7 7
8 8 8
Pick up the first instance of each value of "Group" using the duplicated function after randomly permuting the rows of X:
Y <- X[sample.int(nrow(X)), ]
Y[!duplicated(Y$Group), ]
Group Case
8 8 8
1 2 1
4 4 4
6 5 6
7 7 7
A comparison to X indicates random cases in each group were selected. Repeat these last two steps to confirm this if you like.

Removing rows from a dataset based on conditional statement across factors

I am struggling to figure out how to remove rows from a dataset based on conditions across multiple factors in a large dataset. Here is some example data to illustrate the problem I am having with a smaller data frame:
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
data$Value <- (as.numeric(data$Value))
data
Code Value
1 A 1
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I want to remove values where the Code is A and the Value is < 2 from the dataset. I understand the logic of how to select for values where Code is A and Values <2, but I can't figure out how to remove these values from the dataset without also removing all values of A that are > 2, while maintaining values of the other codes that are less than 2.
#Easy to select for values of A less than 2
data2<- subset(data, (Code == "A" & Value < 2))
data2
Code Value
1 A 1
#But I want to remove values of A less than 2 without also removing values of A that are greater than 2:
data1<- subset(data, (Code != "A" & Value > 2))
data1
Code Value
3 C 3
4 D 4
### just using Value > 2 does not allow me to include values that are less than 2 for the other Codes (B,C,D):
data2<- subset(data, Value > 2)
data2
3 C 3
4 D 4
7 A 3
8 A 4
My ideal dataset would look like this:
data
Code Value
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I have tried different iterations of filter(), subset(), and select() but I can't figure out the correct conditional statement that allows me to remove the desired combination of levels of multiple factors. Any suggestions would be greatly appreciated.

More efficient way to fuzzy match in R?

I am currently working on a data frame with 2 million lines (records). I am wanting to identify potential duplicate records for followup. Someone else has written for me a long code that works, but currently it is taking me overnight to run.
It uses the stringdist package. From what I understand, stringdist works by comparing one row, against all other rows in the data frame. So, a data frame with 5 rows would require 20 computations:
i.e.
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 1
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 1
row 3 compared to row 2
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 1
row 4 compared to row 2
row 4 compared to row 3
row 4 compared to row 5
row 5 compared to row 1
row 5 compared to row 2
row 5 compared to row 3
row 5 compared to row 4
An increase in the size of data frame would exponentially increase the time needed to complete the function. With my rather large data frame, obviously it takes a while.
My proposed solution is this: after comparing each row with all of the other rows in the data frame, is there a way to omit those rows from future computations? For example, in the example above, row 1 compared to row 2 would be the same as row 2 compared to row 1. Could we remove one of these calculations?
So, using the example data frame above, the only computations should be:
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 5
This is the section in a function in the code that looks for these duplicates in various columns - any ideas on how I can amend this?
lastName <- stringdist(DataND$SURNAME[rownumber],DataND$SURNAME, method='lv')
firstName <- stringdist(DataND$GIVEN.NAME[rownumber],DataND$GIVEN.NAME, method='lv')
birthDate <- stringdist(DataND$DOB[rownumber],DataND$DOB, method='lv')
streetAddress<-stringdist(DataND$ADDR.1[rownumber],DataND$ADDR.1, method='lv')
suburb <- stringdist(DataND$LOCALITY[rownumber],DataND$LOCALITY, method='lv')
H 1's idea is great. Another option would be the fuzzyjoin-package.
library(fuzzyjoin)
library(dplyr)
df <- tibble(id = seq(1,10),
words = replicate(10, paste(sample(LETTERS, 5), collapse = "")))
stringdist_left_join(df, df, by = c(words = "words"), max_dist = 5, method = "lv", distance_col = "distance") %>%
filter(distance != 0)
# A tibble: 90 x 5
id.x words.x id.y words.y distance
<int> <chr> <int> <chr> <dbl>
1 1 JUQYR 2 HQMFD 5
2 1 JUQYR 3 WHQOM 4
3 1 JUQYR 4 OUWJV 4
4 1 JUQYR 5 JURGD 3
5 1 JUQYR 6 ZMLAQ 5
6 1 JUQYR 7 RWLVU 5
7 1 JUQYR 8 AYNLE 5
8 1 JUQYR 9 AUPVJ 4
9 1 JUQYR 10 JDFEY 4
10 2 HQMFD 1 JUQYR 5
# ... with 80 more rows
Here you have it all set up in the end, you can pick and dismiss rows by distance. It took 11 seconds for 100.000 records. Trying with stringdistmatrix() however I got the error:
Error: cannot allocate vector of size 37.3 Gb
lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME, method='lv')
If i understand this line, it compar one surname (according the value of rownumber) with aller surnames. So when you change rownumber, all comparisons are made, even the ones already done precedently.
To prevent this, try:
lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME[rownumber:nrows], method='lv')
where nrows is the number of rows

sequential counting with input from more than one variable in r

I want to create a column with sequential values but it gets its value from input from two other columns in the df. I want the value to sequentially count if either Team changes (between 1 and 2) or Event = x. Any help would be appreciated! See example below:
Team Event Value
1 1 a 1
2 1 a 1
3 2 a 2
4 2 x 3
5 2 a 3
6 1 a 4
7 1 x 5
8 1 a 5
9 2 x 6
10 2 a 6
This will do it...
df$Value <- cumsum(df$Event=="x" | c(1, diff(df$Team))!=0)
It takes the cumulative sum (i.e. of TRUE values) of those elements where either Event=="x" or the difference in successive values of Team is non-zero. An extra element is added at the start of the diff term to keep it the same length as the original.

Resources