More efficient way to fuzzy match in R? - r

I am currently working on a data frame with 2 million lines (records). I am wanting to identify potential duplicate records for followup. Someone else has written for me a long code that works, but currently it is taking me overnight to run.
It uses the stringdist package. From what I understand, stringdist works by comparing one row, against all other rows in the data frame. So, a data frame with 5 rows would require 20 computations:
i.e.
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 1
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 1
row 3 compared to row 2
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 1
row 4 compared to row 2
row 4 compared to row 3
row 4 compared to row 5
row 5 compared to row 1
row 5 compared to row 2
row 5 compared to row 3
row 5 compared to row 4
An increase in the size of data frame would exponentially increase the time needed to complete the function. With my rather large data frame, obviously it takes a while.
My proposed solution is this: after comparing each row with all of the other rows in the data frame, is there a way to omit those rows from future computations? For example, in the example above, row 1 compared to row 2 would be the same as row 2 compared to row 1. Could we remove one of these calculations?
So, using the example data frame above, the only computations should be:
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 5
This is the section in a function in the code that looks for these duplicates in various columns - any ideas on how I can amend this?
lastName <- stringdist(DataND$SURNAME[rownumber],DataND$SURNAME, method='lv')
firstName <- stringdist(DataND$GIVEN.NAME[rownumber],DataND$GIVEN.NAME, method='lv')
birthDate <- stringdist(DataND$DOB[rownumber],DataND$DOB, method='lv')
streetAddress<-stringdist(DataND$ADDR.1[rownumber],DataND$ADDR.1, method='lv')
suburb <- stringdist(DataND$LOCALITY[rownumber],DataND$LOCALITY, method='lv')

H 1's idea is great. Another option would be the fuzzyjoin-package.
library(fuzzyjoin)
library(dplyr)
df <- tibble(id = seq(1,10),
words = replicate(10, paste(sample(LETTERS, 5), collapse = "")))
stringdist_left_join(df, df, by = c(words = "words"), max_dist = 5, method = "lv", distance_col = "distance") %>%
filter(distance != 0)
# A tibble: 90 x 5
id.x words.x id.y words.y distance
<int> <chr> <int> <chr> <dbl>
1 1 JUQYR 2 HQMFD 5
2 1 JUQYR 3 WHQOM 4
3 1 JUQYR 4 OUWJV 4
4 1 JUQYR 5 JURGD 3
5 1 JUQYR 6 ZMLAQ 5
6 1 JUQYR 7 RWLVU 5
7 1 JUQYR 8 AYNLE 5
8 1 JUQYR 9 AUPVJ 4
9 1 JUQYR 10 JDFEY 4
10 2 HQMFD 1 JUQYR 5
# ... with 80 more rows
Here you have it all set up in the end, you can pick and dismiss rows by distance. It took 11 seconds for 100.000 records. Trying with stringdistmatrix() however I got the error:
Error: cannot allocate vector of size 37.3 Gb

lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME, method='lv')
If i understand this line, it compar one surname (according the value of rownumber) with aller surnames. So when you change rownumber, all comparisons are made, even the ones already done precedently.
To prevent this, try:
lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME[rownumber:nrows], method='lv')
where nrows is the number of rows

Related

if i want to sort a column by size in rstudio, how do i make sure that the associated values of the rows sort with the column?

I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"
First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1

For each row return the multiple column indexs for specific number

Hi suppose I have a matrix with 0 an 1 only and I want to find out where 1 locates in each row. And for each row, there are multiple 1 exist.
For example I have
set.seed(444)
m3 <- matrix(round(runif(8*8)), 8,8)
For the first row I have column 2,3,8 are 1 and I want a code could report either column name or column index. Meanwhile, it is worth to point out that each the number of 1 in each row could be different.
Can anyone provide some suggestions? I appreciate it so much.
We can use which with arr.ind which returns the row/column index as a matrix
out <- which(m3 ==1, arr.ind = TRUE)
out[,2][order(out[,1])]
[1] 2 3 8 3 5 3 4 8 7 4 6 7 1 3 4 6 1 4 5 6 7 2 4 7 8
To get the column name, use the same index (if the matrix have any column names- here there are not column names attribute)
colnames(m3)[out[,2][order(out[,1])]]

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

sequential counting with input from more than one variable in r

I want to create a column with sequential values but it gets its value from input from two other columns in the df. I want the value to sequentially count if either Team changes (between 1 and 2) or Event = x. Any help would be appreciated! See example below:
Team Event Value
1 1 a 1
2 1 a 1
3 2 a 2
4 2 x 3
5 2 a 3
6 1 a 4
7 1 x 5
8 1 a 5
9 2 x 6
10 2 a 6
This will do it...
df$Value <- cumsum(df$Event=="x" | c(1, diff(df$Team))!=0)
It takes the cumulative sum (i.e. of TRUE values) of those elements where either Event=="x" or the difference in successive values of Team is non-zero. An extra element is added at the start of the diff term to keep it the same length as the original.

From a set of pairs, find all subsets s.t. no pair in the subset shares any element with a pair not in the subset

I have a set of pairs. Each pair is represented as [i,1:2]. That is, the ith pair are the numbers in the first and second column in the ith row.
I need to sort these pairs into distinct groups, s.t. there is no element in any pair in the jth group that is in any group not j. For example:
EXAMPLE 1: DATA
> col1 <- c(3, 4, 6, 7, 10, 8)
> col2 <- c(6, 7, 3, 4, 3, 1)
>
> dat <- cbind(col1, col2)
> rownames(dat) <- 1:nrow(dat)
>
> dat
col1 col2
1 3 6
2 4 7
3 6 3
4 7 4
5 10 3
6 8 1
For all pairs, it doesn't matter if the number is in column 1 or column 2, the pairs should be sorted into groups s.t. every number in every pair in every group exists only in one group. So the solved example would look like this.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 3
Rows 1, 3, and 5 are grouped together because 1 and 3 contain the same numbers and 5 shares the number 3, so it must be grouped with them. 2 and 4 share the same distinct numbers so they are grouped together and 6 has unique numbers so it is left alone.
If we change the data slightly, note the following.
EXAMPLE 2: NEW DATA
Note what happens when we add a row that shares an element with row 6 and row 5.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 1
7 1 10 1
The 10 in the 7th row connects it to the first group because it shares an elements with the 5th row. It also shares an element with the 6th row (the number 1), so the 6th row would instead be in group 1.
PROBLEM
Is there a simple way to form the groups? A vector operation? A sorting algorithm? It gets very nasty very quickly if you try to do it with a loop, since each subsequent row can change the membership of earlier rows, as demonstrated in the example.
To draw on the old answer at: identify groups of linked episodes which chain together , which assigns a group to each individual value, you could try this to assign a group to each linked pair:
library(igraph)
g <- graph_from_data_frame(dat)
links <- data.frame(col1=V(g)$name,group=components(g)$membership)
merge(dat,links,by="col1",all.x=TRUE,sort=FALSE)
# col1 col2 group
#1 3 6 1
#2 4 7 2
#3 6 3 1
#4 7 4 2
#5 10 3 1
#6 8 1 3
Your elements can be regarded as vertices in an undirected graph, and your pairs can be regarded as edges, and then (assuming that you want to find groups of minimal size -- if you don't, then e.g. the entire set of pairs could be labelled "Group 1") the groups you're looking for are the connected components in this graph. They can all be found in linear time with a depth-first or breadth-first search.

Resources