Fuzzy matching in R - about 1 million rows - r

I have a list of about one million individuals - each of which is identified by his/her name and surname. Individuals might be present in the list more than once. I would like to group observations by individual and count how many times they appear - this is usually OK and do it with dplyr::group_by.
However, there are spelling mistakes. In order to solve the issue, I thought of computing a measure of string distance within this list. I would then go ahead and assume that if the string distance is below a certain threshold, record identify the same individual.
All the methods I tried so far are either too time-consuming or plain infeasible RAM-wise.
These is my attempt using dplyr and RecordLinkage:
list_matrix <- expand.grid(x = individual_list, pattern = individual_list, stringsAsFactors = F)
# The same is achieved using stringdistmatrix (stringdist package)
result <- list_matrix %>%
group_by(x) %>%
mutate(similarity = levenshteinSim(x, pattern)) %>%
summarise(match = similarity[which.max(similarity)],
matched_to = pattern[which.max(match)])
This method works well with small data sets. Intuitively, I always confront all elements with each other. Nevertheless, the resulting matrix is of dimension numberofrows x numerofrows, which in my case is a million times a million - way too heavy to be handled.
I also gave a shot to other functions: adist, pmatch, agrep(l). Same logic applies. I think that the problem is conceptual here. Any ideas?

Related

More efficient way to get measurements from different items in R

I'm currently getting a bunch of accuracy measurements for 80k different items which I need to calculate the measurements independently but currently is taking too long, so want to determine a faster way to do it.
Here's my code in R with it's comments:
work_file: Contains 4 varables: item_id, Dates, demand and forecast
my code:
output<-0
uniques<- unique(work_file$item_id)
for( i in uniques){
#filter every unique item
temporal<- work_file %>% filter(item_id==i)
#Calculate the accuracy measure for each item
x<-temporal$demand
x1<-temporal$forecast
item_error<- c(i, accuracy(x1,x)
output<-rbind(output, item_error)}
For 80k~unique items is taking hours,
Any suggestions?
R is a vectorized language, as such one can avoid the use of the loop. Also the binding within a loop is especially slow since the output data structure is constantly being deleted and recreated with each iteration.
Provided the "accuracy()" function can accept a vector input this should work: Without sample data to test, there is always some doubt.
answer<- work_file %>%
group_by(item_id) %>%
summarize(accuracy(forecast, demand))
Here the dplyr's group_by function will collect the different item_ids and the pass those vectors to summarize the accuracy function.
Consider using data.table methods which would be efficient
library(data.table)
setDT(work_file)[, .(acc = accuracy(forecast, demand)), item_id]

Is there an efficient strategy for doing a fuzzy join on customer data to identify a single customer ID in R?

I wish to perform a "fuzzy-deduplication" on my customer data to derive a unique ID per customer where in the original data multiple IDs may have been entered for the same customer.
I have a data frame in R which contains a list of customers. Each customer has an ID, First Name, Last Name, Email and Phone Number.
Many customers have been entered multiple times for different purchases. Sometimes the old record (same ID) is reused, other times a new record and new ID are issued.
What is an appropriate strategy for deduplicating this data where I cannot rely on an exact match across all fields - e.g. misspelling of first name or only initial given, but other fields may match.
I currently do a dplyr left join on First name concatenated to Surname, then use Phone and Email as a validation check, but this may miss some records. Loosening the match rules (all same Surname) results in too large a data frame.
(No code at the moment - this is more a request for general coding strategy and approach.
Are there any packages that handle these sorts of matches efficiently?)
Before getting to the process of finding duplicates it is important to get/gather good data to begin with.
You have mentioned first name, last name, email and phone number. First names are good, since they usually don’t change unlike email addresses and phone numbers. Last names can change through marriage/divorce. Therefore, it is always good to have other time-invariant variables such as “date of birth” or “place of birth”.
Even with good data, there will always be a challenge matching first, last names and date of birth in a large customer database.
As you point out in your comments, a string distance matrix of 100,000 plus customers takes time and causes memory problems.
One work around here is to sort the data and break it into pieces. Create a string distance matrix on each small piece, get some likely matches and piece everything back together. There are different approaches on how to do that, and I will just show how it works in principle and maybe you can expand on this.
I downloaded some fake data of 1,000 records. Unfortunately, it does not contain duplicates, but for showing the Basic workflow it does not deed real duplicates.
The approach takes the following steps:
Create a name field based on last and first name.
Arrange it in ascending order (A-Z).
Break it down into groups of 50 customers (this is for my example data with 1,000 rows, actually running groups of 500 should be no problem in terms of speed and memory).
Create a nested tibble to work on with purrr::map.
Apply a customized stringdistmatrix function that works in the dplyr pipe and gives likely matches between names of customers as output.
Unnest the single results to get a complete list of potential
matches.
The idea behind breaking down the data is that you do not need a string distance matrix of all 100,000 customers. Most of the names are so different that you do not even need to calculate a string distance. Sorting the names and working on small subsets is like narrowing down the search.
Of course this is just one way to break down the data. It is incomplete, since it misses, for example, all customers with a typo in the first letter of the last name. However, you can replicate this approach for other variables such as date of birth, number of characters in a name etc. Ideally you do different break downs and piece everything together in the end.
I downloaded some fake date via www.mockaroo.com. I tried to put it here with dput, but it was to long. So I just show you the head() of my data and you can create your own fake data or use real customer data.
One note regarding my customized version of stringdistmatrix which I named str_dist_mtx. When working with real data you should adjust the size of the group (in the example it is rather small n = 50). And you should adjust the string distance string_dist up until which you want to consider two differing names as potential matches. I took 6 to at least get some results, but I am not working with data which has real duplicates. So in a real application I would choose 1 or 2 to cover the most basic typos.
# the head() of my data
test_data <- structure(list(first_name = c("Gabriel", "Roscoe", "Will", "Francyne",
"Giorgi", "Dulcinea"), last_name = c("Jeandeau", "Chmiel", "Tuckwell",
"Vaggers", "Fairnie", "Tommis"), date_of_birth = structure(c(9161,
4150, 2557, 9437, -884, -4489), class = "Date")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Below is the code I used.
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(stringdist)
# customized stringdistmatrix function
str_dist_mtx <- function(df, x, string_dist, n) {
temp_mtx = stringdistmatrix(df[[x]],df[[x]])
temp_tbl = tibble(name1 = rep(df[[x]], each = n),
name2 = rep(df[[x]], times = n),
str_dist = as.vector(temp_mtx)) %>%
filter(str_dist > 0 & str_dist < string_dist)
temp_tbl[!duplicated(data.frame(t(apply(temp_tbl,1,sort)))),]
}
# dplyr pipe doing the job
test_data2 <- test_data %>%
mutate(name = paste0(last_name, first_name)) %>%
arrange(name) %>%
mutate(slice_id = row_number(),
slice_id = cut_width(slice_id, 50, center = 25)) %>%
nest(-slice_id) %>%
mutate(str_mtx = map(data,
~ str_dist_mtx(., "name", string_dist = 6, n = 50))) %>%
select(str_mtx) %>%
unnest()

Counting number of rows where a value occurs at least once within many columns

I updated the question with pseudocode to better explain what I would like to do.
I have a data.frame named df_sel, with 5064 rows and 215 columns.
Some of the columns (~80) contains integers with a unique identifier for a specific trait (medications). These columns are named "meds_0_1", "meds_0_2", "meds_0_3" etc. as well as "meds_1_1", "meds_1_2", "meds_1_3". Each column may or may not contain any of the integer values I am looking for.
For the specific integer values to look for, some could be grouped under different types of medication, but coded for specific brand names.
metformin = 1140884600 # not grouped
sulfonylurea = c(1140874718, 1140874724, 1140874726) # grouped
If it would be possible to look-up a group of medications, like in a vector format as above, that would be helpful.
I would like to do this:
IF [a specific row]
CONTAINS [the single integer value of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_METFORMIN = 1 ELSE A_NEW_VARIABLE_METFORMIN = 0
and concordingly
IF [a specific row]
CONTAINS [any of multiple integer values of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_SULFONYLUREA = 1 ELSE A_NEW_VARIABLE_SULFONYLUREA = 0
I have manged to create a vector based on column names:
column_names <- names(df_sel) %>% str_subset('^meds_0')
But I havent gotten any further despite some suggestions below.
I hope you understand better what I am trying to do.
As for the selection of the columns, you could do this by first extracting the names in the way you are doing with a regex, and then using select:
library(stringr)
column_names <- names(df_sel) %>%
str_subset('^meds_0')
relevant_df <- df_sel %>%
select(column_names)
I didn't quite get the structure of your variables (if they are integers, logicals, etc.), so I'm not sure how to continue, but it would probably involve something like summing across all the columns and dropping those that are not 0, like:
meds_taken <- rowSums(relevant_df)
df_sel_med_count <- df_sel %>%
add_column(meds_taken)
At this point you should have your initial df with the relevant data in one column, and you can summarize by subject, medication or whatever in any way you want.
If this is not enough, please edit your question providing a relevant sample of your data (you can do this with the dput function) and I'll edit this answer to add more detail.
First, I would like to start off by recommending bioconductor for R libraries, as it sounds like you may be studying biological data. Now to your question.
Although tidyverse is the most widely acceptable and 'easy' method, I would recommend in this instance using 'lapply' as it is extremely fast. Your code from a programming standpoint becomes a simple boolean, as you stated, but I think we can go a little further. Using the built-in data from 'mtcars',
data(mtcars)
head(mtcars, 6)
target=6
#trues and falses for each row and column
rows=lapply(mtcars, function(x) x %in% target)
#Number of Trues for each column and which have more that 0 Trues
column_sums=unlist(lapply(rows, function(x) (sum(x, na.rm = TRUE))))
which(column_sums>0)
This will work with other data types with a few tweaks here and there.

How to filter a column based on a condition from another column in R

I have a huge data table with millions of rows and dozens columns, so performance is a crucial issue for me. The data describes visits to a content site. I want to compute the ContentId of the earliest (i.e. minimum hit time) hit of each visit. What I did is:
dt[,.(FirstContentOfVisit=ContentID[ContentID != ""][which.min(HitTime)]), by=VisitId,.SDcols=c("ContentID","HitTime")]
the problem is that I don't know if which.min first computes the min on all the HitTime vector (which I don't want!) or does it only on the filtered HitTime vector (the one which is corresponding to the non-empty ContentID).
In addition, after I compute it - how can I get the minimal HitTime of the ContentIDs that are different from the first (i.e. the earliest hit time of the non-first content id).
When I tried to have both actions with user-defined functions (first - sort the sub data table and then extract the desired value) it took ages (and actually never stopped), although I have a very strong machine (virtual) with 180 GB RAM. So I'm looking for an inline solution.
dplyr makes this much easier. You didn't share a sample of your data, but I assume the variables of interest look something like this.
web <- tibble(
HitTime = sample(seq(as.Date('2010/01/01'), as.Date('2017/02/23'), by="day"), 1000),
ContentID = 1:1000,
SessionID = sample(1:100, 1000, replace = TRUE)
)
Then you can just use group_by and summarise to find the earliest value of HitTime for each SessionID.
web %>%
group_by(SessionID) %>%
summarise(HitTime = min(HitTime))

Endless function/loop in R: Data Management

I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.
So, as Andrie tells you can look at melt.
Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols
A third alternative would be array2df from package arrayhelpers.
I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:
As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.
Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns
That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.
yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)
And now you're done... mostly, you might want to make an extra column that names the new rows
newDat$condNum <- rep(1:10, nrow(newDat)/10)
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.
(read this to better understand your problems in this code... you've made most of the big errors in there)

Resources