Fuzzy Address matching R - r

Yeah, it's been asked before, but I can't find a thread that provides a simple, clean answer to this question.
I have example data below - I have two columns, col1 is the current address, col2 is an address I am told is 'better' than the current address. I need to see how much 'better' the second column is over the first. Most of the time, the second is better b/c it contains secondary information that the first is lacking, such as apartment number.
test <- as.data.frame(matrix(c(
"742 Evergreen Terrace" , "742 Evergreen Terrace Apt 3" ,
"31 Spooner Street #42" , "31 Spooner Street",
"129 W 81st Street" , "129 W 81st Street Apt 5A" ,
"245 E 73rd Street", "245 E 73rd Street Apt 6") , ncol=2, byrow=TRUE,
dimnames=list(NULL, c("old_addr" , "new_addr"))) ,stringsAsFactors=FALSE)
There is an answer I found here that gets close to what I would like:
Fuzzy match row in one column with same row in next column
I need to create a third column that is a simple 1/0 variable that == 1 if it's an approximate match, and 0 if not. I need to be able to specify threshold for approximate matching.
For my first example - 742 Evergreen Terrace vs 742 Evergreen Terrace Apt 3, the length differs by six. I need to be able to specify a length difference of six, or eight, or whatever.
I looked at agrep, but I need to compare two columns data within the same row, and it does not allow for that. I have also tried lapply, but its results make me think it is cycling through all data in the entire column, and I need row by row comparisons. Also max distance I do not understand, with the ifelse below and a max of 1 (if I understand this correctly to be 1 == there can be one unit of edit or change), it should be throwing errors but it only does in one case.
agrep(test$old_addr, test$new_addr, max.distance = 0.1, ignore.case = TRUE)
test$fuzz_match <- lapply(test$old_addr , agrep , x =
test$new_addr , max.distance = 1 , ignore.case = TRUE)
Any help is appreciated, thank you!

You can calculate the Levenshtein distance between each pair. Then what you need to decide is how large must the distance be for the two not to be the same address.
test$lev_dist <- mapply(adist, test$old_addr, test$new_addr)
test$same_addr <- test$lev_dist < 5
test
# old_addr new_addr lev_dist same_addr
# 1 742 Evergreen Terrace 742 Evergreen Terrace Apt 3 6 FALSE
# 2 31 Spooner Street #42 31 Spooner Street 4 TRUE
# 3 129 W 81st Street 129 W 81st Street Apt 5A 7 FALSE
# 4 245 E 73rd Street 245 E 73rd Street Apt 6 6 FALSE
You can use agrep() together with mapply() in a similar manner.
test$agrep_match <- mapply(agrep, test$old_addr, test$new_addr)
test$agrep_match <- lengths(test$agrep_match) == 1
test
# old_addr new_addr agrep_match
# 1 742 Evergreen Terrace 742 Evergreen Terrace Apt 3 TRUE
# 2 31 Spooner Street #42 31 Spooner Street FALSE
# 3 129 W 81st Street 129 W 81st Street Apt 5A TRUE
# 4 245 E 73rd Street 245 E 73rd Street Apt 6 TRUE
agrep() is also based on Levenshtein distance, but has a bunch of different options for adjusting the threshold, as I'm sure you've found.
There are other difference measures than Levenshtein that might be better suited for this application. Package stringdist has a number of other string distance metrics available.

Related

How to modify a variable iteratively using data.table?

I'm hoping someone can help me figure out how to modify one variable multiple times in data.table, or find a similar approach that would work for big data.
I have a dataset with strings (addresses to be exact, but the exact contents aren't important), such as:
library(data.table)
library(stringr)
# example addresses although you can imagine other types of strings here
addr <- data.table(street = c('1 main street',
'99 madison avenue',
'340 circle court'))
I have another dataset with a column of patterns that I want to search for patterns in these strings (i.e. in the addr dataset) and substitute with other strings kept in another column in this second dataset. For example:
# example of patterns to search for and what I want to replace them with
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
abbrev = c('st', 'ave', 'cir', 'ct'))
The actual datasets are much larger: millions of addresses and 300+ abbreviations I want to check each address for.
It'd be fairly simple to do this in a loop, but because of the size, I'd like to use data.table and probably an apply function to make this process more efficient.
I'm struggling to figure out how to write this exactly. I want something like the following:
# duplicate addresses so we can compare to changes
addr[, orig.street := street]
# function to substitute abbreviations we want
standardize <- function(word, shorter) {
addr[, street := str_replace_all(street,
paste0(" ", word),
paste0(" ", shorter))]
}
# now run function for all abbreviations we want
addr[, street := mapply(FUN = standardize,
word = abbrev$full,
shorter = abbrev$abbrev,
SIMPLIFY = FALSE, USE.NAMES = FALSE)]
When I run this in Rstudio, this is returning the error, "Supplied 4 items to be assigned to 3 items of column 'street'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code."
However it actually does give me what I want, despite the error:
# it breaks but I do get the desired outcome:
street orig.street
1: 1 main st 1 main street
2: 99 madison ave 99 madison avenue
3: 340 cir ct 340 circle court
I feel like there must be a solution I'm missing, but I haven't figured it out. Any help would be greatly appreciated.
You could use stri_replace_all_fixed along with it's argument vectorize_all = FALSE from library(stringi):
library(data.table)
library(stringi)
addr <- data.table(orig_street = c('1 main street',
'99 madison avenue',
'340 circle court'))
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
abbrev = c('st', 'ave', 'cir', 'ct'))
addr[, street := stri_replace_all_fixed(orig_street, abbrev$full, abbrev$abbrev, vectorize_all = FALSE)]
> addr
orig_street street
1: 1 main street 1 main st
2: 99 madison avenue 99 madison ave
3: 340 circle court 340 cir ct
Please also see this related answer and note that library(stringr) imports library(stringi).
An alternative is a Reduce method:
addr[, street2 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\b"), abbrev$abbrev[i], txt),
seq_len(nrow(abbrev)), init = street)][]
# street street2
# <char> <char>
# 1: 1 main street 1 main st
# 2: 99 madison avenue 99 madison ave
# 3: 340 circle court 340 cir ct
Note:
I explicitly add word-boundaries (\\b) to the gsub regex so that we don't inadvertently replace a portion of a word. I think we need this instead of fixed=TRUE because gsub("court", "ct", "courteous", fixed = TRUE) returns "cteous".
If we tried an apply family (on abbrev), then we would see the updated value for each of the patterns, but not know (without extra work) which one had the change; further, if it's possible (in general, perhaps not here) for more than one abbreviation pattern to be useful, then we need to apply each pattern/replacement on the results of the previous replacement, which *apply cannot do (as easily).
Unfortunately, Reduce does not easily iterate over rows of a frame, so we iterate over row indices (seq_len(nrow(abbrev))).
However, I can't help but feel that the last row should really be "340 circle ct". In which case, if we assume that the abbrev is at the end of the string, we can use that instead:
addr[, street3 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\s*$"), abbrev$abbrev[i], txt),
seq_len(nrow(abbrev)), init = street)][]
# street street2 street3
# <char> <char> <char>
# 1: 1 main street 1 main st 1 main st
# 2: 99 madison avenue 99 madison ave 99 madison ave
# 3: 340 circle court 340 cir ct 340 circle ct

How to create a more concise table with these 2 variables? (R programming)

I am using the dataset nba_ht_wt which can be imported via text(readr) by the url http://users.stat.ufl.edu/~winner/data/nba_ht_wt.csv . The question I am trying to tackle is "What percentage of players have a BMI over 25, which is considered "overweight"?
I already created a new variable in the table called highbmi, which corresponds to bmi > 25. This is my code, but the table is hard to read, how could I get a more concise and easier to read table?
nba_ht_wt = nba_ht_wt %>% mutate(highbmi = bmi>25)
tab = table(nba_ht_wt$highbmi, nba_ht_wt$Player)
100*prop.table(tab,1)
I am using R programming.
There is no variable called bmi in the data provided so I will take a guess it is calculated via formula Weight/Height^2, where height is in meters.
data <- read.csv("http://users.stat.ufl.edu/~winner/data/nba_ht_wt.csv")
head(data)
Player Pos Height Weight Age
1 Nate Robinson G 69 180 29
2 Isaiah Thomas G 69 185 24
3 Phil Pressey G 71 175 22
4 Shane Larkin G 71 176 20
5 Ty Lawson G 71 195 25
6 John Lucas III G 71 157 30
I am no expert but it looks to me like height and weight have it names swapped for some reason.
So I will make this adjustment to calculate bmi:
data$bmi <- data$Height/(data$Weight/100)**2
And now we can answer "What percentage of players have a BMI over 25, which is considered "overweight"? with simple line of code:
mean(data$bmi > 25)
Multiply this number by 100 to get answer in percentages. So the answer will be 1.782178%
Assuming the formula: weight (lb) / [height (in)]^2 * 703 (source: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html), you could do:
library(data.table)
nba_ht_wt <- fread("http://users.stat.ufl.edu/%7Ewinner/data/nba_ht_wt.csv")
nba_ht_wt[, highbmi:=(Weight / Height**2 * 703)>25][,
.(`% of Players`=round(.N/dim(nba_ht_wt)[1]*100,2)), by="highbmi"][]
#> highbmi % of Players
#> 1: TRUE 45.35
#> 2: FALSE 54.65
... or plug in the formula into the previous response for a base R solution.
This simple formula might not be really appropriate for basketball players, obviously.

Splitting complex string between symbols R

I have a dataset full of IDs and qualification strings. My issue with this is two fold;
How to deal with splits between different symbols and,
how to iterate output down a dataframe whilst retaining an ID.
ID <- c(1,2,3)
Qualstring <- c("LE:Science = 45 Distinctions",
"A:Chemistry = A A:Biology = A A:Mathematics = A",
"A:Biology = A A:Chemistry = A A:Mathematics = A B:Baccalaureate Advanced Diploma = Pass"
)
s <- data.frame(ID, Qualstring)
The desired output would be:
ID Qualification Subject Grade
1 1 LE: Science 45 Distinctions
2 2 A: Chemistry A
3 2 A: Biology A
4 2 A: Mathematics A
5 3 A: Biology A
6 3 A: Chemistry A
7 3 A: Mathematics A
8 3 WB: Welsh Baccalaureate Advanced Diploma Pass
The commonality of the splits is the ":" and "=", and the codes/words around those.
Looking at the problem from my perspective, it appears complex and whether a continued fudge in excel is ultimately the way to go for this structure of data. Would love to know otherwise if there are any recommendations or direction.
A solution using data.table and stringr. The use of data.table is just for my personal convenience, you could use data.frame with do.call(rbind,.) instead of rbindlist()
library(stringr)
qual <- str_extract_all(s$Qualstring,"[A-Z]+(?=\\:)")
subject <- str_extract_all(s$Qualstring,"(?<=\\:)[\\w ]+")
grade <- str_extract_all(s$Qualstring,"(?<=\\= )[A-z0-9]+")
library(data.table)
df <- lapply(seq(s$ID),function(i){
N = length(qual[[i]])
data.table(ID = rep(s[i,"ID"],N),
Qualification = qual[[i]],
Subject = subject[[i]],
Grade = grade[[i]]
)
}) %>% rbindlist()
ID Qualification Subject Grade
1: 1 LE Science 45
2: 2 A Chemistry A
3: 2 A Biology A
4: 2 A Mathematics A
5: 3 A Biology A
6: 3 A Chemistry A
7: 3 A Mathematics A
8: 3 B Baccalaureate Advanced Diploma Pass
In short, I use positive look behind (?<=) and positive look ahead (?=). [A-Z]+ is for a group of upper letters, [\\w ]+ for a group of words and spaces, [A-z0-9]+ for letters (up and low cases) and numbers. string_extract_all gives a list with all the match on each cell of the character vector tested.

Repeatable way to remove a row from a data frame that ends in a certain character [duplicate]

This question already has answers here:
Select rows from data.frame ending with a specific character string in R
(3 answers)
Closed 7 years ago.
So, I have a small data frame and like my title says, I would like to remove all rows that end in a certain letter, "n".
Here is the code that will give you the data I am working with:
url = "http://www.basketball-reference.com/leagues/NBA_1980.html"
library(XML)
x1 = readHTMLTable(url)
east.1980 = x1[["E_standings"]]
west.1980 = x1[["W_standings"]]
east.1980 = east.1980[c(1,2)]
west.1980 = west.1980[c(1,2)]
names(east.1980) = c("Team", "W")
names(west.1980) = c("Team", "W")
wins.1980 = rbind(east.1980, west.1980)
wins.1980$Team = gsub("\\b\\d+\\b", "", wins.1980$Team)
wins.1980$Team = gsub(" +"," ",gsub("^ +","",gsub("[^a-zA-Z0-9 ]","",wins.1980$Team)))
View(wins.1980)
Here is an example of how the data frame will look:
Team W
1 Atlantic Division �
2 Boston Celtics 61
3 Philadelphia 76ers 59
4 Washington Bullets 39
5 New York Knicks 39
6 New Jersey Nets 34
7 Central Division �
8 Atlanta Hawks 50
9 Houston Rockets 41
10 San Antonio Spurs 41
11 Indiana Pacers 37
12 Cleveland Cavaliers 37
13 Detroit Pistons 16
14 Midwest Division �
15 Milwaukee Bucks 49
16 Kansas City Kings 47
17 Denver Nuggets 30
So basically, I want to remove the division rows "Atlantic Division, Central Division, etc...". It just so happens that all of these strings end with "n", so I am trying to write a for loop to remove all of the rows where the wins.1980$Team string ends with "n".
I want to be able to repeat the process over 30+ years of the data so being repeatable is a must.
Here are the two for loops I have tried so far:
for (i in 1:nrow(wins.1980)) {
if ((str_sub(wins.1980$Team[i], -1)) == "n") {
eval(parse(text=paste0("wins.","1980","[-", i, ",]")))
}
}
for (i in 1:nrow(wins.1980)) {
if ((str_sub(wins.1980$Team[i], -1)) == "n") {
wins.1980[-i,]
}
}
I have used a for loop with if ((str_sub(myData$Column[i], -1)) == "letter") to do something if the last character was equal to "letter" so I am pretty sure that part of the loop works.
Since there are only 6 divisions in the NBA, I would be okay with something that was repeatable and said if (wins.1980$Team == "Atlantic Division" | "Midwest Division" | etc...) then remove that row, however I do not feel like the problem in my loop is selecting the right rows, just removing them.
I do not get any errors when I run each of the above loops, it runs, but I think it just does not save what it does or something.
Pulling from my example data frame above, I would like to result to look like:
Team W
2 Boston Celtics 61
3 Philadelphia 76ers 59
4 Washington Bullets 39
5 New York Knicks 39
6 New Jersey Nets 34
8 Atlanta Hawks 50
9 Houston Rockets 41
10 San Antonio Spurs 41
11 Indiana Pacers 37
12 Cleveland Cavaliers 37
13 Detroit Pistons 16
15 Milwaukee Bucks 49
16 Kansas City Kings 47
17 Denver Nuggets 30
And again, I would like to be able to repeat this over many more data frames. Any ideas?
I am pretty new to R so I might by oblivious to simpler solutions and simplicity would be much appreciated! Thanks in advance!
Here is an easier way:
wins.1980[grep("Division$", wins.1980$Team, invert = TRUE), ]
grep("Division$"... matches anything that ends in "Division" in the Team column (this is probably safer than choosing anything that ends in n, but you could do that with the same technique), and invert = TRUE inverts these matches so you get everything that doesn't end in "Division". Using this to subset gets you all the rows where Team doesn't end in "Division".
You could make this a function to apply to many data frames:
no_div <- function(x) {
x[grep("Division$", x$Team, invert = TRUE), ]
}
Assuming you want to subset them all based on the Team column; if you're using different columns you'd have to modify the function to take an additional argument. Then call it on your data with no_div(wins.1980).
You can use grepl like so,
df <- data.frame(Team=c("Boston Celtics","Atlantic Division",
"Central Division","Atlanta Hawks"),
W=sample(10:20, 4))
df <- df[!grepl("n$", df$Team),]
Where "n$" is a regular expression meaning 'string ends with n'
You should be able to use substr and subset to do this.
First find the rows which end in Division
matches <- substr(wins.1980$team,nchar(wins.1980$team)-8,nchar(wins.1980$team)) %in% c("Division")
Then subset the dataframe based on this
wins.1980 <- subset(wins.1980, !matches)
Edit: better example here - https://stackoverflow.com/a/13012423/1502898
If you like the syntax of the dplyr and magrittr packages:
library(dplyr) ; library(magrittr)
wins.1980 %<>% filter(!grepl("Division", Team))

Why is sapply() taking my matrix and turning it into a list that I cannot factor?

I am using the sapply() function to create a new column of data. First, from my raw data of observations, every patient receives a number between 1-999, each number has a unique description, but they all fall into 1 out of 27 categories. My problem is that the 27 categories are not given in the raw data, so I have to look them up in a dictionary which has the categories that match the numbers 1-999.
Here is the raw data from a data set titled inova9:
ID AgeGroup Race SexCode Org_DRGCode
9 9 75-84 White F 435
10 10 75-84 White F 441
11 11 45-54 White F 301
40 40 14-17 White F 775
70 70 75-84 White F 853
120 120 55-64 White M 395
Here is part of my dictionary:
MSDRG_num MS.DRG_Descriptions_
1 1 Heart transplant or implant of heart assist system w MCC
2 2 Heart transplant or implant of heart assist system w/o MCC
3 3 ECMO or trach w MV 96+ hrs or PDX exc face, mouth & neck w maj O.R.
4 4 Trach w MV 96+ hrs or PDX exc face, mouth & neck w/o maj O.R.
5 5 Liver transplant w MCC or intestinal transplant
6 6 Liver transplant w/o MCC
New_CI_Category
1 Organ Transplant
2 Organ Transplant
3 General/Other Surgery
4 General/Other Surgery
5 Organ Transplant
6 Organ Transplant
here are the 27 categories:
> levels(DRG$New_CI_Category)
[1] "Bariatric Surgery" "Behavioral"
[3] "Cardiovasc Medicine" "CV Surg - Open Heart"
[5] "General/Other Surgery" "GYN Med/Surg"
[7] "Hem/Onc Medicine" "Interventional Cardiology - EP"
[9] "Interventional Cardiology - PCI" "Medicine"
[11] "Neonates" "Neurology"
[13] "Neurosurgery - Brain" "Neurosurgery - Other"
[15] "Normal Newborns" "OB Deliveries"
[17] "OB Other" "Organ Transplant"
[19] "Ortho Medicine" "Ortho Surg - Other"
[21] "Ortho Surgery - Joints" "Rehab"
[23] "Spine" "Thoracic Surgery"
[25] "Unspecified" "Urology Surgery"
[27] "Vascular Procedure - Surgery or IR"
So, I need to match up inova9$Org_DRGCode with MSDRG_num from my dictionary, then pull the corresponding category from DRG$New_CI_Catgory
I implemented the following:
ServiceLine1 = matrix(nrow=length(inova9$Org_DRGCode),ncol=1)
ServiceLine1 = sapply(1:length(inova9$Org_DRGCode),function(i)as.character(DRG$New_CI_Category[DRG$MSDRG_num==inova9$Org_DRGCode[i]]))
Svc = as.factor(ServiceLine1)
inova9 = data.frame(inova9,Svc)
As, you can see, I created a column and now I can merge it with my original data, one-to-one.
I have four data sets like this, but it only works for two. The other two I receive this error:
> Svc = as.factor(ServiceLine2)
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
And my data looks like this:
[[1]]
[1] "Neurology"
[[2]]
[1] "Medicine"
[[3]]
[1] "GYN Med/Surg"
[[4]]
[1] "Vascular Procedure - Surgery or IR"
[[5]]
[1] "Neurology"
[[6]]
[1] "Medicine"
How did sapply() turn my matrix into a list and how do i stop it from happening?
You might save yourself a headache by converting your data.table, setting a key then simply joining.
library(data.table)
DT.DRG <- as.data.table(DRG)
DT.dict <- as.data.table(your_dict)
## Set the key to what you want to join on
setkey(DT.DRG, ID)
setkey(DT.dict, MSDRG_num)
## Assign the column from DT.dict into DT.DRG, joining on the keys
DT.DRG[DT.dict, New_CI_Category := New_CI_Category]
Make sure the keys are of the same type
meaning that they are both factor or both character, etc
This happens because sapply is a wrapper for lapply that tries to be smart about its return structure. When, for whatever reason, it can't figure it out, it will always fall back to a list because that is what lapply returns.
Now, I'm not entirely sure why that's happening here. Just reading your code, I would also expect sapply to return a vector and not a list. One possibility is that, for some value of i, the expression as.character(DRG$New_CI_Category[DRG$MSDRG_num==inova9$Org_DRGCode[i]]) has length greater than one. You can check this with any(sapply(ServiceLine1, length) > 1).
In any case, the function unlist will compress a list down to a vector, so you can do as.factor(unlist(ServiceLine1)).

Resources