R find and replace partial string based on lookup table - r

I've found variants on this issue but can't get the suggested solutions working in my situation. I'm pretty new to R with no other coding experience so it may be I'm just missing something basic. Thanks for any help!!
I have a data table with a column of names of organisations, call it Orgs$OrgName. Sometimes there are misspellings of words within the strings that make up the organisation names. I have a look-up table (imported from csv with common misspellings in one column (spelling$misspelt) and their corrections in another column (spelling$correct).
I want to find any parts of OrgName strings which match spelling$misspelt and replace just those parts with spelling$correct.
I have tried various solutions based on mgsub, stri_replace_all_fixed, str_replace_all (replacement of words in strings has been my main reference). But nothing has worked and all the examples appear to be based on manually created vectors using vect1 <- c("item1", "item2", item3") rather than based on a lookup table.
Example of my data:
OrgName
1: WAIROA DISTRICT COUNCIL
2: MANUTAI MARAE COMMITTEE
3: C S AUTOTECH LTD
4: NEW ZEALAND INSTITUTE OF SPORT
5: BRAUHAUS FRINGS
6: CHRISTCHURCH YOUNG MENS CHRISTIAN ASSOCIATION
The lookup table:
mispelt correct
1 ABANDONNED ABANDONED
2 ABERATION ABERRATION
3 ABILITYES ABILITIES
4 ABILTIES ABILITIES
5 ABILTY ABILITY
6 ABONDON ABANDON
(There's no misspellings in the first few lines of org names but there's 57000+ more in the dataset)
UPDATE: Here's what I have tried based on the update to the second response (trying that first as it's simpler). It hasn't worked, but hopefully someone can see where it's gone wrong?
library("stringi")
Orgs <- data.frame(OrgNameClean$OrgNameClean)
head(Orgs)
head(OrgNameClean)
write.csv(spelling$mispelt,file = "wrong.csv")
write.csv(spelling$correctspelling,file = "corrected.csv")
patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
head(patterns)
head(replacements)
for(i in 1:nrow(Orgs)) {
row <- Orgs[i,]
print(as.character(row))
#print(stri_replace_all_fixed(row, patterns, replacements,
vectorize_all=FALSE))
row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+%
"\\b", replacements, vectorize_all=FALSE)
print(row)
Orgs[i,] <- row
}
head(Orgs)
Orgsdt <- data.table(Orgs)
head(Orgsdt)
chckspellchk <- Orgsdt[OrgNameClean.OrgNameClean %like% "ENVIORNMENT",,]
##should return no rows if spelling correction worked
head(chckspellchk)
#OrgNameClean.OrgNameClean
#1: SMART ENVIORNMENTAL LTD
UPDATE 2: more information - there are spaces in the spelling lookup if that makes a difference:
> head(spelling[mispelt %like% " ",,])
mispelt correctspelling
1: COCA COLA COCA
2: TORTISE TORTOISE
> head(spelling[correctspelling %like% " "])
mispelt correctspelling
1: ABOUTA ABOUT A
2: ABOUTIT ABOUT IT
3: ABOUTTHE ABOUT THE
4: ALOT A LOT
5: ANYOTHER ANY OTHER
6: ASFAR AS FAR

We can use stringi's stri_replace_*_all() to do multiple replacements on a whole string.
library("stringi")
string <- "WAIROA ABANDONNED COUNCIL','C S AUTOTECH LTD', 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES"
mistake <- c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW')
corrected <- c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD')
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
Output:
[1] "WAIROA ABANDONED COUNCIL','C S AUTOTECH SGM', 'OLD ZEALAND INSTITUTE OF ABERRATION ABILITIES"
Some notes:
stri_replace_all_fixed replaces occurrences of a fixed pattern matches.
stri_replace_all_regex uses a regular expression pattern instead. This allows us to specify word boundaries: \b to avoid substring matches (an alternative to \bword\b is (?<=\W)word(?=\W)).
vectorize_all is set to FALSE, otherwise each replacement is applied to a new copy of the original sentence. See details here.
Full sample:
library("stringi")
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL',
' SMART ENVIORNMENTAL LTD',
'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
stringsAsFactors = FALSE)
patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
for(i in 1:nrow(Orgs)) {
row <- Orgs[i,]
print(as.character(row))
row <- stri_replace_all_fixed(row, patterns, replacements, vectorize_all=FALSE)
#row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
print(row)
Orgs[i,] <- row
}
PS: I've made a separate CSV with a single headerless column for each character vector. But there are many other ways to read a CSV with R and convert the columns to a char vector.
PS2: If you want substring matches, eg. match ENVIORNMENT in ENVIORNMENTAL do not use stri_replace_all_regex() along with word boundaries \b. This is a great tutorial to buff-up your regex skills.

This answer is potentially too complicated for a new programmer, and I may be writing this more like Python than R (I'm getting a bit rusty on the latter)* but I have a proposed solution for your problem, which isn't trivial by the way. The issues I foresee you having with other solutions you looked at is that they individually only address one small part of the larger puzzle, which is that you need to be able to check every word inside every string against your lookup table. The simplest way I see to do this is to write a number of small functions to do what you need and then use R's family of apply functions to loop through entries and use the functions.
The only other tricky thing here is using an R environment as your lookup table. For whatever reason in R people don't seem to talk much about or really use hash tables (the real name for a lookup table) but they are very common in other languages. Luckily R's environments are actually just an implementation of a C hash table, which is good because hashes are very fast and allow you to directly map one value to another. (More on this here, if interested.)
*I welcome comments or edits from others that would make my answer simpler or more R-idiomatic
# Some example data - note stringsAsFactors=FALSE is critical for this to work
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL',
'C S AUTOTECH LTD',
'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
stringsAsFactors = FALSE)
spelling_df <- data.frame("Mistake" = c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW'),
"Correct"= c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD'),
stringsAsFactors = FALSE)
# Function to convert your data frame to a hash table
create_hash <- function(in_df){
hash_table <- new.env(hash=TRUE)
for(i in seq(nrow(in_df)))
{
hash_table[[in_df[i, 1]]] <- in_df[i, 2]
}
return(hash_table)
}
# Make the hash table out of your data frame
spelling_hash <- create_hash(spelling_df)
# Try it out:
print(spelling_hash[['ABANDONNED']]) # prints ABANDONED
# Now make a function to apply the lookup - and ensure
# if the string is not in the lookup table, you return the
# original string instead (instead of NULL)
apply_hash <- function(in_string, hash_table=spelling_hash){
x = hash_table[[in_string]]
if(!is.null(x)){
return(x)
}
else{
return(in_string)
}
}
# Finally make a function to break the full company name apart,
# apply the lookup on each word, and then paste it back together
correct_spelling <- function(bad_string) {
split_string <- strsplit(as.character(bad_string), " ")
new_split <- lapply(split_string[[1]], apply_hash)
return(paste(new_split, collapse=' '))
}
# Make a new field that applies the spelling correction
Orgs$Corrected <- sapply(Orgs$OrgName, correct_spelling)

I came across a similar issue and might have a tidyverse-style solution.
stringr::str_replace_all should let us do multiple replacements using a named vector.
With the lookup data frame of misspelled and corrected values we could turn that into a named vector. Then we could use that named vector as a lookup in str_replace_all.
Here is an example using some of the misspelled and corrected values provided previously.
library(tidyverse)
# load data frame of misspelled and corrected values
foo <- read_csv("mispelt, correct
ABANDONNED, ABANDONED
ABERATION, ABERRATION
ABILITYES, ABILITIES
ABILTIES, ABILITIES
ABILTY, ABILITY
ABONDON, ABANDON
COCA COLA, COCA
TORTISE, TORTOISE
ABOUTA, ABOUT A
ABOUTIT, ABOUT IT
ABOUTTHE, ABOUT THE
ALOT, A LOT
ANYOTHER, ANY OTHER
ASFAR, AS FAR",
col_types = "c")
# str_replace_all requires a named vector of replacements
# the value of the vector is the correction,
# while the name of each value is the search string to replace
lookup <- foo$correct
names(lookup) <- foo$mispelt
# data frame to test our lookup named vector
tbl <- tibble(old = foo$mispelt)
# mutating to a new column to show replacement works,
# but we could just overwrite the old column as well using mutate
mutate(tbl, new = str_replace_all(old, lookup))
I did not deal with upper or lower case considerations as I'm just demonstrating the named vector usage in str_replace_all and the examples were all upper case. However, regular expressions and/or the regex function could probably help with that if necessary.
Session info:
|package |loadedversion |
|:---------|:-------------|
|dplyr |1.0.7 |
|readr |2.0.0 |
|stringr |1.4.0 |
|tibble |3.1.3 |
|tidyverse |1.3.1 |

Related

Joining data sets in R where the unique ids have spelling mistakes

Hi I am trying to join two large datasets >10000 entries each. To do this I have created a ‘unique ID’ - a combination of full name and date of birth which are present in both. However, the datasets have spelling mistakes/ different characters in the IDs so when using left join many won’t match. I don’t have access to fuzyjoin/ match so can’t use this to partially match them. Someone has suggested using adist(). How am I able to use this to match and merge the datasets or to flag ones which are close to matching? As simple as possible please I have only been using R for a few weeks!
Examples of code would be amazing
You could just rename them to names that are spelled correctly:
df$correct_spelling <- df$incorrect_spelling
This may a bit of a manual solution, but perhaps a base - R solution would be to look through unique values of the join fields and correct any that are misspelled using the grep() function and creating a crosswalk to merge into the dataframes with misspelled unique IDs. Here's a trivial example of what I mean:
Let's say we have a dataframe of scientists and their year of birth, and then we have a second dataframe with the scientists' names and their field of study, but the "names" column is riddled with spelling errors. Here is the code to make the example dataframes:
##Fake Data##
Names<-c("Jill", "Maria", "Carlos", "DeAndre") #Names
BirthYears<-c(1974, 1980, 1991, 1985) # Birthyears
Field<-c("Astronomy", "Geology", "Medicine", "Ecology") # Fields of science
Mispelled<-c("Deandre", "Marai", "Jil", "Clarlos")# Names misspelled
##Creating Dataframes##
DF<-data.frame(Names=Names, Birth=BirthYears) # Dataframe with correct spellings
DF2<-data.frame(Names=Mispelled, Field=Field) # Dataframe with incorrect spellings we want to fix and merge
What we can do is find all the unique values of the correctly spelled and the incorrectly spelled versions of the scientists' names using a regular expression replacement function gsub().
Mispelled2<-unique(DF2$Names) # Get unique values of names from misspelled dataframe
Correct<-unique(DF$Names) # Get unique values of names from correctly spelled dataframe
fix<-NULL #blank vector to save results from loop
for(i in 1:length(Mispelled2)){#Looping through all unique mispelled values
ptn<-paste("^",substring(Mispelled2[i],1,1), "+", sep="") #Creating a regular expression string to find patterns similar to the correct name
fix[i]<-grep(ptn, Correct, value=TRUE) #Finding and saving replacement name values
}#End loop
You'll have to come up with the regular expressions necessary for your situation, here is a link to a cheatsheet with how to build regular expressions
https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
Now we can make a dataframe crosswalking the misspelled names with the correct spelling ie., Row 1 would have "Deandre" and "DeAndre" Row 2 would have "Jil" and "Jill."
CWX<-data.frame(Name_wrong=Mispelled2, Name_correct=fix)
Finally we merge the crosswalk to the dataframe with the incorrect spellings, and then merge the resultant dataframe to the dataframe with the correct spellings
Mispelled3<-merge(DF2, CWX, by.x="Names", by.y="Name_wrong")
Joined_DF<-merge(DF, Mispelled3[,-1], by.x="Names", by.y="Name_correct")
Here is what I was able to come up with for your question about matching in multiple ways. It's a bit clunky, but it works with this below example data. The trick is making the call to agrep() sensitive enough to not match names that partially match but are truly different, but flexible enough that it allows for partial matches and misspellings:
Example1<-"deborahoziajames04/14/2000"
Example2<-"Somepersonnotdeborah04/15/2002"
Example3<-"AnotherpersonnamedJames01/23/1995"
Misspelled1<-"oziajames04/14/2000"
Misspelled2<-"deborahozia04/14/2000"
Misspelled3<-"deborahoziajames10/14/1990"
Misspelled4<-"personnamedJames"
String<-c(Example1, Example2, Example3)
Misspelled<-c(Misspelled1, Misspelled2, Misspelled3, Misspelled4)
Spell_Correct<-function(String, Misspelled){
out<-NULL
for(i in 1:length(Misspelled)){
ptn_front<-paste('^', Misspelled[i], "\\B", sep="")
ptn_mid<-paste('\\B', Misspelled[i], "\\B", sep="")
ptn_end<-paste('\\B', Misspelled[i], "$", sep="")
ptn<-c(ptn_front, ptn_mid, ptn_end)
Nchar_M<-nchar(Misspelled[i])
Nchar_S<-nchar(String)
out_front<-agrep(pattern=ptn[1], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_mid<-agrep(pattern=ptn[2], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_end<-agrep(pattern=ptn[3], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_test<-list(out_front, out_mid, out_end)
for (j in 1:length(out_test)){
if(length(out_test[j])==1)
use_me<-out_test[j]
}
out[i]<-use_me
}
return(unlist(out))
}
Spell_Correct(String, Misspelled)
Basically this just repeating the previous answer multiple times by using the loop and tweaking the regular expression to try a beginning, middle, and end call to agrep(). Depending on how bad the misspellings are, you may need to play around with the max.distance and cost arguments. Good Luck.
Take Care,
-Sean

R partial string matching ignore spaces omni-directional

I am having issue with partial string matching. I have pairs of people, and I need to compare their names. To do this I have run a charmatch both directions on the two last names, to see if name1 is part of name2, and vice versa. I have a small dataset below to demonstrate the question. I use charmatch below; I have used pmatch as well and it returns the same result.
When charmatch says seeks matches for the seeks matches for the elements of its first argument among those of its second... I take that to mean it will treat each group of characters in element1 as a pattern n see if same group exists in element2. But that's obviously not what's happening, it looks like it's direction specific.
So...is it direction specific? And if so...what else can I use to do what I am describing? My EG names pun intended, what I actually run into are lots of last names where husband has his name and wife has hers + husband. I need to be able to see if husband last name exists within wife last name.
I know it can be done with regular expressions but I am not familiar with them, probably should be, but am not, so I'd prefer an answer that does not use regex.
eg_data <- data.frame(name1 = c('Jimmy Conway', 'Jimmy'),
name2 = c('Conway','Jimmy Conway'))
eg_data$share_name1 <- mapply(charmatch, eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(charmatch, eg_data$name2, eg_data$name1)
eg_data$share_name <- 0
eg_data$share_name [(eg_data$share_name1==1 | eg_data$share_name2==1)]
<- 1
Same two lines, only string detect, not charmatch.
eg_data$share_name1 <- mapply(str_detect,eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(str_detect,eg_data$name2, eg_data$name1)
OR even
eg_data$share_name1 <- ifelse(mapply(str_detect,eg_data$name1, eg_data$name2)==TRUE,1,0)
eg_data$share_name2 <- ifelse(mapply(str_detect,eg_data$name2, eg_data$name1)==TRUE,1,0)
Thanks for anyone who looked. I hope this helps others.
This could be useful
> with(eg_data, intersect(name1, name2))
[1] "Jimmy Conway"

r replace text within a string by lookup table

I already have tried to find a solutions on the internet for my problem, and I have the feeling I know all the small pieces but I am unable to put them together. I'm quite knew at programing so pleace be patient :D...
I have a (in reality much larger) text string which look like this:
string <- "Test test [438] test. Test 299, test [82]."
Now I want to replace the numbers in square brackets using a lookup table and get a new string back. There are other numbers in the text but I only want to change those in brackets and need to have them back in brackets.
lookup <- read.table(text = "
Number orderedNbr
1 270 1
2 299 2
3 82 3
4 314 4
5 438 5", header = TRUE)
I have made a pattern to find the square brackets using regular expressions
pattern <- "\\[(\\d+)\\]"
Now I looked all around and tried sub/gsub, lapply, merge, str_replace, but I find myself unable to make it work... I don't know how to tell R! to look what's inside the brackets, to look for that same argument in the lookup table and give out what's standing in the next column.
I hope you can help me, and that it's not a really stupid question. Thx
We can use a regex look around to match only numbers that are inside a square bracket
library(gsubfn)
gsubfn("(?<=\\[)(\\d+)(?=\\])", setNames(as.list(lookup$orderedNbr),
lookup$Number), string, perl = TRUE)
#[1] "Test test [5] test. Test [3]."
Or without regex lookaround by pasteing the square bracket on each column of 'lookup'
gsubfn("(\\[\\d+\\])", setNames(as.list(paste0("[", lookup$orderedNbr,
"]")), paste0("[", lookup$Number, "]")), string)
Read your table of keys and values (a 2 column table) into a data frame. If your source information be a flat text file, then you can easily use read.csv to obtain a data frame. In the example below, I hard code a data frame with just two entries. Then, I iterate over it and make replacements in the input string.
df <- data.frame(keys=c(438, 82), values=c(5, 3))
string <- "Test test [438] test. Test [82]."
for (i in 1:nrow(df)) {
string <- gsub(paste0("(?<=\\[)", df$keys[i], "(?=\\])"), df$values[i], string, perl=TRUE)
}
string
[1] "Test test 5 test. Test 3."
Demo
Note: As #Frank wisely pointed out, my solution would fail if your number markers (e.g. [438]) happen to have replacements which are numbers also appearing as other markers. That is, if replacing a key with a value results in yet another key, there could be problems. If this be a possibility, I would suggest using markers for which this cannot happen. For example, you could remove the brackets after each replacement.
You can use regmatches<- with a pattern containing lookahead/lookbehind:
patt = "(?<=\\[)\\d+(?=\\])"
m = gregexpr(patt, string, perl=TRUE)
v = as.integer(unlist(regmatches(string, m)))
`regmatches<-`(string, m, value = list(lookup$orderedNbr[match(v, lookup$Number)]))
# [1] "Test test [5] test. Test 299, test [3]."
Or to modify the string directly, change the last line to the more readable...
regmatches(string, m) <- list(lookup$orderedNbr[match(v, lookup$Number)])

Extracting the top match from string comparison in R

I am currently using the 'agrep' function with 'lapply' in a data.table code to link entries from a user-provided VIN# list to a DMV VIN# database. Please see the following two links for all data/code so far:
Accelerate performance and speed of string match in R
Imperfect string match using data.table in R
Is there a way to extract the "best" match from my list that is being generated by:
dt <- dt[lapply(car.vins, function(x) agrep(x,vin.vins, max.distance=c(cost=2, all=2), value=T)), list(NumTimesFound=.N), vin.names]
because as of now, the 'agrep' function gives me multiple matches, even with a lot of modification of the cost, all, substitution, ect. variables.
I have also tried using the 'adist' function instead of 'agrip' but because 'adist' does not have an option for value=TRUE like 'agrep', it throws out the same
Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins, :
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'.
Character columns must join to factor or character columns.
that I was receiving with the 'agrep' before.
Is there perhaps some other package I could use?
Thanks!
Tom, this isn't strictly a data.table problem. Also, it's hard to know exactly what you want without having the data you are using. I tried to figure out what you want, and I came up with this solution:
vin.match <- vapply(car.vins, function(x) which.min(adist(x, vin.vins)), integer(1L))
data.frame(car.vins, vin.vins=vin.vins[vin.match], vin.names=vin.names[vin.match])
# car.vins vin.vins vin.names
# 1 abcdekl abcdef NAME1
# 2 abcdeF abcdef NAME1
# 3 laskdjg laskdjf NAME2
# 4 blerghk blerghk NAME3
And here is the data:
vin.vins <- c("abcdef", "laskdjf", "blerghk")
vin.names <- paste0("NAME", 1:length(vin.vins))
car.vins <- c("abcdekl", "abcdeF", "laskdjg", "blerghk")
This will find the closest match for every value in car.vins in vin.vins, as per adist. I'm not sure data.table is needed for this particular step. If you provide your actual data (or a representative sample), then I can provide a more targeted answer.

Counting words within factors

I have millions of Keywords in a column labeled Keyword.text. Each factor or Keyword can contains multiple words (or shall we say token). Here is an example with 4 keywords
Keyword.text
The quick brown fox the
.8 .crazy lazy dog
dog
jumps over+the 9
I'd like to count the number of tokens in each Keyword, so as to obtain:
Keyword.length
5
4
1
4
I installed the Tau package but I haven't gotten very far...
textcnt(Mydf$Keyword.text, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
returns an error I don't understand. Maybe it's due to having factors; it worked fine when practicing with a string.
I know how to do it in excel, but it doesn't work for the last line. If A2 has the keywords then: =LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1 would do
Edit : For a dataframe and the total number of keywords, just use strsplit. There's no need to use strcnt if you're not interested in the counts per keyword. That's where I got you wrong :
tt <- data.frame(
a=rnorm(3),
b=rnorm(3),
c=c("the quick fox lazy","rbrown+fr even","what what goes & around"),
stringsAsFactors=F
)
sapply(tt$c, function(n){
length(strsplit(n, split = "[[:space:][:punct:]]+")[[1]])
})
To read the data, take also a look at ?readLines and/or ?scan. This preserves the string format and allows you to process the file line by line (or row per row). If you use a file connection, you can even load the file in parts, which helps you when you hit memory limits.
A simple example using readLines :
con <- textConnection("
The lazy fog+fog fog
never ended for fog jumping over the
fog whatever . $ plus.
")
# You use con <- file("myfile.txt")
Text <- readLines(con)
sapply(Text,textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
On a sidenote, using the option Dirk mentioned (stringsAsFactors=F) won't slow down performance compared to the usual read.table command. In contrary actually. You should use the sapply as mentioned above, but replace Text with as.character(Mydf$Keyword.text) (or use the stringsAsFactors=F option and drop the as.character().
Please show the error.
Also try:
require(tau)
textcnt(as character(Mydf$Keyword.txt), split, ....)
... to force character mode.
Or load your data with stringsAsFactors=FALSE -- the same question has come up here before.
What about a nice little function that let us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6

Resources