finding potential duplications (spelling errors) in r [duplicate] - r

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.
For example:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
agrep: only return best match(es)
In R, how do I replace a string that contains a certain pattern with another string?
Fast Levenshtein distance in R?
But most often agrep will do what you want :
> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3
$`Bush, G.W.`
[1] 2
$`Obama, B.H.`
[1] 1 3
$`Clinton, W.J.`
[1] 4

Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.
lapply(pres, agrep, pres, value = TRUE)
[[1]]
[1] " Obama, B." "Obama, B.H."
[[2]]
[1] "Bush, G.W."
[[3]]
[1] " Obama, B." "Obama, B.H."
[[4]]
[1] "Clinton, W.J."

Add another duplicate to show it works with more than one duplicate.
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")
adist shows the string distance between 2 character vectors
adist(" Obama, B.", pres)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 9 3 10 7
For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:
d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."
To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().
keepunique <- function(previousones, x){
if(any(adist(x, previousones)<5)){
x <- NULL
}
return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B." "Bush, G.W." "Clinton, W.J."

Related

How to find word index or position in a given string using r programming

How to find index or position of a word in a given string, below code says the starting position of word and length. After finding the position of the word, I want to extract preceding and succeeding words in my project.
library(stringr)
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was somewhat unique")
word_pos <- regexpr('termination', Output_text)
Output:
[1] 45
attr(,"match.length")
[1] 11
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
45 - It is counting each and every character and displaying starting position of "termination"
11- is length
Here, "termination", is at 7th position, how to find it using r programming
Appreciate your help.
Here it is:
library(stringr)
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was somewhat unique")
words <- unlist(str_split(Output_text, " "))
which(words == "termination")
[1] 7
Edit:
For multiple occurrences of the word in text and generating next and previous keywords:
# Adding a few random "termination" words to the string:
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was termination somewhat unique termination")
words <- unlist(str_split(Output_text, " "))
t1 <- which(words == "termination")
next_keyword <- words[t1+1]
previous_keywords <- words[t1-1]
> next_keyword
[1] "disputes" "somewhat" NA
> previous_keywords
[1] "contract" "was" "unique"
You can do this without worrying about character indices using regular expressions without any external package.
# replace whole string by the words preceding and following 'termination'
(words <- sub("[\\S\\s]+ (\\S+) termination (\\S+) [\\S\\s]+", "\\1 \\2", Output_text, perl = T))
# [1] "contract disputes"
# Split the resulting string into two individual strings
(words <- unlist(strsplit(words, " ")))
# [1] "contract" "disputes"
The easiest way is just the match termination and the surrounding words in str_extract and then str_remove termination.
str_remove(str_extract(Output_text,"\\w+ termination \\w+"),"termination ")
[1] "contract disputes"

R - Extract information from string following a general format

This is a complete re-write of my original question in an attempt to clarify it and make it as answerable as possible. My objective is to write a function which takes a string as input and returns the information contained therein in tabular format. Two examples of the kind of character strings the function will face are the following
s1 <- " 9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167***\r"
s2 <- " 10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0\r"
(For those who had read my original question, these are smaller strings for simplicity.)
The required output would be:
Rank Code Name Club Class Time Points
9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167
10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0
I have managed to split the string based on where there's a blank space using:
strsplit(s1, " ")[[1]][strsplit(s1, " ")[[1]] != ""]
although a more elegant solution was given by G. Grothendieck in the comments below using:
unlist(strsplit(trimws(s1), " +"))
This results in
"9" "9875" "Γεωργίου" "Άγγελος" "Δημήτρης" "ΑΒ/Γ" "Π/Π" "Β" "00:54:05" "167***\r"
However, this is still problematic as "Γεωργίου" "Άγγελος" and "Δημήτρης" should be combined into "Γεωργίου Άγγελος Δημήτρης" (note that the number of elements could be two OR three) and the same applies to "Π/Π" "Β" which should be combined into "Π/Π Β".
The question
How can I use the additional information that I have, namely:
The order of the elements will always be the same
The Name data will consist of two or three words
The Club data (i.e. ΑΒ/Γ in s1 and ΔΕΖ in s2) will come from a pre-defined list of clubs (e.g. stored in a character vector named sClub)
The Class data (i.e. Π/Π Β in s1 and N in s2) will come from a pre-defined list of classes (e.g. stored in a character vector named sClass)
The Points data will always contain "\r" and won't contain any spaces.
to produce the required output above?
Defining
sClub <- c("ΑΒ/Γ", "ΔΕΖ")
sClass <- c("Π/Π Β", "N")
we may do
library(stringr)
myfun <- function(s)
gsub("\\*", "", trimws(str_match(s, paste0("^\\s*(\\d+)\\s*?(\\w+)\\s*?([\\w ]+)\\s*(", paste(sClub, collapse = "|"),")\\s*(", paste(sClass, collapse = "|"), ")(.*?)\\s*([^ ]*\r)"))[, -1]))
sapply(list(s1, s2), myfun)
# [,1] [,2]
# [1,] "9" "10"
# [2,] "9875" "8954F"
# [3,] "Γεωργίου Άγγελος Δημήτρης" "Smith John"
# [4,] "ΑΒ/Γ" "ΔΕΖ"
# [5,] "Π/Π Β" "N"
# [6,] "00:54:05" "ΔΕΝ ΕΚΚΙΝΗΣΕ"
# [7,] "167" "0"
The way it works is just taking into account all your additional information and constructing a long regex. It finishes with erasing * and removing leading/trailing whitespace.

Create a new vector with text from strings in an old vector in R

Working with a data frame in R studio. One column, PODMap, has sentences such as "At my property there is a house at 38.1234, 123.1234 and also I have a car". I want to create new columns, one for the latitude and one for the longitude.
Fvalue is the data frame. So far I have
matches <- regmatches(fvalue[,"PODMap"], regexpr("..\\.....", fvalue[,"PODMap"], perl = TRUE))
Since the only periods in the text are in longitude and latitude, this returns the first lat or long listed in each string (still working on finding a regex to grab the longitude from after the latitude but that's a different question). The problem is, for instance, if my vector is c("test 38.1111", "x", "test 38.2222") then it returns (38.1111. 38.2222) which has the right values, but the vector won't be the right length for my data frame and won't match. I need it to return a blank or a 0 or NA for each string that doesn't have the value matching the regular expression, so that it can be put into the data frame as a column. If I'm going about this entirely wrong let me know about that too.
You can use regexecwhich returns a list of the same length so you don't loose the non-match spaces
PODMap<-c("At my property there is a house at 38.1234, 123.1234 and also I have a",
"Test TEst TEST Tes T 12.1234, 123.4567 test Tes",
"NO LONG HEre Here No Lat either",
"At my property there is a house at 12.1234, 423.1234 and also I have ")
Index<-c(1:4)
fvalue<-data.frame(Index,PODMap)
matches <- regmatches(fvalue[,"PODMap"], regexec("..\\.....", fvalue[,"PODMap"], perl
= TRUE))
> matches
[[1]]
[1] "38.1234"
[[2]]
[1] "12.1234"
[[3]]
character(0)
[[4]]
[1] "12.1234"
Using the package stringr, we can get both the long and lat.
library(stringr)
matches<-str_match_all(fvalue[,"PODMap"], ".\\d\\d\\.\\d\\d\\d\\d")
> matches
[[1]]
[,1]
[1,] " 38.1234"
[2,] "123.1234"
[[2]]
[,1]
[1,] " 12.1234"
[2,] "123.4567"
[[3]]
[,1]
[[4]]
[,1]
[1,] " 12.1234"
[2,] "423.1234"
The \\d checks for any digit 1:9, so that will keep out any words, and we use str_match_all to get all the matches from the string, as regmatches will only take the first match. str_match_all will set a value to NULL instead of character(0) though, which should not be a problem.
Check out this regex demo

Extracting multiple substrings that come after certain characters in a string using stringi in R

I have a large dataframe in R that has a column that looks like this where each sentence is a row
data <- data.frame(
datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
stringsAsFactors=FALSE)
I want to extract all the words that come after "wiki/" and put them in another column
So for the first row it should come out with "political_philosophy self-governance"
The second row should look like "hierarchy free_association_(communism_and_anarchism)"
The third row should be "state_(polity)"
And the fourth row should be "anti-statism"
I definitely want to use stringi because it's a huge dataframe. Thanks in advance for any help.
I've tried
stri_extract_all_fixed(data$datalist, "wiki")[[1]]
but that just extracts the word wiki
You can do this with a regex. By using stri_match_ instead of stri_extract_ we can use parentheses to make matching groups that let us extract only part of the regex match. In the result below, you can see that each row of df gives a list item containing a data frame with the whole match in the first column and each matching group in the following columns:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match
[[1]]
[,1] [,2]
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance" "self-governance"
[[2]]
[,1] [,2]
[1,] "wiki/stateless_society" "stateless_society"
[2,] "wiki/hierarchy" "hierarchy"
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"
[[3]]
[,1] [,2]
[1,] "wiki/state_(polity)" "state_(polity)"
[[4]]
[,1] [,2]
[1,] "wiki/anti-statism" "anti-statism"
You can then use apply functions to make the data into any form you want:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
You can use a lookbehind in the regex.
library(dplyr)
library(stringi)
text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")
df <- data.frame(text, stringsAsFactors = FALSE)
df %>%
mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))
You may use
> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
See the online R code demo.
Details
wiki/(\\S+) - matches wiki/ and captures 1+ non-whitespace chars into Group 1
| - or
(?:(?!wiki/\\S).)+ - a tempered greedy token that matches any char, other than a line break char, 1+ occurrences, that does not start a wiki/+a non-whitespace char sequence.
If you need to get rid of redundant whitespace inside the result you may use another call to gsub:
> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"

Extract only the characters that are between opening and ending parantheses in the start and end of a string in R

I have many strings that all have the following format:
mystrings <- c(
"(ABFUHIASH)THISISAVERYLONGSTRINGWITHOUTANYSPACES(ENDING)",
"(SECONDSTR)YETANOTHERBORINGSTRINGWITHOUTSPACES(RANDOMENDING)",
"(JOWERIC)THISPARTSHOULDNOTBEEXTRACTED(GETTHIS)",
"(CAPTURETHIS)IOJSDOIOIADSNCXZZCX(IJFAI)"
)
I need to capture the strings that are inside parentheses both at the start and the end of the original mystrings.
Therefore, variable start will store the starting characters for each of the above strings with the same index. The result will be this:
start[1]
ABFUHIASH
start[2]
SECONDSTR
start[3]
JOWERIC
start[4]
CAPTURETHIS
And similarly, the ending for each string in mystrings will be saved into end:
end[1]
ENDING
end[2]
RANDOMENDING
end[3]
GETTHIS
end[4]
IJFAI
Parentheses themselves should NOT be captured.
Is there a way/function to do this quickly in R?
I have tried stringr::word and stringi::stri_extract, but I am getting very strange results.
We can use the stringr library for this. For example
library(stringr)
mm <- str_match(mystrings, "^\\(([^)]+)\\).*\\(([^)]+)\\)$")
mm
The match finds the stuff between the parenthesis at the beginning and end of the string in capture groups so they can be easily extracted.
It returns a character matrix, and you seem to just want the 2nd and 3rd column. mm[,2:3]
[,1] [,2]
[1,] "ABFUHIASH" "ENDING"
[2,] "SECONDSTR" "RANDOMENDING"
[3,] "JOWERIC" "GETTHIS"
[4,] "CAPTURETHIS" "IJFAI"
Something like this might work for you:
> regmatches(mystrings,gregexpr("\\(.+?\\)",mystrings))
[[1]]
[1] "(ABFUHIASH)" "(ENDING)"
[[2]]
[1] "(SECONDSTR)" "(RANDOMENDING)"
[[3]]
[1] "(JOWERIC)" "(GETTHIS)"
[[4]]
[1] "(CAPTURETHIS)" "(IJFAI)"
E.g., to extract endings you could:
lapply(x,tail,1)

Resources