String match with R: Finding the best possible match - r

I have two vectors of words.
Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')
Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')
I need to make the best possible match between the Lexicon and Corpus.
I tried many methods. This is one of them.
library(stringr)
match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words
test<- str_extrac_all (Corpus,match,simplify= T)
test
[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"
But, the match should be:
[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"
Instead, the match is with the first word alphabetically ordered in my Lexicon. By the way, these vectors are a sample of a bigger list that I have.
I didn´t try with regex() because I'm not sure how it works. Perhaps the solution goes on that way.
Could you help me to solve this problem? Thank you for your help.

You can just use match function.
Index <- match(Corpus, Lexicon)
Index
[1] 2 3 4 6
Lexicon[Index]
[1] "animalada" "fe" "fernandez" "ladrillo"

You can order Lexicon by the number of characters the patterns have, in decreasing order, so the best match comes first:
match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^')
test<- str_extract_all(Corpus, match, simplify= T)
test
# [,1]
#[1,] "animalada"
#[2,] "fe"
#[3,] "fernandez"
#[4,] "ladrillo"

I tried both methods and the right one was the suggested by #Psidorm.
If a use the function match() this will find the match in any part of the word, not necessary the beginning. For instance:
Corpus<- c('tambien')
Lexicon<- c('bien')
match(Corpus,Lexicon)
The result is 'tambien', but this is not correct.
Again, thank you both for your help!!

Related

R - Extract information from string following a general format

This is a complete re-write of my original question in an attempt to clarify it and make it as answerable as possible. My objective is to write a function which takes a string as input and returns the information contained therein in tabular format. Two examples of the kind of character strings the function will face are the following
s1 <- " 9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167***\r"
s2 <- " 10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0\r"
(For those who had read my original question, these are smaller strings for simplicity.)
The required output would be:
Rank Code Name Club Class Time Points
9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167
10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0
I have managed to split the string based on where there's a blank space using:
strsplit(s1, " ")[[1]][strsplit(s1, " ")[[1]] != ""]
although a more elegant solution was given by G. Grothendieck in the comments below using:
unlist(strsplit(trimws(s1), " +"))
This results in
"9" "9875" "Γεωργίου" "Άγγελος" "Δημήτρης" "ΑΒ/Γ" "Π/Π" "Β" "00:54:05" "167***\r"
However, this is still problematic as "Γεωργίου" "Άγγελος" and "Δημήτρης" should be combined into "Γεωργίου Άγγελος Δημήτρης" (note that the number of elements could be two OR three) and the same applies to "Π/Π" "Β" which should be combined into "Π/Π Β".
The question
How can I use the additional information that I have, namely:
The order of the elements will always be the same
The Name data will consist of two or three words
The Club data (i.e. ΑΒ/Γ in s1 and ΔΕΖ in s2) will come from a pre-defined list of clubs (e.g. stored in a character vector named sClub)
The Class data (i.e. Π/Π Β in s1 and N in s2) will come from a pre-defined list of classes (e.g. stored in a character vector named sClass)
The Points data will always contain "\r" and won't contain any spaces.
to produce the required output above?
Defining
sClub <- c("ΑΒ/Γ", "ΔΕΖ")
sClass <- c("Π/Π Β", "N")
we may do
library(stringr)
myfun <- function(s)
gsub("\\*", "", trimws(str_match(s, paste0("^\\s*(\\d+)\\s*?(\\w+)\\s*?([\\w ]+)\\s*(", paste(sClub, collapse = "|"),")\\s*(", paste(sClass, collapse = "|"), ")(.*?)\\s*([^ ]*\r)"))[, -1]))
sapply(list(s1, s2), myfun)
# [,1] [,2]
# [1,] "9" "10"
# [2,] "9875" "8954F"
# [3,] "Γεωργίου Άγγελος Δημήτρης" "Smith John"
# [4,] "ΑΒ/Γ" "ΔΕΖ"
# [5,] "Π/Π Β" "N"
# [6,] "00:54:05" "ΔΕΝ ΕΚΚΙΝΗΣΕ"
# [7,] "167" "0"
The way it works is just taking into account all your additional information and constructing a long regex. It finishes with erasing * and removing leading/trailing whitespace.

Extract only the characters that are between opening and ending parantheses in the start and end of a string in R

I have many strings that all have the following format:
mystrings <- c(
"(ABFUHIASH)THISISAVERYLONGSTRINGWITHOUTANYSPACES(ENDING)",
"(SECONDSTR)YETANOTHERBORINGSTRINGWITHOUTSPACES(RANDOMENDING)",
"(JOWERIC)THISPARTSHOULDNOTBEEXTRACTED(GETTHIS)",
"(CAPTURETHIS)IOJSDOIOIADSNCXZZCX(IJFAI)"
)
I need to capture the strings that are inside parentheses both at the start and the end of the original mystrings.
Therefore, variable start will store the starting characters for each of the above strings with the same index. The result will be this:
start[1]
ABFUHIASH
start[2]
SECONDSTR
start[3]
JOWERIC
start[4]
CAPTURETHIS
And similarly, the ending for each string in mystrings will be saved into end:
end[1]
ENDING
end[2]
RANDOMENDING
end[3]
GETTHIS
end[4]
IJFAI
Parentheses themselves should NOT be captured.
Is there a way/function to do this quickly in R?
I have tried stringr::word and stringi::stri_extract, but I am getting very strange results.
We can use the stringr library for this. For example
library(stringr)
mm <- str_match(mystrings, "^\\(([^)]+)\\).*\\(([^)]+)\\)$")
mm
The match finds the stuff between the parenthesis at the beginning and end of the string in capture groups so they can be easily extracted.
It returns a character matrix, and you seem to just want the 2nd and 3rd column. mm[,2:3]
[,1] [,2]
[1,] "ABFUHIASH" "ENDING"
[2,] "SECONDSTR" "RANDOMENDING"
[3,] "JOWERIC" "GETTHIS"
[4,] "CAPTURETHIS" "IJFAI"
Something like this might work for you:
> regmatches(mystrings,gregexpr("\\(.+?\\)",mystrings))
[[1]]
[1] "(ABFUHIASH)" "(ENDING)"
[[2]]
[1] "(SECONDSTR)" "(RANDOMENDING)"
[[3]]
[1] "(JOWERIC)" "(GETTHIS)"
[[4]]
[1] "(CAPTURETHIS)" "(IJFAI)"
E.g., to extract endings you could:
lapply(x,tail,1)

R programming : select element from split string based on value in another column

I have a data frame having one column of words, with syllables separated by hyphens. I want to extract the nth syllable, where n is given in another column. Like this:
word <- c("to-ma-to", "cheese", "ta-co")
whichSyl <- c(2, 1, 1)
mydf <- data.frame(word, whichSyl)
mydf$word <- as.character(mydf$word)
desired: a vector containing
ma
cheese
ta
If this were, say, awk, I would just do
'{split($1,a,"-"); print a[$2]}'
The words don't always have the same number of syllables.
It seems likely that there is a straightforward way to do this, but I'm not seeing it. Thanks
You can use mapply and strsplit to get,
mapply('[', strsplit(mydf$word, '-'), whichSyl)
#[1] "ma" "cheese" "ta"
Here I wrote a function that does one row at a time, and then uses lapply() to iterate over all rows and do.call(rbind()) to bind all of those responses together.
getSyl <- function(i){
strsplit(mydf$word[i], '-')[[1]][mydf$whichSyl[i]]
}
do.call(rbind, lapply(1:nrow(mydf), getSyl))
[,1]
[1,] "ma"
[2,] "cheese"
[3,] "ta"
We can use read.table and row/column indexing
read.table(text=mydf$word, sep="-", header=FALSE,
fill=TRUE)[cbind(1:nrow(mydf), mydf$whichSyl)]
#[1] "ma" "cheese" "ta"

finding potential duplications (spelling errors) in r [duplicate]

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.
For example:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?
This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
agrep: only return best match(es)
In R, how do I replace a string that contains a certain pattern with another string?
Fast Levenshtein distance in R?
But most often agrep will do what you want :
> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3
$`Bush, G.W.`
[1] 2
$`Obama, B.H.`
[1] 1 3
$`Clinton, W.J.`
[1] 4
Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.
lapply(pres, agrep, pres, value = TRUE)
[[1]]
[1] " Obama, B." "Obama, B.H."
[[2]]
[1] "Bush, G.W."
[[3]]
[1] " Obama, B." "Obama, B.H."
[[4]]
[1] "Clinton, W.J."
Add another duplicate to show it works with more than one duplicate.
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")
adist shows the string distance between 2 character vectors
adist(" Obama, B.", pres)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 9 3 10 7
For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:
d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."
To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().
keepunique <- function(previousones, x){
if(any(adist(x, previousones)<5)){
x <- NULL
}
return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B." "Bush, G.W." "Clinton, W.J."

Extracting Nouns and Verbs from Text

I was wondering if it is possible to extract nouns, verbs separately in R package openNLP?
I use the the tagPOS function which tags the sentence but what to do in case I want to extract verbs, nouns separately.
Using an example: (this is to extract words tagged as /VBx, where x is any single character)
library("openNLP")
acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter."
acqTag <- tagPOS(acq)
sapply(strsplit(acqTag,"[[:punct:]]*/VB.?"),function(x) sub("(^.*\\s)(\\w+$)", "\\2", x))
[,1]
[1,] "said"
[2,] "sold"
[3,] "engaged"
[4,] "said"
[5,] "is"
[6,] "did"
[7,] " not/RB explain./NN Reuter./."
Ok, my regular expression needs some improvement in order to get rid of the last line in the result.
EDIT
An alternative could be to ignore rows containing a space character
sapply(strsplit(acqTag,"[[:punct:]]*/VB.?"),function(x) {res = sub("(^.*\\s)(\\w+$)", "\\2", x); res[!grepl("\\s",res)]} )

Resources