String matching for names with jumbled up words in R - r

I am trying to match names in R similar to VISHWANATHAN KRISHNA GURUVAYUR and GURUVAYUR KRISHNA VISHWANATHAN. After removing spaces levenshtein gives a 21% match.
I want to know if there is some string matching algorithm that could tag these two names as similar...
library(RecordLinkage)
levenshteinSim("GURUVAYURKRISHNAVISHWANATHAN","VISHWANATHANKRISHNAGURUVAYUR")
#[1] 0.2142857

Try the Jaro-Winkler algorithm, also from the RecordLinkage package. See here for example, and here for more.
In your case,
jarowinkler("GURUVAYURKRISHNAVISHWANATHAN","VISHWANATHANKRISHNAGURUVAYUR")
yields:
0.7063492
Results are always between 0 and 1, so this is an improvement.

Related

Correcting mis-typed data in R

I want to correct wrongly entered data in R. For example if I have a vector
V=c('PO','PO','P0')
I want R to recognize that the 0 in the last entry should be a o and to change it. Is there anyway to do that? I have trying to use correctTypos in the deducorrect package in R. However I am having some problem with the editset. I cannot seems to specify that all the entries have to be letters. Any help greatly appreciated.
Another example would be
V2=c('PL','P1','PL','XX')
That 1 should be an L.
The jaro-winkler distance was developed to find issues with data entry. But on entries only 2 long that is going to be difficult as 1 error tends to score higher than you want it to. You could combine this with other distance measurements available in the stringdist package. But in this case that might be too complicated.
Given your examples you might want to use the base function chartr and set up a replacement of numbers to letters.
chartr("01","OL", V2)
[1] "PL" "PL" "PL" "XX"
chartr("01","OL", V)
[1] "PO" "PO" "PO"
This will always replace the 1 by an L and a 0 (zero) by an O. You can add the 5 for S etc etc. But if there are other combo's it might get complicated.
Also note that the next iteration of the deducorrect package is the deductive package.

how to output string of specific length using agrep in R

I have a load of DNA sequences.
I want to match a part of the sequence and I want to return the match up to a specific length
The dataframe df has the columns:
V1 and V2
>chr1:61695-62229 aattccaagagtattattgcaccaaaaggcatggacttaaaattcttgatacatgatttcaaaatattttctttaaggtttgaatcagtctatattccctccagcagcgtataaaagtgccaatttctctgatccttagccagtttgggtaataataattgtaaaacttttttttctttttttttgagacagagtctccctctgtcgccaggctgaagtgcagtggcgcaatctcggctcactgcaacctccgcctcccggggtcaagctattctcctgcctcagcctcccaagtagctgggactacaggcatgcaccaccatgcccagctaatttttgttatttttagtagagatggagtttccccatgttggacaggatggtctcgatctcttgacctcgtgatccaccctcctcggcctcccaaagtgctgggataacaggcgtgaacaaccatgcccggcctgtaaaactttttcctaatttaacagaaaaataatagtattatattttatcatatttctttgatttcta
>chr1:101718-102194 taaaaataaatgtattaagtatgaacaacaaaaaagctagtaaaggttgaacaacaactatccttaggaaagtggaaataatgtattaataaatatgaaagcaggctagccacggtgactcacatctgtaatcccagcactttgggaggctgaggcaggcagatcacctgaggtcaggagttccagaccagcctggccaacatggtgaaatcttgtctctcctacaaatacaaaaactagccaggcttggttgtgcactcctgtaattcgagctacttgggaggctgaggcaggagaatctcttgaacctgagaggcagaggttgcagtgagccaagatcatgccactgcactccagctggggcaacagagtgacactccatctcaaaataaataaataagaaagcagaaactaataaactagaaaacagaaacatagaactaatttataaatcaaagcactatgccttgaaaaga
I have used agrep to get the match.
RepeatAlusSequencesdfMatch <- RepeatAlusSequencesdf[agrep("aacctcaaagactggcctca", RepeatAlusSequencesdf[,2],ignore.case = TRUE, max.distance = 0.3), ]
but I also want to return a length of 146 characters from the end of the match. At the moment it is giving me the entire sequence which I can't use
See my comment above. I don't think you can achieve what you are trying to do here with agrep. If the DNA sequence you are trying to fuzzy-match has predictable number and positions of nucleotide insertion/deletion/substitutions then just use one (or several) regular expressions with capture groups to extract what you need.
If the differences are not predictable and you truly need a fuzzy match, you could use a brute-force algorithm like the following:
Break up each of your DNA strings into random positions such as to produce different 146 nucleotide long sequences.
Using a "threshold" max.distance run agrep on the resulting sequences and select those that match.
From the above set select out the best matches by running agrep using successively smaller max.distances until you get the one(s) you are interested in.
The more different 146 nucleotide sequences you select the more accurate result you will get. If you want to find the best match, use an exhaustive search by walking each DNA sequence from the beginning, breaking it to 146 nucleotide pieces and running the above algorithm then leaving one nucleotide out from the beginning, selecting the next partitions and so on until the end.
Hope this helps or gives better ideas.

How can I compare two strings to find the number of characters that match in R, using substitution distance?

In R, I have two character vectors, a and b.
a <- c("abcdefg", "hijklmnop", "qrstuvwxyz")
b <- c("abXdeXg", "hiXklXnoX", "Xrstuvwxyz")
I want a function that counts the character mismatches between each element of a and the corresponding element of b. Using the example above, such a function should return c(2,3,1). There is no need to align the strings.
I need to compare each pair of strings character-by-character and count matches and/or mismatches in each pair. Does any such function exist in R?
Or, to ask the question in another way, is there a function to give me the edit distance between two strings, where the only allowed operation is substitution (ignore insertions or deletions)?
Using some mapply fun:
mapply(function(x,y) sum(x!=y),strsplit(a,""),strsplit(b,""))
#[1] 2 3 1
Another option is to use adist which Compute the approximate string distance between character vectors:
mapply(adist,a,b)
abcdefg hijklmnop qrstuvwxyz
2 3 1

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

R - Longest common substring

Does anyone know of an R package that solves the longest common substring problem? I am looking for something fast that could work on vectors.
Check out the "Rlibstree" package on omegahat Github
This uses http://www.icir.org/christian/libstree/.
You should look at the LCS function of qualV package. It is C-implemented, therefore quite efficient.
The question here is not totally clear on the intended application of the solution to the longest common substring problem. A common application that I encounter is matching between names in different datasets. The stringdist package has a useful function amatch() which I find suitable for this task.
In brief, amatch() will take as input two vectors, the first is x the vector of strings that you want to find matches from (this can also just be a single string), the second is table, which is the vector of strings you want to make comparisons to and choose the match with the longest common substring. amatch() will then return a vector whose length equals that of x - each element of this result will be an index in table that contains the best match.
Details: amatch() takes a method argument, which you specify to be lcs if you want matching on longest common substring. There are many other options for different string matching techniques (e.g. Levenshtein distance). There is also a mandatory maxDist argument. If all strings in table are greater "distance" from a given string in x, then amatch() will return NA for that element of its output. "distance" is defined differently depending on the string matching algorithm you choose. For lcs, it (more or less) just means how many different (non-matched) characters there are. See documentation for details.
Parallelization: another nice feature of amatch() is that it will automatically parallelize the operation for you, making reasonable guesses about system resources to use. If you want more control over this, you can toggle the nthread argument.
Example application:
library(stringdist)
Names1 = c(
"SILVER EAGLE REFINING, INC. (SW)",
"ANTELOPE REFINING",
"ANTELOPE REFINING (DOUGLAS FACILITY)"
)
Names2 = c(
"Mobile Concrete, Inc.",
"Antelope Refining, LLC. ",
"Silver Eagle Refining Inc."
)
Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)
Match_Idx
# [1] 3 2 2
Matches = data.frame(Names1, Names2[Match_Idx])
Matches
# Names1 Names2.Match_Idx.
# 1 silver eagle refining, inc. (sw) silver eagle refining inc.
# 2 antelope refining antelope refining, llc.
# 3 antelope refining (douglas facility) antelope refining, llc.
### Compare Matches:
Matches$Distance = stringdist(Matches$Names1, Matches$Match, method = 'lcs')
Also, unlike functions like LCS from qualV, this will not consider "subsequence" matches that involve ignoring intermediate characters in order to form a match (as discussed here). For instance, see this:
Names1 = c(
"hello"
)
Names2 = c(
"hel123l5678o",
"hell"
)
Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)
Matches = data.frame(Names1, Match = Names2[Match_Idx])
Matches
# 1 hello hell
I don't know R, but I used to implement Hirschberg's algorithm which is fast and don't consume too much space.
As I remember it is only 2 or 3 recursively called short functions.
Here is a link:
http://wordaligned.org/articles/longest-common-subsequence
So don't hesitate to implement it in R, it worths the effort since it is a very interesting algorithm.

Resources