How to normalise Levenshtein distance for maximum alignment length rather than for string length? - similarity

Problem:
A few R packages feature Levenshtein distance implementations for computing the similarity of two strings, e.g. http://finzi.psych.upenn.edu/R/library/RecordLinkage/html/strcmp.html.
The distances computed can easily be normalised for string length, e.g. by dividing the Levenshtein distance by the length of the longest string involved or by dividing it by the mean of the lengths of the two strings.
For some applications in linguistics (e.g. dialectometry and receptive multilingualism research), however, it is recommended that the raw Levenshtein distance be normalised for the length of the longest least-cost alignment (Heeringa, 2004: 130-132).
This tends to produce distance measures that make more sense from a perceptual-linguistic point of view.
Example:
The German string "tsYklUs" (Zyklus = cycle) can be converted into its Swedish cognate "sYkEl" (cyckel = (bi)cycle) in a 7-slot alignment with two insertions (I) and two substitutions (S) for a total transformation cost of 4.
Normalised Levenshtein distance: 4/7
(A)
t--s--Y--k--l--U--s
---s--Y--k--E--l---
===================
I-----------S--S--I = 4
It is also possible to convert the strings in an 8-slot alignment with 3 insertions (I) and 1 deletion (D), also for a total alignment cost of 4.
Normalised Levenshtein distance: 4/8
(B)
t--s--Y--k-----l--U--S
---s--Y--k--E--l------
======================
I-----------D-----I--I = 4
The latter alignment makes more sense linguistically, because it aligns the [l]-phonemes with each other rather than with the [E] and [U] vowels.
Question:
Does anyone know of any R function that would allow me to normalise Levenshtein distances for the longest least-cost alignment rather than for string length proper?
Thanks for your input!
Reference:
W.J. Heeringa (2004), Measuring dialect pronunciation differences using Levenshtein distance. PhD thesis, University of Groningen. http://www.let.rug.nl/~heeringa/dialectology/thesis/
Edit - Solution: I think I figured out a solution. The adist function can return the alignment and seems to default to the longest low-cost alignment. To take up the example above, here's the alignment associated with sykel to tsyklus:
> attr(adist("sykel", "tsyklus", counts = TRUE), "trafos")
[,1]
[1,] "IMMMDMII"
To compute length-normalised distances as recommended by Heeringa (2004), we can write a modest function:
normLev.fnc <- function(a, b) {
drop(adist(a, b) / nchar(attr(adist(a, b, counts = TRUE), "trafos")))
}
For the example above, this returns
> normLev.fnc("sykel", "tsyklus")
[1] 0.5
This function also returns the correct normalised distances for Heeringa's (2004: 131) examples:
> normLev.fnc("bine", "bEi")
[1] 0.6
> normLev.fnc("kaninçen", "konEin")
[1] 0.5555556
> normLev.fnc("kenEeri", "kenArje")
[1] 0.5
To compare several pairs of strings:
> L1 <- c("bine", "kaninçen", "kenEeri")
> L2 <- c("bEi", "konEin", "kenArje")
> diag(normLev.fnc(L1, L2))
[1] 0.6000000 0.5555556 0.5000000

In case any linguists stumble upon this post, I'd like to point out that the algorithms provided by the RecordLinkage package are not necessarily optimal for comparing non-ASCII strings, e.g.:
> levenshteinSim("väg", "way")
[1] -0.3333333
> levenshteinDist("väg", "way")
[1] 4
> levenshteinDist("väg", "wäy")
[1] 2
> levenshteinDist("väg", "wüy")
[1] 3

Related

Best similarity distance metric for two strings

I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION
with A&A PRECISION ENGINEERING
However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted Damerau-Levenshtein distance, Full Damerau-Levenshtein distance, Longest Common Substring distance, Q-gram distance, cosine distance, Jaccard distance Jaro, and Jaro-Winkler distance
matches: B&B PRECISION instead.
Any idea which metric would give more emphasis to the preciseness of the substrings and its sequence matched and care less about the length of the string? I think it is because of the length of the string that the metrics would always choose wrongly.
If you really want to "...give more emphasis to the preciseness of the substrings and its sequence...", then this function could work, as it tests wether a string is a substring of another one:
library(data.table)
x <- c("A&A PRECISION", "A&A PRECISION ENGINEERING", "B&B PRECISION")
y <- x
We want to expand the grid. For that I'd use the CJ function in data.table. Then, we will check each pair and see if x is a substring of y (this doesn't work the other way round):
CJ(x, y)[, similarity := apply(.SD, 1, function(x) x[2] %like% x[1]), .SDcols = c("x", "y")][x != y, ]
x y similarity
1: A&A PRECISION A&A PRECISION ENGINEERING TRUE
2: A&A PRECISION B&B PRECISION FALSE
3: A&A PRECISION ENGINEERING A&A PRECISION FALSE
4: A&A PRECISION ENGINEERING B&B PRECISION FALSE
5: B&B PRECISION A&A PRECISION FALSE
6: B&B PRECISION A&A PRECISION ENGINEERING FALSE
Please keep in mind that you'll need to make sure that the strings are as neat as possible for this to work, and even then it might fail.
There are some things I'll check to clean your strings:
Remove multiple spaces,
Remove spaces at the beginning / end of string
Ensure the same encoding
Ensure the same CASE
You can achieve that with the stringi package.

How to calculate sum over term including rising factorial?

I am new to programming and R and would like to compute the following sum
I used the pochMpfr from the Rmpfr package for the rising factorial and a for loop in order compute the sum.
B=rep(1,k+1)
for (i in 0:k) {
B[(i+1)]= (-1)^i *choose(k,i)*pochMpfr((-i)*sigma, n)
}
sum(B)
Doing so, I get the results as list (including always: mpfr) and thus cannot compute the sum.
Is there a possibility to get the results immediately as a Matrix or to convert the list to vector including only the relevant Elements?
The solution is probably quite easy but I haven't found it while looking through the forums.
There is no need to use a for-loop, this should work:
library(Rmpfr)
# You do not define these in your question,
# so I just take some arbitrary values
k <- 10
n <- 3
sigma <- 0.3
i <- 0:k
B <- (-1)^i *choose(k,i)*pochMpfr((-i)*sigma, n)
sum(B)
## 1 'mpfr' number of precision 159 bits
## [1] 6.2977401071861993597462780570563107354142915151e-14

how many unique powers are for x^y for x in 1-1000 and y in 1-1000 using R

Using R, calculate for x and y be integers ∈ [1, 1000], How many unique powers, x^y exist.
This is what I have right now, just don't know how to eliminate the duplicate numbers,
x<-1:1000
y<-1:1000
for (i in x)
{
for (j in y){
print(i^j)
}
}
A combinatorial approach to this could split the numbers from 1-1000 into equivalence classes where each number in the class is the power of some other number. For instance, we would split the numbers 1-10 into (1), (2, 4, 8), (3, 9), (5), (6), (7), (10). None of the powers of values between equivalence classes will coincide, so we can just handle each equivalence class separately.
num.unique.comb <- function(limit) {
# Count number of powers in each equivalence class (labeled by lowest val)
num.powers <- rep(0, limit)
# Handle 1 as special case
num.powers[1] <- 1
# Beyond sqrt(limit), all unhandled numbers are in own equivalence class
handled <- c(T, rep(F, limit-1))
for (base in 2:ceiling(sqrt(limit))) {
if (!handled[base]) {
# Handle all the values in 1:limit that are powers of base
num.handle <- floor(log(limit, base))
handled[base^(1:num.handle)] <- T
# Compute the powers of base that we cover
num.powers[base] <- length(unique(as.vector(outer(1:num.handle, 1:limit))))
}
}
num.powers[!handled] <- limit
# Handle sums too big for standard numeric types
library(gmp)
print(sum(as.bigz(num.powers)))
}
num.unique.comb(10)
# [1] 76
num.unique.comb(1000)
# [1] 978318
One nice property of this combinatorial approach is that it's very fast compared to a brute-force approach. For instance, it takes less than 0.1 seconds to compute with limit set to 1000. This allows us to compute the result for much larger values:
# ~0.15 seconds
num.unique.comb(10000)
# [1] 99357483
# ~4 seconds
num.unique.comb(100000)
# [1] 9981335940
# ~220 seconds
num.unique.comb(1000000)
# [1] 999439867182
This is a pretty neat result -- in under 4 minutes we can compute the number of unique values within 1 trillion numbers, where each number can have up to 6 million digits!
Update: Based on this combinatorial code I've updated the OEIS entry for this sequence to include terms up to 10,000.
A brute-force approach would be to just compute all the powers and count the number of unique values:
num.unique.bf <- function(limit) {
length(unique(as.vector(sapply(1:limit, function(x) x^(1:limit)))))
}
num.unique.bf(10)
# [1] 76
A problem with this brute-force analysis is that you are dealing with large numbers that will create numerical issues. For instance:
1000^1000
# [1] Inf
As a result we get an inaccurate value:
# Wrong due to numerical issues!
num.unique.bf(1000)
# [1] 119117
However, a package like the gmp can enable us to compute even numbers as large as 1000^1000. My computer has trouble storing all 1 million numbers in memory at once, so I'll write them to a file (size for n=1000 is 1.2 GB on my computer) and then compute the number of unique values in that file:
library(gmp)
num.unique.bf2 <- function(limit) {
sink("foo.txt")
for (x in 1:limit) {
vals <- as.bigz(x)^(1:limit)
for (idx in 1:limit) {
cat(paste0(as.character(vals[idx]), "\n"))
}
}
sink()
as.numeric(system("sort foo.txt | uniq | wc -l", intern=T))
}
num.unique.bf2(10)
# [1] 76
num.unique.bf2(1000)
# [1] 978318
A quick visit to the OEIS (click the link for the first 1000 values) shows that this is correct. This approach is rather slow (roughly 40 minutes on my computer), and combinatorial approaches should be significantly faster.

How to work with large numbers in R?

I would like to change the precision in a calculation of R. For example I would like to calculate x^6 with x = c(-2.5e+59, -5.6e+60). In order to calculate it I should change the precision in R, otherwise the result is Inf, and I don't know how to do it.
As Livius points out in his comment, this is an issue with R (and in fact, most programming language), with how numbers are represented in binary.
To work with extremely large/small floating point numbers, you can use the Rmpfr library:
install.packages("Rmpfr")
library("Rmpfr")
x <- c(-2.5e+59, -5.6e+60)
y <- mpfr(x, 6) # the second number is how many precision **bits** you want - NB: not decimal places!
y^6
# 2 'mpfr' numbers of precision 6 bits
# [1] 2.50e356 3.14e364
To work with numbers that are even larger than R can handle (e.g. exp(1800)) you can use the "Brobdingnag" package:
install.packages("Brobdingnag")
library("Brobdingnag")
## An example of a single number too large for R:
10^1000.7
# [1] Inf
## Now using the Brobdingnag package:
10^as.brob(1000.7)
# [1] +exp(2304.2)

Why can cosine similarity between two vectors be negative?

I have 2 vectors with 11 dimentions.
a <- c(-0.012813841, -0.024518383, -0.002765056, 0.079496744, 0.063928973,
0.476156960, 0.122111977, 0.322930189, 0.400701256, 0.454048860,
0.525526219)
b <- c(0.64175768, 0.54625694, 0.40728261, 0.24819750, 0.09406221,
0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
-0.07215885)
cosine_sim <- cosine(a,b)
which returns:
-0.05397935
I used cosine() from lsa package.
for some values i am getting negative cosine_sim like the given one. I am not sure how the similarity can be negative. It should be between 0 and 1.
Can anyone explain what is going on here.
The nice thing about R is that you can often dig into the functions and see for yourself what is going on. If you type cosine (without any parentheses, arguments, etc.) then R prints out the body of the function. Poking through it (which takes some practice), you can see that there is a bunch of machinery for computing the pairwise similarities of the columns of the matrix (i.e., the bit wrapped in the if (is.matrix(x) && is.null(y)) condition, but the key line of the function is
crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
Let's pull this out and apply it to your example:
> crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
[,1]
[1,] -0.05397935
> crossprod(a)
[,1]
[1,] 1
> crossprod(b)
[,1]
[1,] 1
So, you're using vectors that are already normalized, so you just have crossprod to look at. In your case this is equivalent to
> sum(a*b)
[1] -0.05397935
(for real matrix operations, crossprod is much more efficient than constructing the equivalent operation by hand).
As #Jack Maney's answer says, the dot product of two vectors (which is length(a)*length(b)*cos(a,b)) can be negative ...
For what it's worth, I suspect that the cosine function in lsa might be more easily/efficiently implemented for matrix arguments as as.dist(crossprod(x)) ...
edit: in comments on a now-deleted answer below, I suggested that the square of the cosine-distance measure might be appropriate if one wants a similarity measure on [0,1] -- this would be analogous to using the coefficient of determination (r^2) rather than the correlation coefficient (r) -- but that it might also be worth going back and thinking more carefully about the purpose/meaning of the similarity measures to be used ...
The cosine function returns
crossprod(a, b)/sqrt(crossprod(a) * crossprod(b))
In this case, both the terms in the denominator are 1, but crossprod(a, b) is -0.05.
The cosine function can take on negative values.
While cosine of two vectors can take any value between -1 and +1, cosine similarity (in dicument retreival) used to take values from the [0,1] interval. The reason is simple: in the WordxDocument matrix there are no negative values, so the maximum angle of two vectors is 90 degrees, for wich the cosine is 0.

Resources