Infer optimum threshold values for normalized levenshtein distance and trigram similarity functions

Infer optimum threshold values for normalized levenshtein distance and trigram similarity functions - similarity

Records from two datasets are compared for fuzzy match string similarity, by using normalized levenshtein distance function and trigram similarity function. 4 different similarity metrics are calculated:
LevCmpSimilarity - normalized Levenshtein similarity for compared composite (concatenated) fields,
LevWghSimilarity - normalized Levenshtein similarity as summary for all individual fields being compared,
TrgWgh and TrgCmp - the same as with Levenshtein, but with Trigram Similarity function instead Levenshtein,
Below are histograms for all four metrics, for frequencies and cummulative frequencies.
absolute frequencies
cummulative frequencies
My question is: could these frequencies distribution pattern be used for automatic unsupervised determination of optimum threshold values for record matching acceptance/rejection? If answer is yes, can you suggest direction?
Basically, could levenshtein distance and trigram similarity values frequencies pattern be used solely for infering optimum threshold values for fuzzy match record linkage?

Related

Computing Jaccard index of similarity on rasters

I want to compute Jaccard index of similarity based on continuous quantities. I found the package vegan that can compute the Jaccard index for continuous cases based on Bray-Curtis measure through the function vegdist. I was able to do it by choosing the number of sites randomly and compute the Jaccard index between all pairs of the chosen sites and take the average after. This procedure takes a lot of time especially that I have many scenarios to treat. I wonder if there is a way to do it by using rasters directly (using all the non-NA pixels) without using binary maps in faisable time.

Deciding the threshold value in string similarity

When calculating the the similarity between two strings what is the best method of deciding the threshold value ?
For en example Jaccard Coefficient how can we decide its best threshold value.

substitution matrix based on spatial autocorrelation transformation

I would like to measure the hamming sequence similarity in which the substitution costs are not based on the substitution rates in the observed sequences but based on the spatial autocorrelation within the study area of the different states (states are thus not related to DNA but something else).
I divided my study area in grid cells of equal size (e.g. 1000m) and measured how often the same "state" is observed in a neighboring cell (Rook-case). Consequently the weight matrix indicates that from state A to A (to move within the same states) has a much higher probability than to go from A to B or B to C or A to C. This already indicates that states have a high spatial autocorrelation.
The problem is, if you want to measure sequence similarity the substitution matrix should be 0 at the diagonal. Therefore I was wondering whether there is a kind of transformation to go from an "autocorrelation matrix" to a substitution matrix, with 0 values along the diagonal. By means of this we would like to account for spatial autocorrelation in the study area in our sequence similarity measure. To do my analysis I am using the package TraMineR.
Example matrix in R for sequences consisting out of four states (A,B,C,D):
Sequence example: AAAAAABBBBCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDAAAAAAAAA
Autocorrelation matrix:
A = c(17.50,3.00,1.00,0.05)
B = c(3.00,10.00,2.00,1.00)
C = c(1.00,2.00,30.00,3.00)
D = c(0.05,1.00,3.00,20.00)
subm = rbind(A,B,C,D)
colnames(subm) = c("A","B","C","D")
how to transform this matrix to a substitution matrix?

First, TraMineR computes the Hamming distance, i.e., a dissimilarity, not a similarity.
The simple Hamming distance is just the count of mismatches between two sequences. For example, the Hamming distance between AABBCC and ABBBAC is 2, and between AAAAAA and AAAAAA it is 0 since there are no mismatches.
Generalized Hamming allows to weighting mismatches (not matches!) with substitution costs. For example if the substitution cost between A and B is 1.5, and is 2 between B and C, then the distance would be the weighted sum of mismatches, i.e., 3.5 between the first two sequences. It would still be zero between one sequence and itself.
From what I understand, the shown matrix is not the matrix of substitution costs. It is the matrix of what you call 'spatial autocorrelations', and you look for how you can turn this information into substitutions costs.
The idea is to assign high substitution cost (mismatch weight) when the autocorrelation (a rate in your case) is low, i.e., when there is a low probability to find say state B in the neighborhood of state A, and to assign a low substitution cost when the probability is high. Since your probability matrix is symmetric, a simple solution is to use $1 - p(A|B)$ for all off diagonal terms, and leave 0 on the diagonal for the reason explained above.
sm <- 1 - subm/100
diag(sm) <- 0
sm
For non symmetric probabilities, you could use a similar formula to the one used for deriving the costs from transition rates, i.e., $2 - p(A|B) - p(B|A)$.

How the command dist(x,method="binary") calculates the distance matrix?

I have a been trying to figure that out but without much success. I am working with a table with binary data (0s and 1s). I managed to estimate a distance matrix from my data using the R function dist(x,method="binary"), but I am not quite sure how exactly this function estimates the distance matrix. Is it using the Jaccard coefficient J=(M11)/(M10+M01+M11)?

This is easily found in the help page ?dist:
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
[...]
binary: (aka asymmetric binary): The vectors are regarded as binary
bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The
distance is the proportion of bits in which only one is on amongst
those in which at least one is on.
This is equivalent to the Jaccard distance as described in Wikipedia:
An alternate interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union.
In your notation, it is 1 - J = (M01 + M10)/(M01 + M10 + M11).

Similarity of sets (considering also the set sizes)

I know that you can use the Jaccard index \ distance to measure the similarity / distance of two sets. However, I am looking for some way to scale the raw Jaccard values with respect to the lengths of the sets. For example, I want a higher similarity for two large sets with a significant overlap than for two small sets.
Of course, I could simply divide the value of the Jaccard distance by the size of the union of both sets, but is there a standard scheme of scaling for that purpose?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex