Deciding the threshold value in string similarity - information-retrieval

When calculating the the similarity between two strings what is the best method of deciding the threshold value ?
For en example Jaccard Coefficient how can we decide its best threshold value.

Related

How to find the optimal feature count by mRMRe package?

I am trying to use the mRMRe package in R to do a feature selection on a gene expression dataset.I have RNA seq data containing over 10K genes and I would like to find the optimal feature fitter the classification model. I am wondering how to find the optimal feature count. Here is my code ,
mrEnsemble <- mRMR.ensemble(data = Xdata, target_indices = c(1) ,feature_count = 100 ,solution_count = 1)
mrEnsemble_genes <- as.data.frame(apply(solutions(mrEnsemble)[[1]], 2, function(x, y) { return(y[x]) }, y=featureNames(Xdata)))
View(mrEnsemble_genes)
I just set feature_count = 100 but I am wondering how to find the optimal feature count for classification without setting the number.
and the result after extracting mrEnsemble_genes will be the list of genes like,
gene05
gene08
gene45
gene67
Are they ranked by score calculated from mutual information? I mean the first ranked gene gain the highest MI and it may be a good gene for classifying the class of sample i.e. cancer and normal, right ? Thank you
As far as I understand, the MRMR method simply ranks the N number of features you input according to their MRMR score. It is then up to you to decide which features to keep and discard.
According to the package documentation of mRMRe, the MRMR score is computed as the following:
For each target, the score of a feature is defined as the
mutual information between the target and this feature minus the average mutual information of
previously selected features and this feature.
So in other words,
Relevancy = Mutual information with the target
Redundancy = Average mutual information with the previous predictors
MRMR Score = Relevancy - Redundancy.
The way I interpret this, the scores themselves don't offer a clear-cut answer for keeping or discarding features. Higher scores are better, but a zero / negative score does not mean the feature has no effect on the target: It could have some mutual information with the target, but higher average mutual information with the other predictors, leading to a negative MRMR score. Finding the exact optimal feature set requires experimentation.
To retrieve the indexes of the features (in the original data), ranked from highest to lowest MRMR score, use:
solutions(mrEnsemble)
To retrieve the actual MRMR scores, use:
scores(MrEnsemble)

Text similarity as probability (between 0 and 1)

I have been trying to compute text similarity such that it'd be between 0 and 1, seen as a probability. The two text are encoded in two vectors, that are a bunch of numbers between [-1, 1]. So as two vectors are given, it seems plausible to use cosine similarity to obtain the vector similarity, but the output value of cosine is in between -1 and 1. So, I'm wondering if there's a method that either: 1) gives similarity between [0,1], or 2) transfer the cosine similarity to [0,1] distribution. Any ideas?
P.S. as I was so much working with cosine similarity, I saw that some suggest transferring the cosine distance to probability, or some suggested that every value between [-1, 0] should be mapped to 0, while keeping values between [0,1] as they are. Honestly, none of the methods makes sense to me, and I think they both mis-change the concept of similarity. So I'm wondering if any elegant method is out there to serve this functionality.

Optimum number of permutations to use for estimating set similarity using min hash

Let's say I have to find estimate the jaccard similarity between documents A and B, and I use k random permutations of the union of these sets/documents to determine the documents' signatures.
How should I set my k value? Since setting it to a really high value would increase computation time significantly, what could be the least value of k which can give me a good jaccard index estimate?
Given error tolerance e>0 and delta, how can I determine the minimum value of k such that the jaccard index is between (1-e)jaccard_estimate and (1+e)jaccard_estimate with probability greater than or equal to (1-delta)?
I believe this can be derived using chernoff inequality bound, but I'm unable to figure about how to go about it. Any help would be appreciated. Thanks in advance!
If J' denotes the estimate and J is the true Jaccard similarity, (k * J') follows a binomial distribution with parameters k and J. As a consequence, the variance of the estimate is Var(J') = J(1-J)/k <= 1/(4*k). Therefore the standard deviation of J is bounded by stdev(J') <= 1/(2*sqrt(k)).

Infer optimum threshold values for normalized levenshtein distance and trigram similarity functions

Records from two datasets are compared for fuzzy match string similarity, by using normalized levenshtein distance function and trigram similarity function. 4 different similarity metrics are calculated:
LevCmpSimilarity - normalized Levenshtein similarity for compared composite (concatenated) fields,
LevWghSimilarity - normalized Levenshtein similarity as summary for all individual fields being compared,
TrgWgh and TrgCmp - the same as with Levenshtein, but with Trigram Similarity function instead Levenshtein,
Below are histograms for all four metrics, for frequencies and cummulative frequencies.
absolute frequencies
cummulative frequencies
My question is: could these frequencies distribution pattern be used for automatic unsupervised determination of optimum threshold values for record matching acceptance/rejection? If answer is yes, can you suggest direction?
Basically, could levenshtein distance and trigram similarity values frequencies pattern be used solely for infering optimum threshold values for fuzzy match record linkage?

Discrete Math: Given a set of integers, permute, calculate expected number of integers that remain same position

So we are given a set of integers from 0 to n. This is then randomized. The goal is to calculate the number of expected integers which remain in the same position in both lists. I have tried to set up two indicator variables for each integer and then mapping it to the two different sets, but I don't really know how to go from there.
The random variable X, representing the number of your integers which remain in the same position after randomisation, follows the binomial distribution with n+1 trials and a probability of 1/(n+1), therefore the expected number of integers remaining in place is 1.
My reasoning is:
Each integer can move to any other position in the list after randomisation, with equal probability. So whether an integer remains in place can be considered a Bernoulli distribution, with probability 1/(n+1), since there are n+1 possible position it could move to, and only 1 position for it to have remained in place.
There are therefore n+1 Bernoulli distributions, all with the same probability, and all independent of each other. (A Bernoulli distribution represents a yes / no outcome where the yes has a given probability.)
The binomial distribution is defined as the number of successes in a given number of identical independent trials, or (equivalently) the number of "yes" outcomes in a given number of independent Bernoulli distributions with the same probability.
The number of your integers which remain in place after randomisation is therefore a bimonial distribution, probability 1/(n+1) and with n+1 trials.
The mean of a binomial distribution with n trials with probability p is np, therefore in your case the expected number of integers remaining in place is (n+1) . (1/(n+1)) which is 1.
For more info on the binomial distribution, see wikipedia.

Resources