Text similarity as probability (between 0 and 1) - similarity

I have been trying to compute text similarity such that it'd be between 0 and 1, seen as a probability. The two text are encoded in two vectors, that are a bunch of numbers between [-1, 1]. So as two vectors are given, it seems plausible to use cosine similarity to obtain the vector similarity, but the output value of cosine is in between -1 and 1. So, I'm wondering if there's a method that either: 1) gives similarity between [0,1], or 2) transfer the cosine similarity to [0,1] distribution. Any ideas?
P.S. as I was so much working with cosine similarity, I saw that some suggest transferring the cosine distance to probability, or some suggested that every value between [-1, 0] should be mapped to 0, while keeping values between [0,1] as they are. Honestly, none of the methods makes sense to me, and I think they both mis-change the concept of similarity. So I'm wondering if any elegant method is out there to serve this functionality.

Related

Why are distances in text2vec's RWMD module between 1 and -1?

From what I understand, the dist2 RWMD feature of the great text2vec package calculates distances between matrixes as cosine distances. Wouldn't that mean 1 - (cosine similarity)? If cosine similarity runs between 0 and 1, then shouldn't that result in values between 0 and 1, too? I am not sure how to interpret negative distances in this case, and how are they different from positive distances. Thanks!
The cosine between two vectors is the dot product divided by the product of the norms. Since the dot product can be negative, cosine is between 1 and -1.

Normalize the similarity between word vectors and document vectors?

Cosine similarity is broadly used for measuring the similarity between two vectors, where two could be word vectors or document vectors.
Others, like manhattan, euclidean, minkowski, etc, are also popular.
Cosine similarity gives the number between 0 and 1 so it SEEMS like it is a percentage of the similarity between two vectors. Euclidean gives some number in large variation.
.
When cosine similarity between two vectors gives 0.78xxx, people including me probably expect that "these two vectors are 78 % similar!", which it is not actual "similarity degree" of two vectors.
.
Unlike cosine similarity, minkowski, manhattan, canberra, etc even give some large number not ranged in 0 to 1.
For word1:word2 example
0.78 (cosine, gives between 0 to 1)
9.54 (Euclidean, gives the actual distance between two vectors)
158.417 (Canberra)
.
I expect that there might be some normalization methods broadly used to represent the actual "similarity degree" between two vectors. Please provide if you know some. If there are articles or papers, it would be much better.
For word1:word2 example
0.848 (cosine, transformed as normalized number)
0.758 (Euclidean, normalized between 0 to 1)
0.798 (Canberra, normalized between 0 to 1)
I do not expect you to mention about the softmax number because I read an article that the softmax number itself should not be considered as the actual percentage.
You would have to rigorously define what you mean by "actual 'similarity degree'" for any answer to be possible.
Each of these measures can be useful. Each could be scaled to a value from 0.0 to 1.0, if you needed things in that range. But that wouldn't necessarily make any of them a "percentage similarity", because "percentage similarity" isn't a concept with a rigorous meaning.

substitution matrix based on spatial autocorrelation transformation

I would like to measure the hamming sequence similarity in which the substitution costs are not based on the substitution rates in the observed sequences but based on the spatial autocorrelation within the study area of the different states (states are thus not related to DNA but something else).
I divided my study area in grid cells of equal size (e.g. 1000m) and measured how often the same "state" is observed in a neighboring cell (Rook-case). Consequently the weight matrix indicates that from state A to A (to move within the same states) has a much higher probability than to go from A to B or B to C or A to C. This already indicates that states have a high spatial autocorrelation.
The problem is, if you want to measure sequence similarity the substitution matrix should be 0 at the diagonal. Therefore I was wondering whether there is a kind of transformation to go from an "autocorrelation matrix" to a substitution matrix, with 0 values along the diagonal. By means of this we would like to account for spatial autocorrelation in the study area in our sequence similarity measure. To do my analysis I am using the package TraMineR.
Example matrix in R for sequences consisting out of four states (A,B,C,D):
Sequence example: AAAAAABBBBCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDAAAAAAAAA
Autocorrelation matrix:
A = c(17.50,3.00,1.00,0.05)
B = c(3.00,10.00,2.00,1.00)
C = c(1.00,2.00,30.00,3.00)
D = c(0.05,1.00,3.00,20.00)
subm = rbind(A,B,C,D)
colnames(subm) = c("A","B","C","D")
how to transform this matrix to a substitution matrix?
First, TraMineR computes the Hamming distance, i.e., a dissimilarity, not a similarity.
The simple Hamming distance is just the count of mismatches between two sequences. For example, the Hamming distance between AABBCC and ABBBAC is 2, and between AAAAAA and AAAAAA it is 0 since there are no mismatches.
Generalized Hamming allows to weighting mismatches (not matches!) with substitution costs. For example if the substitution cost between A and B is 1.5, and is 2 between B and C, then the distance would be the weighted sum of mismatches, i.e., 3.5 between the first two sequences. It would still be zero between one sequence and itself.
From what I understand, the shown matrix is not the matrix of substitution costs. It is the matrix of what you call 'spatial autocorrelations', and you look for how you can turn this information into substitutions costs.
The idea is to assign high substitution cost (mismatch weight) when the autocorrelation (a rate in your case) is low, i.e., when there is a low probability to find say state B in the neighborhood of state A, and to assign a low substitution cost when the probability is high. Since your probability matrix is symmetric, a simple solution is to use $1 - p(A|B)$ for all off diagonal terms, and leave 0 on the diagonal for the reason explained above.
sm <- 1 - subm/100
diag(sm) <- 0
sm
For non symmetric probabilities, you could use a similar formula to the one used for deriving the costs from transition rates, i.e., $2 - p(A|B) - p(B|A)$.

How the command dist(x,method="binary") calculates the distance matrix?

I have a been trying to figure that out but without much success. I am working with a table with binary data (0s and 1s). I managed to estimate a distance matrix from my data using the R function dist(x,method="binary"), but I am not quite sure how exactly this function estimates the distance matrix. Is it using the Jaccard coefficient J=(M11)/(M10+M01+M11)?
This is easily found in the help page ?dist:
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
[...]
binary: (aka asymmetric binary): The vectors are regarded as binary
bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The
distance is the proportion of bits in which only one is on amongst
those in which at least one is on.
This is equivalent to the Jaccard distance as described in Wikipedia:
An alternate interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union.
In your notation, it is 1 - J = (M01 + M10)/(M01 + M10 + M11).

how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements?
The user wants to impose a unique, non-trivial, upper/lower bound on the correlation between every pair of variable in a var/covar matrix.
For example: I want a variance matrix in which all variables have 0.9 > |rho(x_i,x_j)| > 0.6, rho(x_i,x_j) being the correlation between variables x_i and x_j.
Thanks.
There are MANY issues here.
First of all, are the pseudo-random deviates assumed to be normally distributed? I'll assume they are, as any discussion of correlation matrices gets nasty if we diverge into non-normal distributions.
Next, it is rather simple to generate pseudo-random normal deviates, given a covariance matrix. Generate standard normal (independent) deviates, and then transform by multiplying by the Cholesky factor of the covariance matrix. Add in the mean at the end if the mean was not zero.
And, a covariance matrix is also rather simple to generate given a correlation matrix. Just pre and post multiply the correlation matrix by a diagonal matrix composed of the standard deviations. This scales a correlation matrix into a covariance matrix.
I'm still not sure where the problem lies in this question, since it would seem easy enough to generate a "random" correlation matrix, with elements uniformly distributed in the desired range.
So all of the above is rather trivial by any reasonable standards, and there are many tools out there to generate pseudo-random normal deviates given the above information.
Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range. You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense. Thus, as the sample size goes to infinity, you should expect to see the specified distribution parameters. But any small sample set will not necessarily have the desired parameters, in the desired ranges.
For example, (in MATLAB) here is a simple positive definite 3x3 matrix. As such, it makes a very nice covariance matrix.
S = randn(3);
S = S'*S
S =
0.78863 0.01123 -0.27879
0.01123 4.9316 3.5732
-0.27879 3.5732 2.7872
I'll convert S into a correlation matrix.
s = sqrt(diag(S));
C = diag(1./s)*S*diag(1./s)
C =
1 0.0056945 -0.18804
0.0056945 1 0.96377
-0.18804 0.96377 1
Now, I can sample from a normal distribution using the statistics toolbox (mvnrnd should do the trick.) As easy is to use a Cholesky factor.
L = chol(S)
L =
0.88805 0.012646 -0.31394
0 2.2207 1.6108
0 0 0.30643
Now, generate pseudo-random deviates, then transform them as desired.
X = randn(20,3)*L;
cov(X)
ans =
0.79069 -0.14297 -0.45032
-0.14297 6.0607 4.5459
-0.45032 4.5459 3.6549
corr(X)
ans =
1 -0.06531 -0.2649
-0.06531 1 0.96587
-0.2649 0.96587 1
If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough.
You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring.
An approach that might work (but one that I've not totally thought out at this point) is to use the standard scheme as above to generate a random sample. Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired. Now, find a zero mean random perturbation to your sampled data that would move the sample covariance matrix in the desired direction.
This might work, but unless I knew that this is actually the question at hand, I won't bother to go any more deeply into it. (Edit: I've thought some more about this problem, and it appears to be a quadratic programming problem, with quadratic constraints, to find the smallest perturbation to a matrix X, such that the resulting covariance (or correlation) matrix has the desired properties.)
This is not a complete answer, but a suggestion of a possible constructive method:
Looking at the characterizations of the positive definite matrices (http://en.wikipedia.org/wiki/Positive-definite_matrix) I think one of the most affordable approaches could be using the Sylvester criterion.
You can start with a trivial 1x1 random matrix with positive determinant and expand it in one row and column step by step while ensuring that the new matrix has also a positive determinant (how to achieve that is up to you ^_^).
Woodship,
"First of all, are the pseudo-random deviates assumed to be normally distributed?"
yes.
"Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range."
Yes, that's the whole difficulty
"You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense."
True, but this is not the problem here: your strategy works for p=2, but fails for p>2, regardless of sample size.
"If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough."
It is not a sample size issue b/c with p>2 you do not even observe convergence to the right range for the correlations, as sample size growths: i tried the technique you suggest before posting here, it obviously is flawed.
"You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring."
Not an option, for p large (say larger than 10) this option is intractable.
"Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired."
Ditto
As for the QP, i understand the constraints, but i'm not sure about the way you define the objective function; by using the "smallest perturbation" off some initial matrix, you will always end up getting the same (solution) matrix: all the off diagonal entries will be exactly equal to either one of the two bounds (e.g. not pseudo random); plus it is kind of an overkill isn't it ?
Come on people, there must be something simpler

Resources