Why are distances in text2vec's RWMD module between 1 and -1? - r

From what I understand, the dist2 RWMD feature of the great text2vec package calculates distances between matrixes as cosine distances. Wouldn't that mean 1 - (cosine similarity)? If cosine similarity runs between 0 and 1, then shouldn't that result in values between 0 and 1, too? I am not sure how to interpret negative distances in this case, and how are they different from positive distances. Thanks!

The cosine between two vectors is the dot product divided by the product of the norms. Since the dot product can be negative, cosine is between 1 and -1.

Related

Generating two vectors with a given angle between them

I am trying to generate two vectors with a given cosine similarity. Input would be the degree of cosine similarity (or angle as it depends on it anyway) and the number of dimensions (D) in the vectors, and output would be two vectors of D dimensions with that given similarity between them Now, I know how to use the cosine similarity function to calculate the similarity but I'm lost when trying it the other way around.
Is there such a procedure or algorithm and how is it called?
For a given starting vector u and cosine similarity c:
Generate 2 points in n-D space; call them a and b
Project b onto the plane orthogonal to u and containing a
Subtract a from the result to obtain a vector orthogonal to u; call it h
Use [u,h] as a basis and basic trigonometry to generate the desired vector v
The above method is dimension-agnostic as it only uses dot products. The resultant vectors {v} are of unit length and uniformly distributed around u.

Text similarity as probability (between 0 and 1)

I have been trying to compute text similarity such that it'd be between 0 and 1, seen as a probability. The two text are encoded in two vectors, that are a bunch of numbers between [-1, 1]. So as two vectors are given, it seems plausible to use cosine similarity to obtain the vector similarity, but the output value of cosine is in between -1 and 1. So, I'm wondering if there's a method that either: 1) gives similarity between [0,1], or 2) transfer the cosine similarity to [0,1] distribution. Any ideas?
P.S. as I was so much working with cosine similarity, I saw that some suggest transferring the cosine distance to probability, or some suggested that every value between [-1, 0] should be mapped to 0, while keeping values between [0,1] as they are. Honestly, none of the methods makes sense to me, and I think they both mis-change the concept of similarity. So I'm wondering if any elegant method is out there to serve this functionality.

Coefficient of Euclidean Distance

I have been trying to calculate correlation coefficient (say r) and euclidean distance (say d) between two random variables X and Y. It is known that -1 <= r <= 1, whereas d >= 0. To compare these similarity metrics (mostly for visualization purpose), first I want to calculate a coefficient for d, so that it will be between 0 and 1 or between -1 and 1 like r. One way to scale d is dividing it by its maximum, i.e. d* = d/max(d). However, the maximum max(d) is not a global value and when someone uses different data points for X and Y, then it is not longer comparable to the first one. Therefore, I'm asking this community to suggest me a better way of scaling the euclidean distance that ranges in [0, 1] or [-1, 1].
I appreciate your cooperation in advance.
Alemu

substitution matrix based on spatial autocorrelation transformation

I would like to measure the hamming sequence similarity in which the substitution costs are not based on the substitution rates in the observed sequences but based on the spatial autocorrelation within the study area of the different states (states are thus not related to DNA but something else).
I divided my study area in grid cells of equal size (e.g. 1000m) and measured how often the same "state" is observed in a neighboring cell (Rook-case). Consequently the weight matrix indicates that from state A to A (to move within the same states) has a much higher probability than to go from A to B or B to C or A to C. This already indicates that states have a high spatial autocorrelation.
The problem is, if you want to measure sequence similarity the substitution matrix should be 0 at the diagonal. Therefore I was wondering whether there is a kind of transformation to go from an "autocorrelation matrix" to a substitution matrix, with 0 values along the diagonal. By means of this we would like to account for spatial autocorrelation in the study area in our sequence similarity measure. To do my analysis I am using the package TraMineR.
Example matrix in R for sequences consisting out of four states (A,B,C,D):
Sequence example: AAAAAABBBBCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDAAAAAAAAA
Autocorrelation matrix:
A = c(17.50,3.00,1.00,0.05)
B = c(3.00,10.00,2.00,1.00)
C = c(1.00,2.00,30.00,3.00)
D = c(0.05,1.00,3.00,20.00)
subm = rbind(A,B,C,D)
colnames(subm) = c("A","B","C","D")
how to transform this matrix to a substitution matrix?
First, TraMineR computes the Hamming distance, i.e., a dissimilarity, not a similarity.
The simple Hamming distance is just the count of mismatches between two sequences. For example, the Hamming distance between AABBCC and ABBBAC is 2, and between AAAAAA and AAAAAA it is 0 since there are no mismatches.
Generalized Hamming allows to weighting mismatches (not matches!) with substitution costs. For example if the substitution cost between A and B is 1.5, and is 2 between B and C, then the distance would be the weighted sum of mismatches, i.e., 3.5 between the first two sequences. It would still be zero between one sequence and itself.
From what I understand, the shown matrix is not the matrix of substitution costs. It is the matrix of what you call 'spatial autocorrelations', and you look for how you can turn this information into substitutions costs.
The idea is to assign high substitution cost (mismatch weight) when the autocorrelation (a rate in your case) is low, i.e., when there is a low probability to find say state B in the neighborhood of state A, and to assign a low substitution cost when the probability is high. Since your probability matrix is symmetric, a simple solution is to use $1 - p(A|B)$ for all off diagonal terms, and leave 0 on the diagonal for the reason explained above.
sm <- 1 - subm/100
diag(sm) <- 0
sm
For non symmetric probabilities, you could use a similar formula to the one used for deriving the costs from transition rates, i.e., $2 - p(A|B) - p(B|A)$.

How the command dist(x,method="binary") calculates the distance matrix?

I have a been trying to figure that out but without much success. I am working with a table with binary data (0s and 1s). I managed to estimate a distance matrix from my data using the R function dist(x,method="binary"), but I am not quite sure how exactly this function estimates the distance matrix. Is it using the Jaccard coefficient J=(M11)/(M10+M01+M11)?
This is easily found in the help page ?dist:
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
[...]
binary: (aka asymmetric binary): The vectors are regarded as binary
bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The
distance is the proportion of bits in which only one is on amongst
those in which at least one is on.
This is equivalent to the Jaccard distance as described in Wikipedia:
An alternate interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union.
In your notation, it is 1 - J = (M01 + M10)/(M01 + M10 + M11).

Resources