Comparing context vectors - vector

I'm using vectors to represent context around words and I need to compare contexts with each other. The following is a simplified version of my problem:
Let's say I have a vector a=[1,1,15,2,0]. Then I have a vector b=[0,0,15,0,0] and c=[1,1,11,0,1]. When comparing the two vectors by cosine similarity b is closest to a. However, since the vectors are representing context c makes more sense in my case since b is just a context which happens to have one word common with the original and has the same score.
How could I return c as the most similar? Another similarity measure? Or maybe my reasoning is flawed somewhere?
As I've said, this is a simplification of my problem. I am already normalizing the vectors and for scoring context words I'm using log-likelihood.
Thanks!

Use Jaccard similarity. In the Python demo below, keep in mind that the functions cosine and jaccard return distance, which is the "inverse" of similarity, and read the comments:
# Input all the data
In [19]: from scipy.spatial.distance import cosine, jaccard
In [24]: a
Out[24]: array([ 1, 1, 15, 2, 0])
In [25]: b
Out[25]: array([ 0, 0, 15, 0, 0])
In [26]: c
Out[26]: array([ 1, 1, 11, 0, 1])
# Calculate cosine similarity. I've scaled it by a factor of 100 for legibility
In [20]: 100*cosine(a,b)
Out[20]: 1.3072457560346473
In [21]: 100*cosine(c,a)
Out[21]: 1.3267032349480568
# Note c is slightly "further away" from a than b.
# Now let's see what Mr Jaccard has to say
In [28]: jaccard(a,b)
Out[28]: 0.75
In [29]: jaccard(a,c)
Out[29]: 0.59999999999999998
# Behold the desired effect- c is now considerably closer to a than b
# Sanity check- the distance between a and a is 0
In [30]: jaccard(a,a)
Out[30]: 0.0
PS Many more similarity measures exist, and each is appropriate under different circumstances. Do you have a good reason to believe c should be more similar to a than b? What is your task? If you want to read more about the subject, I highly recommend this PhD thesis. Warning: 200 pages long.

Related

Text similarity as probability (between 0 and 1)

I have been trying to compute text similarity such that it'd be between 0 and 1, seen as a probability. The two text are encoded in two vectors, that are a bunch of numbers between [-1, 1]. So as two vectors are given, it seems plausible to use cosine similarity to obtain the vector similarity, but the output value of cosine is in between -1 and 1. So, I'm wondering if there's a method that either: 1) gives similarity between [0,1], or 2) transfer the cosine similarity to [0,1] distribution. Any ideas?
P.S. as I was so much working with cosine similarity, I saw that some suggest transferring the cosine distance to probability, or some suggested that every value between [-1, 0] should be mapped to 0, while keeping values between [0,1] as they are. Honestly, none of the methods makes sense to me, and I think they both mis-change the concept of similarity. So I'm wondering if any elegant method is out there to serve this functionality.

R: using Solve in matlib to solve system of equations where # unknowns > # eqns

library(matlib)
A <- matrix(c(1, 1/7, 1, 2/7, 1, 3/7, 1, 4/7, 1, 5/7, 1, 6/7), 2, 6)
b <- c(1,0.4)
showEqn(A, b)
I am trying to use the matlib function to help me find values of $x_1, ..., x_6$ such that they satisfy the above equations. I know there are many different combinations of $x_1, ..., x_6$ such that the above 2 equations can be satisfied. How can I find those values of $x_1, ...., x_6$? I tried using Solve(A, b, fractions = TRUE) but that does not give me a set of possible solutions.
If the number of unknown is greater thab the number of equations you have, there is an infinite number of solutions. Indeed, if the difference is 1 the solution is a line, if the difference is 2 the solution is a plane, and so on.
In general, if the number of variables is N and the number of equations is M (and we can assume the equations to be linearly independent), the solution will be a shape of dimension N-M embedded in a space of dimension N.
So what you obtain by solving systems where N is greater than M will be an equation describing the relation between some x_i, e.g. the equation of a line. R being mainly designed to work on numbers and perform arithmetic rather than working with symbolic math, it is probably not the best suited software for this purpose. However, if you want to stay with R there are some libraries that offer this possibility, see this post for example.

Cylindrical Clustering in R - clustering timestamp with other data

I'm learning R and I have to cluster numeric data with a timestamp field.
One of the parameters is a time, and since the data is strictly day-night dependent, I want to take into account the "spherical" nature of this data.
As far as I saw from the manual, libraries such as skmeans cannot handle "cylindrical" data but only "spherical" data (i.e. where all the components are in polar coordinates).
My idea for a suitable solution is the follwing: I can decompose the HOUR column (0-24) into two different colums X,Y and express the time in polar coordinates, such as x^2+y^2=1.
In this way a k-means with euclidean distance should not have problem interpreting the data.
Am I right?
Here is such a mapping of h to m where h is the time in hours (and fraction of an hour). Then we try kmeans and at least in this test it seems to work:
h <- c(22, 23, 0, 1, 2, 10, 11, 12)
ha <- 2*pi*h/24
m <- cbind(x = sin(ha), y = cos(ha))
kmeans(m, 2)$cluster # compute cluster assignments via kmeans
## [1] 2 2 2 2 2 1 1 1
k-means should use squared Euclidean distance.
But indeed: projecting your data into a meaningful Euclidean space is an easy way to avoid this kind of problems.
However be aware that your mean will no longer lie on the cylinder. In many cases, you can just scale the mean to the desired cylinder. But it might become 0, then no meaningful rescaling is possible.
The other option is kernel k-means. As your desired distance is Euclidean after a data transformation, you can also "kernelize" this transformation, and use kernel k-means. But it may actually be faster to transform your data in your particular case. It will likely only pay off when using much more complex transformations (say, to an infinite dimensional vector space).

Clustering unstructured text based on similarity and calculating optimum number of clusters

I am a data mining beginner and am trying to first formulate an approach to a clustering problem I am solving.
Suppose we have x writers, each with a particular style (use of unique words etc.). They each write multiple short texts, let's say a haiku. We collect many hundreds of these haikus from the authors and try to understand from the haikus, using context analysis, how many authors we had in the first place (we somehow lost records of how many authors there were, after a great war!)
Let's assume I create a hash table of words for each of these haikus. Then I could write a distance function that would look at the repetition of similar words between each vector. This could allow me to implement some sort of k-mean clustering function.
My problem now is to measure, probabilistically, the number of clusters, i.e. the number of authors, that would give me the optimum fit.
Something like:
number of authors | probability
1, 0.05
2, 0.1
3, 0.2
4, 0.4
5, 0.1
6, 0.05
7, 0.03
8, 0.01
The only constraint here is that as the number of authors (or clusters) goes to infinity, the sigma of the probabilities should converge onto 1, I think.
Does anyone have any thoughts or suggestions on how to implement this second part?
Let's formulate an approach using Bayesian statistics.
Pick a prior P(K) on the number of authors, K. For example, you might say K ~ Geometric(p) with support {1, 2, ... } where E[K] = 1 / p is the number of authors you expect there to be prior to seeing any writings.
Pick a likelihood function L(D|K) that assigns a likelihood to the writing data D given a fixed number of authors K. For example, you might say L(D|K) is the total amount of error in a k-component GMM found by expectation-maximization. To be really thorough, you could learn L(D|K) from data: the internet is full of haikus with known authors.
Find the value of K that maximizes the posterior probability P(K|D) - your best guess at the number of authors. Note that since P(K|D) = P(D|K)P(K)/P(D), P(D) is constant, and L(D|K) is proportional to P(D|K), you have:
max { P(K|D) | K = 1, 2, ... } = max { L(D|K)P(K) | K = 1, 2, ... }
With respect to your question, the first column in your table corresponds to K and the second column corresponds to a normalized P(K|D); that is, it is proportional to L(D|K)P(K).

Fast Fourier Transform using a Vandermonde Matrix - Evaluation of Co-efficients?

Say i'm trying to evaluate the Polynomial:
x^2 + 1
Using the Fast Fourier transform method for evaluating co-efficients. Now i can change this into matrix/vector form using the co-effcient as inputs for the fast fourier transform:
so:
x^2 + 1 = <1, 0, 1, 0>
This is done by using the coefficient value e.g 1 = 1, 0x^1 = 0, X^2 = 1 and so on
Now we get to the bit where i'm totally confused. I'm meant to use the vandermonde matrix :Vandermonde matrix ~ Wiki to evaluate these values into FFT Form using the matrix:
1 1 1 1
1 i-1-i
1-1 1-i
1-i 1 i
The output of
fft(1,0,1,0)
is
(2,0,2,0)
Now thats the step i don't quite understand, how did we use that matrix to get (2,0,2,0)?
First, your Vandermonde matrix is incorrect. The (4,3) entry should be -1, not 1, since the fourth row should be (-i)0, (-i)1, (-i)2, (-i)3. Note in particular that
(-i)*(-i) = (-1)2 * i2 = i2 = -1.
With this correction, the result follows from multiplying the Vandermonde matrix by the column vector (1,0,1,0).
Maybe you could explain what your overall goal is here. I have never heard of FFTs being used to evaluate polynomials. They are used to multiply polynomials, or to convolve signals (an equivalent task), but I wouldn't bother unless the polynomials/signals have a large number of terms. x2 + 1 isn't large. 16 terms is not large, and even 64 or 256 terms is probably better done by straightforward O(N2) techniques.
Discrete Fourier Transforms use the matrix Mij = ωij where ω is the Nth complex root of 1 and column/row numbering goes from 0 to N-1.
Fast Fourier Transforms never use this matrix directly, they are heavily optimized to use a divide-and-conquer technique (Cooley-Tukey algorithm) to calculate the end result through stages of 2x2 DFTs in series and parallel.
If you write your vector as [0,1,0,1] instead of [1,0,1,0], I think you will see that if you multiply that by the matrix you gave, you'll get [0,2,0,2]. (Although you have an error, it's
1 1 1 1
1 i-1-i
1-1 1-1
1-i-1 i
) There must be some convention in the program you are using which reverses the order of the vector's coefficients.

Resources