Calculating the cosine similarity - information-retrieval

I'm trying to figure out how to calculate the cosine similarity of these two vectors:
A:(1,1,0,0,0,0,0,0,0)
B:(1,0,0,1,0,0,0,0,1)
From what I understand, I need to multiply A by B and then divide it by the length of A*B.
The first part I understand, but how do I know what the length is?
A is a document with 11 words
B is a query with 7 words
Does the length refer to the number of words? Or do I have to 'normalize' the vectors? I'm unsure because, from what I understood, the cosine already normalizes the vector.
Any helps and hints would be appreciated.

Vector operations are defined on elements of a vector space (the dimensionality of which needs to be predefined).
So, even though we loosely use the terminology sparse vector, it's nothing but an efficient way to represent a dense vector...
For your example involving text vectors - both a document and a query are nothing but dense vectors (the dimensionality is the size of the entire vocabulary).
The first step is to convert a document and a query to a dense vector representation.
If D=<the cat sat on the mat> and Q=<cat on mat>,
then the vocabulary (set of unique words) is {cat, mat, on, sat, the}
Our toy vector space is hence of dimensionality 5. Each vector is hence represented by 5 numbers, where the value represents presence/absence or count of the corresponding term.
vec(D) = (1, 1, 1, 1, 2) - because cat occurs once, the occurs twice in D and so on.
Similarly, vec(Q) = (1, 1, 1, 0, 0) - note the 0's corresponding to the terms that are not present, e.g. the term the.
Cosine similarity is the normalized inner product.
The numerator is simply \sum_i a_i * b_i, which for this example is
1.1 + 1.1 + 1.1 + 1.0 + 2.0 = 3
What about the lengths? For finding the length (specifically L2 norm) of a vector simply compute its inner product with itself and then take a square root.
Len(D) = sqrt(1.1 + 1.1 + 1.1 + 1.1 + 2.2) = 2sqrt(2)
Len(Q) = sqrt(1.1 + 1.1 + 1.1 + 0.0 + 0.0) = sqrt(3)
Hence cosine-sim = 3/(2sqrt(2)*sqrt(3))
Finally, for your example,
A:(1,1,0,0,0,0,0,0,0) and B:(1,0,0,1,0,0,0,0,1),
A.B = 1
Len(A) = sqrt(A.A) = sqrt(2)
Len(B) = sqrt(A.A) = sqrt(3)
cosine-sim = 1/sqrt(6) = 0.4082

Related

How can you map a set of numbers, full of "holes" into a smaller one without "holes"

Can anyone figure out a function that can perform a mapping from a finite set of N numbers X = {x0, x1, x2, ..., xN} where each x can be valued 0 to 999999999 and N < 999999999, to a set Y = {0, 1, 2, 3, ..., N}.
In my case, i have about 24000000 element in the first set whose values can range as X. This elements have continuous block (for example 53000 to 1234500, then 8000000 to 9000000 and so on) and i have to remap this elements from 0 to 2400000. I don't require to maintain order.
I need a (possibly simple and rapid) math function, or a bitwise transformation, not something like put it ordered into an array and then binary search for their position.
Really thank to whom that can figure out a way to solve this!
Luca
If you don't want to keep some gigabytes of straight map, then augmented segment tree is reasonable approach. Tree should contain intervals and shift of every interval (sum of left intervals). Of course, finding appropriate interval (and shift) in this method is close to the binary search.
For example, you get X=80000015. Find interval for this value - it is 8000000 to 9000000. Rank of this interval is 175501 (1234500-53000 + 1). So X maps to
X => 175501 + 80000015 - 80000000 = 175516
For sparse elements make counting stage - find what is rank R for every number M and put (key=M, value=R) pair in hash table.
X = (3, 19, 20, 101)
table: [(3:0), (19:1), (20:2), (101:3)]
Note that one should keep balance between speed and space - for long filled intervals it is better to store only interval ends.

Check if 4 points in space are corner points of a rectangle

I have 4 points in space A(x,y,z), B(x,y,z), C(x,y,z) and D(x,y,z). How can I check if these points are the corner points of a rectangle?
You must first determine whether or not the points are all coplanar, since a rectangle is a 2D geometric object, but your points are in 3-space. You can determine they are coplanar by comparing cross products as in:
V1 = (B-A)×(B-C)
V2 = (C-A)×(C-D)
This will give you two vectors which, if A, B, C, and D are coplanar are linearly dependent. By considering what Wolfram has to say on vector dependence, we can test the vectors for linear dependence by using
C = (V1∙V1)(V2∙V2) - (V1∙V2)(V2∙V1)
If C is 0 then the vectors V1 and V2 are linearly dependent and all the points are coplanar.
Next compute the distances between each pair of points. There should be a total of 6 such distances.
D1 = |A-B|
D2 = |A-C|
D3 = |A-D|
D4 = |B-C|
D5 = |B-D|
D6 = |C-D|
Assuming none of these distances are 0, these points form a rectangle if and only if the vertices are coplanar (already verified) and these lengths can be grouped into three pairs where elements of each pair have the same length. If the figure is a square, two sets of the pairs will have be the same length and will be shorter than the remaining pair.
Update: Reading this again, I realize the the above could define a parallelogram, so an additional check is required to check that the square of the longest distance is equal to the sum of the squares of the two shorter distances. Only then will the parallelogram also be a rectangle.
Keep in mind all of this is assuming infinite precision and within a strictly mathematical construct. If you're planning to code this up, you will need to account for rounding and accept a degree of imprecision that's not really a player when speaking in purely mathematical terms.
Check if V1=B-A and V2=D-A are orthogonal using the dot product. Then check if
C-A == V1+V2
within numerical tolerances. If both are true, the points are coplanar and form a rectangle.
Here a function is defined to check whether the 4 points represents the rectangle or not .
from math import sqrt
def Verify(A, B, C, D, epsilon=0.0001):
# Verify A-B = D-C
zero = sqrt( (A[0]-B[0]+C[0]-D[0])**2 + (A[1]-B[1]+C[1]-D[1])**2 + (A[2]-B[2]+C[2]-D[2])**2 )
if zero > epsilon:
raise ValueError("Points do not form a parallelogram; C is at %g distance from where it should be" % zero)
# Verify (D-A).(B-A) = 0
zero = (D[0]-A[0])*(B[0]-A[0]) + (D[1]-A[1])*(B[1]-A[1]) + (D[2]-A[2])*(B[2]-A[2])
if abs(zero) > epsilon:
raise ValueError("Corner A is not a right angle; edge vector dot product is %g" % zero)
else:
print('rectangle')
A = [x1,y1,z1]
print(A)
B = [x2,y2,z2]
C = [x3,y3,z3]
D = [x4,y4,z4]
Verify(A, B, C, D, epsilon=0.0001)

How to tell the length a unit vector needs to be in order to pass another

Let's say I have two items: a unit direction vector, and another arbitrary vector.
What I want to get is the length to make the unit vector so that it covers the "distance" or magnitude of the other vector. So the new vector "contains" the other vector but maintains its direction.
Do you see what I'm saying?
If I understand you correctly (you want vector v):
You want a vector v = (An) where:
(An).b = |b|
Here A is just a number, n is the unit vector and b is the arbitrary vector.
What this means is you want a vector with length A, but if you were to rotate the world so that b was on the x axis, the x component of (An) would be |b| (absolute value of b)
Therefore, in components:
A(n1b1 + n2b2 + n3b3) = sqrt(b1^2 + b2^2 + b3^2)
where n1 means the 1st (x) component of the vector n.
Therefore just re-arrange:
A = sqrt(b1^2 + b2^2 + b3^2)/(n1b1 + n2b2 + n3b3)
A = |b|/(n.b)
So the vector that you're are looking for is:
v = A*n = n * |b|/(n.b)
I believe that's what you want.
Edit: I broke that into components when I REALLY didn't need to. Components are useful if you don't understand what all the terms mean though. But here's it in just vector maths:
An.b = A(n.b) = |b| = abs(b)
A = |b|/(n.b)
Therefore v = An = n * |b|/(n.b)

How to find 10 values, exponentially distributed, which sum to a value, x

I have a value, for example 2.8. I want to find 10 numbers which are on an exponential curve, which sum to this value.
That is, I want to end up with 10 numbers which sum to 2.8, and which, when plotted, look like the curve below (exponential decay). These 10 numbers should be equally spaced along the curve - that is, the 'x-step' between the values should be constant.
This value of 2.8 will be entered by the user, and therefore the way I calculate this needs to be some kind of algorithm that I can program (hence asking this on SO not Math.SE).
I have no idea where to start with this at all - any ideas?
You want to have 10 x values equally distributed, i.e. x_k = a + k * b. They shall fulfill sum(exp(-x_k)) = v with v being your target value (the 2.8). This means exp(-a) * sum(exp(-b)^k) = v.
Obviously, there is a solution for each choice of b if v is positive. Set b to an arbitrary value, and calculate a from it.
E.g. for v = 2.8 and b = 0.1, you get a = -log(v / sum(exp(-b)^k)) = -log(2.8/sum(0.90484^k)) = -log(2.8/6.6425) = -log(0.421526) = 0.86387.
So for this example, the x values would be 0.86387, 0.96387, ..., 1.76387 and the y values 0.421526, 0.381412, 0.345116, 0.312274, 0.282557, 0.255668, 0.231338, 0.209324, 0.189404, 0.171380.
Update:
As it has been clarified that the curve can be scaled arbitrarily and the xs are preferred to be 1, 2, 3 ... 9, this is much more simple.
Assuming the curve function is r*exp(-x), the 10 values would be r*exp(-1) ... r*exp(-9). Their sum is r*sum(exp(-x)) = r*0.58190489. So to reach a certain value (2.8) you just have to adjust the r accordingly:
r = 2.8/sum(exp(-x)) = 4.81178294
And you get the 10 values: 1.770156, 0.651204, 0.239565, 0.088131, 0.032422, 0.011927, 0.004388, 0.001614, 0.000594.
If I understand your question correctly then you want to find x which solves the equation
It can be solved as
(just sum numbers as geometric progression)
The equation under RootOf will always have 1 real square different from 1 for 2.8 or any other positive number. You can solve it using some root-finding algorithm (1 is always a root but it does not solve original task). For constant a you can choose any number you like.
After computing the x you can easily calculate 10 numbers as .
I'm going to generalize and assume you want N numbers summing to V.
Since your numbers are equally spaced on an exponential you can write your sum as
a + a*x + a*x^2 + ... + a*x^(N-1) = V
Where the first point has value a, and the second a*x etc.
You can take out a factor of a and get:
a ( 1 + x + x^2 + ... + x^(N-1) ) = V
If we're free to pick x then we can solve for a easily
a = V / ( 1 + x + x^2 + .. x^(N-1) )
= V*(x+1)/(x^N-1)
Substituting that back into
a, a*x, a*x^2, ..., a*x^(N-1)
gives the required sequence

Bisector of two vectors in 2D (may be collinear)

How to find a bisecor b = (bx, by) of two vectors in general (we consider two non–zero vectors u = (ux, uy), v = (vx, vy), that may be collinear ).
For non-collinear vector we can write:
bx = ux/|u| + vx / |v|
by = uy/|u| + vy / |v|
But for collinear vectors
bx = by = 0.
Example:
u = (0 , 1)
v = (0, -1)
b = (0, 0)
A general and uniform approach is to get the angle of both vectors
theta_u = math.atan2(ux, uy)
theta_v = math.atan2(vx, vy)
and to create a new vector with the average angle:
middle_theta = (theta_u+theta_v)/2
(bx, by) = (cos(middle_theta), sin(middle_theta))
This way, you avoid the pitfall that you observed with opposite vectors.
PS: Note that there is an ambiguity in what the "bisector" vector is: there are generally two bisector vectors (typically one for the smaller angle and one for the larger angle). If you want the bisector vector inside the smaller angle, then your original formula is quite good; you may handle separately the special case that you observed for instance by taking a vector orthogonal to any of the two input vectors (-uy/|u|, ux/|u|) if your formula yields the null vector.
To find the unit bisection vectors of u and v.
if u/|u|+v/|v| !=0
first calculate the unit vector of u and v
then use the parallelogram rule to get the bisection (just add them)
since they both have unit of 1, their sum is the bisector vector
then calculate the unit vector of the calculated vector.
else (if u/|u|+v/|v| ==0):
(if you use the method above, it's like a indintermination: 0*infinity=?)
if you want the bisector of (u0v) if u/|u| = (cos(t),sin(t))
take b=(cost(t+Pi/2),sin(t+Pi/2)) = (-sin(t),cos(t) )as the bisector
therefore if u/|u|=(a1,a2) chose b=(-a2,a1)
Example:
u=(0,1)
v=(0,-1)
the bisector of (u0v):
b=(-1,0)

Resources