How to use Euclidean Distance to calculate similarity values for the three pairs of documents - information-retrieval

I am trying to work on this question however I am not sure how to use the Euclidean equation to find the solution.
Question:
Following are keywords, frequencies, and token counts from 3 other
documents.
Doc 4 – tablet: 7; memory: 5; apps: 8; sluggish: 5
Doc 5 – memory: 4; performance: 6; playbook: 8; apps: 6
Doc 6 –tablet: 6; performance: 3; playbook: 7; sluggish: 3
Token counts: Doc 4: 55 Doc 5: 60 Doc 6: 65
(i) Use Euclidean Distance to calculate similarity values for the three pairs of
documents (4,5), (4,6), (5,6) with relative frequency values. State the
distance for each pair to 4 decimal places (4 d.p.).
I have tried to use the Euclidean Distance formula with the given pairs of documents to find the distance for each pair.
This is the equation that i have tried to use:
dist((x, y), (a, b)) = √(x - a)² + (y - b)²
According to the solutions this is what the answer should be:
Euclidean D4,D5 = 0.2343 to 4.d.p
Euclidean D5,D6 = 0.1693 to 4.d.p
Euclidean D4,D6 = 0.2153 to 4.d.p
Any help would be appreciated.

First you should make your document-term matrix based on your term-frequency. Term-frequency of a term means the number of times that term is repeated in a document divided by number of tokens document has. So we just made the below table:
As you mentioned the distance formula yourself I will just calculate the distance between document 4 and 5 as an example.
d(Document4,Document5) = [(7/55-0)^2 + (5/55-4/60)^2 + (8/55-6/60)^2 + (5/55-0)^2 + (0-6/60)^2 + (0-8/60)^2]^(1/2) = 0.23428614982 which is rounded to 0.2343.

The Euclidean distance is a popular heuristic and the formula is as follows:
Suppose you have 2 points (a1,b1) and (a2,b2), then the Euclidean distance between these points is given as: SquareRoot( (a2-a1)^2 + (b2-b1)^2 ).
In your case,
Doc 4 - (7,5,8,5)
Doc 5 - (4,6,8,6)
So the formula to apply would be,
SquareRoot( (a2-a1)^2 + (b2-b1)^2 + (c2-c1)^2 + (d2-d1)^2 ).

Wikipedia
The Euclidean distance between points p and q is the length of the line segment connecting them (pq).
In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:
d(p ,q) = d (q ,p) = [(p1-q1)^2 + (p1-q1)^2 + ... (pn-qn)^2]^(1/2)
Let's normalize the given like this.
Doc 4 – tablet: 7, memory: 5, apps: 8, sluggish: 5, playbook: 0, performance: 0
Doc 5 – tablet: 0, memory: 4, apps: 6, sluggish: 0, playbook: 8, performance: 6
Doc 6 – tablet: 6, memory: 0, apps:0, sluggish: 3, playbook: 7, performance: 3
then according to above formula,
D(Doc4, Doc5) = [(7-0)^2 + (5-4)^2 + (8-6)^2 + (5-0)^2 + (8-0)^2 + (0-0)^2]^(1/2) = [49+1+4+25+64]^(1/2) ~= 11.96
You can calculate the other two pairs as I've done.
If needed let me know, thus I add a sample snippet to calculate this programmatically.

Related

Concatenation of binary representation of first n positive integers in O(logn) time complexity

I came across this question in a coding competition. Given a number n, concatenate the binary representation of first n positive integers and return the decimal value of the resultant number formed. Since the answer can be large return answer modulo 10^9+7.
N can be as large as 10^9.
Eg:- n=4. Number formed=11011100(1=1,10=2,11=3,100=4). Decimal value of 11011100=220.
I found a stack overflow answer to this question but the problem is that it only contains a O(n) solution.
Link:- concatenate binary of first N integers and return decimal value
Since n can be up to 10^9 we need to come up with solution that is better than O(n).
Here's some Python code that provides a fast solution; it uses the same ideas as in Abhinav Mathur's post. It requires Python >= 3.8, but it doesn't use anything particularly fancy from Python, and could easily be translated into another language. You'd need to write algorithms for modular exponentiation and modular inverse if they're not already available in the target language.
First, for testing purposes, let's define the slow and obvious version:
# Modulus that results are reduced by,
M = 10 ** 9 + 7
def slow_binary_concat(n):
"""
Concatenate binary representations of 1 through n (inclusive).
Reinterpret the resulting binary string as an integer.
"""
concatenation = "".join(format(k, "b") for k in range(n + 1))
return int(concatenation, 2) % M
Checking that we get the expected result:
>>> slow_binary_concat(4)
220
>>> slow_binary_concat(10)
462911642
Now we'll write a faster version. First, we split the range [1, n) into subintervals such that within each subinterval, all numbers have the same length in binary. For example, the range [1, 10) would be split into four subintervals: [1, 2), [2, 4), [4, 8) and [8, 10). Here's a function to do that splitting:
def split_by_bit_length(n):
"""
Split the numbers in [1, n) by bit-length.
Produces triples (a, b, 2**k). Each triple represents a subinterval
[a, b) of [1, n), with a < b, all of whose elements has bit-length k.
"""
a = 1
while n > a:
b = 2 * a
yield (a, min(n, b), b)
a = b
Example output:
>>> list(split_by_bit_length(10))
[(1, 2, 2), (2, 4, 4), (4, 8, 8), (8, 10, 16)]
Now for each subinterval, the value of the concatenation of all numbers in that subinterval is represented by a fairly simple mathematical sum, which can be computed in exact form. Here's a function to compute that sum modulo M:
def subinterval_concat(a, b, l):
"""
Concatenation of values in [a, b), all of which have the same bit-length k.
l is 2**k.
Equivalently, sum(i * l**(b - 1 - i)) for i in range(a, b)) modulo M.
"""
n = b - a
inv = pow(l - 1, -1, M)
q = (pow(l, n, M) - 1) * inv
return (a * q + (q - n) * inv) % M
I won't go into the evaluation of the sum here: it's a bit off-topic for this site, and it's hard to express without a good way to render formulas. If you want the details, that's a topic for https://math.stackexchange.com, or a page of fairly simple algebra.
Finally, we want to put all the intervals together. Here's a function to do that.
def fast_binary_concat(n):
"""
Fast version of slow_binary_concat.
"""
acc = 0
for a, b, l in split_by_bit_length(n + 1):
acc = (acc * pow(l, b - a, M) + subinterval_concat(a, b, l)) % M
return acc
A comparison with the slow version shows that we get the same results:
>>> fast_binary_concat(4)
220
>>> fast_binary_concat(10)
462911642
But the fast version can easily be evaluated for much larger inputs, where using the slow version would be infeasible:
>>> fast_binary_concat(10**9)
827129560
>>> fast_binary_concat(10**18)
945204784
You just have to note a simple pattern. Taking up your example for n=4, let's gradually build the solution starting from n=1.
1 -> 1 #1
2 -> 2^2(1) + 2 #6
3 -> 2^2[2^2(1)+2] + 3 #27
4 -> 2^3{2^2[2^2(1)+2]+3} + 4 #220
If you expand the coefficients of each term for n=4, you'll get the coefficients as:
1 -> (2^3)*(2^2)*(2^2)
2 -> (2^3)*(2^2)
3 -> (2^3)
4 -> (2^0)
Let the N be total number of bits in the string representation of our required number, and D(x) be the number of bits in x. The coefficients can then be written as
1 -> 2^(N-D(1))
2 -> 2^(N-D(1)-D(2))
3 -> 2^(N-D(1)-D(2)-D(3))
... and so on
Since the value of D(x) will be the same for all x between range (2^t, 2^(t+1)-1) for some given t, you can break the problem into such ranges and solve for each range using mathematics (not iteration). Since the number of such ranges will be log2(Given N), this should work in the given time limit.
As an example, the various ranges become:
1. 1 (D(x) = 1)
2. 2-3 (D(x) = 2)
3. 4-7 (D(x) = 3)
4. 8-15 (D(x) = 4)

How to allocated a fixed resources to vertices of graph yet low degree vertices get more?

I have a simple problem to solve, but I'm stuck and I need your help. The problem is as follows:
I have three nodes A,B,C with degree 11, 6, 1 and I have 20 resources to allocate to each node based on their degree. I know when I want the large degree nodes to get more resources the formula is:
just sum all degrees (11+6+1)=18 and distribute the resources as follows:
A=(20/18)*11 and similar for the other nodes
What if I want the low degree node to get more resources than the large one? I mean Node C of degree 1 to get more resources than nodes A and B.
I tried this:
A = (20/18)* (11)^r
Such that when r=1, larger degree will have more resources. But the reverse is not true, for example when r=-1, the total resources won't sum to 20. What to do?
Is there a way in which for the same formula, I can plug in different values of r and give results for both low and large degree?
Your help will be much appreciated, Thank you in advance
When there are at least 2 nodes, is this a possibility with either r=0 for more resources to the highest degrees or r=1 for more resources to the lowest degrees:
(degree_sum*r - nodes[n]) * resources / ((num_nodes*r - 1) * degree_sum)
Simplied, for r=1, the formula would be to give every node in proportion to "the sum of all degrees minus the degree of the current node" (with nodes[n] the degree of the current node):
(degree_sum - nodes[n]) * resources / ((num_nodes - 1) * degree_sum)
So, this would give:
A: (18 - 11) * 20 / (2 * 18)
B: (18 - 6) * 20 / (2 * 18)
C: (18 - 1) * 20 / (2 * 18)
Here is a demo program in Python to demonstrate the idea:
nodes = {'A': 11, 'B': 6, 'C': 1}
resources = 20
degree_sum = sum(nodes[n] for n in nodes)
num_nodes = len(nodes)
print("sum of all degrees:", degree_sum, "- number of nodes", num_nodes)
distribution_highest_more = {n: nodes[n] * resources / degree_sum for n in nodes}
print("higher degrees get more:", distribution_highest_more)
print(" total:", sum(distribution_highest_more[n] for n in distribution_highest_more))
distribution_lowest_more = {n: (degree_sum - nodes[n]) * resources / ((num_nodes - 1) * degree_sum) for n in nodes}
print("lower degrees get more:", distribution_lowest_more)
print(" total:", sum(distribution_lowest_more[n] for n in distribution_lowest_more))
for r in range(2):
distribution = {n: (degree_sum*r - nodes[n]) * resources / ((num_nodes*r - 1) * degree_sum) for n in nodes}
print("distribution for r =", r, ":")
print(" ", distribution)
print(" total:", sum(distribution[n] for n in distribution))
Which prints:
sum of all degrees: 18 - number of nodes 3
higher degrees get more: {'A': 12.222222222222221, 'B': 6.666666666666667, 'C': 1.1111111111111112}
total: 20.0
lower degrees get more: {'A': 3.888888888888889, 'B': 6.666666666666667, 'C': 9.444444444444445}
total: 20.0
distribution for r = 0 :
{'A': 12.222222222222221, 'B': 6.666666666666667, 'C': 1.1111111111111112}
total: 20.0
distribution for r = 1 :
{'A': 3.888888888888889, 'B': 6.666666666666667, 'C': 9.444444444444445}
total: 20.0

Is there a function f(n) that returns the n:th combination in an ordered list of combinations without repetition?

Combinations without repetitions look like this, when the number of elements to choose from (n) is 5 and elements chosen (r) is 3:
0 1 2
0 1 3
0 1 4
0 2 3
0 2 4
0 3 4
1 2 3
1 2 4
1 3 4
2 3 4
As n and r grows the amount of combinations gets large pretty quickly. For (n,r) = (200,4) the number of combinations is 64684950.
It is easy to iterate the list with r nested for-loops, where the initial iterating value of each for loop is greater than the current iterating value of the for loop in which it is nested, as in this jsfiddle example:
https://dotnetfiddle.net/wHWK5o
What I would like is a function that calculates only one combination based on its index. Something like this:
tuple combination(i,n,r) {
return [combination with index i, when the number of elements to choose from is n and elements chosen is r]
Does anyone know if this is doable?
You would first need to impose some sort of ordering on the set of all combinations available for a given n and r, such that a linear index makes sense. I suggest we agree to keep our combinations in increasing order (or, at least, the indices of the individual elements), as in your example. How then can we go from a linear index to a combination?
Let us first build some intuition for the problem. Suppose we have n = 5 (e.g. the set {0, 1, 2, 3, 4}) and r = 3. How many unique combinations are there in this case? The answer is of course 5-choose-3, which evaluates to 10. Since we will sort our combinations in increasing order, consider for a minute how many combinations remain once we have exhausted all those starting with 0. This must be 4-choose-3, or 4 in total. In such a case, if we are looking for the combination at index 7 initially, this implies we must subtract 10 - 4 = 6 and search for the combination at index 1 in the set {1, 2, 3, 4}. This process continues until we find a new index that is smaller than this offset.
Once this process concludes, we know the first digit. Then we only need to determine the remaining r - 1 digits! The algorithm thus takes shape as follows (in Python, but this should not be too difficult to translate),
from math import factorial
def choose(n, k):
return factorial(n) // (factorial(k) * factorial(n - k))
def combination_at_idx(idx, elems, r):
if len(elems) == r:
# We are looking for r elements in a list of size r - thus, we need
# each element.
return elems
if len(elems) == 0 or len(elems) < r:
return []
combinations = choose(len(elems), r) # total number of combinations
remains = choose(len(elems) - 1, r) # combinations after selection
offset = combinations - remains
if idx >= offset: # combination does not start with first element
return combination_at_idx(idx - offset, elems[1:], r)
# We now know the first element of the combination, but *not* yet the next
# r - 1 elements. These need to be computed as well, again recursively.
return [elems[0]] + combination_at_idx(idx, elems[1:], r - 1)
Test-driving this with your initial input,
N = 5
R = 3
for idx in range(choose(N, R)):
print(idx, combination_at_idx(idx, list(range(N)), R))
I find,
0 [0, 1, 2]
1 [0, 1, 3]
2 [0, 1, 4]
3 [0, 2, 3]
4 [0, 2, 4]
5 [0, 3, 4]
6 [1, 2, 3]
7 [1, 2, 4]
8 [1, 3, 4]
9 [2, 3, 4]
Where the linear index is zero-based.
Start with the first element of the result. The value of that element depends on the number of combinations you can get with smaller elements. For each such smaller first element, the number of combinations with first element k is n − k − 1 choose r − 1, with potentially some of-by-one corrections. So you would sum over a bunch of binomial coefficients. Wolfram Alpha can help you compute such a sum, but the result still has a binomial coefficient in it. Solving for the largest k such that the sum doesn't exceed your given index i is a computation you can't do with something as simple as e.g. a square root. You need a loop to test possible values, e.g. like this:
def first_naive(i, n, r):
"""Find first element and index of first combination with that first element.
Returns a tuple of value and index.
Example: first_naive(8, 5, 3) returns (1, 6) because the combination with
index 8 is [1, 3, 4] so it starts with 1, and because the first combination
that starts with 1 is [1, 2, 3] which has index 6.
"""
s1 = 0
for k in range(n):
s2 = s1 + choose(n - k - 1, r - 1)
if i < s2:
return k, s1
s1 = s2
You can reduce the O(n) loop iterations to O(log n) steps using bisection, which is particularly relevant for large n. In that case I find it easier to think about numbering items from the end of your list. In the case of n = 5 and r = 3 you get choose(2, 2)=1 combinations starting with 2, choose(3,2)=3 combinations starting with 1 and choose(4,2)=6 combinations starting with 0. So in the general choose(n,r) binomial coefficient you increase the n with each step, and keep the r. Taking into account that sum(choose(k,r) for k in range(r,n+1)) can be simplified to choose(n+1,r+1), you can eventually come up with bisection conditions like the following:
def first_bisect(i, n, r):
nCr = choose(n, r)
k1 = r - 1
s1 = nCr
k2 = n
s2 = 0
while k2 - k1 > 1:
k3 = (k1 + k2) // 2
s3 = nCr - choose(k3, r)
if s3 <= i:
k2, s2 = k3, s3
else:
k1, s1 = k3, s3
return n - k2, s2
Once you know the first element to be k, you also know the index of the first combination with that same first element (also returned from my function above). You can use the difference between that first index and your actual index as input to a recursive call. The recursive call would be for r − 1 elements chosen from n − k − 1. And you'd add k + 1 to each element from the recursive call, since the top level returns values starting at 0 while the next element has to be greater than k in order to avoid duplication.
def combination(i, n, r):
"""Compute combination with a given index.
Equivalent to list(itertools.combinations(range(n), r))[i].
Each combination is represented as a tuple of ascending elements, and
combinations are ordered lexicograplically.
Args:
i: zero-based index of the combination
n: number of possible values, will be taken from range(n)
r: number of elements in result list
"""
if r == 0:
return []
k, ik = first_bisect(i, n, r)
return tuple([k] + [j + k + 1 for j in combination(i - ik, n - k - 1, r - 1)])
I've got a complete working example, including an implementation of choose, more detailed doc strings and tests for some basic assumptions.

Find row of pyramid based on index?

Given a pyramid like:
0
1 2
3 4 5
6 7 8 9
...
and given the index of the pyramid i where i represents the ith number of the pyramid, is there a way to find the index of the row to which the ith element belongs? (e.g. if i = 6,7,8,9, it is in the 3rd row, starting from row 0)
There's a connection between the row numbers and the triangular numbers. The nth triangular number, denoted Tn, is given by Tn = n(n-1)/2. The first couple triangular numbers are 0, 1, 3, 6, 10, 15, etc., and if you'll notice, the starts of each row are given by the nth triangular number (the fact that they come from this triangle is where this name comes from.)
So really, the goal here is to determine the largest n such that Tn ≤ i. Without doing any clever math, you could solve this in time O(√n) by just computing T0, T1, T2, etc. until you find something bigger than i. Even better, you could binary search for it in time O(log n) by computing T1, T2, T4, T8, etc. until you overshoot, then binary searching on the range you found.
Alternatively, we could try to solve for this directly. Suppose we want to find the choice of n such that
n(n + 1) / 2 = i
Expanding, we get
n2 / 2 + n / 2 = i.
Equivalently,
n2 / 2 + n / 2 - i = 0,
or, more easily:
n2 + n - 2i = 0.
Now we use the quadratic formula:
n = (-1 &pm; √(1 + 8i)) / 2
The negative root we can ignore, so the value of n we want is
n = (-1 + √(1 + 8i)) / 2.
This number won't necessarily be an integer, so to find the row you want, we just round down:
row = ⌊(-1 + √(1 + 8i)) / 2⌋.
In code:
int row = int((-1 + sqrt(1 + 8 * i)) / 2);
Let's confirm that this works by testing it out a bit. Where does 9 go? Well, we have
(-1 + √(1 + 72)) / 2 = (-1 + √73) / 2 = 3.77
Rounding down, we see it goes in row 3 - which is correct!
Trying another one, where does 55 go? Well,
(-1 + √(1 + 440)) / 2 = (√441 - 1) / 2 = 10
So it should go in row 10. The tenth triangular number is T10 = 55, so in fact, 55 starts off that row. Looks like it works!
I get row = math.floor (√(2i + 0.25) - 0.5) where i is your number
Essentially the same as the guy above but I reduced n2 + n to (n + 0.5)2 - 0.25
I think ith element belongs nth row where n is number of n(n+1)/2 <= i < (n+1)(n+2)/2
For example, if i = 6, then n = 3 because n(n+1)/2 <= 6
and if i = 8, then n = 3 because n(n+1)/2 <= 8

Minimum number of element required to make a sequence that sums to a particular number

Suppose there is number s=12 , now i want to make sequence with the element a1+a2+.....+an=12.
The criteria is as follows-
n must be minimum.
a1 and an must be 1;
ai can differs a(i-1) by only 1,0 and -1.
for s=12 the result is 6.
So how to find the minimum value of n.
Algorithm for finding n from given s:
1.Find q = FLOOR( SQRT(s-1) )
2.Find r = q^2 + q
3.If s <= r then n = 2q, else n = 2q + 1
Example: s = 12
q = FLOOR( SQRT(12-1) ) = FLOOR(SQRT(11) = 3
r = 3^2 + 3 = 12
12 <= 12, therefore n = 2*3 = 6
Example: s = 160
q = FLOOR( SQRT(160-1) ) = FLOOR(SQRT(159) = 12
r = 12^2 + 12 = 156
159 > 156, therefore n = 2*12 + 1 = 25
and the 25-numbers sequence for
159: 1,2,3,4,5,6,7,8,9,10,10,10,9,10,10,10,9,8,7,6,5,4,3,2,1
Here's a way to visualize the solution.
First, draw the smallest triangle (rows containing successful odd numbers of stars) that has a greater or equal number of stars to n. In this case, we draw a 16-star triangle.
*
***
*****
*******
Then we have to remove 16 - 12 = 4 more stars. We do this diagonally starting from the top.
1
**2
****3
******4
The result is:
**
****
******
Finally, add up the column heights to get the final answer:
1, 2, 3, 3, 2, 1.
There are two cases: s odd and s even. When s is odd, you have the sequence:
1, 2, 3, ..., (s-1)/2, (s-1)/2, (s-1)/2-1, (s-1)/2-2, ..., 1
when n is even you have:
1, 2, 3, ..., s/2, s/2-1, s/2-2, ..., 1
The maximum possible for any given series of length n is:
n is even => (n^2+2n)/4
n is odd => (n+1)^2/4
These two results are arrived at easily enough by looking at the simple arithmetic sum of series where in the case of n even it is twice the sum of the series 1...n/2. In the case of n odd it is twice the sum of the series 1...(n-1)/2 and add on n+1/2 (the middle element).
Clearly you can generate any positive number that is less than this max as long as n>3.
So the problem then becomes finding the smallest n with a max greater than your target.
Algorithmically I'd go for:
Find (sqrt(4*s)-1) and round up to the next odd number. Call this M. This is an easy to work out value and will represent the lowest odd n that will work.
Check M-1 to see if its max sum is greater than s. If so then that your n is M-1. Otherwise your n is M.
Thank all you answer me. I derived a simpler solution. The algorithm looks like-
First find what is the maximum sum that can be made using n element-
if n=1 -> 1 sum=1;
if n=2 -> 1,1 sum=2;
if n=3 -> 1,2,1 sum=4;
if n=4 -> 1,2,2,1 sum=6;
if n=5 -> 1,2,3,2,1 sum=9;
if n=6 -> 1,2,3,3,2,1 sum=12;
So from observation it is clear that form any number,n 9<n<=12 can be
made using 6 element, similarly number
6<n<=9 can be made at using 5 element.
So it require only a binary search to find the number of
element that make a particular number.

Resources