how can I find a string similarity degree? - similarity

I am working on keyword extraction system, after the prepossessing, the system extract candidates keywords by checking the text matching with same patterns
Now I want to know how to find the similarity between the extracted candidate keywords senses?
For example let's see the next matrix :
k1 k2 k3
k1 1 ?1 ?2
k2 ?1 1 ?3
k3 ?2 ?3 1
how can I find the values of (?) ???
where
(?1) refer to the sense similarity degree between (k1) and (k2)
(?2) refer to the sense similarity degree between (k1) and (k3)
(?3) refer to the sense similarity degree between (k2) and (k3)
note: the keyword can be 1 word or more

You might want to check out WordNet::Similarity - it provides measures of similarity between senses of words as found in WordNet
http://wn-similarity.sourceforge.net

Related

Design an algorithm that minimises the load on the most heavily loaded server

Reading the book of Aziz & Prakash 2021 I am a bit stuck on problem 3.7 and the associated solution for which I am trying to implement.
The problem says :
You have n users with unique hashes h1 through hn and
m servers, numbered 1 to m. User i has Bi bytes to store. You need to
find numbers K1 through Km such that all users with hashes between
Kj and Kj+1 get assigned to server j. Design an algorithm to find the
numbers K 1 through Km that minimizes the load on the most heavily
loaded server.
The solution says:
Let L(a,b) be the maximum load on a server when
users with hash h1 through ha are assigned to servers S1 through Sb in
an optimal way so that the max load is minimised. We observe the
following recurrence:
In other words, we find the right value of x such that if we pack the
first x users in b - 1 servers and the remaining in the last servers the max
load on a given server is minimized.
Using this relationship, we can tabulate the values of L till we get
L(n,m). While computing L(a,b) when the values of L is tabulated
for all lower values of a and b we need to find the right value of x to
minimize the load. As we increase x, L(x,b-1) in the above expression increases the the sum term decreases. We can do binary search for x to find x that minimises their max.
I know that we can probably use some sort of dynamic programming, but how could we possibly implement this idea into a code?
The dynamic programming algorithm is defined fairly well given that formula: Implementing a top-down DP algorithm just needs you to loop from x = 1 to a and record which one minimizes that max(L(x,b-1), sum(B_i)) expression.
There is, however, a simpler (and faster) greedy/binary search algorithm for this problem that you should consider, which goes like this:
Compute prefix sums for B
Find the minimum value of L such that we can partition B into m contiguous subarrays whose maximum sum is equal to L.
We know 1 <= L <= sum(B). So, perform a binary search to find L, with a helper function canSplit(v) that tests whether we can split B into such subarrays of sum <= v.
canSplit(v) works greedily: Remove as many elements from the start of B as possible so that our sum does not exceed v. Repeat this a total of m times; return True if we've used all of B.
You can use the prefix sums to run canSplit in O(m log n) time, with an additional inner binary search.
Given L, use the same strategy as the canSplit function to determine the m-1 partition points; find the m partition boundaries from there.

How does one approach this challenge asked in an Amazon Interview?

I am struggling optimising this past amazon Interview question involving a DAG.
This is what I tried (The code is long and I would rather explain it)-
Basically since the graph is a DAG and because its a transitive relation a simple traversal for every node should be enough.
So for every node I would by transitivity traverse through all the possibilities to get the end vertices and then compare these end vertices to get
the most noisy person.
In my second step I have actually found one such (maybe the only one) most noisy person for all the vertices of the traversal in step 2. So I memoize all of this in a mapping and mark the vertices of the traversal as visited.
So I am basically maintaining an adjacency list for the graph, A visited/non visited mapping and a mapping for the output (the most noisy person for every vertex).
In this way by the time I get a query I would not have to recompute anything (in case of duplicate queries).
The above code works but since I cannot test is with testcases it may/may not pass the time limit. Is there a faster solution(maybe using DP) to this. I feel I am not exploiting the transitive and anti symmetric condition enough.
Obviously I am not checking the cases where a person is less wealthy than the current person. But for instance if I have pairs like - (1,2)(1,3)(1,4)...etc and maybe (2,6)(2,7)(7,8),etc then if I am given to find a more wealthy person than 1 I have traverse through every neighbor of 1 and then the neighbor of every neighbor also I guess. This is done only once as I store the results.
Question Part 1
Question Part 2
Edit(Added question Text)-
Rounaq is graduating this year. And he is going to be rich. Very rich. So rich that he has decided to have
a structured way to measure his richness. Hence he goes around town asking people about their wealth,
and notes down that information.
Rounaq notes down the pair (Xi; Yi) if person Xi has more wealth than person Yi. He also notes down
the degree of quietness, Ki, of each person. Rounaq believes that noisy persons are a nuisance. Hence, for
each of his friends Ai, he wants to determine the most noisy(least quiet) person among those who have
wealth more than Ai.
Note that "has more wealth than"is a transitive and anti-symmetric relation. Hence if a has more wealth
than b, and b has more wealth than c then a has more wealth than c. Moreover, if a has more wealth than
b, then b cannot have more wealth than a.
Your task in this problem is to help Rounaq determine the most noisy person among the people having
more wealth for each of his friends ai, given the information Rounaq has collected from the town.
Input
First line contains T: The number of test cases
Each Test case has the following format:
N
K1 K2 K3 K4 : : : Kn
M
X1 Y1
X2 Y2
. . .
. . .
XM YM
Q
A1
A2
. . .
. . .
AQ
N: The number of people in town
M: Number of pairs for which Rounaq has been able to obtain the wealth
information
Q: Number of Rounaq’s Friends
Ki: Degree of quietness of the person i
Xi; Yi: The pairs Rounaq has noted down (Pair of distinct values)
Ai: Rounaq’s ith friend
For each of Rounaq’s friends print a single integer - the degree of quietness of the most noisy person as required or -1 if there is no wealthier person for that friend.
Perform a topological sort on the pairs X, Y. Then iterate from the most wealthy down the the least wealthy, and store the most noisy person seen so far:
less wealthy -> most wealthy
<- person with lowest K so far <-
Then for each query, binary search the first person with greater wealth than the friend. The value we stored is the most noisy person with greater wealth than the friend.
UPDATE
It seems that we cannot rely on the data allowing for a complete topological sort. In this case, traverse sections of the graph that lead from known greatest to least wealth, storing for each person visited the most noisy person seen so far. The example you provided might look something like:
3 - 5
/ |
1 - 2 |
/ |
4 --
Traversals:
1 <- 3 <- 5
1 <- 2
4 <- 2
4 <- 5
(Input)
2 1
2 4
3 1
5 3
5 4
8 2 16 26 16
(Queries and solution)
3 4 3 5 5
16 2 16 -1 -1

Similarity between bags of words

I have three bags of words:
BoW1 = [word11, word12, word13]
BoW2 = [word21, word22, word23]
BoW3 = [word31, word32, word33]
BoW1 contains synonym words, BoW2 also contain synonym words. Both BoW1 and BoW are fixed. BoW3 contains words of a document, so it is multiset.
I want to search BoW3 to see if it contains any word of BoW1 and BoW2. Then, I would like to calculate the similarity between Bow1 + BoW2 and BoW3. So, together BoW1 and BoW2. I am not interested in calculating the similarity between BoW1 and BoW2, in calculating I can assume that they are one. However, for my case, BoW1 contains significant words than BoW2.
What do you think is the best and accurate way to calculate such similarity. I though to use term frequency as in Information retrieval filed. However, I am not sure if repetition is important in my case.
You are probably wanting the cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity). Compute the dot product between each bag of words vector. If you're using Python, your code will look something like:
# Make sure each BoW is a map from word -> frequency
BoW1 = {word11: 1, word12: 5, word13: 3}
BoW2 = ...
BoW3 = ...
# Normalise the frequencies
BoW1_total = sum([freq for freq in BoW1.values()])
BoW1 = {word : freq / BoW1_total for word, freq in BoW1.items()}
BoW2_total = ...
...
# Compute the dot product
similarity = 0
for word in set(BoW1.keys()).intersection(BoW2.keys()):
similarity += BoW1[word] * BoW2[word]
... # continue for each pair you want to work out the similarities
Of course, organise the code better than this ^ (write functions for all the things you need to do multiple times, etc) but this should give you the rough idea.

USING TFIDF FOR RELATIVE FREQUENCY, COSINE SIMILARITY

I'm trying to use TFIDF for relative frequency to calculate cosine distance. I've selected 10 words from one document say: File 1 and selected another 10 files from my folder, using the 10 words and their frequency to check which of the 10 files are similar to File 1. Say Total number of files in folder are 46.i know that DF(is the no of documents the word appears in) IDF(is log(total no of files(46)/DF) and TFIDF(is the product of TF(frequency of the word in one doc) and IDF)
QUESTION:
Assuming what i said above is 100% correct, after getting the TFIDF for all 10 words in one document say: File 2, Do i add all the TFIDF for each of the 10 words together to get the TFIDF for File 2?
What is the cosine distance?
Could anyone help with an example?
The problem is you are confused between cosine similarity and tf-idf. While the former is a measure of similarity between two vectors (in this case documents), the latter simply is a technique of setting the components for the vectors to be eventually used in the former.
Particular to your question, it is rather inconvenient to select 10 terms from each document. I'd rather suggest to work with all terms. Let V be the total number of terms (the cardinality of the set of union over all documents in the collection). You can the represent each document as a vector of V dimensions. The ith component of a particular document D can be set to the tf-idf weight corresponding to that term (say t), i.e. D_i = tf(t,D)*idf(t)
Once you represent every document in your collection in this way, you can then compute the inter-document similarities in the following way.
cosine-sim(D, D') = (1/|D_1|*|D'|) * \sum_{i=1}^{V} D_i * D'_i
= (1/|D_1|*|D'|) * \sum_{i=1}^{V} tf(t,D)*idf(t)*tf(t,D')*idf(t)
Note that the contributing terms in this summation are only those ones which occur in both documents. If a term t occurs in D but not in D' then tf(t,D')=0 which thus contributes 0 to the sum.

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity.
I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."
Now I'm wondering....aren't they 2 different things?
Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.
I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
If d2 and q are tf-idf vectors, then
where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.
There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.
(Formula taken from the Wikipedia, hence the d2.)
TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).
To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.
You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)
TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):
Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).
Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).
The TF-IDF of 'egg' is:
1 * log (100/99) = 0.01 # document 1
2 * log (100/99) = 0.02 # document 2
1 * log (100/99) = 0.01 # document 3
These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.
The TF-IDF for 'arugula' is:
1 * log (100/9) = 2.40 # document 1
2 * log (100/9) = 4.81 # document 2
0 * log (100/9) = 0 # document 3
'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.
The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').
The complete mathematical procedure for cosine similarity is explained in these tutorials
part-I
part-II
part-III
Suppose if you want to calculate cosine similarity between two documents, first step will be to calculate the tf-idf vectors of the two documents. and then find the dot product of these two vectors. Those tutorials will help you :)
tf/idf weighting has some cases where they fail and generate NaN error in code while computing. It's very important to read this:
http://www.p-value.info/2013/02/when-tfidf-and-cosine-similarity-fail.html
Tf-idf is just used to find the vectors from the documents based on tf - Term Frequency - which is used to find how many times the term occurs in the document and inverse document frequency - which gives the measure of how many times the term appears in the whole collection.
Then you can find the cosine similarity between the documents.
TFIDF is inverse documet frequency matrix and finding cosine similarity against document matrix returns similar listings

Resources