Similarity between bags of words - similarity

I have three bags of words:
BoW1 = [word11, word12, word13]
BoW2 = [word21, word22, word23]
BoW3 = [word31, word32, word33]
BoW1 contains synonym words, BoW2 also contain synonym words. Both BoW1 and BoW are fixed. BoW3 contains words of a document, so it is multiset.
I want to search BoW3 to see if it contains any word of BoW1 and BoW2. Then, I would like to calculate the similarity between Bow1 + BoW2 and BoW3. So, together BoW1 and BoW2. I am not interested in calculating the similarity between BoW1 and BoW2, in calculating I can assume that they are one. However, for my case, BoW1 contains significant words than BoW2.
What do you think is the best and accurate way to calculate such similarity. I though to use term frequency as in Information retrieval filed. However, I am not sure if repetition is important in my case.

You are probably wanting the cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity). Compute the dot product between each bag of words vector. If you're using Python, your code will look something like:
# Make sure each BoW is a map from word -> frequency
BoW1 = {word11: 1, word12: 5, word13: 3}
BoW2 = ...
BoW3 = ...
# Normalise the frequencies
BoW1_total = sum([freq for freq in BoW1.values()])
BoW1 = {word : freq / BoW1_total for word, freq in BoW1.items()}
BoW2_total = ...
...
# Compute the dot product
similarity = 0
for word in set(BoW1.keys()).intersection(BoW2.keys()):
similarity += BoW1[word] * BoW2[word]
... # continue for each pair you want to work out the similarities
Of course, organise the code better than this ^ (write functions for all the things you need to do multiple times, etc) but this should give you the rough idea.

Related

Mathematical function for string similarity score

I'm working on a string similarity algorithm, and was thinking on how to give a score between 0 and 1 when comparing two strings. The two variables for this function are the Levenshtein distance D: (added, removed and changed characters) and the maximum length of the two strings L (but you could also take the average).
My initial algorithm was just 1-D/L but this gave too high scores for short strings, e.g. 'tree' and 'bee' would get a score of 0.5, and too low scores for longer strings which have more in common even if half of the characters is different.
Now I'm looking for a mathematical function that can output a better score. I wasn't able to come up with one, so I sketched this height map of a 3D plot (L is x and D = y).
Does anyone know how to convert such a graph to an equation, if I would be better off to just create a lookup table or if there is an existing solution?

USING TFIDF FOR RELATIVE FREQUENCY, COSINE SIMILARITY

I'm trying to use TFIDF for relative frequency to calculate cosine distance. I've selected 10 words from one document say: File 1 and selected another 10 files from my folder, using the 10 words and their frequency to check which of the 10 files are similar to File 1. Say Total number of files in folder are 46.i know that DF(is the no of documents the word appears in) IDF(is log(total no of files(46)/DF) and TFIDF(is the product of TF(frequency of the word in one doc) and IDF)
QUESTION:
Assuming what i said above is 100% correct, after getting the TFIDF for all 10 words in one document say: File 2, Do i add all the TFIDF for each of the 10 words together to get the TFIDF for File 2?
What is the cosine distance?
Could anyone help with an example?
The problem is you are confused between cosine similarity and tf-idf. While the former is a measure of similarity between two vectors (in this case documents), the latter simply is a technique of setting the components for the vectors to be eventually used in the former.
Particular to your question, it is rather inconvenient to select 10 terms from each document. I'd rather suggest to work with all terms. Let V be the total number of terms (the cardinality of the set of union over all documents in the collection). You can the represent each document as a vector of V dimensions. The ith component of a particular document D can be set to the tf-idf weight corresponding to that term (say t), i.e. D_i = tf(t,D)*idf(t)
Once you represent every document in your collection in this way, you can then compute the inter-document similarities in the following way.
cosine-sim(D, D') = (1/|D_1|*|D'|) * \sum_{i=1}^{V} D_i * D'_i
= (1/|D_1|*|D'|) * \sum_{i=1}^{V} tf(t,D)*idf(t)*tf(t,D')*idf(t)
Note that the contributing terms in this summation are only those ones which occur in both documents. If a term t occurs in D but not in D' then tf(t,D')=0 which thus contributes 0 to the sum.

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Calculating Cosine Similarity of two Vectors of Different Size

I have 2 questions,
I've made a vector from a document by finding out how many times each word appeared in a document. Is this the right way of making the vector? Or do I have to do something else also?
Using the above method I've created vectors of 16 documents, which are of different sizes. Now i want to apply cosine similarity to find out how similar each document is. The problem I'm having is getting the dot product of two vectors because they are of different sizes. How would i do this?
Sounds reasonable, as long as it means you have a list/map/dict/hash of (word, count) pairs as your vector representation.
You should pretend that you have zero values for the words that do not occur in some vector, without storing these zeros anywhere. Then, you can use the following algorithm to compute the dot product of these vectors (pseudocode):
algorithm dot_product(a : WordVector, b : WordVector):
dot = 0
for word, x in a do
y = lookup(word, b)
dot += x * y
return dot
The lookup part can be anything, but for speed, I'd use hashtables as the vector representation (e.g. Python's dict).

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity.
I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."
Now I'm wondering....aren't they 2 different things?
Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.
I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
If d2 and q are tf-idf vectors, then
where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.
There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.
(Formula taken from the Wikipedia, hence the d2.)
TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).
To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.
You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)
TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):
Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).
Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).
The TF-IDF of 'egg' is:
1 * log (100/99) = 0.01 # document 1
2 * log (100/99) = 0.02 # document 2
1 * log (100/99) = 0.01 # document 3
These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.
The TF-IDF for 'arugula' is:
1 * log (100/9) = 2.40 # document 1
2 * log (100/9) = 4.81 # document 2
0 * log (100/9) = 0 # document 3
'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.
The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').
The complete mathematical procedure for cosine similarity is explained in these tutorials
part-I
part-II
part-III
Suppose if you want to calculate cosine similarity between two documents, first step will be to calculate the tf-idf vectors of the two documents. and then find the dot product of these two vectors. Those tutorials will help you :)
tf/idf weighting has some cases where they fail and generate NaN error in code while computing. It's very important to read this:
http://www.p-value.info/2013/02/when-tfidf-and-cosine-similarity-fail.html
Tf-idf is just used to find the vectors from the documents based on tf - Term Frequency - which is used to find how many times the term occurs in the document and inverse document frequency - which gives the measure of how many times the term appears in the whole collection.
Then you can find the cosine similarity between the documents.
TFIDF is inverse documet frequency matrix and finding cosine similarity against document matrix returns similar listings

Resources