Cosine similarity and tf-idf - information-retrieval

I am confused by the following comment about TF-IDF and Cosine Similarity.
I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."
Now I'm wondering....aren't they 2 different things?
Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.
I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?

Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
If d2 and q are tf-idf vectors, then
where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.
There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.
(Formula taken from the Wikipedia, hence the d2.)

TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).
To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.
You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)
TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):
Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).
Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).
The TF-IDF of 'egg' is:
1 * log (100/99) = 0.01 # document 1
2 * log (100/99) = 0.02 # document 2
1 * log (100/99) = 0.01 # document 3
These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.
The TF-IDF for 'arugula' is:
1 * log (100/9) = 2.40 # document 1
2 * log (100/9) = 4.81 # document 2
0 * log (100/9) = 0 # document 3
'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.
The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').

The complete mathematical procedure for cosine similarity is explained in these tutorials
part-I
part-II
part-III
Suppose if you want to calculate cosine similarity between two documents, first step will be to calculate the tf-idf vectors of the two documents. and then find the dot product of these two vectors. Those tutorials will help you :)

tf/idf weighting has some cases where they fail and generate NaN error in code while computing. It's very important to read this:
http://www.p-value.info/2013/02/when-tfidf-and-cosine-similarity-fail.html

Tf-idf is just used to find the vectors from the documents based on tf - Term Frequency - which is used to find how many times the term occurs in the document and inverse document frequency - which gives the measure of how many times the term appears in the whole collection.
Then you can find the cosine similarity between the documents.

TFIDF is inverse documet frequency matrix and finding cosine similarity against document matrix returns similar listings

Related

r + tfidf and inverse document frequency

I was hoping that someone can explain a specific part of an academic paper and assist in writing R code for that section:
Name of Paper
Large-scale Analysis of Counseling Conversations:
An Application of Natural Language Processing to Mental Health (https://cs.stanford.edu/people/jure/pubs/counseling-tacl16.pdf)
On page 5, we have the following snippet:
"
...build a TF-IDF vector of word occurrences
to represent the language of counselors within this
subset. We use the global inverse document (i.e.,
conversation) frequencies instead of the ones from
each subset to make the vectors directly comparable
and control for different counselors having different numbers of conversations by weighting conversations so all counselors have equal contributions.
"
What does the paper mean by "global inverse document frequency"?
How can I code this in R with the different subsets (positive and negative counsellors for example)
Here is my sample code:
corp_pos_1 <- Corpus(VectorSource(positive_chats$Text1))
#corp_pos_1 <- tm_map(corp_pos_1, removeWords, stopwords("english"))
tdm_pos_1 <- DocumentTermMatrix(corp_pos_1,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
ui = unique(tdm_pos_1 $i)
tdm_pos_1 = tdm_pos_1 [ui,]
cosine_tdm_pos_1 <- crossprod_simple_triplet_matrix(tdm_pos_1)/(sqrt(col_sums(tdm_pos_1^2) %*% t(col_sums(tdm_pos_1^2))))
In the code 'pos' stands for positive, and 'neg' would stand for negative.
The number at the end of the variable end shows the part of the chunk being calculated.
Now I have them chunked in 5 different parts trying to follow this paper. But how would I be able to calculate "global inverse document frequency"?
I think I have found this stackoverflow question from before but I am still not understanding the paper + what I need to do in R.
R: weighted inverse document frequency (tfidf) similarity between strings
TF/IDF is a well-known measure in information retrieval. For more information on it, and formulae that describe how to calculate it, see the Wikipedia page.
In short, you want to have words that are specific to texts; words that occur in all texts do not add any distinctive information. So, the inverse document frequency is the number of all documents divided by the number of documents that contain a given word. For common words such as the or of, the IDF would be 1.0, as we would assume they occurred in all texts. For that reason they are often excluded as stop words. IDF can also be scaled, eg by taking the logarithm.
If I understand your application correctly, you would take a term and divide the total number of documents by the number of negative documents that contain the term.

In tf-idf why do we normalize by document frequency and not average term frequency across all documents in the corpus?

Average term frequency would be the average frequency that term appears in other documents. Intuitively I want to compare how frequently it appears in this document relative to the other documents in the corpus.
An example:
d1 has the word "set" 100 times, d2 has the word "set" 1 time, d3 has the word "set" 1 time, d4-N does not have the word set
d1 has the word "theory" 100 times, d2 has the word "theory" 100 times, d3 has the word "theory" 100 times, d4-N does not have the word set
Document 1 has the same tf-idf for the word "set" and the word "theory" even though the word set is more important to d1 than theory.
Using average term frequency would distinguish these two examples. Is tf-iatf (inverse average term frequency) a valid approach? To me it would give me more important keywords, rather than just "rare" and "unique" keywords. If idf is "an estimate of how rare that word is" wouldn't iatf be a better estimate? It seems only marginally harder to implement (especially if the data is pre-processed).
I am thinking of running an experiment and manually analyzing the highest ranked keywords with each measure, but wanted to pass it by some other eyes first.
A follow-up question:
Why is tf-idf used so frequently as opposed to alternative methods like this which MAY be more accurate? (If this is a valid approach that is).
Update:
Ran an experiment where I manually analyzed the scores and corresponding top words for a few dozen documents, and it seems like iatf and inverse collection frequency (the standard approach to what I described) have super similar results.
Tf-idf is not meant to compare the importance of a word in a document across two corpora.
It is rather meant to distinguish the importance of a word within a document in relation to the distribution of the same term in the other documents of the same collection (not across collections).
A standard approach that you can apply for your case is: collection frequency, cf(t), instead of document frequency, df(t).
cf(t) measures how many times does a term t occurs in the corpus.
cf(t) divided by the total collection size would give you the probability
of sampling t from the collection.
And then you can compute a linear combination of tf(t,d) and cf(t) values, which gives you the probability of sampling a term t either from a document or from the collection.
P(t,d) = \lambda P(t|d) + (1-\lambda) P(t|Collection)
This is known by the name of Jelinek Mercer smoothed Language Model.
For your example (letting \lambda=0.5):
Corpus 1: P("set",d1) = 0.5*100/100 + 0.5*100/102
Corpus 2: P("set",d1) = 0.5*100/100 + 0.5*100/300
Clearly, P("set",d1) for corpus 2 is less (almost one-third) of that in corpus 1.

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

The formula for IDF is log( N / df t ) instead of just N / df t.
Where N = total documents in collection, and df t = document frequency of term t.
Log is said to be used because it “dampens” the effect of IDF. What does this mean?
Also, why do we use log frequency weighing for term frequency as seen here:
Debasis's answer is correct. I am not sure why he got downvoted.
Here is the intuition:
If term frequency for the word 'computer' in doc1 is 10 and in doc2 it's 20, we can say that doc2 is more relevant than doc1 for the word 'computer.
However, if the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 millions, at this point, there is no much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'.
Just like Debasis's answer, adding log is to dampen the importance of term that has a high frequency, e.g. Using log base 2, the count of 1 million will be reduced to 19.9!
We also add 1 to the log(tf) because when tf is equal to 1, the log(1) is zero. By adding one, we distinguish between tf=0 and tf=1.
Hope this helps!
It is not necessarily the case that more the occurrence of a term in a document more is the relevance... the contribution of term frequency to document relevance is essentially a sub-linear function... hence the log to approximate this sub-linear function...
the same is applicable for idf as well... a linear idf function may be boosting too much the document scores with high idf terms (which could be rare terms due to spelling mistakes)... a sublinear function performs much better...
I'll try to put my answer more in a practical aspect. Lets take two words - "The" and "Serendipity".
So here the first word "the", if our corpus is of 1000 documents will occur in almost every document but "serendipity" is a rare word and might occur is less documents, for instance we take as it has occurred in only one document.
So, when calculating the IDF of both -
IDF
Log(IDF)
The = 1000/1000 = 0
0
Serendipity = 1000/1 =1000
~6.9
Now we see if we had a TF of range around 0-20 then if our IDF was not a log(IDF) then definitely it would have dominated the TF but if taken as log(IDF) then it would have a equal effect on the result as TF has.
you can think like we are getting information content of word in entire corpus i.e information content = -log(p) = -log(n_i/N) = log(N/n_i).

USING TFIDF FOR RELATIVE FREQUENCY, COSINE SIMILARITY

I'm trying to use TFIDF for relative frequency to calculate cosine distance. I've selected 10 words from one document say: File 1 and selected another 10 files from my folder, using the 10 words and their frequency to check which of the 10 files are similar to File 1. Say Total number of files in folder are 46.i know that DF(is the no of documents the word appears in) IDF(is log(total no of files(46)/DF) and TFIDF(is the product of TF(frequency of the word in one doc) and IDF)
QUESTION:
Assuming what i said above is 100% correct, after getting the TFIDF for all 10 words in one document say: File 2, Do i add all the TFIDF for each of the 10 words together to get the TFIDF for File 2?
What is the cosine distance?
Could anyone help with an example?
The problem is you are confused between cosine similarity and tf-idf. While the former is a measure of similarity between two vectors (in this case documents), the latter simply is a technique of setting the components for the vectors to be eventually used in the former.
Particular to your question, it is rather inconvenient to select 10 terms from each document. I'd rather suggest to work with all terms. Let V be the total number of terms (the cardinality of the set of union over all documents in the collection). You can the represent each document as a vector of V dimensions. The ith component of a particular document D can be set to the tf-idf weight corresponding to that term (say t), i.e. D_i = tf(t,D)*idf(t)
Once you represent every document in your collection in this way, you can then compute the inter-document similarities in the following way.
cosine-sim(D, D') = (1/|D_1|*|D'|) * \sum_{i=1}^{V} D_i * D'_i
= (1/|D_1|*|D'|) * \sum_{i=1}^{V} tf(t,D)*idf(t)*tf(t,D')*idf(t)
Note that the contributing terms in this summation are only those ones which occur in both documents. If a term t occurs in D but not in D' then tf(t,D')=0 which thus contributes 0 to the sum.

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Resources