How can IDF be different for several documents? - information-retrieval

I am using LETOR to make an information retrieval system. They use TF and IDF.
I am sure TF is query-dependent. But IDF should be to, but:
"Note that IDF is document independent, and so all the documents under a query have
same IDF values."
But that does not make sense because IDF is part of the feature list. How will IDF for each document be calculated?

IDF is term specific. The IDF of any given term is document independent, but the TF is document specific.
To say it differently.
Let's say we have 3 documents.
doc id 1
"The quick brown fox jumps over the lazy dog"
doc id 2
"The Sly Fox Pub Annapolis is located on church circle"
doc id 3
"Located on Church Circle, in the heart of the Historic District"
Now if IDF is (number of documents) / (number of documents containing term t)
then the IDF for the term fox is 3/2 regardless of what the search is or what the document is. So IDF is a function of t.
TF on the other hand is a funciton on t and d. So the TF of 'the' for doc id 1 is 2.

To add to what jshen said:
IDF is a measure of how common any particular word or gram is in the given corpus that you are searching. It is an estimate of how rare that word is and thus its likely importance. So if a query contains an uncommon word, documents containing that rare word should be judged to be more important.

Related

WHy support count of candidate 3-itemset "bread milk diaper" is given as 3, Apriori algorithm,

WHy support count of candidate 3-itemset "bread milk diaper" is given as 3, although it appears in only 2 transctions, kindly check the apriori algorithm association mining in datamining text book by Pang-ning tan, vipin kumar and steinbach please go trough the image for more clarity on question
That slide does not contain the original transactions.
Try scanning the original database. You cannot get the exact counts just from the previous itemsets; you can only get upper bounds.

How to use WmdSimilarity function provided in gensim along with word embeddings which are in numpy.ndarray data type

Using Word2vec (skip-gram) model in tensorflow , I wrote the code to obtain word embeddings from document-set.
The final embeddings are in numpy.ndarray format
Now to obtain similar documents , I need to use the WMD(Word Movers Distance) algorithm.
(I don't have much knowledge of gensim)
The gensim.similarities.WmdSimilarity() requires the embeddings to be in KeyedVectors data type (seems like) --
What can I do to implement WMD in my code.I have a tight deadline and can't give much time to writing the code of WMD from scratch .
If you're looking for similarity between 2 words, use
my_gensim_word2vec_model.most_similar('king')
my_gensim_word2vec_model is the gensim model, of course, not your own tensorflow model.
If you want the most similar to a bunch of words:
my_gensim_word2vec_model.most_similar(positive=['king', 'queen', 'rabbit'])
Check the gensim docs
If your're looking for similarity between sentences or documents, you're better off using doc2vec which gives a vector for all the vocabulary words and documents.
Or take the average of all words in the sentence/document to get the vector for that document. Then get the cosine similarity between the averages of the two sentences to be compared.
For example:
Similarity("Hello World", "Hi there") = CosineSimilarity(vec1, vec2)
"Hello World" -> (Vec("Hello") + Vec("World"))/2 -> vec1
"Hi there" -> (Vec("Hi") + Vec("there"))/2 -> vec2
(Your question is unclear. What is document set? What is your task?)
Hope this helps.

How to determine upper bound of c when estimating jaccard similarity between documents?

Let's say I've a million documents that I preprocessed (calculated signatures for using minhash) in O(D*sqrt(D)) time where D is the number of documents. When I'm given a query document, I've to return the first of the million preprocessed documents in O(sqrt(D)) time such that the jaccard similarity is greater than or equal to, say, 0.8.
If there's no document similar to the query document enough to reach that score, I've to return a document with similarity at least c * 0.8 (where c<1) with probability at least 1 - 1/e^2. How may I find the maximum value of C for this minhash scheme?
Your orders of complexity/time don't sound right. Calculating the minhashes (signature) for a document should be roughly O(n), where n is the number of features (e.g., words, or shingles).
Finding all similar documents to a given document (with estimated similarity above a given threshold) should be roughly O(log(n)), where n is the number of candidate documents.
A document with (estimated) minimum .8 jaccard similarity will have at least 80% of its minhashes matching the given document. You haven't defined c and e for us, so I can't tell what your minimum threshold is -- I'll leave that to you -- but you can easily achieve this efficiently in a single pass:
Work through all your base document's hashes one by one. For each hash, look in your hash dictionary for all other docs that share that hash. Keep a tally for each document found of how many hashes it shares. As soon as one of these tallies reaches 80% of the total number of hashes, you have found the winning document and can halt calculations. But if none of the tallies ever reach the .8 threshold, then continue to the end. Then you can choose the document with the highest tally and decide whether that passes your minimum threshold.

semantic matching strings - using word2vec or s-match?

I have this problem of matching two strings for 'more general', 'less general', 'same meaning', 'opposite meaning' etc.
The strings can be from any domain. Assume that the strings can be from people's emails.
To give an example,
String 1 = "movies"
String 2 = "Inception"
Here I should know that Inception is less general than movies (sort of is-a relationship)
String 1 = "Inception"
String 2 = "Christopher Nolan"
Here I should know that Inception is less general than Christopher Nolan
String 1 = "service tax"
String 2 = "service tax 2015"
At a glance it appears to me that S-match will do the job. But I am not sure if S-match can be made to work on knowledge bases other than WordNet or GeoWordNet (as mentioned in their page).
If I use word2vec or dl4j, I guess it can give me the similarity scores. But does it also support telling a string is more general or less general than the other?
But I do see word2vec can be based on a training set or large corpus like wikipedia etc.
Can some one throw light on the way to go forward?
The current usage of machine learning methods such as word2vec and dl4j for modelling words are based on distributional hypothesis. They train models of words and phrases based on their context. There is no ontological aspects in these word models. At its best trained case a model based on these tools can say if two words can appear in similar contexts. That is how their similarity measure works.
The Mikolov papers (a, b and c) which suggests that these models can learn "Linguistic Regularity" doesn't have any ontological test analysis, it only suggests that these models are capable of predicting "similarity between members of the word pairs". This kind of prediction doesn't help your task. These models are even incapable of recognising similarity in contrast with relatedness (e.g. read this page SimLex test set).
I would say that you need an ontological database to solve your problem. More specifically about your examples, it seems for String 1 and String 2 in your examples:
String 1 = "a"
String 2 = "b"
You are trying to check entailment relations in sentences:
(1) "c is b"
(2) "c is a"
(3) "c is related to a".
Where:
(1) entails (2)
or
(1) entails (3)
In your two first examples, you can probably use semantic knowledge bases to solve the problem. But your third example will probably need a syntactical parsing before understanding the difference between two phrases. For example, these phrases:
"men"
"all men"
"tall men"
"men in black"
"men in general"
It needs a logical understanding to solve your problem. However, you can analyse that based on economy of language, adding more words to a phrase usually makes it less general. Longer phrases are less general comparing to shorter phrases. It doesn't give you a precise tool to solve the problem, but it can help to analyse some phrases without special words such as all, general or every.

How to select stop words using tf-idf? (non english corpus)

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.
Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.
The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.
As a quick note, as #Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.
edit:
As #FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.
I hope this helps.
From "Introduction to Information Retrieval" book:
tf-idf assigns to term t a weight in document d that is
highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.
So words with lowest tf-idf can considered as stop words.

Resources