Information retrieval system - information-retrieval

What is the difference between TF-IDF ranking of documents and binary independence model of ranking? I'm unable to differentiate them.
I think practical implementation of binary independence model leads to have TF-IDF. Please help me if i'm wrong.

The main difference is that in Binary Independence Model there is no idea about how important a word is, and all words are treated same. But weighting words with TF-IDF will give better scores to words that are used more in one document and have less document frequency.

You are correct. The Binary Independence Model assumption is that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. On the other hand, according to Vector Space Model, documents are represented by a vector of term weights and TF-IDF is just one way to represent the term weights.

Related

Is the information captured by Doc2Vec a subset of the information captured by BERT?

Both Doc2Vec and BERT are NLP models used to create vectors for text. The original BERT model maintained a vector of 768, while the original Doc2Vec model maintained a vector of size 300. Would it be reasonable to assume that all the information captured by D2V is a subset of information captured by BERT?
I ask, because I want to think about how to compare differences in representations for a set of sentences between models. I am thinking I could project the BERT vectors into a D2V subspace and compare those vectors to the D2V vectors for the same sentence, but this relies on the assumption that the subspace I'm projecting the BERT vectors into is actually comparable (i.e., the same type of information) to the D2V space.
The objective functions, while different, are quite similar. The Cloze task for BERT and the next word prediction for D2V are both trying to create associations between a word and its surrounding words. BERT can look bidirectionally, while D2V can only look at a window and moves from the left to the right of a sentence. The same objective function doesn't necessarily mean that they're capturing the same information, but it seems in which the way D2V does it (the covariates it uses) are a subset of the covariates used by BERT.
Interested to hear other people's thoughts.
I'll assume by Doc2Vec you mean the "Paragraph Vector" algorithm, which is often called Doc2Vec (including in libraries like Python Gensim).
That Doc2Vec is closely related to word2vec: it's essentially word2vec with a synthetic floating pseudoword vector over the entire text. It models texts via a shallow network that can't really consider word-order, or the composite-meaning of word runs, except in a very general 'nearness' sense.
So, a Doc2Vec model will not generate realistic/grammatical completions/summaries from vectors (except perhaps in very-limited single-word tests).
What info Doc2Vec most captures can be somewhat influenced by parameter choices, especially choice-of-mode and window (in modes where that matters, like when co-training word-vectors).
BERT is a far deeper model with more internal layers and a larger default dimensionality of text-representations. Its training mechanisms give it the potential to differentiate between significant word-orderings – and thus be sensitive to grammar and composite phrases beyond what Doc2Vec can learn. It can generate plausible multi-word completions/summarizations.
You could certainly train a 768-dimension Doc2Vec model on the same texts as a BERT model & compare the results. The resulting summary text-vectors, from the 2 models, would likely perform quite differently on key tasks. If you need to detect subtle shifts in meaning in short texts – things like the reversal of menaing from the insert of a single 'not' – I'd expect the BERT model to dominate (if sufficiently trained). On broader tasks less-sensitive to grammar like topic-classification, the Doc2Vec model might be competitive, or (given its simplicity) attractive in its ability to achieve certain targets with far less data or quicker training.
So, it'd be improper to assume that what Doc2Vec captures is a proper subset of what BERT does.
You could try learning a mapping from one model to the other (possibly including dimensionality-reduction), as there are surely many consistent correlations between the trained coordinate-spaces. But the act of creating such a mapping requires starting assumptions that certain vectors "should" line-up, or be in similar configurations.
If trying to understand what's unique/valuable across the two options, it's likely better to compare how the models rank a text's neighbors – do certain kinds of similarities dominate in one or the other? Or, try both as inputs to downstream classification/info-retrieval tasks, and see where they each shine.
(With sufficient data & training time, I'd expect BERT as the more-sophisticated model to usually provide better results – especially if it's also allotted a larger representation. But for some tasks, and limited data/compute/time resources, Doc2Vec might shine.

Discrepancy Between Two Methods of Finding Information Entropy

So I learned about the concept of information entropy from Khan Academy where is was phrased in the form of "average amount of yes or no questions needed per symbol". They also gave an alternative form using logarithms.
So let's say we have a symbol generator that produces A,B, and C.
P(A)=1/2, P(B)=1/3, and P(C)=1/6
According to their method, I would gat a chart like this:
First method
Then I would multiply their probability of occurring by the amount of questions needed for each giving
(1/2)*1+(1/3)*2+(1/6)*2 = 1.5bits
but their other method gives
-(1/2)log2(1/2)-(1/3)log2(1/3)-(1/6)log2(1/6)= 1.459... bits
The difference is small, but still significant. I've tried this with different combinations and probabilities and got similar results. Is there something I'm missing? Am I using either method wrong, or is one of them more conditional?
Your second calculation is correct.
The problem with your decision tree approach is that the decision tree is not optimal (and indeed, no binary decision tree could be for those probabilities). Your “is it B” decision node represents less than one bit of information, since once you get there you already know it’s probably B. So your decision tree represents a potential encoding of symbols which is expected to consume 1.5 bits on average, but it represents slightly less than 1.5 bits of information.
In order to have a binary tree which represents an optimal encoding, each node needs to have balanced probabilities. This is not possible if some symbol has a probability whose denominator is not a power of 2.

Cosine similarity and PageRank

Let's say i have search engine that uses Cosine similarity for retrieving pages.
But, without the idf part, only the tf.
If i add Page Rank for the formula of the Cosine.
it's possible that the formula will change from one corpus to another ?
Example -
Corpus A - Doc A, Doc B ---> There is a line between A and B.
Corpus A - Doc B ---> There is a line between A and B.
The score of Page B will be different for the corpuses?
Thanks.
Your question is not completely clear. But let me try to address your concern.
When you said, you are only using tf, not tf-idf in cosine similarity calculation, I guess you are presenting web pages through the constituent terms' frequencies. Then you said if you add the PageRank values of each webpage to the similarity score calculated based on cosine similarity, whether the formula is going to change? I think here you have little ambiguity. The formula will never change but yes the combined scores may vary from corpus to corpus. Also, the example you provided is not clear.
So, here is the thing probably you want to know:
If you compute cosine similarity or PageRank for the different corpus, the scores may vary based on the corpus distribution. PageRank is computed over a collection of web pages which are considered as the entire web, so if you consider two different corpus, then the PageRank score of web pages can vary!

word vector and paragraph vector query

I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.

How to compute document similarity against a document collection?

What could be the approaches to combine the pairwise document similarity scores to get the overall similarity score of a certain document against a document collection?
How to compute document similarity against a document collection? - ResearchGate. Available from: https://www.researchgate.net/post/How_to_compute_document_similarity_against_a_document_collection [accessed Aug 22, 2016].
One way of approaching this is the way that a naive bayes text classifier works. By "concatenating" all of the documents in your collection into one large pseudo-document, you can assess the similarity of a particular document against that "collection's" document. This is how the majority of spam filters work; they compare the text of a document "cheap pharmaceuticals" against the text seen in your spam documents and see if it is more like them than the documents you tend to read.
This "pseudo-document" approach is probably the most efficient way to compute such a similarity, since you only need to do your similarity calculation once per document after you pre-compute a representation for the collection.
If you truly have a document similarity matrix and want to use document-pair similarities rather than creating a pseudo-document, you're almost performing clustering. (I say this because how to combine intra-document similarities is the subject of the different linking methods in types of clustering).
One way to do this might be to look at the average similarity. For a document, you sum up the similarity scores between that document and all other documents, and divide by the total. This would give you a sense of the average distance between that document and the others in your similarity space. An outlier would have a higher average distance since most documents are farther away from it than a document in the center of a cluster.
Without more information about your similarity measure or what problem you're trying to solve, I'm not sure I can give better advice.

Resources