Let's say i have search engine that uses Cosine similarity for retrieving pages.
But, without the idf part, only the tf.
If i add Page Rank for the formula of the Cosine.
it's possible that the formula will change from one corpus to another ?
Example -
Corpus A - Doc A, Doc B ---> There is a line between A and B.
Corpus A - Doc B ---> There is a line between A and B.
The score of Page B will be different for the corpuses?
Thanks.
Your question is not completely clear. But let me try to address your concern.
When you said, you are only using tf, not tf-idf in cosine similarity calculation, I guess you are presenting web pages through the constituent terms' frequencies. Then you said if you add the PageRank values of each webpage to the similarity score calculated based on cosine similarity, whether the formula is going to change? I think here you have little ambiguity. The formula will never change but yes the combined scores may vary from corpus to corpus. Also, the example you provided is not clear.
So, here is the thing probably you want to know:
If you compute cosine similarity or PageRank for the different corpus, the scores may vary based on the corpus distribution. PageRank is computed over a collection of web pages which are considered as the entire web, so if you consider two different corpus, then the PageRank score of web pages can vary!
Related
What is the difference between TF-IDF ranking of documents and binary independence model of ranking? I'm unable to differentiate them.
I think practical implementation of binary independence model leads to have TF-IDF. Please help me if i'm wrong.
The main difference is that in Binary Independence Model there is no idea about how important a word is, and all words are treated same. But weighting words with TF-IDF will give better scores to words that are used more in one document and have less document frequency.
You are correct. The Binary Independence Model assumption is that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. On the other hand, according to Vector Space Model, documents are represented by a vector of term weights and TF-IDF is just one way to represent the term weights.
I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.
I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.
What could be the approaches to combine the pairwise document similarity scores to get the overall similarity score of a certain document against a document collection?
How to compute document similarity against a document collection? - ResearchGate. Available from: https://www.researchgate.net/post/How_to_compute_document_similarity_against_a_document_collection [accessed Aug 22, 2016].
One way of approaching this is the way that a naive bayes text classifier works. By "concatenating" all of the documents in your collection into one large pseudo-document, you can assess the similarity of a particular document against that "collection's" document. This is how the majority of spam filters work; they compare the text of a document "cheap pharmaceuticals" against the text seen in your spam documents and see if it is more like them than the documents you tend to read.
This "pseudo-document" approach is probably the most efficient way to compute such a similarity, since you only need to do your similarity calculation once per document after you pre-compute a representation for the collection.
If you truly have a document similarity matrix and want to use document-pair similarities rather than creating a pseudo-document, you're almost performing clustering. (I say this because how to combine intra-document similarities is the subject of the different linking methods in types of clustering).
One way to do this might be to look at the average similarity. For a document, you sum up the similarity scores between that document and all other documents, and divide by the total. This would give you a sense of the average distance between that document and the others in your similarity space. An outlier would have a higher average distance since most documents are farther away from it than a document in the center of a cluster.
Without more information about your similarity measure or what problem you're trying to solve, I'm not sure I can give better advice.
I want to do a project on document summarization.
Can anyone please explain the algorithm for document summarization using graph based approach?
Also if someone can provide me links to few good research papers???
Take a look at TextRank and LexRank.
LexRank is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.
In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.
https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank