How to compute document similarity against a document collection?

How to compute document similarity against a document collection? - information-retrieval

What could be the approaches to combine the pairwise document similarity scores to get the overall similarity score of a certain document against a document collection?
How to compute document similarity against a document collection? - ResearchGate. Available from: https://www.researchgate.net/post/How_to_compute_document_similarity_against_a_document_collection [accessed Aug 22, 2016].

One way of approaching this is the way that a naive bayes text classifier works. By "concatenating" all of the documents in your collection into one large pseudo-document, you can assess the similarity of a particular document against that "collection's" document. This is how the majority of spam filters work; they compare the text of a document "cheap pharmaceuticals" against the text seen in your spam documents and see if it is more like them than the documents you tend to read.
This "pseudo-document" approach is probably the most efficient way to compute such a similarity, since you only need to do your similarity calculation once per document after you pre-compute a representation for the collection.
If you truly have a document similarity matrix and want to use document-pair similarities rather than creating a pseudo-document, you're almost performing clustering. (I say this because how to combine intra-document similarities is the subject of the different linking methods in types of clustering).
One way to do this might be to look at the average similarity. For a document, you sum up the similarity scores between that document and all other documents, and divide by the total. This would give you a sense of the average distance between that document and the others in your similarity space. An outlier would have a higher average distance since most documents are farther away from it than a document in the center of a cluster.
Without more information about your similarity measure or what problem you're trying to solve, I'm not sure I can give better advice.

Related

R text mining with TM: Does a document contain words that are rare

Using TM package in R, how can I score a document in term of its uniqueness? I want to somehow separate documents with very unique words from documents that contain often used words.
I know how to find the frequently used words and least used words with e.g. findFreqTerms, but how do I score a document with regards to it's uniqueness?
I am struggling to come up with a good solution.

A good starting point for assessing which words are used only in some documents is the so-called tf-idf weighting (tidytext package vignette). This assigns a score to each (word, document) combination, so once you have that calculated you can try summarizing along the 'document' margin, maybe literally just colMeans, to get a sense of how many relatively unique terms it uses.
To separate documents, a weighting scheme like tf-idf may be better than just finding the rarest overall tokens: a rare word used once in most documents is treated quite differently than a word used several times in just a few documents.
R packages TM, tidytext, and quanteda all have functions to calculate this.

Cosine similarity and PageRank

Let's say i have search engine that uses Cosine similarity for retrieving pages.
But, without the idf part, only the tf.
If i add Page Rank for the formula of the Cosine.
it's possible that the formula will change from one corpus to another ?
Example -
Corpus A - Doc A, Doc B ---> There is a line between A and B.
Corpus A - Doc B ---> There is a line between A and B.
The score of Page B will be different for the corpuses?
Thanks.

Your question is not completely clear. But let me try to address your concern.
When you said, you are only using tf, not tf-idf in cosine similarity calculation, I guess you are presenting web pages through the constituent terms' frequencies. Then you said if you add the PageRank values of each webpage to the similarity score calculated based on cosine similarity, whether the formula is going to change? I think here you have little ambiguity. The formula will never change but yes the combined scores may vary from corpus to corpus. Also, the example you provided is not clear.
So, here is the thing probably you want to know:
If you compute cosine similarity or PageRank for the different corpus, the scores may vary based on the corpus distribution. PageRank is computed over a collection of web pages which are considered as the entire web, so if you consider two different corpus, then the PageRank score of web pages can vary!

Information retrieval system

What is the difference between TF-IDF ranking of documents and binary independence model of ranking? I'm unable to differentiate them.
I think practical implementation of binary independence model leads to have TF-IDF. Please help me if i'm wrong.

The main difference is that in Binary Independence Model there is no idea about how important a word is, and all words are treated same. But weighting words with TF-IDF will give better scores to words that are used more in one document and have less document frequency.

You are correct. The Binary Independence Model assumption is that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. On the other hand, according to Vector Space Model, documents are represented by a vector of term weights and TF-IDF is just one way to represent the term weights.

Document similarity selfplagiarism

I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.

I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.

explain the algorithm for document summarization

I want to do a project on document summarization.
Can anyone please explain the algorithm for document summarization using graph based approach?
Also if someone can provide me links to few good research papers???

Take a look at TextRank and LexRank.
LexRank is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.
In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.
https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex