I'm wondering what the ideal/best/optimal size of documents are when the goal is to identify topics within the documents using Structural Topic Modeling.
I have a body of documents of different length, originating from 30 "authors", organized in 26 chapters and aqcuired on two different points in time. So lets say the whole data consists of 1560 documents of different length. Now I want to (1) identify the topics in these documents and (2) check whether the topics differ between the two points in time.
While there is some research on topic modeling for short texts, I could not find any information on the "optimal" size of documents. So I could:
build a corpus from all of the 1560 documents, or
merge the chapters for each author and point in time, leaving a total of 52 documents.
Is there any advice or evidence which solution leads to better topic modeling results using STM?
Related
We are working on a survey where we have a few open-ended answers part from the numeric/categorical responses.Till now we used to categorize these texts into 10-15 buckets manually so that the marketing team can take actions on it.For example, if the respondent is asked what other features he wants in a particular tablet which he is using, we will group his/hers responses into buckets like 'Better security features', 'Better support' etc.
Instead of doing it manually I am automating this by creating individual logistic regression/CART/Random Forest Equations for each bucket. For example for bucket one 1 use the code
model1=glm(Better.support~.,data=verbatimSparse,family=binomial)
verbatim$predict1=predict(model1,type="response")
I am building 12 other models like this and each response will be grouped into the bucket where the predicted probability is the highest.This is somewhat serving my purpose, but with the accuracy is only around 80%.Is there any other method to better classify the text.
I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.
I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.
What could be the approaches to combine the pairwise document similarity scores to get the overall similarity score of a certain document against a document collection?
How to compute document similarity against a document collection? - ResearchGate. Available from: https://www.researchgate.net/post/How_to_compute_document_similarity_against_a_document_collection [accessed Aug 22, 2016].
One way of approaching this is the way that a naive bayes text classifier works. By "concatenating" all of the documents in your collection into one large pseudo-document, you can assess the similarity of a particular document against that "collection's" document. This is how the majority of spam filters work; they compare the text of a document "cheap pharmaceuticals" against the text seen in your spam documents and see if it is more like them than the documents you tend to read.
This "pseudo-document" approach is probably the most efficient way to compute such a similarity, since you only need to do your similarity calculation once per document after you pre-compute a representation for the collection.
If you truly have a document similarity matrix and want to use document-pair similarities rather than creating a pseudo-document, you're almost performing clustering. (I say this because how to combine intra-document similarities is the subject of the different linking methods in types of clustering).
One way to do this might be to look at the average similarity. For a document, you sum up the similarity scores between that document and all other documents, and divide by the total. This would give you a sense of the average distance between that document and the others in your similarity space. An outlier would have a higher average distance since most documents are farther away from it than a document in the center of a cluster.
Without more information about your similarity measure or what problem you're trying to solve, I'm not sure I can give better advice.
I have 10000 tweets for 5 topics. Assume I know the ground truth (the actual topic of each tweet) and I group the tweets into 5 documents where each document contain tweets for a particular topic. Then I apply LDA on to the 5 documents with number of topics set to 5. In which case I get good topic words.
Now If I don't know the ground truth of tweets, how do I make input documents in a way that LDA will still give me good topic words describing the 5 topics.
What if I create input documents by randomly selecting a sample of tweets? What if this ends up with similar topic mixtures for input documents? Should LDA still find good topic words as in the case of 1st paragraph?
If I understand correctly, your problem is about topic modeling on short texts (Tweets). One approach is to combine Tweets into long pseudo-documents before training LDA. Another one is to assume that there is only one topic per document/Tweet.
In the case that you don't know the ground truth labels of Tweets, you might want to try the one-topic-per-document topic model (i.e. mixture-of-unigrams). The model details are described in:
Jianhua Yin and Jianyong Wang. 2014. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242.
You can find my Java implementations for this model and LDA at http://jldadmm.sourceforge.net/ Assumed that you know ground truth labels, you can also use my implementation to compare these topic models in document clustering task.
If you'd like to evaluate topic coherence (i.e. evaluate how good topic words), I would suggest you to have a look at the Palmetto toolkit (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations.