How to apply topic modeling? - information-retrieval

I have 10000 tweets for 5 topics. Assume I know the ground truth (the actual topic of each tweet) and I group the tweets into 5 documents where each document contain tweets for a particular topic. Then I apply LDA on to the 5 documents with number of topics set to 5. In which case I get good topic words.
Now If I don't know the ground truth of tweets, how do I make input documents in a way that LDA will still give me good topic words describing the 5 topics.
What if I create input documents by randomly selecting a sample of tweets? What if this ends up with similar topic mixtures for input documents? Should LDA still find good topic words as in the case of 1st paragraph?

If I understand correctly, your problem is about topic modeling on short texts (Tweets). One approach is to combine Tweets into long pseudo-documents before training LDA. Another one is to assume that there is only one topic per document/Tweet.
In the case that you don't know the ground truth labels of Tweets, you might want to try the one-topic-per-document topic model (i.e. mixture-of-unigrams). The model details are described in:
Jianhua Yin and Jianyong Wang. 2014. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242.
You can find my Java implementations for this model and LDA at http://jldadmm.sourceforge.net/ Assumed that you know ground truth labels, you can also use my implementation to compare these topic models in document clustering task.
If you'd like to evaluate topic coherence (i.e. evaluate how good topic words), I would suggest you to have a look at the Palmetto toolkit (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations.

Related

Optimal document size for topic modeling using STM

I'm wondering what the ideal/best/optimal size of documents are when the goal is to identify topics within the documents using Structural Topic Modeling.
I have a body of documents of different length, originating from 30 "authors", organized in 26 chapters and aqcuired on two different points in time. So lets say the whole data consists of 1560 documents of different length. Now I want to (1) identify the topics in these documents and (2) check whether the topics differ between the two points in time.
While there is some research on topic modeling for short texts, I could not find any information on the "optimal" size of documents. So I could:
build a corpus from all of the 1560 documents, or
merge the chapters for each author and point in time, leaving a total of 52 documents.
Is there any advice or evidence which solution leads to better topic modeling results using STM?

Dynamic topic models/topic over time in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a database of newspaper articles about the water policy from 1998 to 2008. I would like to see how the newspaper release changes during this period. My question is, should I use Dynamic Topic Modeling or Topic Over Time model to handle this task? Would they be significantly better than the traditional LDA model (in which I fit the topic model base on the entire set of text corpus, and plot the trend of topic based on how each of the document is tagged)? If yes, is there a package I could use for the DTA/ToT model in R?
So it depends on what your research question is.
A dynamic topic model allows the words that are most strongly associated with a given topic to vary over time. The paper that introduces the model gives a great example of this using journal entries [1]. If you are interested in whether the characteristics of individual topics vary over time, then this is the correct approach.
I have not dealt with the ToT model before, but it appears similar to a structural topic model whose time covariates are continuous. This means that topics are fixed, but their relative prevalence and correlations can vary. If you group your articles into say - months - then a structural or ToT model can show you whether certain topics become more or less prevalent over time.
So in sum, do you want the variation to be within topics or between topics? Do you want to study how the articles vary in the topics they speak on, or do you want to study how these articles construct certain topics?
In terms of R, you'll run into some problems. The stm package can deal with a STM with discrete time periods, but there is no pre-packaged implementation of a ToT model that I am aware of. For a DTM, I know there is a C++ implementation that was released with the introductory paper, and I have a python version which I can find for you.
Note: I would never recommend someone to use a simple LDA for text documents. I would always take a correlated topic model as a base, and build from there.
Edit: to explain more on stm package.
This package is an implementation of the structural topic model [2]. The STM is an extension to the correlated topic model [3] but permits the inclusion of covariates at the document level. You can then explore the relationship between topic prevalence and these covariates. If you include a covariate for date, then you can explore how individual topics become more or less important over time, relative to others. The package itself is excellent, fast and intuitive, and includes functions to choose the most appropriate number of topics etc.
[1] Blei, David M., and John D. Lafferty. "Dynamic topic models." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
[2] Roberts, Margaret E., et al. "Structural Topic Models for Open‐Ended Survey Responses." American Journal of Political Science 58.4 (2014): 1064-1082.
[3] Lafferty, John D., and David M. Blei. "Correlated topic models." Advances in neural information processing systems. 2006.

how to train Word2Vec model properly for a special purpose

My question concerns the proper training of the model for unique and really specific use of the Word2Vec model. See Word2Vec details here
I am working on identifying noun-adjective (or ) relationships within the word embeddings.
(E.g. we have 'nice car' in a sentence of the data-set. Given the word embeddings of the corpus and the nouns and adjectives all labeled, I am trying to design a technique to find the proper vector that connects 'nice' with 'car'.)
Of course I am not trying to connect only that pair of words, but the technique should would for all relationships. A supervised approach is taken at this moment, then try to work towards designing an unsupervised method.
Now that you understand what I am trying to do, I will explain the problem. I obviously know that word2vec needs to be trained on large amounts of data, to learn the proper embeddings as accurately as possible, but I am afraid to give it more data than the data-set with labelled sentences (500-700).
I am afraid that if I give it more data to train on (e.g. Latest Wikipedia dump data-set), it will learn better vectors, but the extra data will influence the positioning of my words, then this word relationship is biased by the extra training data. (e.g. what if there is also 'nice Apple' in the extra training data, then the positioning of the word 'nice' could be compromised).
Hopefully this makes sense and I am not making bad assumptions, but I am just in the dilemma of having bad vectors because of not enough training data, or having good vectors, but compromised vector positioning in the word embeddings.
What would be the proper way to train on ? As much training data as possible (billions of words) or just the labelled data-set (500-700 sentences) ?
Thank you kindly for your time, and let me know if anything that I explained does not make sense.
As always in similar situations it is best to check...
I wonder if you tested the difference in training on the labelled dataset results vs. the wikipedia dataset. Are there really the issues you are afraid of seeing?
I would just run an experiment and check if the vectors in both cases are indeed different (statistically speaking).
I suspect that you may introduce some noise with larger corpus but more data may be beneficial wrt. to vocabulary coverage (larger corpus - more universal). It all depends on your expected use case. It is likely to be a trade off between high precision with very low recall vs. so-so precision with relatively good recall.

Predicting next word with text2vec in R

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).
I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?
You can try char-rnn or word-rnn (google a little bit).
For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.
If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.
There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.
In the paper the authors provide the instructions to run de code:
1. Download pre-trained GloVe vectors.
2. Obtain a text to train the model on.
3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main
function.
4. Run the following methods:
(a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name,
tokenized_file_name)
(b) run_experiment(tokenized_file_name)
The code in Python is here https://github.com/vicmak/ProofSeer.
I also found that #Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).
Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/

Manually Specifying a Topic Model in R

I have a corpus of text with each line in the csv file uniquely specifying a "topic" I am interested in. If I were to run an topic model on this corpus using an LDA or Gibbs method from either the topicmodels package or lda, as expected I would get multiple topics per "document" (a line of text in my CSV which I have a-priori defined to be my unique topic of interest). I get that this is a result of the topic model's algorithm and the bag of words assumption.
What I am curious about however is this
1) Is there a pre-fab'd package in R that is designed for the user to specify the topics using the empirical word distribution? That is, I don't want the topics to be estimated; I want to tell R what the topics are. I suppose I could run a topic model with the correct number of Topics, use that structure of the object and then overwrite its contents. I was just hoping there was an easier or more obvious way that I'm just not seeing at this point.
Thoughts?
edit: added -
I just thought about the alpha and beta parameters having control over the topic/term distributions within the LDA modeling algorithm. What settings might I be able to use that would force the model to only find 1 topic per document? Or is there a setting which would allow for that to occur?
If these seem like silly questions I understand - I'm quite new to this particular field and I am finding it fascinating.
What are you trying to accomplish with this approach? If you want to tell R what the topics are so it can predict the topics in other lines or documents, then RTextTools may be a helpful package.

Resources