Integration of pre-trained word vectors in topic modeling in R - r

I am relatively new to the domain of topic modeling so I hope this isn't a stupid question.
I have a text corpus of 7k documents which are mostly relatively short (just a few words). As standard LDA produces only moderately good results, I want to include word vectors that are pre-trained on a large external corpus (like these: https://nlp.stanford.edu/projects/glove/).
However, I haven't found anything that explains understandably how I should proceed (I found some information about the implementation in Python, but I need a solution for R).
After downloading the pre-trained word vectors, how do I integrate them in the LDA modeling process for my own corpus?
Thanks a lot in advance!

The package text2vec has an implementation of GloVe.
Package: https://cran.r-project.org/web/packages/text2vec/index.html
Vignette on GloVe: https://cran.r-project.org/web/packages/text2vec/vignettes/glove.html

Related

Details behind "augment" when applied to topic modeling

I have a question on "augment" function from Silge and Robinson's "Text Mining with R: A Tidy Approach" textbook. Having run an LDA on a corpus, I am applying the "augment" to assign topics to each word.
I get the results, but am not sure what takes place "under the hood" behind "augment", i.e. how the topic for each word is being determined using the Bayesian framework. Is it just based on conditional probability formula, and estimated after LDA is fit using p(topic|word)=p(word|topic)*p(topic)/p(word)?
I will appreciate if someone could please provide statistical details on how "augment" does this. Could you also please provide references to papers where this is documented.
The tidytext package is open source and on GitHub so you can dig into the code for augment() for yourself. I'd suggest looking at
augment() for LDA from the topicmodels package
augment() for the structural topic model from the stm package
To learn more about these approaches, there is an excellent paper/vignette on the structural topic model, and I like the Wikipedia article for LDA.

how does gensim's word2vec differ from tensorflow vector representation?

I am fairly new to the NLP embedding world. I used gensim's word2vec model and tensorflow vector representation.
I have a question that while training gensim's word2vec model it takes tokenize sentences, while tensorflow takes a long list of words. How does it differ in training. Is there any quality impact?
Also how does then tensorflow cater to the needs of skip-gram as now the data is a list of words and no more sentences.
I am referring to the tensorflow's tutorial found at link https://www.tensorflow.org/tutorials/word2vec
Pardon me if my understanding in this domain is wrong would appreciate if my understanding is cleared.
Thank you for your guidance and help.

How to do Language Modeling using HTK

I am in confusion on how to use HTK for Language Modeling.
I followed the tutorial example from the Voxforge site
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial
After training and testing I got around 78% accuracy. I did this for my native language.Now I have to use HTK for Language Modeling.
Is there any tutorial available for doing the same? Please help me.
Thanks
speech_tri
If I understand your question correctly, you are trying to change from a "grammar" to an "n-gram language model" approach. These two methods are alternative ways of specifying what combinations of words are permissible in the responses that a recognizer will return. Having followed the Voxforge process you will probably have a grammar in place.
A language model comes from the analysis of a corpus of text which defines the probabilities of words appearing together. The text corpus used can be very specialized. There are a number of analysis tools such as SRILM (http://www.speech.sri.com/projects/srilm/) and MITLM (https://github.com/mitlm/mitlm) which will read a corpus and produce a model.
Since you are using words from your native language you will need a unique corpus of text to analyze. One way to get a test corpus would be to artificially generate a number of sentences from your existing grammar and use that as the corpus. Then with the new language model in place, you just point the recognizer at it instead of the grammar and hope for the best.

Predicting next word with text2vec in R

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).
I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?
You can try char-rnn or word-rnn (google a little bit).
For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.
If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.
There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.
In the paper the authors provide the instructions to run de code:
1. Download pre-trained GloVe vectors.
2. Obtain a text to train the model on.
3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main
function.
4. Run the following methods:
(a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name,
tokenized_file_name)
(b) run_experiment(tokenized_file_name)
The code in Python is here https://github.com/vicmak/ProofSeer.
I also found that #Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).
Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/

Can I perform Generalized Iterative Scaling in R?

I'm looking to port our home-grown platform of various machine learning algorithms from C# to a more robust data mining platform such as R. While it's obvious R is great at many types of data mining tasks, it is not clear to me if it can be used for text classification.
Specifically, we extract a list of bigrams from the text and then classify it into one of 15 different categories, eg:
Bigram list: jewelry, books, watches, shoes, department store
-> Category: Shopping
We'd want to both train the models in R as well as hook up to a database to perform this on a larger scale.
Can it be done in R?
Hmm, I am rather starting to look into Machine Learning, but I might have a suggestion: have you considered Weka? There's a bunch of various algorithms around and there'S IS some documentation. Plus, there is an R package RWeka that makes use of the Weka jars.
EDIT:
There is also a nice, comprehensive read by Witten et al. : Data mining that contains an extensive description of Weka among other interesting things. Look into the API opportunities.

Resources