I stumbled upon text2vec package, it implements word embeddings in R. I have been experimenting with it successfully. However, I have been trying implement word vectors onto each document exactly like i found in H2O(python) here https://github.com/h2oai/h2o-tutorials/blob/master/h2o-world-2017/nlp/AmazonReviews.ipynb
In line 21 of this tutorial, the word vectors are averaged and then used as features into a model.
I believe the question is not so much about the code, its about the how can we take the word vectors and assign it to each document. So that they could be fed as features, I am simply following the tutorials mentioned here. http://text2vec.org/glove.html
Related
I'm trying to convert textual data into vectors using GLOVE in r. My plan was to average the word vectors of a sentence, but I can't seem to get to the word vectorization stage. I've downloaded the glove.6b.50.txt file and it's parent zip file from: https://nlp.stanford.edu/projects/glove/ and I have visited text2vec's website and tried running through their example where they load wikipedia data. But I dont think its what I'm looking for (or perhaps I am not understanding it). I'm trying to load the pretrained embeddings into a model so that if I have a sentence (say 'I love lamp') I can iterate through that sentence and turn each word into a vector that I can then average (turning unknown words into zeros) with a function like vectorize(word). How do I load the pretrained embeddings into a glove model as my corpus (and is that even what I need to do to accomplish my goal?)
I eventually figured it out. The embeddings matrix is all I needed. It already has the words in their vocab as rownames, so I use those to determine the vector of each word.
Now I need to figure out how to update those vectors!
I am in confusion on how to use HTK for Language Modeling.
I followed the tutorial example from the Voxforge site
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial
After training and testing I got around 78% accuracy. I did this for my native language.Now I have to use HTK for Language Modeling.
Is there any tutorial available for doing the same? Please help me.
Thanks
speech_tri
If I understand your question correctly, you are trying to change from a "grammar" to an "n-gram language model" approach. These two methods are alternative ways of specifying what combinations of words are permissible in the responses that a recognizer will return. Having followed the Voxforge process you will probably have a grammar in place.
A language model comes from the analysis of a corpus of text which defines the probabilities of words appearing together. The text corpus used can be very specialized. There are a number of analysis tools such as SRILM (http://www.speech.sri.com/projects/srilm/) and MITLM (https://github.com/mitlm/mitlm) which will read a corpus and produce a model.
Since you are using words from your native language you will need a unique corpus of text to analyze. One way to get a test corpus would be to artificially generate a number of sentences from your existing grammar and use that as the corpus. Then with the new language model in place, you just point the recognizer at it instead of the grammar and hope for the best.
I am playing a bit with text classification and SVM.
My understanding is that typically the way to pick up the features for the training matrix is essentially to use a "bag of words" where we essentially end up with a matrix with as many columns as different words are in our document and the values of such columns is the number of occurrences per word per document (of course each document is represented by a single row).
So that all works fine, I can train my algorithm and so on, but sometimes i get an error like
Error during wrapup: test data does not match model !
By digging it a bit, I found the answer in this question Error in predict.svm: test data does not match model which essentially says that if your model has features A, B and C, then your new data to be classified should contain columns A, B and C. Of course with text this is a bit tricky, my new documents to classify might contain words that have never been seen by the classifier with the training set.
More specifically I am using the RTextTools library whith uses SparseM and tm libraries internally, the object used to train the svm is of type "matrix.csr".
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
UPDATE The solution suggested by #lejlot is very simple to achieve in RTextTools by simply making use of the originalMatrix optional parameter when using the create_matrix function. Essentially, originalMatrix should be the SAME matrix that one creates when one uses the create_matrix function for TRAINING the data. So after you have trained your data and have your models, keep also the original document matrix, when using new examples, make sure of using such object when creating the new matrix for your prediction set.
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
Yes, and it is very trivial one. Before applying any training or classification you create a preprocessing object, which is supposed to map text to your vector representation. In particular - it stores whole vocabulary used for training. Later on you reuse the same preprocessing object on test documents, and you simply ignore words from outside of vocabulary stored before (OOV words, as they are often refered in the literature).
Obviously there are plenty other more "heuristic" approaches, where instead of discarding you try to map them to existing words (although it is less theoreticalyy justified). Rather - you should create intermediate representation, which will be your new "preprocessing" object which can handle OOV words (through some levenstein distance mapping etc.).
I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).
I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?
You can try char-rnn or word-rnn (google a little bit).
For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.
If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.
There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.
In the paper the authors provide the instructions to run de code:
1. Download pre-trained GloVe vectors.
2. Obtain a text to train the model on.
3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main
function.
4. Run the following methods:
(a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name,
tokenized_file_name)
(b) run_experiment(tokenized_file_name)
The code in Python is here https://github.com/vicmak/ProofSeer.
I also found that #Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).
Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/
I have a corpus of text with each line in the csv file uniquely specifying a "topic" I am interested in. If I were to run an topic model on this corpus using an LDA or Gibbs method from either the topicmodels package or lda, as expected I would get multiple topics per "document" (a line of text in my CSV which I have a-priori defined to be my unique topic of interest). I get that this is a result of the topic model's algorithm and the bag of words assumption.
What I am curious about however is this
1) Is there a pre-fab'd package in R that is designed for the user to specify the topics using the empirical word distribution? That is, I don't want the topics to be estimated; I want to tell R what the topics are. I suppose I could run a topic model with the correct number of Topics, use that structure of the object and then overwrite its contents. I was just hoping there was an easier or more obvious way that I'm just not seeing at this point.
Thoughts?
edit: added -
I just thought about the alpha and beta parameters having control over the topic/term distributions within the LDA modeling algorithm. What settings might I be able to use that would force the model to only find 1 topic per document? Or is there a setting which would allow for that to occur?
If these seem like silly questions I understand - I'm quite new to this particular field and I am finding it fascinating.
What are you trying to accomplish with this approach? If you want to tell R what the topics are so it can predict the topics in other lines or documents, then RTextTools may be a helpful package.