I have a corpus of N documents classified as spam / no-spam. I am following the standard procedure to pre-process the data in R(code here). The pre-processing ends with a DocumenTermMatrix using weights as tfidf.
Now I want to classify new documents with my model.
How can I calculate the corresponding DocumentVector (using the tf of the document and the idfs of the corpus) for a single new document? I would like to avoid recalculating the DocumentTermMatrix for the whole corpus.
I had a similar problem in the past and this functionality is not included in the tm package. Ingo Feinerer suggested to build a function to get the DocumentVector. The function would need to use the previously built tm or dtm from the corpus and the new document. First pre-process the new document in the same way done for the corpus and create a list with the words and tf. You can merge the words from the tm/dtm (ex. tdm$dimnames$Terms) in the way that your new document is transformed to have the same Terms of your corpus with the tf values of the document (simple merging). Then divide the tf by the idfs of the corpus in the standard way:
cs<-row_sums(tm>0)
lnrs<-log2(nDocs(tm)/cs)
tf*lnrs #
finish returning your DocumentVector.
You can then use vector as a data.frame when predicting with the SVM directly.
I don't know what svm library you use, but it seems that your SVM model is stored in Corpus.svm -- correct?
For prediction of a new document you can follow procedure described at: http://planatscher.net/svmtut/svmtut.html task 2. If you use some other library the procedure will be similar. There is also practical example with IRIS dataset. The only difference is that your new document has to be processed in the same way as training examples (i.e. remove stopwords, tf-idf,...)
Related
I have a question to a Documenttermmatrix. I would like to use the "LDAVIS" package in R. To visualize my results of the LDA algorithm I need to calculate the number of tokens of every document. I donĀ“t have the text corpus for the considered DTM. Does anyone know how I can calculate the amount of tokens for every Document. The output as a list with the document name and his amount of tokens would be the perfect solution.
Kind Regards,
Tom
You can use slam::row_sums. This calculates the row_sums of a document term matrix without first transforming the dtm into a matrix. This function comes from the slam package which is installed when you install the tm package.
count_tokens <- slam::row_sums(dtm_goes_here)
# if you want a list
count_tokens_list <- as.list(slam::row_sums(dtm_goes_here))
I would like to train the word2vec model on my own corpus using the rword2vec package in R.
The word2vec function that is used to train the model requires a train_file. The package's documentation in R simply notes that this is the training text data, but doesn't specify how it can be created.
The training data used in the example on GitHub can be downloaded here:
http://mattmahoney.net/dc/text8.zip. I can't figure out what type of file it is.
I've looked through the README file on the rword2vec GitHub page and checked out the official word2vec page on Google Code.
My corpus is a .csv file with about 68,000 documents. File size is roughly 300MB. I realize that training the model on a corpus of this size might take a long time (or be infeasible), but I'm willing to train it on a subset of the corpus. I just don't know how to create the train_file required by the function.
After you unzip text8, you can open it with a text editor. You'll see that it is one long document. You will need to decide how many of your 68,000 documents you want to use for training and whether you want to concatenate them together of keep them as separate documents. See https://datascience.stackexchange.com/questions/11077/using-several-documents-with-word2vec
I'm trying to convert textual data into vectors using GLOVE in r. My plan was to average the word vectors of a sentence, but I can't seem to get to the word vectorization stage. I've downloaded the glove.6b.50.txt file and it's parent zip file from: https://nlp.stanford.edu/projects/glove/ and I have visited text2vec's website and tried running through their example where they load wikipedia data. But I dont think its what I'm looking for (or perhaps I am not understanding it). I'm trying to load the pretrained embeddings into a model so that if I have a sentence (say 'I love lamp') I can iterate through that sentence and turn each word into a vector that I can then average (turning unknown words into zeros) with a function like vectorize(word). How do I load the pretrained embeddings into a glove model as my corpus (and is that even what I need to do to accomplish my goal?)
I eventually figured it out. The embeddings matrix is all I needed. It already has the words in their vocab as rownames, so I use those to determine the vector of each word.
Now I need to figure out how to update those vectors!
How do I create DocumentTermMatrix directly from list/vector of terms ?
I'd like to calculate LDA for my corpus using bigrams instead of words. Thus I do following:
Convert each document to words via txt.to.words
Create bigrams using stylo package via make.ngrams(res, ngram.size = 2)
Remove bigrams where at least one word is from my stoplist
But here is the problem. LDA wants DocumentTermMatrix as parameter. How do I create one with my bigrams instead of raw text ?
There is example in tm's FAQ that explains how to use bigrams instead of single tokens in a term-document matrix but it produces TermDocumentMatrix that LDA won't accept.
I am playing a bit with text classification and SVM.
My understanding is that typically the way to pick up the features for the training matrix is essentially to use a "bag of words" where we essentially end up with a matrix with as many columns as different words are in our document and the values of such columns is the number of occurrences per word per document (of course each document is represented by a single row).
So that all works fine, I can train my algorithm and so on, but sometimes i get an error like
Error during wrapup: test data does not match model !
By digging it a bit, I found the answer in this question Error in predict.svm: test data does not match model which essentially says that if your model has features A, B and C, then your new data to be classified should contain columns A, B and C. Of course with text this is a bit tricky, my new documents to classify might contain words that have never been seen by the classifier with the training set.
More specifically I am using the RTextTools library whith uses SparseM and tm libraries internally, the object used to train the svm is of type "matrix.csr".
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
UPDATE The solution suggested by #lejlot is very simple to achieve in RTextTools by simply making use of the originalMatrix optional parameter when using the create_matrix function. Essentially, originalMatrix should be the SAME matrix that one creates when one uses the create_matrix function for TRAINING the data. So after you have trained your data and have your models, keep also the original document matrix, when using new examples, make sure of using such object when creating the new matrix for your prediction set.
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
Yes, and it is very trivial one. Before applying any training or classification you create a preprocessing object, which is supposed to map text to your vector representation. In particular - it stores whole vocabulary used for training. Later on you reuse the same preprocessing object on test documents, and you simply ignore words from outside of vocabulary stored before (OOV words, as they are often refered in the literature).
Obviously there are plenty other more "heuristic" approaches, where instead of discarding you try to map them to existing words (although it is less theoreticalyy justified). Rather - you should create intermediate representation, which will be your new "preprocessing" object which can handle OOV words (through some levenstein distance mapping etc.).