Count the number of tokens in a Documenttermmatrix - r

I have a question to a Documenttermmatrix. I would like to use the "LDAVIS" package in R. To visualize my results of the LDA algorithm I need to calculate the number of tokens of every document. I donĀ“t have the text corpus for the considered DTM. Does anyone know how I can calculate the amount of tokens for every Document. The output as a list with the document name and his amount of tokens would be the perfect solution.
Kind Regards,
Tom

You can use slam::row_sums. This calculates the row_sums of a document term matrix without first transforming the dtm into a matrix. This function comes from the slam package which is installed when you install the tm package.
count_tokens <- slam::row_sums(dtm_goes_here)
# if you want a list
count_tokens_list <- as.list(slam::row_sums(dtm_goes_here))

Related

Tokenize Text and Analyze with Dictionary in Quanteda

I am trying to do a text analysis using the quanteda packages in R and have been successful in gaining the desired output without doing anything to my texts. However, I am interested in removing stopwords and other common phrases to rerun the analysis (from what I am learning in other sources -- this process is called "Tokenizing"(?)). (The instructions are from https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda/)
With the processed text, which I was able to do using the instructions and the quanteda package. However, I am interested in applying a dictionary for analyzing the text. How can I do that? Since it is hard to attach all my documents here, any hints or examples that I can apply would be helpful and greatly appreciated.
Thank you!
i have used this library with great success and then merged by word to get the score or sentiment. Merge by word
library(tidytext)
get_sentiments("afinn")
get_sentiments("bing")
you can save it as a table
table <- get_sentiments("afinn")
total <- merge(data frameA,data frameB,by="ID")

Using GLOVEs pretrained glove.6B.50.txt as a basis for word embeddings R

I'm trying to convert textual data into vectors using GLOVE in r. My plan was to average the word vectors of a sentence, but I can't seem to get to the word vectorization stage. I've downloaded the glove.6b.50.txt file and it's parent zip file from: https://nlp.stanford.edu/projects/glove/ and I have visited text2vec's website and tried running through their example where they load wikipedia data. But I dont think its what I'm looking for (or perhaps I am not understanding it). I'm trying to load the pretrained embeddings into a model so that if I have a sentence (say 'I love lamp') I can iterate through that sentence and turn each word into a vector that I can then average (turning unknown words into zeros) with a function like vectorize(word). How do I load the pretrained embeddings into a glove model as my corpus (and is that even what I need to do to accomplish my goal?)
I eventually figured it out. The embeddings matrix is all I needed. It already has the words in their vocab as rownames, so I use those to determine the vector of each word.
Now I need to figure out how to update those vectors!

How do I create DocumentTermMatrix directly from list/vector of terms?

How do I create DocumentTermMatrix directly from list/vector of terms ?
I'd like to calculate LDA for my corpus using bigrams instead of words. Thus I do following:
Convert each document to words via txt.to.words
Create bigrams using stylo package via make.ngrams(res, ngram.size = 2)
Remove bigrams where at least one word is from my stoplist
But here is the problem. LDA wants DocumentTermMatrix as parameter. How do I create one with my bigrams instead of raw text ?
There is example in tm's FAQ that explains how to use bigrams instead of single tokens in a term-document matrix but it produces TermDocumentMatrix that LDA won't accept.

Misspelling-aware stemming with R Text Analysis

I am new to TM package in R. I am trying to perform a word frequency analysis but I know that there are several spelling issues within my source file and I was wondering how can I fix these spelling errors before performing word frequencies analysis.
I read already another post (Stemming with R Text Analysis), but I have a question about the solution proposed in there: Is it possible to use a dictionary (a data frame, for example) to make several/all the replacements in my corpus before creating the TermDocumentMatrix and then the word frequency analysis??
I have a data frame with the dictionary and this have the following structure:
sept -> september
sep -> september
acct -> account
serv -> service
servic -> service
adj -> adjustment
ajuste -> adjustment
I know I could develop a function to perform transformations on my corpus but I really do not know how to automatize this task and perform a loop or something like that with each record on my data frame.
Any help would be greatly appreciated.
For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.
All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist) to find the closest match in the dictionary.

Classification/Prediction in R

I have a corpus of N documents classified as spam / no-spam. I am following the standard procedure to pre-process the data in R(code here). The pre-processing ends with a DocumenTermMatrix using weights as tfidf.
Now I want to classify new documents with my model.
How can I calculate the corresponding DocumentVector (using the tf of the document and the idfs of the corpus) for a single new document? I would like to avoid recalculating the DocumentTermMatrix for the whole corpus.
I had a similar problem in the past and this functionality is not included in the tm package. Ingo Feinerer suggested to build a function to get the DocumentVector. The function would need to use the previously built tm or dtm from the corpus and the new document. First pre-process the new document in the same way done for the corpus and create a list with the words and tf. You can merge the words from the tm/dtm (ex. tdm$dimnames$Terms) in the way that your new document is transformed to have the same Terms of your corpus with the tf values of the document (simple merging). Then divide the tf by the idfs of the corpus in the standard way:
cs<-row_sums(tm>0)
lnrs<-log2(nDocs(tm)/cs)
tf*lnrs #
finish returning your DocumentVector.
You can then use vector as a data.frame when predicting with the SVM directly.
I don't know what svm library you use, but it seems that your SVM model is stored in Corpus.svm -- correct?
For prediction of a new document you can follow procedure described at: http://planatscher.net/svmtut/svmtut.html task 2. If you use some other library the procedure will be similar. There is also practical example with IRIS dataset. The only difference is that your new document has to be processed in the same way as training examples (i.e. remove stopwords, tf-idf,...)

Resources