Tokenize Text and Analyze with Dictionary in Quanteda - r

I am trying to do a text analysis using the quanteda packages in R and have been successful in gaining the desired output without doing anything to my texts. However, I am interested in removing stopwords and other common phrases to rerun the analysis (from what I am learning in other sources -- this process is called "Tokenizing"(?)). (The instructions are from https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda/)
With the processed text, which I was able to do using the instructions and the quanteda package. However, I am interested in applying a dictionary for analyzing the text. How can I do that? Since it is hard to attach all my documents here, any hints or examples that I can apply would be helpful and greatly appreciated.
Thank you!

i have used this library with great success and then merged by word to get the score or sentiment. Merge by word
library(tidytext)
get_sentiments("afinn")
get_sentiments("bing")
you can save it as a table
table <- get_sentiments("afinn")
total <- merge(data frameA,data frameB,by="ID")

Related

R - how to create DocumentTermMatrix for Korean words

I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.
I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.
First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.
Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)
Third of all, I know that by using tm package, I can create DTM rather easily.
The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.
Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?
I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.
Much appreciated in advance.

Count the number of tokens in a Documenttermmatrix

I have a question to a Documenttermmatrix. I would like to use the "LDAVIS" package in R. To visualize my results of the LDA algorithm I need to calculate the number of tokens of every document. I don´t have the text corpus for the considered DTM. Does anyone know how I can calculate the amount of tokens for every Document. The output as a list with the document name and his amount of tokens would be the perfect solution.
Kind Regards,
Tom
You can use slam::row_sums. This calculates the row_sums of a document term matrix without first transforming the dtm into a matrix. This function comes from the slam package which is installed when you install the tm package.
count_tokens <- slam::row_sums(dtm_goes_here)
# if you want a list
count_tokens_list <- as.list(slam::row_sums(dtm_goes_here))

R TM Package stemDocument. What dictionary does it use

My version of R is 3.4.1 platform x86_64-w64_mingw32/x64
I am using R to find the most popular words in a document.
I would like to stem the words and then Complete them. This means I need to use the SAME dictionary for both the stemming and the completion. I am confused by the TM package I am using.
Q1) The stemDocument function seems to work fine without a dictionary defined explicitly. However I would like to define one or at least get hold of the one it uses if it is inbuilt into R. Can I download it anywhere? Apparently I cannot do this
dfCorpus <- tm_map(dfCorpus, stemDocument, language = "english")
Q2) I would like to use the SAME dictionary to complete the words and if they aren't in the dictionary keep the original. Can't do this so just need to know what format the dictionary should be in to work because it currently just giving me NA for all the answers. It is two columns stem and word. This is just an example I found online.
dict.data = fread("Z:/Learning/lemmatization-en.txt")
I'm expecting the code to be something like
dfCorpus <- stemCompletion_modified(dfCorpus, dictionary="dict.data", type="prevalent")`
thanks
Edit. I see that I am trying to solve my problem with a hammer. Just because documentation says to do it one way I was trying to get it to work. So now what I need is just a lookup between all English words and their base not stem. I know I'm not allowed to ask that here but I'm sure I will find it. Have a good weekend.

How to do Language Modeling using HTK

I am in confusion on how to use HTK for Language Modeling.
I followed the tutorial example from the Voxforge site
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial
After training and testing I got around 78% accuracy. I did this for my native language.Now I have to use HTK for Language Modeling.
Is there any tutorial available for doing the same? Please help me.
Thanks
speech_tri
If I understand your question correctly, you are trying to change from a "grammar" to an "n-gram language model" approach. These two methods are alternative ways of specifying what combinations of words are permissible in the responses that a recognizer will return. Having followed the Voxforge process you will probably have a grammar in place.
A language model comes from the analysis of a corpus of text which defines the probabilities of words appearing together. The text corpus used can be very specialized. There are a number of analysis tools such as SRILM (http://www.speech.sri.com/projects/srilm/) and MITLM (https://github.com/mitlm/mitlm) which will read a corpus and produce a model.
Since you are using words from your native language you will need a unique corpus of text to analyze. One way to get a test corpus would be to artificially generate a number of sentences from your existing grammar and use that as the corpus. Then with the new language model in place, you just point the recognizer at it instead of the grammar and hope for the best.

Misspelling-aware stemming with R Text Analysis

I am new to TM package in R. I am trying to perform a word frequency analysis but I know that there are several spelling issues within my source file and I was wondering how can I fix these spelling errors before performing word frequencies analysis.
I read already another post (Stemming with R Text Analysis), but I have a question about the solution proposed in there: Is it possible to use a dictionary (a data frame, for example) to make several/all the replacements in my corpus before creating the TermDocumentMatrix and then the word frequency analysis??
I have a data frame with the dictionary and this have the following structure:
sept -> september
sep -> september
acct -> account
serv -> service
servic -> service
adj -> adjustment
ajuste -> adjustment
I know I could develop a function to perform transformations on my corpus but I really do not know how to automatize this task and perform a loop or something like that with each record on my data frame.
Any help would be greatly appreciated.
For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.
All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist) to find the closest match in the dictionary.

Resources