R TM Package stemDocument. What dictionary does it use - r

My version of R is 3.4.1 platform x86_64-w64_mingw32/x64
I am using R to find the most popular words in a document.
I would like to stem the words and then Complete them. This means I need to use the SAME dictionary for both the stemming and the completion. I am confused by the TM package I am using.
Q1) The stemDocument function seems to work fine without a dictionary defined explicitly. However I would like to define one or at least get hold of the one it uses if it is inbuilt into R. Can I download it anywhere? Apparently I cannot do this
dfCorpus <- tm_map(dfCorpus, stemDocument, language = "english")
Q2) I would like to use the SAME dictionary to complete the words and if they aren't in the dictionary keep the original. Can't do this so just need to know what format the dictionary should be in to work because it currently just giving me NA for all the answers. It is two columns stem and word. This is just an example I found online.
dict.data = fread("Z:/Learning/lemmatization-en.txt")
I'm expecting the code to be something like
dfCorpus <- stemCompletion_modified(dfCorpus, dictionary="dict.data", type="prevalent")`
thanks
Edit. I see that I am trying to solve my problem with a hammer. Just because documentation says to do it one way I was trying to get it to work. So now what I need is just a lookup between all English words and their base not stem. I know I'm not allowed to ask that here but I'm sure I will find it. Have a good weekend.

Related

R - how to create DocumentTermMatrix for Korean words

I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.
I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.
First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.
Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)
Third of all, I know that by using tm package, I can create DTM rather easily.
The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.
Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?
I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.
Much appreciated in advance.

Italian Stemmer alternative to Snowball

I'm trying to analyze the texts in Italian in R.
As you do in a textual analysis I have eliminated all the punctuation, special characters and Italian stopwords.
But I have got a problem with Stemming: there is only one Italian stemmer (Snowball), but it is not very precise.
To do the stemming I used the tm library and in particular the stemDocument function and I also tried to use the SnowballC library and both lead to the same result.
stemDocument(content(myCorpus[[1]]),language = "italian")
The problem is that the resulting stemming is not very precise. Are there other more precise Italian stemmers?
or is there a way to implement the stemming, already present in the TM library, by adding new terms?
Another alternative you can check out is the package from this person, he has it for many different languages. Here is the link for Italian.
Whether it will help your case or not is another debate but it can also be implemented via the corpus package. A sample example (for English use case, tweak it for Italian) is also given in their documentation if you move down to the Dictionary Stemmer section
Alternatively, similar to the above way, you can also consider the stemmers or lemmatizers (if you havent considered lemmatizers, they are worth considering) from Python libraries such as NLTK or Spacy and check if you are getting better resutls. After all, they are just files containing mappings of root word vs child words. Download them, fine tune the file to your requirement, and use the mappings as per your convenience by passing it via a custom made function.

Tokenize Text and Analyze with Dictionary in Quanteda

I am trying to do a text analysis using the quanteda packages in R and have been successful in gaining the desired output without doing anything to my texts. However, I am interested in removing stopwords and other common phrases to rerun the analysis (from what I am learning in other sources -- this process is called "Tokenizing"(?)). (The instructions are from https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda/)
With the processed text, which I was able to do using the instructions and the quanteda package. However, I am interested in applying a dictionary for analyzing the text. How can I do that? Since it is hard to attach all my documents here, any hints or examples that I can apply would be helpful and greatly appreciated.
Thank you!
i have used this library with great success and then merged by word to get the score or sentiment. Merge by word
library(tidytext)
get_sentiments("afinn")
get_sentiments("bing")
you can save it as a table
table <- get_sentiments("afinn")
total <- merge(data frameA,data frameB,by="ID")

Type/Token Ratio in R

I'm working with a new corpus and want to get the type/token ratio. Does anyone know of a standard way to do this? I've been trawling around the internet and didn't find anything relevant. Even the tm package doesn't seem to have an easy way to do this.
Just as a reference, I have the following code to tokenize the corpus:
fwseven <- scan(what="c", sep="\n", file="<my file>") #read file
fwseven <- tolower(fwseven) #make lowercase
fwords <- unlist(strsplit(fwseven, "[^a-z]+")) #delete numbers from tokenizing
fwords.clean <- fwords[fwords != ""] #delete empty strings
tokens<-length(fwords.clean) #get number of tokens
I thought there would be an easier way than just making an empty vector and a for loop to run through each individual word of the corpus, but perhaps there's not. If that's the case, I've got the following code, but I'm running into problems with the if statement.
typelist <- vector() #empty vector
for (i in tokens) { #for loop running through list of strings tokens
if (i in typelist)
typelist <- typelist #if i is in typelist do nothing
else {
typelist <- typelist + i #if i isn't in typelist, add it to typelist
}
}
It's been a long time since I've used R; how would I change the if statement for it to check if the string i is already contained in list typelist?
The tm package is probably the simplest way to do this.
Example data (since I don't have your text.fwt file...):
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Use the tm package to create a corpus and document term matrix from these texts:
library(tm)
# create a corpus
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4, examp5)))
# process to remove stopwords, punctuation, etc.
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
# create a document term matrix
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
Find the number of tokens (total number of words in corpus)
n_tokens <- sum(as.matrix(corpus2a.dtm))
Find the number of types (number of unique words in corpus)
n_types <- length(corpus2a.dtm$dimnames$Terms)
So now we can easily find the type/token ratio:
n_types / n_tokens
[1] 0.6170213

R equivalent to matrix row insertion in Matlab

In Matlab, without any coding, I can create a matrix, open up its spreadsheet, and copy multiple columns of values from Excel and paste them into the spreadsheet. I can then right click this matrix and plot it instantly.
I've tried googling for how to do the equivalent in R, and everything seems to involve creating a function iterating over each value with a for loop. This seems a bit cumbersome, is there an equivalent simple way to do this in RStudio?
Thanks.
You certainly can have a similar functionality by using R's integration with a clipboard. In particular, standard R functions that provide support for clipboard operations include connection functions (base package), such as file(), url(), pipe() and others, clipboard text transfer functions (utils package), such as readClipboard(), writeClipboard(), as well as data import functions (base package), which use connection argument, such as scan() or read.table().
This functionality differs from platform to platform. In particular, for Windows platform, you need to use connection name clipboard, for Mac platform (OS X) - you can use pipe("pbpaste") (see this StackOverflow discussion for more details and alternative methods). It appears that Kmisc package offers a platform-independent approach to that functionality, however, I haven't used it so far, so, can't really confirm that it works as expected. See this discussion for details.
The following code is a simplest example of how you would use the above-mentioned functionality:
read.table("clipboard", sep="\t", header=header, ...)
An explanation and further examples are available in this blog post. As far as plotting the imported data goes, RStudio not only allows you to use standard R approaches, but also adds an element of interactivity via its bundled manipulate package. See this post for more details and examples.

Resources