Text Categorization In R for single paragraph - r

I have been searching for a solution/library or any function that performs text categorization of a single paragraph without any training involved in R. I need to categorize/classify contact center call data individually. The calls need to be categorized according to the terms used by the agent or caller. The terms may not be consecutive, and so it doesn't follow bigram.
For example, the following sample text should be categorized something like "Router Internet issues"
"Hello thank you for calling XYZ solutions. This is Mark. How can I help you?
Hi, I have been facing issues in connecting to internet. There seems to be some issue with my router. "
I have tried OpenNLP, RTextTools libraries in R, but could not figure out how to process a single paragraph. Does anyone have any ideas? Any help is appreciated.
Edited
As I am a beginner in R so would much appreciate a thorough solution if possible

It looks like you are trying to extract keywords from a document and using those as tags/labels. You may want to look at this R package {RKEA} - http://www.nzdl.org/Kea/Download/Kea-5.0-Readme.txt

Related

Repeatable Macros in R?

Is there any way in R to write a macro like one would in SAS? That is, I want to write a macro with some input variable (corresponding to a row in a dataset) so I can quickly make a plot of certain characteristics from said row. Any information regarding a package/method to do so would be greatly appreciated.
R will generate some very, very, very basic code for you. If you have RStudio installed, you can click File > Import Dataset > From ... point to your file and click 'Open'. R will automatically create the code to do the import. Again, this is very basic. You really need to know how to code to do anything useful.
You get out of it what you put into it, so spend some time learning this stuff, and inevitably you'll learn a ton. I've found that it's very helpful to read through people's questions that are posted here, and try to solve the problem yourself. You'll learn a lot that way and you'll see what the current trends are. Reading books is great, of course, but sometimes I feel like some authors are too academic, and in the real world, sometimes it's done differently than what you see in textbooks.

Using R's exams package for assignments: Is it possible to add question hints?

The exams package is a really fantastic tool for generating exams from R.
I am interested in the possibilities of using it for (programming) assignments. The main difference from an exam is that besides solutions I'd also like hints to be included in the PDF / HTML output file.
Typically I put the hints for (sub)-questions in a separate section at the end of the PDF assignment (using a separate Latex section), but this requires manual labour. These are for students to consult if they need help getting started on any particular exercise, and it avoids having them look at the solutions directly for hints on how to start.
An assignment might look like:
Question 1
Question 2 ...
Question 10
Hints to all questions
I'd be open to changing the exact format as long it is possible to look up hints without looking up the answer, and it remains optional to read the hints.
So in fact I am looking for an intermediate "hints" section between the between the "question" and "solution" section, which is present for some questions but not for all.
My questions: Is this already possible? If not, how could this be implemented using the exams package?
R/exams does not have dedicated/native support for this kind of assignment so it isn't available out of the box. So if you want to get this kind of processing you have to ensure it yourself using LaTeX for PDF or CSS for HTML.
In LaTeX I think it should be possible to do what you want using the newfloat and endfloat packages in the LaTeX template that you pass to exams2pdf(). Any LaTeX template needs to provide {question} and {solution} environments, e.g., the plain.tex template shipped with the package has
\newenvironment{question}{\item \textbf{Problem}\newline}{}
\newenvironment{solution}{\textbf{Solution}\newline}{}
with the exercises embedded as
\begin{enumerate}
%% \exinput{exercises}
\end{enumerate}
Now instead of the \newenvironment{solution}... you could use
\usepackage{newfloat,endfloat}
\DeclareFloatingEnvironment{hint}
\DeclareDelayedFloat{hint}{Hint}
\DeclareFloatingEnvironment{solution}
\DeclareDelayedFloat{solution}{Solution}
This defines two new floating environments {hint} and {solution} which are then declared delayed floats. And then you would need to customize these environments regarding the text displayed within the questions at the beginning and the listing at the end. I'm not sure if this can get you exactly what you want, though, but hopefully it is a useful place to start from.

Hierarchical bar charts in r

I'm working on a visualization project in R and thought of creating bar charts for hierarchical data (states with constituencies in it, each constituency possessing a numeric value).
I came across this web page (https://observablehq.com/#d3/hierarchical-bar-chart) which implements exactly this using library "d3" but for JavaScript.
Is there any similar library in R to do this?
I also had similar question before! But I couldn't find any package or code online that does the same thing in R, the closest thing I found was "r2d3" package which allows you to run javascript in R.
So by leveraging the "r2d3" package, I was able to replicate it and put it into a function call "hbar", you can directly source it from my github with the code below:
source("https://raw.githubusercontent.com/JohnnyPeng123/hierarchical_bar_chart-/master/hbar.r")
For more details and use case please follow the link below:
https://github.com/JohnnyPeng123/hierarchical_bar_chart-
Let me know if you need any help in terms of using this function.
You can use ggplot.
If you need more help, please share a reproductive example.

Type/Token Ratio in R

I'm working with a new corpus and want to get the type/token ratio. Does anyone know of a standard way to do this? I've been trawling around the internet and didn't find anything relevant. Even the tm package doesn't seem to have an easy way to do this.
Just as a reference, I have the following code to tokenize the corpus:
fwseven <- scan(what="c", sep="\n", file="<my file>") #read file
fwseven <- tolower(fwseven) #make lowercase
fwords <- unlist(strsplit(fwseven, "[^a-z]+")) #delete numbers from tokenizing
fwords.clean <- fwords[fwords != ""] #delete empty strings
tokens<-length(fwords.clean) #get number of tokens
I thought there would be an easier way than just making an empty vector and a for loop to run through each individual word of the corpus, but perhaps there's not. If that's the case, I've got the following code, but I'm running into problems with the if statement.
typelist <- vector() #empty vector
for (i in tokens) { #for loop running through list of strings tokens
if (i in typelist)
typelist <- typelist #if i is in typelist do nothing
else {
typelist <- typelist + i #if i isn't in typelist, add it to typelist
}
}
It's been a long time since I've used R; how would I change the if statement for it to check if the string i is already contained in list typelist?
The tm package is probably the simplest way to do this.
Example data (since I don't have your text.fwt file...):
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Use the tm package to create a corpus and document term matrix from these texts:
library(tm)
# create a corpus
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4, examp5)))
# process to remove stopwords, punctuation, etc.
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
# create a document term matrix
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
Find the number of tokens (total number of words in corpus)
n_tokens <- sum(as.matrix(corpus2a.dtm))
Find the number of types (number of unique words in corpus)
n_types <- length(corpus2a.dtm$dimnames$Terms)
So now we can easily find the type/token ratio:
n_types / n_tokens
[1] 0.6170213

How to import fundamentals for stocks by using Quandl package

I'm trying to import fundamentals for stocks by using Quandl package,I pulled some data by using this line of code:
mydata <-Quandl("WIKI/AAPL", collapse="quarterly")
However, I don't understend how can I get fundamentals data and not just the price data.
Any Idea how to do that? Also, Is there a way to pull all fundamentals with the same line?
On this site you find all the fundamentals:
http://www.quandl.com/c/stocks/aapl. Just click on the figure (quaterly/anualy) and you will be directed to a page with the data. There is an R button on the right. Clicking it will give you the R-Code.
Example:
Gross-Profit: http://www.quandl.com/SEC/AAPL_GROSSPROFIT_Q-APPLE-INC-AAPL-Quarterly-Gross-Profit
Quandl("SEC/AAPL_GROSSPROFIT_Q",
trim_start="2009-06-27",
trim_end="2014-03-29")
But from my understanding quandl do not have all the fundamental-details like
Bloomberg, Reuters, CapitalIQ.
There can be an issue needing to use different SEC codes to obtain comparable data on different companies so Quandl hss recently provided a harmonized set of tags which can be used for some of the more common data types. In addition to Floo0 references, you might want to check out http://www.quandl.com/help/api-for-stock-data#SEC-Harmonized-Data
I got this line of code from the maintainer of the package.There was a typo in the documentation so this is the right syntax:
Quandl("RAYMOND/MSFT_COST_OF_REVENUE_TOTAL_Q", trim_start="2009-06-27", trim_end="2014-03-29")

Resources