Type/Token Ratio in R - r

I'm working with a new corpus and want to get the type/token ratio. Does anyone know of a standard way to do this? I've been trawling around the internet and didn't find anything relevant. Even the tm package doesn't seem to have an easy way to do this.
Just as a reference, I have the following code to tokenize the corpus:
fwseven <- scan(what="c", sep="\n", file="<my file>") #read file
fwseven <- tolower(fwseven) #make lowercase
fwords <- unlist(strsplit(fwseven, "[^a-z]+")) #delete numbers from tokenizing
fwords.clean <- fwords[fwords != ""] #delete empty strings
tokens<-length(fwords.clean) #get number of tokens
I thought there would be an easier way than just making an empty vector and a for loop to run through each individual word of the corpus, but perhaps there's not. If that's the case, I've got the following code, but I'm running into problems with the if statement.
typelist <- vector() #empty vector
for (i in tokens) { #for loop running through list of strings tokens
if (i in typelist)
typelist <- typelist #if i is in typelist do nothing
else {
typelist <- typelist + i #if i isn't in typelist, add it to typelist
}
}
It's been a long time since I've used R; how would I change the if statement for it to check if the string i is already contained in list typelist?

The tm package is probably the simplest way to do this.
Example data (since I don't have your text.fwt file...):
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Use the tm package to create a corpus and document term matrix from these texts:
library(tm)
# create a corpus
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4, examp5)))
# process to remove stopwords, punctuation, etc.
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
# create a document term matrix
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
Find the number of tokens (total number of words in corpus)
n_tokens <- sum(as.matrix(corpus2a.dtm))
Find the number of types (number of unique words in corpus)
n_types <- length(corpus2a.dtm$dimnames$Terms)
So now we can easily find the type/token ratio:
n_types / n_tokens
[1] 0.6170213

Related

Repeatable Macros in R?

Is there any way in R to write a macro like one would in SAS? That is, I want to write a macro with some input variable (corresponding to a row in a dataset) so I can quickly make a plot of certain characteristics from said row. Any information regarding a package/method to do so would be greatly appreciated.
R will generate some very, very, very basic code for you. If you have RStudio installed, you can click File > Import Dataset > From ... point to your file and click 'Open'. R will automatically create the code to do the import. Again, this is very basic. You really need to know how to code to do anything useful.
You get out of it what you put into it, so spend some time learning this stuff, and inevitably you'll learn a ton. I've found that it's very helpful to read through people's questions that are posted here, and try to solve the problem yourself. You'll learn a lot that way and you'll see what the current trends are. Reading books is great, of course, but sometimes I feel like some authors are too academic, and in the real world, sometimes it's done differently than what you see in textbooks.

How to go about data preparation for topic modeling in R (topicmodels, lda, tm)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a corpus (622 docs) of lengthy txt files (ca. 20.000-30.000 words per file) that I'm trying to explore in R. I have done some basic text mining using the tm package and would like to delve into topic modeling now. However, being very new to this, I'm struggling already with some basics of data preparation. A sample of the files I'm currently working with is available here: http://s000.tinyupload.com/?file_id=46554569218218543610
I'm assuming that just feeding these lengthy documents into a topic modeling tool is pointless. So I would like to break them up into paragraphs (or alternatively sets of perhaps 300-500 words, seeing as there is a lot of redundant paragraph breaks and OCR errors in my data). Would you do this within the VCorpus or should I actually divide my source files (e.g. with a shell script)? Any suggestions or experiences?
The text comes from OCR'ed magazine articles, so if I split my docs up, I'm thinking I should add a metadata tag to these paragraphs that tells me which issue it was from originally (basically just the original file name), correct? Is there a way to do this easily?
Generally speaking, can anyone recommend a good hands-on introduction to topic modeling in R? A tutorial that takes me by the hand like a third-grader would be great, actually. I am using the documentation of 'topicmodels' and 'lda', but the learning curve is rather steep for a novice.
edit: Just to be clear, I have already read a lot of the popular introductions to topic modeling (e.g. Scott Weingart
and the MALLET tutorials for Historians). I was thinking of
something specific to the processes in R.
Hope that these questions aren't entirely redundant. Thanks for taking the time to read!
There's no code in your question, so it's not really suitable for this site. That said, here are some comments that might be useful. If you supply code you'll get more specific and useful answers.
Yes. Breaking the text into chunks is common and advisable. Exact sizes are a matter of taste. It is often done within R, I've done it before making the corpus. You might also subset only nouns, like #holzben suggests. Here some code for cutting a corpus into chunks:
corpus_chunk <- function(x, corpus, n) {
# convert corpus to list of character vectors
message("converting corpus to list of vectors...")
listofwords <- vector("list", length(corpus))
for(i in 1:length(corpus))
{
listofwords[[i]] <- corpus[[i]]
}
message("done")
# divide each vector into chunks of n words
# from http://stackoverflow.com/q/16232467/1036500
f <- function(x)
{
y <- unlist(strsplit(x, " "))
ly <- length(y)
split(y, gl(ly%/%n+1, n, ly))
}
message("splitting documents into chunks...")
listofnwords1 <- sapply(listofwords, f)
listofnwords2 <- unlist(listofnwords1, recursive = FALSE)
message("done")
# append IDs to list items so we can get bibliographic data for each chunk
lengths <- sapply(1:length(listofwords), function(i) length(listofnwords1[[i]]))
names(listofnwords2) <- unlist(lapply(1:length(lengths), function(i) rep(x$bibliodata$x[i], lengths[i])))
names(listofnwords2) <- paste0(names(listofnwords2), "_", unlist(lapply(lengths, function(x) seq(1:x))))
return(listofnwords2)
}
Yes, you might make a start with some code and then come back with a more specific question. That's how you'll get the most out of this site.
For a basic introduction to text mining and topic modelling, see Matthew Jockers' book Text Analysis with R for Students of Literature
If you're already a little familiar with MALLET, then try rmallet for topic modelling. There's lots of code snippets on the web that use this, here's one of mine.
I had recently a similar project, usually, at least some of this steps are done:
Stop-words removal: you can easily do this via the removeWords(your
corpus, stopwords("english")) from the tm package. Further you can
construct your own stop word list and remove it via the same
function.
usually you also numbers and punctuation's (see tm package) are
removed.
also very common is stemming (see Wikipedia for an explanation) and
removing sparse terms, this helps to reduce the dimension of your
term document matrix with only little loss of information (both in tm and RWeka
package).
some people also like to work only with nouns/proper nouns or noun
phrases. See here for an overview and some word lists and part
of speech dictionaries you can be found on Kevin's Word List Page.
regarding splitting in paragraphs: this should be possible with
the NgramTokenizer from Rweka package see tm package FAQ.
A nice article about pre-processing in general can be found
here or more scientific here.
regarding meta data management see tm package vignette.
One more example of R + topic models can be found Ponweiser 2012
I learned that text mining is a bit different. Things which improved results in one case do not work in an other case. It is a lot of testing which parameters and which pre-processing steps improve your results...So have fun!

How can I measure my usage of R? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm writing an annual report for uni, in which I would like to detail how my usage of R has increased over the past year. I'm looking metrics that I can use to describe my usage of R. Some possible metrics to describe usage:
number of lines of code in history
number of errors
hours spent using program
number of times a particular function has been called
number of plots made
So my question is: can I extract any of the above from R, or can I extract any other metrics which would demonstrate my usage of R?
First, I'm not sure that this question is at all suited to Stack Overflow. Second, I think that the metrics you've identified are not really suitable. Let's look at the ones you've shortlisted so far:
Number of lines of code in history
You make a lot of tweaks to your code. They accumulate in your history. Your history now has a lot of lines of code. Does that reflect positively of your usage of R? Or, you like to write code like the following in R:
temp <- 0
for (i in 1:10) {
temp <- temp + i
}
print(temp)
while a person familiar with R would just write sum(1:10). One line versus five. Can we really say that number of lines is a good metric?
Number of errors
Maybe there is some merit to this. But are you going to classify errors in some way? Is a missing or misplaced bracket forgivable? What about times when no error or warning is issued but R behaves in a way that you might not have expected, thus leading to unexpected results (for example, assuming that numeric(0) and factor(0) would behave the same way). See here for some R gotchas, several of which won't provide any indication of an error, but would certainly lead to erroneous analysis. How would they be analyzed with this metric?
Number of hours spent using the program
Again, debatable. How do you measure the number of hours? Time spent coding? Time the computer spends processing your code? Time it took you to figure out how to program your problem?
Number of times a particular function has been called
I don't understand this metric at all. Do more obscure functions get a higher weight (for example, if you are one of those who use vapply while the rest of the schmucks use sapply, do you get bonus points for using vapply because it can be safer (and sometimes faster) to use?)
Number of plots made
Sorry, but again, I don't understand this metric at all. First of all, not all plots are created equally! There are several in the data visualization field who feel that a lot of software ruined data visualization because some software (a very popular spreadsheet program, in particular) made it so easy for people to quickly make gaudy plots. With R, they are less gaudy by default, but that in itself doesn't make it good. So, if you're just measuring the number of plots churned out without some other criteria for quality assessment, then I'm not sure how this metric is useful.
And, from your comment to your question:
Actually...stack overflow reputation points might be as good as anything!
Eh... The only time I really use R is to answer questions on Stack Overflow (unfortunately true). At the same time, almost all my reputation points here are from the questions I've answered in the R tag. Sure, there are some users here that I would really trust, but sometimes, I don't even trust myself, so I don't know if that's a good indicator of your usage of R.
Lots of users have also complained that Stack Overflow voting is totally wacky, so I'm not sure that you really can use "reputation" as a valid measure of skill. For example, there's an ongoing discussion among regular users here that answers to "easy" questions get voted up very quickly (because they are easy to verify, often without even running the code) while answers to "complicated" questions don't yield votes proportional to the effort taken to answer the question. Case in point: Why the heck do I have a "Guru" badge for an answer that is essentially a reordered version of data already easily available with two minutes on Google. I'm not particularly proud of that answer, and it certainly doesn't say anything about my "usage" of R.
Now, to make this so that it might qualify as an answer and not just an extended comment on your question itself, the biggest thing that I would consider valid, but not sure how to measure it, would be something like how active you are in the R community. There are many ways to get involved with R, from writing or contributing to packages, filing bug reports, conducting workshops to help others make the switch to R, and so on.
I'm not suggesting that you need to write a book, as several others here have done, or to become a legendary package developer with a cult of underscore followers, but you can take small steps. For instance, although I'm a writing teacher, I have held workshops for students and written a few "getting started tips" just to introduce them to using R, so they can consider adding it to their toolkit. Many other users here regularly blog about their experiences working with R and, again, as this is part of a community, they learn a lot in the process.
Finally, a couple of more ideas:
#PaulHiemstra suggested in his comment that you could "mention the percentage of your programming work you do in R." I would extend that concept as follows: (1) try to measure how much of your work overall is done in R and tools complementary to R (obvious ones like Sweave/knitr/LaTeX come to mind), and (2) try to measure how much of an impact using R has had on improving your overall skills (with the logic being that good programming is often accompanied by logical thought, careful organization, good documentation, and so on).
Related to the previous point, try to see how your usage of R has changed with time. Has your behavior changed from manually redoing the same steps to writing functions yet? Have you then gone back and adapted those functions so that, instead of solving a specific problem you had at a given point in time, they can be used more generally by a larger audience? These are pretty significant changes, particularly if you had started from scratch with the language, and they can be a bit more meaningful than the ideas you presented in your question.
So, to summarize, a lot of the somewhat easily quantifiable things that you've identified in your question will probably lead to very meaningless analysis. I feel that the qualitative inputs you make would be much more valuable.
Another metric: Get an old and complex (don't know if you have one) code and redo it from 0. Use the difference of computation time as metric.

R tm package used for predictive analytics. How one classifies a new document?

This is a general question about the procedures concerning text mining. Suppose one has a Corpus of documents classified as Spam/No_Spam. As standard procedure one pre-process the data, removing punctuation, stops words etc. After converting it into a DocumentTermMatrix one can build some models to predict spam/No_Spam.
Here is my problem. Now I want to use the model built for new documents arrive. In order to check a single document I would have to build a DocumentTerm*Vector*? so it can be used to predict Spam/No_Spam. In the documentation of tm I found one converts the full Corpus into a Matrix using for example tfidf weights. How can I then convert a single vector using the idf from the Corpus? do i have to change my corpus and build a new DocumentTermMatrix every time?
I processed my corpus, converted it into a matrix and then split it into a Training and Testing sets. But here the test set was built in the same line as the document matrix of the full set. I can check precision etc, but do not know whats the best procedure for new text classification.
Ben, Imagine I have a preprocessed DocumentTextMatrix, I convert it into a data.frame.
dtm <- DocumentTermMatrix(CorpusProc,control = list(weighting =function(x) weightTfIdf(x, normalize =FALSE),stopwords = TRUE, wordLengths=c(3, Inf), bounds = list(global = c(4,Inf))))
dtmDataFrame <- as.data.frame(inspect(dtm))
Added a Factor Variable and built a model.
Corpus.svm<-svm(Risk_Category~.,data=dtmDataFrame)
Now imagine I give you a new document d (was not in your Corpus before) and you want to know the model prediction spam/No_Spam. How you do that?
Ok lets create an example based on the code used here.
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4)))
Note I took out example 5
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
dtmDataFrame <- as.data.frame(inspect(corpus2a.dtm))
Added a factor variable Spam_Classification 2 levels spam/No_Spam
dtmFinal<-cbind(dtmDataFrame,Spam_Classification)
I build a model SVM
Corpus.svm<-svm(Spam_Category~.,data=dtmFinal)
Now imagine I have example 5 as a new document (email) How I generate a Spam/No_Spam value???
Thanks for this interesting question. I have been thinking over it for some time. Too keep things short, the quintessence of my findings: For weighting-methods except tf there is no way around laborious work or recalculating the whole DTM (and probably rerunning your svm).
Only for tf-weighting I could find an easy process for classifying new content. You have to transform the new document (for sure) to a DTM. During the transformation you have to add a dictionary containing all the terms you have used to train your classifier on the old corpus. Then you can use predict() as usually. For the tf part, here a very minimal sample and a method for classifying a new document:
### I) Data
texts <- c("foo bar spam",
"bar baz ham",
"baz qux spam",
"qux quux ham")
categories <- c("Spam", "Ham", "Spam", "Ham")
new <- "quux quuux ham"
### II) Building Model on Existing Documents „texts“
library(tm) # text mining package for R
library(e1071) # package with various machine-learning libraries
## creating DTM for texts
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)))
## making DTM a data.frame and adding variable categories
df <- data.frame(categories, as.data.frame(inspect(dtm)))
model <- svm(categories~., data=df)
### III) Predicting class of new
## creating dtm for new
dtm_n <- DocumentTermMatrix(Corpus(VectorSource(new)),
## without this line predict won't work
control=list(dictionary=names(df)))
## creating data.frame for new
df_n <- as.data.frame(inspect(dtm_n))
predict(model, df_n)
## > 1
## > Ham
## > Levels: Ham Spam
I have the same problem and I think that RTextTools package can help you.
Look at create_matrix:
...
originalMatrix - The original DocumentTermMatrix used to train the models. If supplied, will
adjust the new matrix to work with saved models.
...
So in code:
train.data <- loadDataTable() # load data from DB - 3 columns (info, subject, category)
train.matrix <- create_matrix(train.data[, c(subject, info)]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
train.container <- create_container(train.matrix,train.data$category,trainSize=1:nrow(train.data), virgin=FALSE)
model <- train_model(train.container, algorithm=c("SVM"))
# save model & matrix
predict.text <- function(info, subject, train.matrix, model)
{
predict.matrix <- create_matrix(cbind(subject = subject, info = info), originalMatrix = train.matrix, language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
predict.container <- create_container(predict.matrix, NULL, testSize = 1, virgin = FALSE) # testSize = 1 - we have only one row!
return(classify_model(predict.container, model))
}
It's not clear what your question is or what kind of answer you're looking for.
Assuming you're asking 'how can I get a 'DocumentTermVector' to pass to other functions?', here's one method.
Some reproducible data:
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Create a corpus from these texts:
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4, examp5)))
Process text:
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
Convert processed corpora to term document matrix:
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
inspect(corpus2a.dtm)
A document-term matrix (5 documents, 273 terms)
Non-/sparse entries: 314/1051
Sparsity : 77%
Maximal term length: 10
Weighting : term frequency (tf)
Terms
Docs able actually addition allows answer answering answers archives are arsenal avoid background based
1 0 0 2 0 0 0 0 0 1 0 1 0 0
2 1 1 0 0 0 0 0 0 0 0 0 0 0
3 0 1 0 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 1 0
5 2 1 0 0 8 2 3 1 0 1 0 0 1
This is the key line that gets you the "DocumentTerm*Vector*" that you refer to:
# access vector of first document in the dtm
as.matrix(corpus2a.dtm)[1,]
able actually addition allows answer answering answers archives are
0 0 2 0 0 0 0 0 1
arsenal avoid background based basic before better beware bit
0 1 0 0 0 0 0 0 0
board book bother bug changed chat check
In fact it is a named number, that should be useful for passing to other functions, etc. which seems like what you want to do:
str(as.matrix(corpus2a.dtm)[1,])
Named num [1:273] 0 0 2 0 0 0 0 0 1 0 ...
If you just want a numeric vector, try as.numeric(as.matrix(corpus2a.dtm)[1,]))
Is that what you want to do?

Rules-of-thumb doc for mathematical programming in R?

Does there exist a simple, cheatsheet-like document which compiles the best practices for mathematical computing in R? Does anyone have a short list of their best-practices? E.g., it would include items like:
For large numerical vectors x, instead of computing x^2, one should compute x*x. This speeds up calculations.
To solve a system $Ax = b$, never solve $A^{-1}$ and left-multiply $b$. Lower order algorithms exist (e.g., Gaussian elimination)
I did find a nice numerical analysis cheatsheet here. But I'm looking for something quicker, dirtier, and more specific to R.
#Dirk Eddelbeuttel has posted a bunch of stuff on "high performance computing with R". He's also a regular so will probably come along and grab some well-deserved reputation points. While you are waiting you can read some of his stuff here:
http://dirk.eddelbuettel.com/papers/ismNov2009introHPCwithR.pdf
There is an archive of the r-devel mailing list where discussions about numerical analysis issues relating to R performance occur. I will often put its URL in the Google advanced search page domain slot when I want to see what might have been said in the past: https://stat.ethz.ch/pipermail/r-devel/

Resources