R/exams - exams2blackboard no randomization or shuffling - r-exams

I created a set of num and schoice questions. I double-checked them with exams2html() which gives me the randomization and shuffling of answers I want.
However, when I upload the zip file to the system, I notice that there is no randomization. Code below creates identical 5 questions for both num and schoice questions.
Could you tell what I am doing wrong?
Thank you

It turns out the way that I set knitr options with knitr::opts_chunk$set() in a few questions was the problem. Specifically knitr::opts_chunk$set(cache = TRUE) was stopping randomization at the exam level.
I removed all `knitr::opts_chunk$set()' from all files, and move the options that I need to the relevant code chunks. All worked beautifuly.
Thank you so much Achim for finding my mistake!

Related

How does exshuffle works in cloze questions (package `exams`)?

I would like to understand how does exsuffle works on cloze questions. Does it work with various schoice questions with a different number of possible answers?
I expect to have the possible answers shuffled when presenting the exercises in Moodle. With the observed behavior, the order is always the same.
The exshuffle option in the meta-information for R/exams questions also works for schoice or mchoice elements within cloze questions. Notes:
Bug fix: Prior to R/exams version 2.4-0 specifying exshuffle in cloze question without an answer list in the solution part of the exercise lead to an error. This prompted this StackOverflow question but has been fixed now.
Numeric values for exshuffle (i.e., sub-sampling a larger number of answer alternatives) also work.
Instead of setting exshuffle to TRUE which does the shuffling on the R side, it is also possible to do the shuffling in Moodle (as pointed out by #JPMD) by selecting a cloze_mchoice_display that includes shuffling, e.g., MULTICHOICE_S or MULTICHOICE_VS etc..
Currently, only a single exshuffle value can currently be set for the entire question. Thus, if there are multiple schoice and/or mchoice elements, this single exshuffle value is applied to all of them.
or you can use (for future reference) "MULTICHOICE_VS" as in:
exams2moodle(questions,
name = "exameXPTO",
cloze = list(cloze_mchoice_display = "MULTICHOICE_VS"),
envir =.GlobalEnv)
:-)
```
options[[1]] <- sample(paste0(c(correct_answer, possible_answers[!possible_answers %in% correct_answer])))
solutions[[1]] <- options[[1]]==correct_answer
```

Using R's exams package for assignments: Is it possible to add question hints?

The exams package is a really fantastic tool for generating exams from R.
I am interested in the possibilities of using it for (programming) assignments. The main difference from an exam is that besides solutions I'd also like hints to be included in the PDF / HTML output file.
Typically I put the hints for (sub)-questions in a separate section at the end of the PDF assignment (using a separate Latex section), but this requires manual labour. These are for students to consult if they need help getting started on any particular exercise, and it avoids having them look at the solutions directly for hints on how to start.
An assignment might look like:
Question 1
Question 2 ...
Question 10
Hints to all questions
I'd be open to changing the exact format as long it is possible to look up hints without looking up the answer, and it remains optional to read the hints.
So in fact I am looking for an intermediate "hints" section between the between the "question" and "solution" section, which is present for some questions but not for all.
My questions: Is this already possible? If not, how could this be implemented using the exams package?
R/exams does not have dedicated/native support for this kind of assignment so it isn't available out of the box. So if you want to get this kind of processing you have to ensure it yourself using LaTeX for PDF or CSS for HTML.
In LaTeX I think it should be possible to do what you want using the newfloat and endfloat packages in the LaTeX template that you pass to exams2pdf(). Any LaTeX template needs to provide {question} and {solution} environments, e.g., the plain.tex template shipped with the package has
\newenvironment{question}{\item \textbf{Problem}\newline}{}
\newenvironment{solution}{\textbf{Solution}\newline}{}
with the exercises embedded as
\begin{enumerate}
%% \exinput{exercises}
\end{enumerate}
Now instead of the \newenvironment{solution}... you could use
\usepackage{newfloat,endfloat}
\DeclareFloatingEnvironment{hint}
\DeclareDelayedFloat{hint}{Hint}
\DeclareFloatingEnvironment{solution}
\DeclareDelayedFloat{solution}{Solution}
This defines two new floating environments {hint} and {solution} which are then declared delayed floats. And then you would need to customize these environments regarding the text displayed within the questions at the beginning and the listing at the end. I'm not sure if this can get you exactly what you want, though, but hopefully it is a useful place to start from.

How can I generate exams that contain randomly-generated single-choice answers using R/exams package?

I am interested in using R/exams package in order to generate tests composed of 'single-choice' questions. The three most important things that I am looking for are:
-being able to randomly select one (or more) out of a set of exercises for each participant
-being able to randomly shuffle answer alternatives
-being able to randomly select numbers, text blocks, graphics using the R programming language.
I have followed the basic R/exams tutorials and was able to generate their demo exams, but I was not yet able to find a full tutorial on how to achieve these goals. I am a beginner R programmer and I would, therefore, need a step-by-step tutorial.
If there are any suggestions of such tutorials here I would really appreciate any help.
Thank you
All things you are looking for can be accomplished with R/exams. There is not one step-by-step tutorial that illustrates everything, though. But there are quite a few bits and pieces that should get you started.
Do you want to generate written single-choice exams or do you want to conduct your tests in some learning management system like Moodle or so? If you're looking for written exams, then exams2nops() is the most complete solution, see:
http://www.R-exams.org/tutorials/exams2nops/
For setting up single-choice exercises based on numeric questions, a step-by-step tutorial is: http://www.R-exams.org/tutorials/static_num_schoice/
If you prefer an arithmetic illustration rather than one from economics, there is:
http://www.R-exams.org/general/user2019/
For selecting one out of a set of exercises for each participant, you need to define an exam with a list of exercises, e.g.,
exm <- list(
c("a.Rmd", "b.Rmd", "c.Rmd"),
c("d.Rmd", "e.Rmd")
)
When using exams2xyz(exm) then you will get an exam with two exercises. The first one is a random sample of a-c and the second one a random sample of d-e.
I suggest you try to get started with these, keeping it simple in the beginning. I.e., instead of accomplishing all tasks immediately, try to take them one by one.

Type/Token Ratio in R

I'm working with a new corpus and want to get the type/token ratio. Does anyone know of a standard way to do this? I've been trawling around the internet and didn't find anything relevant. Even the tm package doesn't seem to have an easy way to do this.
Just as a reference, I have the following code to tokenize the corpus:
fwseven <- scan(what="c", sep="\n", file="<my file>") #read file
fwseven <- tolower(fwseven) #make lowercase
fwords <- unlist(strsplit(fwseven, "[^a-z]+")) #delete numbers from tokenizing
fwords.clean <- fwords[fwords != ""] #delete empty strings
tokens<-length(fwords.clean) #get number of tokens
I thought there would be an easier way than just making an empty vector and a for loop to run through each individual word of the corpus, but perhaps there's not. If that's the case, I've got the following code, but I'm running into problems with the if statement.
typelist <- vector() #empty vector
for (i in tokens) { #for loop running through list of strings tokens
if (i in typelist)
typelist <- typelist #if i is in typelist do nothing
else {
typelist <- typelist + i #if i isn't in typelist, add it to typelist
}
}
It's been a long time since I've used R; how would I change the if statement for it to check if the string i is already contained in list typelist?
The tm package is probably the simplest way to do this.
Example data (since I don't have your text.fwt file...):
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Use the tm package to create a corpus and document term matrix from these texts:
library(tm)
# create a corpus
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4, examp5)))
# process to remove stopwords, punctuation, etc.
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
# create a document term matrix
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
Find the number of tokens (total number of words in corpus)
n_tokens <- sum(as.matrix(corpus2a.dtm))
Find the number of types (number of unique words in corpus)
n_types <- length(corpus2a.dtm$dimnames$Terms)
So now we can easily find the type/token ratio:
n_types / n_tokens
[1] 0.6170213

R tm package used for predictive analytics. How one classifies a new document?

This is a general question about the procedures concerning text mining. Suppose one has a Corpus of documents classified as Spam/No_Spam. As standard procedure one pre-process the data, removing punctuation, stops words etc. After converting it into a DocumentTermMatrix one can build some models to predict spam/No_Spam.
Here is my problem. Now I want to use the model built for new documents arrive. In order to check a single document I would have to build a DocumentTerm*Vector*? so it can be used to predict Spam/No_Spam. In the documentation of tm I found one converts the full Corpus into a Matrix using for example tfidf weights. How can I then convert a single vector using the idf from the Corpus? do i have to change my corpus and build a new DocumentTermMatrix every time?
I processed my corpus, converted it into a matrix and then split it into a Training and Testing sets. But here the test set was built in the same line as the document matrix of the full set. I can check precision etc, but do not know whats the best procedure for new text classification.
Ben, Imagine I have a preprocessed DocumentTextMatrix, I convert it into a data.frame.
dtm <- DocumentTermMatrix(CorpusProc,control = list(weighting =function(x) weightTfIdf(x, normalize =FALSE),stopwords = TRUE, wordLengths=c(3, Inf), bounds = list(global = c(4,Inf))))
dtmDataFrame <- as.data.frame(inspect(dtm))
Added a Factor Variable and built a model.
Corpus.svm<-svm(Risk_Category~.,data=dtmDataFrame)
Now imagine I give you a new document d (was not in your Corpus before) and you want to know the model prediction spam/No_Spam. How you do that?
Ok lets create an example based on the code used here.
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4)))
Note I took out example 5
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
dtmDataFrame <- as.data.frame(inspect(corpus2a.dtm))
Added a factor variable Spam_Classification 2 levels spam/No_Spam
dtmFinal<-cbind(dtmDataFrame,Spam_Classification)
I build a model SVM
Corpus.svm<-svm(Spam_Category~.,data=dtmFinal)
Now imagine I have example 5 as a new document (email) How I generate a Spam/No_Spam value???
Thanks for this interesting question. I have been thinking over it for some time. Too keep things short, the quintessence of my findings: For weighting-methods except tf there is no way around laborious work or recalculating the whole DTM (and probably rerunning your svm).
Only for tf-weighting I could find an easy process for classifying new content. You have to transform the new document (for sure) to a DTM. During the transformation you have to add a dictionary containing all the terms you have used to train your classifier on the old corpus. Then you can use predict() as usually. For the tf part, here a very minimal sample and a method for classifying a new document:
### I) Data
texts <- c("foo bar spam",
"bar baz ham",
"baz qux spam",
"qux quux ham")
categories <- c("Spam", "Ham", "Spam", "Ham")
new <- "quux quuux ham"
### II) Building Model on Existing Documents „texts“
library(tm) # text mining package for R
library(e1071) # package with various machine-learning libraries
## creating DTM for texts
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)))
## making DTM a data.frame and adding variable categories
df <- data.frame(categories, as.data.frame(inspect(dtm)))
model <- svm(categories~., data=df)
### III) Predicting class of new
## creating dtm for new
dtm_n <- DocumentTermMatrix(Corpus(VectorSource(new)),
## without this line predict won't work
control=list(dictionary=names(df)))
## creating data.frame for new
df_n <- as.data.frame(inspect(dtm_n))
predict(model, df_n)
## > 1
## > Ham
## > Levels: Ham Spam
I have the same problem and I think that RTextTools package can help you.
Look at create_matrix:
...
originalMatrix - The original DocumentTermMatrix used to train the models. If supplied, will
adjust the new matrix to work with saved models.
...
So in code:
train.data <- loadDataTable() # load data from DB - 3 columns (info, subject, category)
train.matrix <- create_matrix(train.data[, c(subject, info)]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
train.container <- create_container(train.matrix,train.data$category,trainSize=1:nrow(train.data), virgin=FALSE)
model <- train_model(train.container, algorithm=c("SVM"))
# save model & matrix
predict.text <- function(info, subject, train.matrix, model)
{
predict.matrix <- create_matrix(cbind(subject = subject, info = info), originalMatrix = train.matrix, language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
predict.container <- create_container(predict.matrix, NULL, testSize = 1, virgin = FALSE) # testSize = 1 - we have only one row!
return(classify_model(predict.container, model))
}
It's not clear what your question is or what kind of answer you're looking for.
Assuming you're asking 'how can I get a 'DocumentTermVector' to pass to other functions?', here's one method.
Some reproducible data:
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Create a corpus from these texts:
corpus2 <- Corpus(VectorSource(c(examp1, examp2, examp3, examp4, examp5)))
Process text:
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
Convert processed corpora to term document matrix:
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(3,10)))
inspect(corpus2a.dtm)
A document-term matrix (5 documents, 273 terms)
Non-/sparse entries: 314/1051
Sparsity : 77%
Maximal term length: 10
Weighting : term frequency (tf)
Terms
Docs able actually addition allows answer answering answers archives are arsenal avoid background based
1 0 0 2 0 0 0 0 0 1 0 1 0 0
2 1 1 0 0 0 0 0 0 0 0 0 0 0
3 0 1 0 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 1 0
5 2 1 0 0 8 2 3 1 0 1 0 0 1
This is the key line that gets you the "DocumentTerm*Vector*" that you refer to:
# access vector of first document in the dtm
as.matrix(corpus2a.dtm)[1,]
able actually addition allows answer answering answers archives are
0 0 2 0 0 0 0 0 1
arsenal avoid background based basic before better beware bit
0 1 0 0 0 0 0 0 0
board book bother bug changed chat check
In fact it is a named number, that should be useful for passing to other functions, etc. which seems like what you want to do:
str(as.matrix(corpus2a.dtm)[1,])
Named num [1:273] 0 0 2 0 0 0 0 0 1 0 ...
If you just want a numeric vector, try as.numeric(as.matrix(corpus2a.dtm)[1,]))
Is that what you want to do?

Resources