I have a document out of which I have special characters along with text such as !, #, #, $, % and more. The following code is used to obtain the most frequent terms list. But when it is performed, the special characters are missing in the frequent terms list i.e. if "#StackOverFlow" is the word present 100 times in the document, I get it as "StackOverFlow" without # in the frequent terms list. Here is my code:
review_text <- paste(rome_1$text, collapse=" ")
#The special characters are present within review_text
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
head(frequency)
Where exactly have I gone wrong here?
As you can see in the DocumentTermMatrix documentation :
This is different for a SimpleCorpus. In this case all options are
processed in a fixed order in one pass to improve performance. It
always uses the Boost Tokenizer (via Rcpp) and takes no custom
functions as option arguments.
It seems that SimpleCorpus objects (created by Corpus function) use a pre-defined Boost tokenizer which automatically splits words removing punctuations (including #).
You could use VCorpus instead, and removes the punctuations characters you want e.g. :
library(tm)
review_text <-
"I love #StackOverflow. #Stackoverflow is great, but Stackoverflow exceptions are not!"
review_source <- VectorSource(review_text)
corpus <- VCorpus(review_source) # N.B. use VCorpus here !!
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
patternRemover <- content_transformer(function(x,patternToRemove) gsub(patternToRemove,'',x))
corpus <- tm_map(corpus, patternRemover, '\\!|\\.|\\,|\\;|\\?') # remove only !.,;?
dtm <- DocumentTermMatrix(corpus,control=list(tokenize='words'))
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
Result :
> frequency
#stackoverflow exceptions great love stackoverflow
2 1 1 1 1
Related
Using R{tm} package, i create a corpus, per usual:
mycorpus <- Corpus(DirSource(folder,pattern="txt"))
Please note I am not using an encoding variable. The summary (mycorpus) shows document names listed. However after a series of tm_map transforms:
(content_transformer(tolower),content_transformer(removeWords), stopwords("SMART"),stripWhitespace)
ending with mycorpus<- tm_map(mycorpus, PlainTextDocument) and mydtm <- DocumentTermMatrix(mycorpus, control = list(...))
I get an error with inspect(mydtm[1:10, intersect(colnames(dtm), 'toyota')]) to get my variable of choice:
Terms
Docs toyota
character(0) 0
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
The file names (doc ids) have disappeared. Any idea what could be causing this error? more importantly, how do i reinstate the document names? Many thanks.
Code below will work for single file. You likely could use something like list.files to read all files in the directory.
First, I would wrap the cleaning functions in a custom function. Note the order matters and you have to use content_transformer if the function is not from tm.
clean.corpus<-function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, custom.stopwords)
return(corpus)
}
Then concatenate English words with custom words. This is passed as the last part of the custom function above.
custom.stopwords <- c(stopwords('english'), 'lol', 'smh')
doc<-read.csv('coffee.csv', header=TRUE)
The CSV is a data frame with a column of tweets in a text document and another column with an ID for each tweet. The file from my workshop with this file is here.
The csv file is now in memory so next step is to read it in tabular fashion with a specific mapping when making a corpus. Here the content is in a column called text and the unique ID is in a column name "id".
custom.reader <- readTabular(mapping=list(content="text", id="id"))
corpus <- VCorpus(DataframeSource(doc), readerControl=list(reader=custom.reader))
corpus<-clean.corpus(corpus)
The corpus creation uses the readerControl and then once done you can apply the pre-processing steps. Without the reader control the package assigns the 0 character as the name.
The corpus content of document 1 can be accessed here
corpus[[1]][1]
You can review the corpus meta data for the first document with this code
corpus[[1]][2]
So I think you are needing to use readTabular and readerControl in your corpus construction no matter the source.
I was having the same problem and I realized that it was due to tolower. tolower, unlike removeNumbers, removePunctuation, removeWords, stemDocument, stripWhitespace are not tranformations defined in the tm package. To get a list of transformations defined in the tm package that can be directly applied to a corpus, type:
getTransformations()
[1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”
Thus, in order to use tolower it first must make a transformation for tolower for it to handle corpus objects properly.
docs <- tm_map(docs,content_transformer(tolower))
The above line of code should stop the files from being renamed to character(0)
The same trick can be applied to any R function to work with corpuses. For example for gsub, the following syntax applies:
docs <- tm_map(docs, content_transformer(gsub), pattern = “internt”, replacement = “internet”)
I am new to R. I have a CSV file that includes 15000 rows of text, each row belongs to one person. I want to do Latent Dirichlet Allocation on it. But, first I need to create a term document matrix. However, I don't know how to make R to treat each row as a document. Here is what I've done, but it doesn't look correct:
text <- read.csv("text.csv", stringsAsFactors = FALSE)
corpus Corpus(VectorSource(text))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
the current dtm doesn't look like having all the terms in all the documents in columns. I feel like they're only words in each document.
I really appreciate your help
I have two columns, first one is 'class' (5 categories) and second one is 'Text'. I have managed to load the text column as a vector corpus = Corpus(VectorSource(data$Text))
I ultimately want to reduce the text list in each row to unique terms which correlate to the class.
input=read.csv("input.csv",stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(DataframeSource(input))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
dtm = DocumentTermMatrix(corpus,control=list(weighting=weightTfIdf, minWordLength=2))
When I view the corpus it seems to be ignoring the first column, the 'class' column. I'm looking for code to find which words are highly correlated with the different class categories i.e correlate with class 1, but not the other classes.
Thank you
you have a typo in:
corpus = Corpus(DataframeSrouce(input))
change it to:
corpus = Corpus(DataframeSource(input))
I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:
corpus <- Corpus(VectorSource(vec), readerControl=list(language="en"))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))
And then performing LDA:
LDA(dtm, 30)
This final call to LDA() returns the error
"Each row of the input matrix needs to contain at least one non-zero entry".
I assume this means that there is at least one document that has no terms in it after preprocessing. Is there an easy way to remove documents that contain no terms from a DocumentTermMatrix?
I looked in the documentation for the topicmodels package and found the function removeSparseTerms, which removes terms that do not appear in any document, but there is no analogue for removing documents.
"Each row of the input matrix needs to contain at least one non-zero entry"
The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words
agstudy's answer works great, but using it on a slow computer proved mildly problematic.
tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed
(this was done with a 4000x15000 dtm)
The bottleneck appears to be applying sum() to a sparse matrix.
A document-term-matrix created by the tm package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i does not contain a particular row index p, then row p is empty.
tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed
ui contains all the non-zero indices, and since dtm$i is already ordered, dtm.new will be in the same order as dtm. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.
This is just to elaborate on the answer given by agstudy.
Instead of removing the empty rows from the dtm matrix, we can identify the documents in our corpus that have zero length and remove the documents directly from the corpus, before performing a second dtm with only non empty documents.
This is useful to keep a 1:1 correspondence between the dtm and the corpus.
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
corpus <- corpus[-as.numeric(empty.rows)]
Just remove the sparse terms from the DTM and all will work well.
dtm <- DocumentTermMatrix(crude, sparse=TRUE)
Just small addendum to the answer of Dario Lacan:
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
will collect record's id, rather than order numbers. Try this:
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"
If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:
corpus <- tm_filter(
corpus,
FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
# !( meta(doc)$id %in% emptyRows )
)
I had a column in a data frame lt$title which contained strings. I had no "empty" rows in this column, but still got the error:
Error in LDA(dtm, k = 20, control = list(seed = 813)) : Each row of
the input matrix needs to contain at least one non-zero entry
Some of the solutions above did not work for me, since I needed to join the vector of predicted topics to my original data frame. So removing non-zero entries from the document term matrix was no option.
The problem was, that some (very short) strings in lt$title contained special characters which could not be processed by Corpus() and/or DocumentTermMatrix().
My solution was to remove "short" strings (one or two words max.) which do not carry much information anyway.
# Clean up text data
lt$test=nchar(lt$title)
lt = lt[!lt$test<10,]
lt$test<-NULL
# Topic modeling
corpus <- Corpus(VectorSource(lt$title))
dtm = DocumentTermMatrix(corpus)
tm = LDA(dtm, k = 20, control = list(seed = 813))
# Add "topics" to original DF
lt$topic = topics(tm)
I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message:
Error in UseMethod("Content", x) :
no applicable method for 'Content' applied to an object of class "character"
All pre-processing steps work correctly up until document term matrix.
I created a non-random subset of the corpus (with 4000 documents) and the document term matrix command works fine on that.
My code is below. Thanks for the help.
##Import
file <- "reut-full.xml"
reuters <- Corpus(ReutersSource(file), readerControl = list(reader = readReut21578XML))
## Convert to Plain Text Documents
reuters <- tm_map(reuters, as.PlainTextDocument)
## Convert to Lower Case
reuters <- tm_map(reuters, tolower)
## Remove Stopwords
reuters <- tm_map(reuters, removeWords, stopwords("english"))
## Remove Punctuations
reuters <- tm_map(reuters, removePunctuation)
## Stemming
reuters <- tm_map(reuters, stemDocument)
## Remove Numbers
reuters <- tm_map(reuters, removeNumbers)
## Eliminating Extra White Spaces
reuters <- tm_map(reuters, stripWhitespace)
## create a term document matrix
dtm <- DocumentTermMatrix(reuters)
Error in UseMethod("Content", x) :
no applicable method for 'Content' applied to an object of class "character"