R, tm-error of transformation drops documents - r

I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map:
library (tm)
library(NLP)
lirary (openNLP)
text = c('.......')
corp <- Corpus(VectorSource(text))
corp <- tm_map(corp, stripWhitespace)
Warning message:
In tm_map.SimpleCorpus(corp, stripWhitespace) :
transformation drops documents
corp <- tm_map(corp, tolower)
Warning message:
In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents
The codes were working 2 months ago, now I'm trying for a new data and it is not working anymore. Anyone please shows me where was I wrong. Thank you.
I even tried with the command below, but it doesn't work either.
corp <- tm_map(corp, content_transformer(stripWhitespace))

The code should still be working. You get a warning, not an error. This warning only appears when you have a corpus based on a VectorSource in combination when you use Corpus instead of VCorpus.
The reason is that there is a check in the underlying code to see if the number of names of the corpus content matches the length of the corpus content. With reading the text as a vector there are no document names and this warning pops up. And this is only a warning, no documents have been dropped.
See the difference between the 2 examples
library(tm)
text <- c("this is my text with some other text and some more")
# warning based on Corpus and Vectorsource
text_corpus <- Corpus(VectorSource(text))
# warning appears running following line
tm_map(text_corpus, content_transformer(tolower))
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 1
Warning message:
In tm_map.SimpleCorpus(text_corpus, content_transformer(tolower)) :
transformation drops documents
# Using VCorpus
text_corpus <- VCorpus(VectorSource(text))
# warning doesn't appear
tm_map(text_corpus, content_transformer(tolower))
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
tm_map(text_corpus, content_transformer(tolower))

Related

R Function delete text till a certain keyword in a list

I am trying to manipulate text in R. I am loading word documents and want to preprocess them in such a way, that every text till a certain point is deleted.
library(readtext)
#List all documents
file_list = list.files()
#Read Texts and write them to a data table
data = readtext(file_list)
# Create a corpus
library(tm)
corp = VCorpus(VectorSource(data$text))
#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)
Now what I am trying to do is, to delete every text till a certain keyword, here "Disclosure", for each text corpus and delete everything after the word "Conclusion"
There are many ways to do what you want, but without knowing more about your case or your example it is difficult to come up with the right solution.
If you are SURE that there will only be one instance of Disclosure and one instance of Conclusion you can use the following. Also, be warned, this assumes that each document is a single content vector and will not work otherwise. It will be relatively slow, but for a few small to medium sized documents it will work fine.
All I did was write some functions that apply regex to content in a corpus. You could also do this with an apply statement instead of a tm_map.
#Read Texts and write them to a data table
data = c("My fake text Disclosure This is just a sentence Conclusion Don't consider it a file.",
"My second fake Disclosure This is just a sentence Conclusion Don't consider it a file.")
# Create a corpus
library(tm)
library(stringr)
corp = VCorpus(VectorSource(data))
#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)
remove_before_Disclosure <- function(doc.in){
doc.in$content <- str_remove(doc.in$content,".+(?=Disclosure)")
return(doc.in)
}
corp2 <- tm_map(corp,remove_before_Disclosure)
remove_after_Conclusion <- function(doc.in){
doc.in$content <- str_remove(doc.in$content,"(?<=Conclusion).+")
return(doc.in)
}
corp2 <- tm_map(corp2,remove_after_Conclusion)

removing stop words from corpus in R is too slow

I have loaded my corpus which includes 16 text files but it has taken about 2 hours to remove stop words from this corpus.
the total size of the corpus is 31Mb.
Do you know how can I fix this problem?
multidocMBTI <- Corpus(DirSource("F:/my master course/Principle of analytics/DATA03"))
multidocMBTI <- tm_map(multidocMBTI, removeWords, stopwords("english"))

How do I keep special characters which searching for term frequencies?

I have a document out of which I have special characters along with text such as !, #, #, $, % and more. The following code is used to obtain the most frequent terms list. But when it is performed, the special characters are missing in the frequent terms list i.e. if "#StackOverFlow" is the word present 100 times in the document, I get it as "StackOverFlow" without # in the frequent terms list. Here is my code:
review_text <- paste(rome_1$text, collapse=" ")
#The special characters are present within review_text
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
head(frequency)
Where exactly have I gone wrong here?
As you can see in the DocumentTermMatrix documentation :
This is different for a SimpleCorpus. In this case all options are
processed in a fixed order in one pass to improve performance. It
always uses the Boost Tokenizer (via Rcpp) and takes no custom
functions as option arguments.
It seems that SimpleCorpus objects (created by Corpus function) use a pre-defined Boost tokenizer which automatically splits words removing punctuations (including #).
You could use VCorpus instead, and removes the punctuations characters you want e.g. :
library(tm)
review_text <-
"I love #StackOverflow. #Stackoverflow is great, but Stackoverflow exceptions are not!"
review_source <- VectorSource(review_text)
corpus <- VCorpus(review_source) # N.B. use VCorpus here !!
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
patternRemover <- content_transformer(function(x,patternToRemove) gsub(patternToRemove,'',x))
corpus <- tm_map(corpus, patternRemover, '\\!|\\.|\\,|\\;|\\?') # remove only !.,;?
dtm <- DocumentTermMatrix(corpus,control=list(tokenize='words'))
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
Result :
> frequency
#stackoverflow exceptions great love stackoverflow
2 1 1 1 1

R tm document names missing

Using R{tm} package, i create a corpus, per usual:
mycorpus <- Corpus(DirSource(folder,pattern="txt"))
Please note I am not using an encoding variable. The summary (mycorpus) shows document names listed. However after a series of tm_map transforms:
(content_transformer(tolower),content_transformer(removeWords), stopwords("SMART"),stripWhitespace)
ending with mycorpus<- tm_map(mycorpus, PlainTextDocument) and mydtm <- DocumentTermMatrix(mycorpus, control = list(...))
I get an error with inspect(mydtm[1:10, intersect(colnames(dtm), 'toyota')]) to get my variable of choice:
Terms
Docs toyota
character(0) 0
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
The file names (doc ids) have disappeared. Any idea what could be causing this error? more importantly, how do i reinstate the document names? Many thanks.
Code below will work for single file. You likely could use something like list.files to read all files in the directory.
First, I would wrap the cleaning functions in a custom function. Note the order matters and you have to use content_transformer if the function is not from tm.
clean.corpus<-function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, custom.stopwords)
return(corpus)
}
Then concatenate English words with custom words. This is passed as the last part of the custom function above.
custom.stopwords <- c(stopwords('english'), 'lol', 'smh')
doc<-read.csv('coffee.csv', header=TRUE)
The CSV is a data frame with a column of tweets in a text document and another column with an ID for each tweet. The file from my workshop with this file is here.
The csv file is now in memory so next step is to read it in tabular fashion with a specific mapping when making a corpus. Here the content is in a column called text and the unique ID is in a column name "id".
custom.reader <- readTabular(mapping=list(content="text", id="id"))
corpus <- VCorpus(DataframeSource(doc), readerControl=list(reader=custom.reader))
corpus<-clean.corpus(corpus)
The corpus creation uses the readerControl and then once done you can apply the pre-processing steps. Without the reader control the package assigns the 0 character as the name.
The corpus content of document 1 can be accessed here
corpus[[1]][1]
You can review the corpus meta data for the first document with this code
corpus[[1]][2]
So I think you are needing to use readTabular and readerControl in your corpus construction no matter the source.
I was having the same problem and I realized that it was due to tolower. tolower, unlike removeNumbers, removePunctuation, removeWords, stemDocument, stripWhitespace are not tranformations defined in the tm package. To get a list of transformations defined in the tm package that can be directly applied to a corpus, type:
getTransformations()
[1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”
Thus, in order to use tolower it first must make a transformation for tolower for it to handle corpus objects properly.
docs <- tm_map(docs,content_transformer(tolower))
The above line of code should stop the files from being renamed to character(0)
The same trick can be applied to any R function to work with corpuses. For example for gsub, the following syntax applies:
docs <- tm_map(docs, content_transformer(gsub), pattern = “internt”, replacement = “internet”)

tm Package error: Error definining Document Term Matrix

I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message:
Error in UseMethod("Content", x) :
no applicable method for 'Content' applied to an object of class "character"
All pre-processing steps work correctly up until document term matrix.
I created a non-random subset of the corpus (with 4000 documents) and the document term matrix command works fine on that.
My code is below. Thanks for the help.
##Import
file <- "reut-full.xml"
reuters <- Corpus(ReutersSource(file), readerControl = list(reader = readReut21578XML))
## Convert to Plain Text Documents
reuters <- tm_map(reuters, as.PlainTextDocument)
## Convert to Lower Case
reuters <- tm_map(reuters, tolower)
## Remove Stopwords
reuters <- tm_map(reuters, removeWords, stopwords("english"))
## Remove Punctuations
reuters <- tm_map(reuters, removePunctuation)
## Stemming
reuters <- tm_map(reuters, stemDocument)
## Remove Numbers
reuters <- tm_map(reuters, removeNumbers)
## Eliminating Extra White Spaces
reuters <- tm_map(reuters, stripWhitespace)
## create a term document matrix
dtm <- DocumentTermMatrix(reuters)
Error in UseMethod("Content", x) :
no applicable method for 'Content' applied to an object of class "character"

Resources