removeWords not working [duplicate] - r

This question already has answers here:
R tm removeWords function not removing words
(2 answers)
Closed 7 years ago.
I am trying to build a wordcloud of the jeopardy dataset found here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
My code is as follows:
library(tm)
library(SnowballC)
library(wordcloud)
jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)
jeopCorpus <- Corpus(VectorSource(jeopQ$Question))
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english')))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
The words 'the' and 'this' are still appearing in the wordcloud. Why is this happening and how can I fix it?

The problem lies in the fact that you didn't perform a lower case action. A lot of questions start with "The". The stopwords are all in lower case, e.g. "the" and "this". Since "The" != "the", "The" it is not removed from the corpus
If you use the code below it should work correctly:
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

The construction of argument does not seem right:see here and here
tm_map(jeopCorpus, removeWords, c(stopwords("english"),"the","this"))
But as said, those words are already included, so simply
tm_map(jeopCorpus, removeWords, stopwords("english"))
should work

Related

After removing stopwords, my output is not saved when I futher clean up my tweets in R

I am doing sentiment analysis, I have two documents in my directory of
corpus 1 is of positive tweets and other is of negative tweets but in
comparison wordcloud I have words those are stopwords. This means it is not
removing the stopwords ("english").
I created custom stopwords but failed to retain that output too. After that I have searched and found a stopwords.txt file of stopwords that I have downloaded from the github and used it to remove the stopwords. For this I have to convert the corpus (atomic vector) to table and then to vector (dataframe) as to read this file. I have combined it with stopwords of tm library.
The output was as expected, but when I tried to remove the punctuation and inspected the corpus, the output was just according to removePunctuation output not retaining the output of stopwords.
Then, I tried the removeNumbers and inspect the corpus but it is not retaining the output of stopwords but retaining the output of removePunctuation.
so, what is the problem here?
What I am missing here?
[This is the code]
[1][This is the output after removing the stopwords from the tweets using R]
[2][This is the output after appling other cleaning like removePunctuation,
removeNumbers, stipwhitespace, stemDocument but it is not retaining the removed stopwords output]
[3]
[1]: https://i.stack.imgur.com/RMbvD.png
[2]: https://i.stack.imgur.com/18H3P.png
[3]: https://i.stack.imgur.com/SxaJE.png
This is the code that I have used. I have put the two text files in the
directory and converted it into the corpus.
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-
Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
stopwords("english"))
inspect(clean_tweets_corpus)
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
class(stop)
stop
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
class(stop_vec)
stop_vec
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
inspect(clean_tweets_corpus)
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(tweets_corpus,
content_transformer(remove_multiplechar))
inspect(clean_tweets_corpus)
clean_tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(tweets_corpus,removeNumbers)
clean_tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(tweets_corpus, stemDocument)
inspect(clean_tweets_corpus)
str(clean_tweets_corpus)
Here is the corrected code, replacing "tweets_corpus" with "clean_tweets_corpus" in all calls to tm_map except the first one:
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(clean_tweets_corpus,
content_transformer(remove_multiplechar))
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeNumbers)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stemDocument)

Wordcloud of a column in R based on another column

I'm working on wordcloud in R and so far I'm successful with just the basic stuff however what I want to do is I want to show word cloud of specific location. E.g if I have text like
TEXT LOCATION
True or false? link(#Addition, #Classification) NewYork,USA
Gene deFuser: detecting gene fusion events from protein sequences #bmc #bioinformatics Norwich,UK
Biologists do have a sense of humor, especially computational bio people France
Semantic Inference using #Chemogenomics Data for Drug Discovery London,UK
here is the basic wordcloud code I'm using
library(tm)
library(SnowballC)
library(wordcloud)
DATA<-c('True or false? link(#Addition, #Classification) ','Gene deFuser: detecting gene fusion events from protein sequences #bmc #bioinformatics',' Biologists do have a sense of humor, especially computational bio people','Semantic Inference using #Chemogenomics Data for Drug Discovery')
Location<-c('NewYork,USA','Norwich,UK',' France','London,UK')
jeopQ<-data.frame(DATA,Location)
jeopCorpus <- Corpus(VectorSource(jeopQ$DATA))
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removeNumbers)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
myDTM = TermDocumentMatrix(jeopCorpus, control = list(minWordLength = 1))
m = as.matrix(myDTM)
v = sort(rowSums(m), decreasing = TRUE)
set.seed(4363)
wordcloud(names(v), v,max.words =100,min.freq=3,scale=c(4,0.1), random.order = FALSE,rot.per=.5,vfont=c("sans serif","plain"),colors=palette())
I want something like a separate word cloud for Location having "USA" in it and locations having "UK" in it , and a separate wordcloud for FRANCE, is this possible?
jeopQ<-data.frame(DATA,Location)
# Clean Location
jeopQ$Location <- sub('.*,\\s*','', jeopQ$Location)
# Loop
for(i in unique(jeopQ$Location)){
jeopCorpus <- Corpus(VectorSource(jeopQ$DATA[jeopQ$Location==i]))
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removeNumbers)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
myDTM = TermDocumentMatrix(jeopCorpus, control = list(minWordLength = 1))
m = as.matrix(myDTM)
v = sort(rowSums(m), decreasing = TRUE)
set.seed(4363)
wordcloud(names(v), v,max.words =100,min.freq=3,scale=c(4,0.1), random.order = FALSE,rot.per=.5,vfont=c("sans serif","plain"),colors=palette())
}

Text mining pdf files/issues with word frequencies

I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.

R - Text Mining - Importing a Corpus and keeping the file names in document term matrix

Up until recently (1 month ago) the code shown below allowed me to import a series of .txt documents stored in a local folder into R, to create a Corpus, pre-process it and finally to convert it into a Document Term Matrix. The issue I am having is that the document names are not being imported, instead each document is listed as 'character(0)'.
One of my aims is to conduct topic modelling on the corpus and so it is important that I can relate the document names to the topics that the model produces.
Does anyone have any suggestions as to what has changed? Or how I can fix this?
library("tm")
library("SnowballC")
setwd("C:/Users/Documents/Dataset/")
corpus <-Corpus(DirSource("blog"))
#pre_processing
myStopwords <- c(stopwords("english"))
your_corpus <- tm_map(corpus, tolower)
your_corpus <- tm_map(your_corpus, removeNumbers)
your_corpus <- tm_map(your_corpus, removeWords, myStopwords)
your_corpus <- tm_map(your_corpus, stripWhitespace)
your_corpus <- tm_map(your_corpus, removePunctuation)
your_corpus <- tm_map(your_corpus, stemDocument)
your_corpus <- tm_map(your_corpus, PlainTextDocument)
#creating a doucment term matrix
myDtm <- DocumentTermMatrix(your_corpus, control=list(wordLengths=c(3,Inf)))
dim(myDtm)
inspect(myDtm)
Here's a debugging session to identify / correct the loss of file name. The tolower line was modified, and the plaintext line was commented-out since these lines remove the file information. Also, if you check ds$reader, you can see the baseline reader creates a plain text document.
library("tm")
library("SnowballC")
# corpus <-Corpus(DirSource("blog"))
sf<-system.file("texts", "txt", package = "tm")
ds <-DirSource(sf)
your_corpus <-Corpus(ds)
# Check status with the following line
meta(your_corpus[[1]])
#pre_processing
myStopwords <- c(stopwords("english"))
# your_corpus <- tm_map(your_corpus, tolower)
your_corpus <- tm_map(your_corpus, content_transformer(tolower))
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removeNumbers)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removeWords, myStopwords)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, stripWhitespace)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removePunctuation)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, stemDocument)
meta(your_corpus[[1]])
#your_corpus <- tm_map(your_corpus, PlainTextDocument)
#meta(your_corpus[[1]])
#creating a doucment term matrix
myDtm <- DocumentTermMatrix(your_corpus, control=list(wordLengths=c(3,Inf)))
dim(myDtm)
inspect(myDtm)
Here's an approach using qdap where I make a function to read in a directory of files and convert them to a data.frame:
library(qdap)
sf <- system.file("texts", "txt", package = "tm")
read_in <- function(sf) {
list2df(setNames(lapply(file.path(sf, dir(sf)), function(x) {
clean(unbag(readLines(x)))}), dir(sf)), "text", "source")[, 2:1]
}
mydtm <- with(read_in(sf), as.dtm(text, source, stem=TRUE,
stopwords=tm::stopwords("english")))
mydtm <- Filter(mydtm, min=3)
inspect(mydtm)

Why some cyrillic letters are missing in wordcloud?

I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

Resources