Removing words containing a certain substring - r

So I'm making a function to receive in a word corpus, and then spit out a cleaned product:
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
}
This works great for the most part, however when I look at the resulting word cloud I generated I notice one thing that stands out:
the word cloud includes random words that have the term "html" in them.
I figure I can fix this by simply adding a line in the function that removes any word that contains the substring "http", but I can't for the life of get around to doing that, and all the existing answers I've found seem to have to do with replacing a substring, or removing only that substring.
What I want to do is:
if a substring is a part of the word, then remove that entire word.
Word Cloud code I use to generate the word cloud from the corpus:
color_scheme <- brewer.pal(9,"YlGnBu")
color_scheme <- color_scheme[-(1:4)]
set.seed(103)
wordcloud(words = manu_corpus_final, max.words=200, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=color_scheme)

If you are directly getting corpus as input, you could extract the content of corpus using sapply and then remove the document from the corpus which has the required string.
You could integrate it into your function in the following way :
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
#Added the below line
corpus <- corpus[-grep("http", sapply(corpus, `[`, 1))]
}

Related

How do i retain my unique identifiers when doing a clean and stem using tm package in R?

#to prepare for dataframesource you must change name to doc_id and text.
textdataframe <- textdataframe %>% rename(doc_id= orig_id, text= orig.narr)
corpus=Corpus(DataframeSource(textdataframe))
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, tolower)
corpus[[1]][1]
#remove punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]][1]
#remove stopwords
corpus = tm_map(corpus, removeWords, c("cloth", stopwords("english")))
corpus[[1]][1]
#stemming
corpus = tm_map(corpus, stemDocument)
corpus[[1]][1]
What ends up happening is i lose my unique id's that i assigned when setting dataframe source. I would like to set it up and continue it to be edited as i go along with clean and stem.

Getting Error for DocumentTermMatrix in R

My previous code has been as below -
corpus <- VCorpus(VectorSource(final_data$comment))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, 'brw')
corpus <- tm_map(corpus, removeWords, 'cid')
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, trimws)
dtm <- DocumentTermMatrix(corpus)
I am getting following error on last command (DocumentTermMatrix) -
'no applicable method for 'meta' applied to an object of class
"character"'
Can you please let me know how to fix it ?
The use of this line of code is causing an issue tm_map(corpus, trimws). The result is a character string instead of the document. This messes up the corpus. If you want to use other function within tm_map that is not part of the tm package, you need to use the function content_transformer.
If you change your last line of code to the one below it should work.
corpus <- tm_map(crude, content_transformer(function(x) trimws(x)))
dtm <- DocumentTermMatrix(corpus)

R Find tags in text automatically

I´m not really into text mining, so I ask my question here.
I have some texts and I would like to analyse which topics(tags) the texts are about.
So I asked myself what´s the best way to do so.
First of all, I prepared the texts and removed stopwords with the tm package:
library(tm)
sample2 = c('This text is about the wheather. Today the wheater is really bad.', 'The dog is barking very loud. That is annoying.')
myStopwords <- c(stopwords("english"), "today")
df <- do.call("rbind", lapply(sample2, as.data.frame))
colnames(df) = "texts"
corpus <- Corpus(VectorSource(df$texts))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, function(x) removeWords(x, myStopwords))
corpus <- tm_map(corpus, stemDocument, language = c("english"))
Now I created a TermDocumentMatrix
td.mat <- TermDocumentMatrix(corpus, control=list(minWordLength = 1))
and now I would like to find frequent nouns (no verbs or adjectives). Text 1 should have "text" and "wheater" as tags and Text 2 should have "dog".
Can anybody tell me how to do that?

plot word cloud errors

I was trying to plot a word cloud in R. My code is as below. The wordcloud can be plotted but the font size of all words are equally the same. It does not list the words with higher frequency in the front and show in bigger font sizes. Can anybody point what mistakes I made? Thank you!
install.packages('tm')
library(tm)
reviews <- read.csv('keywords_2.csv',stringsAsFactors = FALSE)
review_text <- paste(reviews$customer_search_term
, collapse =" ")
review_source<-VectorSource(review_text)
corpus<-Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <-DocumentTermMatrix(corpus)
dtm2<-as.matrix(dtm)
install.packages('wordcloud')
library(wordcloud)
frequency<- colSums(dtm2)
str(frequency)
words<-names(frequency)
wordcloud(words,scale=c(1,.0005), min.freq = 3,random.order = FALSE, rot.per = .15)

Why some cyrillic letters are missing in wordcloud?

I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

Resources