plot word cloud errors - r

I was trying to plot a word cloud in R. My code is as below. The wordcloud can be plotted but the font size of all words are equally the same. It does not list the words with higher frequency in the front and show in bigger font sizes. Can anybody point what mistakes I made? Thank you!
install.packages('tm')
library(tm)
reviews <- read.csv('keywords_2.csv',stringsAsFactors = FALSE)
review_text <- paste(reviews$customer_search_term
, collapse =" ")
review_source<-VectorSource(review_text)
corpus<-Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <-DocumentTermMatrix(corpus)
dtm2<-as.matrix(dtm)
install.packages('wordcloud')
library(wordcloud)
frequency<- colSums(dtm2)
str(frequency)
words<-names(frequency)
wordcloud(words,scale=c(1,.0005), min.freq = 3,random.order = FALSE, rot.per = .15)

Related

Translate a wordcloud in R

I generated a wordcloud with the most frequent words uttered by a child acquiring Portuguese. I used the R package wordcloud. I would like to generate a translated version (English) of the wordcloud I created. Is that possible? Here is my code:
library(tm)
library(wordcloud)
mct <- read_file('mc.txt')
vs <- VectorSource(mct)
corpus <- Corpus(vs)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, c("pra", "né", "porque", "assim", "opa", stopwords('portuguese')))
corpus <- tm_map(corpus, stripWhitespace)
wordcloud(corpus,
min.freq = 1,
max.words = 60,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2"))

Removing words containing a certain substring

So I'm making a function to receive in a word corpus, and then spit out a cleaned product:
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
}
This works great for the most part, however when I look at the resulting word cloud I generated I notice one thing that stands out:
the word cloud includes random words that have the term "html" in them.
I figure I can fix this by simply adding a line in the function that removes any word that contains the substring "http", but I can't for the life of get around to doing that, and all the existing answers I've found seem to have to do with replacing a substring, or removing only that substring.
What I want to do is:
if a substring is a part of the word, then remove that entire word.
Word Cloud code I use to generate the word cloud from the corpus:
color_scheme <- brewer.pal(9,"YlGnBu")
color_scheme <- color_scheme[-(1:4)]
set.seed(103)
wordcloud(words = manu_corpus_final, max.words=200, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=color_scheme)
If you are directly getting corpus as input, you could extract the content of corpus using sapply and then remove the document from the corpus which has the required string.
You could integrate it into your function in the following way :
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
#Added the below line
corpus <- corpus[-grep("http", sapply(corpus, `[`, 1))]
}

I am doing text mining, if I have a dendrogram of some documents, then cut it in one level How can I get all the terms in that level of cut?

I have a code like this:
nf<- read.csv("test2.csv")#test 2 is containing 79 rows(name of documents) and one column of text as containing document.
corpus <- Corpus(VectorSource(nf$segment))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, function(x) removeWords(x,"shall"))
corpus <- tm_map(corpus, function(x) removeWords(x,"will"))
corpus <- tm_map(corpus, function(x) removeWords(x,"can"))
corpus <- tm_map(corpus, function(x) removeWords(x,"could"))
corpus <- tm_map(corpus, stemDocument, language = "english")
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
ft <- hclust(dist.mat, method="ward.D2")
plot(ft)
my dendrogram
I have cluster dendrogram from documents. if I cut I in height=50 .how I can have the terms in this level?

R Find tags in text automatically

I´m not really into text mining, so I ask my question here.
I have some texts and I would like to analyse which topics(tags) the texts are about.
So I asked myself what´s the best way to do so.
First of all, I prepared the texts and removed stopwords with the tm package:
library(tm)
sample2 = c('This text is about the wheather. Today the wheater is really bad.', 'The dog is barking very loud. That is annoying.')
myStopwords <- c(stopwords("english"), "today")
df <- do.call("rbind", lapply(sample2, as.data.frame))
colnames(df) = "texts"
corpus <- Corpus(VectorSource(df$texts))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, function(x) removeWords(x, myStopwords))
corpus <- tm_map(corpus, stemDocument, language = c("english"))
Now I created a TermDocumentMatrix
td.mat <- TermDocumentMatrix(corpus, control=list(minWordLength = 1))
and now I would like to find frequent nouns (no verbs or adjectives). Text 1 should have "text" and "wheater" as tags and Text 2 should have "dog".
Can anybody tell me how to do that?

Why some cyrillic letters are missing in wordcloud?

I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

Resources