Why some cyrillic letters are missing in wordcloud? - r

I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

Related

Translate a wordcloud in R

I generated a wordcloud with the most frequent words uttered by a child acquiring Portuguese. I used the R package wordcloud. I would like to generate a translated version (English) of the wordcloud I created. Is that possible? Here is my code:
library(tm)
library(wordcloud)
mct <- read_file('mc.txt')
vs <- VectorSource(mct)
corpus <- Corpus(vs)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, c("pra", "né", "porque", "assim", "opa", stopwords('portuguese')))
corpus <- tm_map(corpus, stripWhitespace)
wordcloud(corpus,
min.freq = 1,
max.words = 60,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2"))

Removing words containing a certain substring

So I'm making a function to receive in a word corpus, and then spit out a cleaned product:
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
}
This works great for the most part, however when I look at the resulting word cloud I generated I notice one thing that stands out:
the word cloud includes random words that have the term "html" in them.
I figure I can fix this by simply adding a line in the function that removes any word that contains the substring "http", but I can't for the life of get around to doing that, and all the existing answers I've found seem to have to do with replacing a substring, or removing only that substring.
What I want to do is:
if a substring is a part of the word, then remove that entire word.
Word Cloud code I use to generate the word cloud from the corpus:
color_scheme <- brewer.pal(9,"YlGnBu")
color_scheme <- color_scheme[-(1:4)]
set.seed(103)
wordcloud(words = manu_corpus_final, max.words=200, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=color_scheme)
If you are directly getting corpus as input, you could extract the content of corpus using sapply and then remove the document from the corpus which has the required string.
You could integrate it into your function in the following way :
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
#Added the below line
corpus <- corpus[-grep("http", sapply(corpus, `[`, 1))]
}

I am doing text mining, if I have a dendrogram of some documents, then cut it in one level How can I get all the terms in that level of cut?

I have a code like this:
nf<- read.csv("test2.csv")#test 2 is containing 79 rows(name of documents) and one column of text as containing document.
corpus <- Corpus(VectorSource(nf$segment))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, function(x) removeWords(x,"shall"))
corpus <- tm_map(corpus, function(x) removeWords(x,"will"))
corpus <- tm_map(corpus, function(x) removeWords(x,"can"))
corpus <- tm_map(corpus, function(x) removeWords(x,"could"))
corpus <- tm_map(corpus, stemDocument, language = "english")
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
ft <- hclust(dist.mat, method="ward.D2")
plot(ft)
my dendrogram
I have cluster dendrogram from documents. if I cut I in height=50 .how I can have the terms in this level?

R: Obtaining Single Term Frequencies instead of Bigrams

Here is the code I use to create bi-grams with frequency list:
library(tm)
library(RWeka)
#data <- myData[,2]
tdm.generate <- function(string, ng){
# tutorial on rweka - http://tm.r-forge.r-project.org/faq.html
corpus <- Corpus(VectorSource(string)) # create corpus for TM processing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
# corpus <- tm_map(corpus, removeWords, stopwords("english"))
options(mc.cores=1) # http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/20251039#20251039
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng)) # create n-grams
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) # create tdm from n-grams
tdm
}
source("GenerateTDM.R") # generatetdm function in appendix
tdm <- tdm.generate("The book The book The greatest The book",2)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
topwords <- as.numeric(topwords)
hist(topwords, breaks = 10)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
head(sort(topwords, decreasing = TRUE))
The result for the above code is:
the book greatest
4 3 1
Instead, I'm looking for the result where bi-grams are shown like:
"the book" "book the"
3 2
What needs to be changed in the above code to get the output as above?
You need to use VCorpus instead of Corpus, I was having the same issue you could check more details here

plot word cloud errors

I was trying to plot a word cloud in R. My code is as below. The wordcloud can be plotted but the font size of all words are equally the same. It does not list the words with higher frequency in the front and show in bigger font sizes. Can anybody point what mistakes I made? Thank you!
install.packages('tm')
library(tm)
reviews <- read.csv('keywords_2.csv',stringsAsFactors = FALSE)
review_text <- paste(reviews$customer_search_term
, collapse =" ")
review_source<-VectorSource(review_text)
corpus<-Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <-DocumentTermMatrix(corpus)
dtm2<-as.matrix(dtm)
install.packages('wordcloud')
library(wordcloud)
frequency<- colSums(dtm2)
str(frequency)
words<-names(frequency)
wordcloud(words,scale=c(1,.0005), min.freq = 3,random.order = FALSE, rot.per = .15)

Resources