Translate a wordcloud in R - r

I generated a wordcloud with the most frequent words uttered by a child acquiring Portuguese. I used the R package wordcloud. I would like to generate a translated version (English) of the wordcloud I created. Is that possible? Here is my code:
library(tm)
library(wordcloud)
mct <- read_file('mc.txt')
vs <- VectorSource(mct)
corpus <- Corpus(vs)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, c("pra", "né", "porque", "assim", "opa", stopwords('portuguese')))
corpus <- tm_map(corpus, stripWhitespace)
wordcloud(corpus,
min.freq = 1,
max.words = 60,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2"))

Related

Getting Error for DocumentTermMatrix in R

My previous code has been as below -
corpus <- VCorpus(VectorSource(final_data$comment))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, 'brw')
corpus <- tm_map(corpus, removeWords, 'cid')
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, trimws)
dtm <- DocumentTermMatrix(corpus)
I am getting following error on last command (DocumentTermMatrix) -
'no applicable method for 'meta' applied to an object of class
"character"'
Can you please let me know how to fix it ?
The use of this line of code is causing an issue tm_map(corpus, trimws). The result is a character string instead of the document. This messes up the corpus. If you want to use other function within tm_map that is not part of the tm package, you need to use the function content_transformer.
If you change your last line of code to the one below it should work.
corpus <- tm_map(crude, content_transformer(function(x) trimws(x)))
dtm <- DocumentTermMatrix(corpus)

How to remove non UTF-8 characters from text

I need help removing non UTF-8 character from my word cloud. So far this is my code. I've tried gsub and removeWords and they are still there in my word cloud and I do not know what to do to get rid of them. Any help would be appreciated. Thank you for your time.
txt <- readLines("11-0.txt")
corpus = VCorpus(VectorSource(txt))
gsub("’","‘","",txt)
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","â€","project"))
tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
Edit: Here is my inconv version
txt <- readLines("11-0.txt")
Encoding(txt) <- "latin1"
iconv(txt, "latin1", "ASCII", sub="")
corpus = VCorpus(VectorSource(txt))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removeWords, c("gutenberg","gutenbergtm","project"))
tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing = TRUE)
d = data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words = 20, random.order=FALSE, rot.per=0.2, colors=brewer.pal(8, "Dark2"))
title(main="Alice in Wonderland word cloud",font.main=1,cex.main =1.5)
The signature of gsub is:
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Not sure what you wanted to do with
gsub("’","‘","",txt)
but that line is probably not doing what you want it to do...
See here for a previous SO question on gsub and non-ascii symbols.
Edit:
Suggested solution using iconv:
Removing all non-ASCII characters:
txt <- "’xxx‘"
iconv(txt, "latin1", "ASCII", sub="")
Returns:
[1] "xxx"

I am doing text mining, if I have a dendrogram of some documents, then cut it in one level How can I get all the terms in that level of cut?

I have a code like this:
nf<- read.csv("test2.csv")#test 2 is containing 79 rows(name of documents) and one column of text as containing document.
corpus <- Corpus(VectorSource(nf$segment))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, function(x) removeWords(x,"shall"))
corpus <- tm_map(corpus, function(x) removeWords(x,"will"))
corpus <- tm_map(corpus, function(x) removeWords(x,"can"))
corpus <- tm_map(corpus, function(x) removeWords(x,"could"))
corpus <- tm_map(corpus, stemDocument, language = "english")
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
ft <- hclust(dist.mat, method="ward.D2")
plot(ft)
my dendrogram
I have cluster dendrogram from documents. if I cut I in height=50 .how I can have the terms in this level?

plot word cloud errors

I was trying to plot a word cloud in R. My code is as below. The wordcloud can be plotted but the font size of all words are equally the same. It does not list the words with higher frequency in the front and show in bigger font sizes. Can anybody point what mistakes I made? Thank you!
install.packages('tm')
library(tm)
reviews <- read.csv('keywords_2.csv',stringsAsFactors = FALSE)
review_text <- paste(reviews$customer_search_term
, collapse =" ")
review_source<-VectorSource(review_text)
corpus<-Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <-DocumentTermMatrix(corpus)
dtm2<-as.matrix(dtm)
install.packages('wordcloud')
library(wordcloud)
frequency<- colSums(dtm2)
str(frequency)
words<-names(frequency)
wordcloud(words,scale=c(1,.0005), min.freq = 3,random.order = FALSE, rot.per = .15)

Why some cyrillic letters are missing in wordcloud?

I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

Resources