Getting Error for DocumentTermMatrix in R - r

My previous code has been as below -
corpus <- VCorpus(VectorSource(final_data$comment))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, 'brw')
corpus <- tm_map(corpus, removeWords, 'cid')
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, trimws)
dtm <- DocumentTermMatrix(corpus)
I am getting following error on last command (DocumentTermMatrix) -
'no applicable method for 'meta' applied to an object of class
"character"'
Can you please let me know how to fix it ?

The use of this line of code is causing an issue tm_map(corpus, trimws). The result is a character string instead of the document. This messes up the corpus. If you want to use other function within tm_map that is not part of the tm package, you need to use the function content_transformer.
If you change your last line of code to the one below it should work.
corpus <- tm_map(crude, content_transformer(function(x) trimws(x)))
dtm <- DocumentTermMatrix(corpus)

Related

How do i retain my unique identifiers when doing a clean and stem using tm package in R?

#to prepare for dataframesource you must change name to doc_id and text.
textdataframe <- textdataframe %>% rename(doc_id= orig_id, text= orig.narr)
corpus=Corpus(DataframeSource(textdataframe))
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, tolower)
corpus[[1]][1]
#remove punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]][1]
#remove stopwords
corpus = tm_map(corpus, removeWords, c("cloth", stopwords("english")))
corpus[[1]][1]
#stemming
corpus = tm_map(corpus, stemDocument)
corpus[[1]][1]
What ends up happening is i lose my unique id's that i assigned when setting dataframe source. I would like to set it up and continue it to be edited as i go along with clean and stem.

Removing words containing a certain substring

So I'm making a function to receive in a word corpus, and then spit out a cleaned product:
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
}
This works great for the most part, however when I look at the resulting word cloud I generated I notice one thing that stands out:
the word cloud includes random words that have the term "html" in them.
I figure I can fix this by simply adding a line in the function that removes any word that contains the substring "http", but I can't for the life of get around to doing that, and all the existing answers I've found seem to have to do with replacing a substring, or removing only that substring.
What I want to do is:
if a substring is a part of the word, then remove that entire word.
Word Cloud code I use to generate the word cloud from the corpus:
color_scheme <- brewer.pal(9,"YlGnBu")
color_scheme <- color_scheme[-(1:4)]
set.seed(103)
wordcloud(words = manu_corpus_final, max.words=200, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=color_scheme)
If you are directly getting corpus as input, you could extract the content of corpus using sapply and then remove the document from the corpus which has the required string.
You could integrate it into your function in the following way :
corpus_creater <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
#Added the below line
corpus <- corpus[-grep("http", sapply(corpus, `[`, 1))]
}

I am doing text mining, if I have a dendrogram of some documents, then cut it in one level How can I get all the terms in that level of cut?

I have a code like this:
nf<- read.csv("test2.csv")#test 2 is containing 79 rows(name of documents) and one column of text as containing document.
corpus <- Corpus(VectorSource(nf$segment))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, function(x) removeWords(x,"shall"))
corpus <- tm_map(corpus, function(x) removeWords(x,"will"))
corpus <- tm_map(corpus, function(x) removeWords(x,"can"))
corpus <- tm_map(corpus, function(x) removeWords(x,"could"))
corpus <- tm_map(corpus, stemDocument, language = "english")
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
ft <- hclust(dist.mat, method="ward.D2")
plot(ft)
my dendrogram
I have cluster dendrogram from documents. if I cut I in height=50 .how I can have the terms in this level?

Does R works for multilingual data

We have prepared machine learning algorithms like clasification algorithm having features as factors. Topic modelling on text data for which text data is in English
Below script which is prepared .
complete <- subset(complete,select=c(Group,Type,Text,Target))
data <- complete$Text
corpus <-tm_map(corpus,content_transformer(tolower))
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "-")
corpus <- tm_map(corpus, toSpace, ":")
corpus <- tm_map(corpus, toSpace, ";")
corpus <- tm_map(corpus, toSpace, "#")
corpus <- tm_map(corpus, toSpace, "\\(" )
corpus <- tm_map(corpus, toSpace, ")")
corpus <- tm_map(corpus, toSpace, ",")
corpus <- tm_map(corpus, toSpace, "_")
corpus <- tm_map(corpus, content_transformer(removeSpecialChars))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus,stemDocument)
tdm <- DocumentTermMatrix(corpus)
train1 <- as.matrix(tdm)
complete1 <- subset(complete,select=c(Group,Type,Target))
complete1 <- Filter(function(x)(length(unique(x))>1), complete1)
train <- cbind(complete1, train1)
train$Text <- NULL
train$Target <- as.factor(train$Target)
############################################################################################
# Model Run
############################################################################################
fit <-svm(Target ~ ., data = train)
termlist <- list(dictionary = Terms(tdm))
retval <- list(model = fit, termlist = termlist, complete = complete)
saveRDS(retval, "./modelTarget.rds")
Now we will be expecting data in another languages - Chinese/Korean/Japanese/French/Portugese/Spanish .
Wanted to check if R support these types of data especially for text cleaning.
Please advice

Why some cyrillic letters are missing in wordcloud?

I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

Resources