New error being thrown with tm package and dtm/wordclouds - r

using R(3.2.5) and with the following packages loaded
'SnowballC', 'tm', 'NLP', 'RWeka', 'RTextTools', 'wordcloud', 'fpc'
carmenCorpus <- Corpus(VectorSource(feedback$Description))
carmenCorpus <- tm_map(carmenCorpus, PlainTextDocument)
carmenCorpus <- tm_map(carmenCorpus, removePunctuation)
carmenCorpus <- tm_map(carmenCorpus, removeWords, stopwords('english'))
carmenCorpus <- tm_map(carmenCorpus, stemDocument)
When I go to create a wordcloud i get the following error. this is a new error, when the code was run several months ago there was no issue:
wordcloud(carmenCorpus, max.words = 100, random.order = FALSE)
# Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus), :
# 'i, j' invalid
Please advise on this issue.

wordcloud cannot just take a corpus and magically churn out a wordcloud.
You have to do the hard work of converting it into a TextDocumentMatrix and then summing up the word frequencies:
# convert to TDM
tdm <- TermDocumentMatrix(carmenCorpus, control=list(stemming=True))
# calculate word frequencies
freqs = sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
# plot wordcloud
wordcloud(names(freqs), freqs,
max.words = 100,
random.order = FALSE,
# any other params you want to pass into wordcloud
)

Related

I can't create a cloud with this algoritm

I'm trying to run this code to create a wordcloud that I saw in the class. But I cannot find the solving problem by the error that appears where I try plot the wordcloud. the error is: Error in wordcloud(dfCorpus, max.words = 100, random.order = FALSE) :
it was not possible find the function "wordcloud". Can anyone help?
install.packages("tm", "SnowballC", "worldcloud", "RColorBrewer")
library(tm, SnowballC,worldcloud,RColorBrewer)
df <- read.csv('C:/r_fundamentos/parte2/questoes.csv', sep = ",")
head(df)
dfCorpus <-Corpus(VectorSource(df$Question))
class(dfCorpus)
dfCorpus <- tm_map(dfCorpus, PlainTextDocument)
dfCorpus <- tm_map(dfCorpus,removePunctuation)
dfCorpus <- tm_map(dfCorpus, removeWords, stopwords("english"))
dfCorpus <- tm_map(dfCorpus, stemDocument)
dfCorpus <- tm_map(dfCorpus, removeWords, c("the", "this", c(stopwords("english"))))
wordcloud(dfCorpus, max.words = 100, random.order = FALSE)
I think he does not load all the packages. Try instead:
library(tm) ; library(SnowballC) ; library(worldcloud) ; library(RColorBrewer)

TermDocumentMatrix not responding to Tokenizer

I am very new to R and I am trying to do an NGram WordCloud. However, my results always show a 1Gram instead of an NGram. I have searched for days for answers on the web and tried different methods...still the same result. Also, for some reason, I don't have the Ngramtokenizer function that I see everyone is using. However, I found another tokenizer function that I am using here. I hope someone can help me out. Thanks in advance!
library(dplyr)
library(ggplot2)
library(tidytext)
library(wordcloud)
library(tm)
library(RTextTools)
library(readxl)
library(qdap)
library(RWeka)
library(tau)
library(quanteda)
rm(list = ls())
#setwd("C:\\RStatistics\\Data\\")
#allverbatims <-read_excel("RS_Verbatims2018.xlsx") #reads excel files
#selgroup <- subset(allverbatims, FastNPS=="Detractors")
#selcolumns <- selgroup[ ,3:8]
#sample data
selcolumns <- c("this is a test","my test is not working","sample data here")
Comments <- Corpus(VectorSource(selcolumns))
CommentClean <- tm_map(Comments, removePunctuation)
CommentClean <- tm_map(CommentClean, content_transformer(tolower))
CommentClean <- tm_map(CommentClean,removeNumbers)
CommentClean <- tm_map(CommentClean, stripWhitespace)
CommentClean <- tm_map(CommentClean,removeWords,c(stopwords('english')))
#create manual tokenizer using tau textcnt since NGramTokenizer is not available
tokenize_ngrams <- function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x,method="string", n=n)))))
#test tokenizer
head(tokenize_ngrams(CommentClean))
td_mat <- TermDocumentMatrix(CommentClean, control = list(tokenize = tokenize_ngrams))
inspect(td_mat) #should be bigrams but the result is 1 gram
matrix <- as.matrix(td_mat)
sorted <- sort(rowSums(matrix),decreasing = TRUE)
data_text <- data.frame(word = names(sorted),freq = sorted)
set.seed(1234)
wordcloud(word = data_text$word, freq = data_text$freq, min = 5, max.words = 100, random.order = FALSE, rot.per = 0.1, colors = rainbow(30))

Error in .jcall()

I am running the following code and receiving this error:
Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
: java.lang.NullPointerException
setwd("C:\\Users\\jbarr\\Desktop\\test)
library (tm); library (wordcloud);library (RWeka); library (tau);library(xlsx);
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
review_source <- VectorSource(Comment)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords,stopwords(kind = "english"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c("member", "advise", "inform", "informed", "caller", "call","provided", "advised"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
wordfreq <- colSums(dtm2)
wordfreq <- sort(wordfreq, decreasing=TRUE)
head(wordfreq, n=100)
wfreq <- head(wordfreq, 500)
set.seed(142)
words <- names(wfreq)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(words[1:100], wordfreq[1:100], rot.per=0.35, scale=c(2.7, .4), colors=dark2, random.order=FALSE)
write.xlsx(wfreq, "C:\\Users\\jbarr\\Desktop\\test")
The interesting problem is, I have ran this code on multiple files, and only specific ones have the error.
Sanmeet is right - it's a problem with NAs in your data frame.
just prior to your line: review_source <- VectorSource(Comment)
insert the line below:
Comment[which(is.na(Comment))] <- "NULLVALUEENTERED"
This will change all of your NA values to the phrase NULLVALUEENTERED (feel free to change that). No more NAs, and the code should run fine.
You are getting the error in tokenizer due to NA in your string vector Comment
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
length(Comment)
Comment = Comment[complete.cases(Comment)]
length(Comment)
Or you can also use is.na as below
Comment = Comment[!is.na(Comment)]
Now apply the preprocessing steps, create the corpus etc
Hope this helps.
A Suggestion: I get this error when reading an excel (.xlsx) file using:
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1", startRow = 1, endRow = 0).
Notice it appears that the value for endRow should be NULL or a valid number. But
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1")
works fine. So you might want to check your argument values and argument to parameter alignment.
Seems like there are NAs in your data Frame. Run is.na() and remove those rows. Try running the code again. It should work.

How to remove error in term-document matrix in R?

I am trying to create Term-Document matrix using R from a corpus of file. But on running the code I am getting this error followed by 2 warnings:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j' invalid
Calls: DocumentTermMatrix ... TermDocumentMatrix.VCorpus -> simple_triplet_matrix -> .Call
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
My code is given below:
library(tm)
library(RWeka)
library(tmcn.word2vec)
#Reading data
data <- read.csv("Train.csv", header=T)
#text <- data$EventDescription
#Pre-processing
corpus <- Corpus(VectorSource(data$EventDescription))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
#dataframe <- data.frame(text=unlist(sapply(corpus,'[',"content")))
#Reading dictionary file
dict <- scan("dictionary.txt", what='character',sep='\n')
#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 4))
tdm_doc <- DocumentTermMatrix(corpus,control=list(stopwords = dict, tokenize=BigramTokenizer))
tdm_dic <- DocumentTermMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))
As given in other answers in SO, I have tried installing SnowballC package and other listed ideas. Still I am getting the same error. Can anyone help me in this regard? Thanks in advance.
I had the same problem for getting my DocumnetTermMatrix and I solved it by removing the following command:
corpus <- tm_map(corpus, PlainTextDocument)
I had a similar error when cleaning a corpus. To fix the problem I added the following after the offending line of code and it fixed it. Some of the tm_map functions do not return a corpus...
corpus <- Corpus(VectorSource(corpus))
For me the problem arose after stem completion. I would suggest trying to make a tdm after each tm_map call. That will tell you which cleaning step is causing the problem.
Best of luck!

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create a matrix. The error is:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
For example, here is code from Jon Starkweather's text mining example. Apologies in advance for such long code, but this does produce a reproducible example. Please note that the error comes at the end with the {tdm} function.
#Read in data
policy.HTML.page <- readLines("http://policy.unt.edu/policy/3-5")
#Obtain text and remove mark-up
policy.HTML.page[186:202]
id.1 <- 3 + which(policy.HTML.page == " TOTAL UNIVERSITY </div>")
id.2 <- id.1 + 5
text.data <- policy.HTML.page[id.1:id.2]
td.1 <- gsub(pattern = "<p>", replacement = "", x = text.data,
ignore.case = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
td.2 <- gsub(pattern = "</p>", replacement = "", x = td.1, ignore.case = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
text.d <- td.2; rm(text.data, td.1, td.2)
#Create corpus and clean
library(tm)
library(SnowballC)
txt <- VectorSource(text.d); rm(text.d)
txt.corpus <- Corpus(txt)
txt.corpus <- tm_map(txt.corpus, tolower)
txt.corpus <- tm_map(txt.corpus, removeNumbers)
txt.corpus <- tm_map(txt.corpus, removePunctuation)
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
txt.corpus <- tm_map(txt.corpus, stripWhitespace); #inspect(docs[1])
txt.corpus <- tm_map(txt.corpus, stemDocument)
# NOTE ERROR WHEN CREATING TDM
tdm <- TermDocumentMatrix(txt.corpus)
The link provided by jazzurro points to the solution. The following line of code
txt.corpus <- tm_map(txt.corpus, tolower)
must be changed to
txt.corpus <- tm_map(txt.corpus, content_transformer(tolower))
There are 2 reasons for this issue in tm v0.6.
If you are doing term level transformations like tolower etc., tm_map returns character vector instead of PlainTextDocument.
Solution: Call tolower through content_transformer or call tm_map(corpus, PlainTextDocument) immediately after tolower
If the SnowballC package is not installed and if you are trying to stem the documents then also this can occur.
Solution: install.packages('SnowballC')
There is No need to apply content_transformer.
Create the corpus in this way:
trainData_corpus <- Corpus((VectorSource(trainData$Comments)))
Try it.

Resources