I'm creating a correlated topic model from public review data and getting a rather odd error.
When I call terms(ctm1, 5) on my CTM, I get back the names of the documents rather than the top 5 terms for each topic.
In more detail I ran,
library(topicmodels)
library(data.table)
library(tm)
a <-Corpus(DirSource("~/text", encoding="UTF-8"), readerControl =
list(language="lat"))
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stemDocument, language = "english")
adtm <-TermDocumentMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
ctm1 <- CTM(adtm, 30, method = "VEM", control = NULL, model = NULL)
terms(ctm1, 5)
which returned
terms(ctm1)
Topic 1 "cmnt656661.txt"
(etc.)
We cannot know for sure because you did not provide data; but it is likely that you did not import the files correctly. See ?DirSource (my emphasis):
directory : A character vector of full path names; the default
corresponds to the working directory getwd().
In your case, it seems like you should do something like this:
a <- Corpus(DirSource(list.files("~/text", full.names = TRUE)))
Related
I am trying to mess around with some R analytics. I have downloaded 10 TED talks file and save them as text. I am struggling with using removeWords stopwords
source("Project_Functions.R")
getwd()
# ====
# Load the PDF data
# pdf.loc <- file.path("data") # folder "PDF Files" with PDFs
# myFiles <- normalizePath(list.files(path = pdf.loc, pattern = "pdf", full.names = TRUE)) # Get the path (chr-vector) of PDF file names
# # Extract content from PDF files
# Docs.corpus <- Corpus(URISource(myFiles), readerControl = list(reader = readPDF(engine = "xpdf")))
# ====
# Load TED Talks Data
myFiles <- normalizePath(list.files(pattern = "txt", full.names = TRUE))
Docs.corpus <- Corpus(URISource(myFiles), readerControl=list(reader=readPlain))
length(Docs.corpus)
#Docs.corpus <-tm_map(Docs.corpus, tolower)
Docs.corpus <-tm_map(Docs.corpus, removeWords, stopwords("english"))
Docs.corpus <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus <-tm_map(Docs.corpus, removeNumbers)
Docs.corpus <-tm_map(Docs.corpus, stripWhitespace)
However, when I run:
dtm <-DocumentTermMatrix(Docs.corpus)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
It still shows some words that I don't want like "and", "just"..etc..
I have tried researching and using [[:punct:]] and other methods but they don't work.
Please help, thank you
I found out why, so the order of the tm_map matters a lot, for example, if you run tolower and then run the next line removeNumbers, it somehow does not execute the tolower anymore, but switch to removeNumbers, I fixed it, it might not be the most effective way, but it works
Docs.corpus.temp <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus.temp1 <-tm_map(Docs.corpus.temp, removeNumbers)
Docs.corpus.temp2 <-tm_map(Docs.corpus.temp1, tolower)
Docs.corpus.temp3 <-tm_map(Docs.corpus.temp2,PlainTextDocument)
Docs.corpus.temp4 <-tm_map(Docs.corpus.temp3, stripWhitespace)
Docs.corpus.temp5 <-tm_map(Docs.corpus.temp4, removeWords, stopwords("english"))
#frequency
dtm <-DocumentTermMatrix(Docs.corpus.temp5)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
ord<- order(freq)
freq
That fixes my problem, now all the tm_map preprocessing code works.
If anyone have better idea, please let me know, thank you!
I need to train a model which would perform multilabel multiclass categorization on text data.
Currently, i'm using mlr package in R. But unluckily I didn't proceed further because of the error I got it before training a model.
More specifically I'm stuck in this place:
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
and, got this error
Error in makeMultilabelTask(id = "classif", data = termsDf, target = target) :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
I used this example: -
Multi-label text classification using mlr package in R
Here is a complete code snippet i'm using so far,
tm <- read.csv("translate_text_V02.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("", "NA"))
process <- tm[, c("label", "text")]
process <- na.omit(process)
docs <- Corpus(VectorSource(process$text))
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, mystopwords)
corpus <- tm_map(corpus, removeWords, stopwords("SMART"))
corpus <- tm_map(corpus, removeWords, stopwords("german"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument, language = "english")
return(corpus)
}
clean_corp <- clean_corpus(docs)
terms <-DocumentTermMatrix(clean_corp)
m <- as.matrix(terms)
m <- cbind(m,process$label)
termsDf <- as.data.frame(m)
target <- unique(termsDf[,2628]) %>% as.character() %>% sort()
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
I created the data frame after Document term matrix with the label class. but I'm stuck afterwords how can I proceed further with machine learning part?
Questions for kind answer: -
How can I proceed further with the creation of DocumentTermMatrix?
How to apply the random-forest algorithm on this particular dataset?
I am running the following code and receiving this error:
Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
: java.lang.NullPointerException
setwd("C:\\Users\\jbarr\\Desktop\\test)
library (tm); library (wordcloud);library (RWeka); library (tau);library(xlsx);
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
review_source <- VectorSource(Comment)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords,stopwords(kind = "english"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c("member", "advise", "inform", "informed", "caller", "call","provided", "advised"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
wordfreq <- colSums(dtm2)
wordfreq <- sort(wordfreq, decreasing=TRUE)
head(wordfreq, n=100)
wfreq <- head(wordfreq, 500)
set.seed(142)
words <- names(wfreq)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(words[1:100], wordfreq[1:100], rot.per=0.35, scale=c(2.7, .4), colors=dark2, random.order=FALSE)
write.xlsx(wfreq, "C:\\Users\\jbarr\\Desktop\\test")
The interesting problem is, I have ran this code on multiple files, and only specific ones have the error.
Sanmeet is right - it's a problem with NAs in your data frame.
just prior to your line: review_source <- VectorSource(Comment)
insert the line below:
Comment[which(is.na(Comment))] <- "NULLVALUEENTERED"
This will change all of your NA values to the phrase NULLVALUEENTERED (feel free to change that). No more NAs, and the code should run fine.
You are getting the error in tokenizer due to NA in your string vector Comment
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
length(Comment)
Comment = Comment[complete.cases(Comment)]
length(Comment)
Or you can also use is.na as below
Comment = Comment[!is.na(Comment)]
Now apply the preprocessing steps, create the corpus etc
Hope this helps.
A Suggestion: I get this error when reading an excel (.xlsx) file using:
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1", startRow = 1, endRow = 0).
Notice it appears that the value for endRow should be NULL or a valid number. But
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1")
works fine. So you might want to check your argument values and argument to parameter alignment.
Seems like there are NAs in your data Frame. Run is.na() and remove those rows. Try running the code again. It should work.
I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.
I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create a matrix. The error is:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
For example, here is code from Jon Starkweather's text mining example. Apologies in advance for such long code, but this does produce a reproducible example. Please note that the error comes at the end with the {tdm} function.
#Read in data
policy.HTML.page <- readLines("http://policy.unt.edu/policy/3-5")
#Obtain text and remove mark-up
policy.HTML.page[186:202]
id.1 <- 3 + which(policy.HTML.page == " TOTAL UNIVERSITY </div>")
id.2 <- id.1 + 5
text.data <- policy.HTML.page[id.1:id.2]
td.1 <- gsub(pattern = "<p>", replacement = "", x = text.data,
ignore.case = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
td.2 <- gsub(pattern = "</p>", replacement = "", x = td.1, ignore.case = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
text.d <- td.2; rm(text.data, td.1, td.2)
#Create corpus and clean
library(tm)
library(SnowballC)
txt <- VectorSource(text.d); rm(text.d)
txt.corpus <- Corpus(txt)
txt.corpus <- tm_map(txt.corpus, tolower)
txt.corpus <- tm_map(txt.corpus, removeNumbers)
txt.corpus <- tm_map(txt.corpus, removePunctuation)
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
txt.corpus <- tm_map(txt.corpus, stripWhitespace); #inspect(docs[1])
txt.corpus <- tm_map(txt.corpus, stemDocument)
# NOTE ERROR WHEN CREATING TDM
tdm <- TermDocumentMatrix(txt.corpus)
The link provided by jazzurro points to the solution. The following line of code
txt.corpus <- tm_map(txt.corpus, tolower)
must be changed to
txt.corpus <- tm_map(txt.corpus, content_transformer(tolower))
There are 2 reasons for this issue in tm v0.6.
If you are doing term level transformations like tolower etc., tm_map returns character vector instead of PlainTextDocument.
Solution: Call tolower through content_transformer or call tm_map(corpus, PlainTextDocument) immediately after tolower
If the SnowballC package is not installed and if you are trying to stem the documents then also this can occur.
Solution: install.packages('SnowballC')
There is No need to apply content_transformer.
Create the corpus in this way:
trainData_corpus <- Corpus((VectorSource(trainData$Comments)))
Try it.