I just tried out this quite interesting Youtube-R-tutorial about building a text mining machine: http://www.youtube.com/watch?v=j1V2McKbkLo
Currently I have come so far that the whole code I have is
# Tutorial: http://www.youtube.com/watch?v=j1V2McKbkLo
# init
libs <- c("tm", "plyr", "class")
lapply(libs, require, character.only = TRUE)
# set options
options(stringsAsFactors = FALSE)
# set parameters
candidates <- c("Obama", "Romney")
pathname <- "C:/Users/***" # here I pointed out the name for reasons of anonymity
# clean text
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
return(corpus.tmp)
}
# build TDM
generateTDM <- function(cand, path){
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm(TermDocumentMatrix(s.cor.cl))
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
tdm = lapply(candidates, generateTDM, path = pathname)
When I try to run this, I constantly get the following error message:
tdm = lapply(candidates, generateTDM, path = pathname)
Error in DirSource(directory = s.dir, encoding = "ANSI") :
empty directory
and I just can't figure out where the error is. I tried several versions of writing the directory path but none works. I am unsure whether the error is in RStudio not being able to access locally saved data or whether it is in the overall code and I would be absoluty happy if anybody could help me or give any hints.
Thank you!
On Windows you need to separate path components by \ (not /), and in R strings your need to type "\\" to get a single \. Thus, you can (hopefully) solve your problem by defining pathname as follows:
pathname <- "C:\\Users\\***"
(of course writing the correct path instead of the starts).
Related
I am trying to mess around with some R analytics. I have downloaded 10 TED talks file and save them as text. I am struggling with using removeWords stopwords
source("Project_Functions.R")
getwd()
# ====
# Load the PDF data
# pdf.loc <- file.path("data") # folder "PDF Files" with PDFs
# myFiles <- normalizePath(list.files(path = pdf.loc, pattern = "pdf", full.names = TRUE)) # Get the path (chr-vector) of PDF file names
# # Extract content from PDF files
# Docs.corpus <- Corpus(URISource(myFiles), readerControl = list(reader = readPDF(engine = "xpdf")))
# ====
# Load TED Talks Data
myFiles <- normalizePath(list.files(pattern = "txt", full.names = TRUE))
Docs.corpus <- Corpus(URISource(myFiles), readerControl=list(reader=readPlain))
length(Docs.corpus)
#Docs.corpus <-tm_map(Docs.corpus, tolower)
Docs.corpus <-tm_map(Docs.corpus, removeWords, stopwords("english"))
Docs.corpus <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus <-tm_map(Docs.corpus, removeNumbers)
Docs.corpus <-tm_map(Docs.corpus, stripWhitespace)
However, when I run:
dtm <-DocumentTermMatrix(Docs.corpus)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
It still shows some words that I don't want like "and", "just"..etc..
I have tried researching and using [[:punct:]] and other methods but they don't work.
Please help, thank you
I found out why, so the order of the tm_map matters a lot, for example, if you run tolower and then run the next line removeNumbers, it somehow does not execute the tolower anymore, but switch to removeNumbers, I fixed it, it might not be the most effective way, but it works
Docs.corpus.temp <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus.temp1 <-tm_map(Docs.corpus.temp, removeNumbers)
Docs.corpus.temp2 <-tm_map(Docs.corpus.temp1, tolower)
Docs.corpus.temp3 <-tm_map(Docs.corpus.temp2,PlainTextDocument)
Docs.corpus.temp4 <-tm_map(Docs.corpus.temp3, stripWhitespace)
Docs.corpus.temp5 <-tm_map(Docs.corpus.temp4, removeWords, stopwords("english"))
#frequency
dtm <-DocumentTermMatrix(Docs.corpus.temp5)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
ord<- order(freq)
freq
That fixes my problem, now all the tm_map preprocessing code works.
If anyone have better idea, please let me know, thank you!
I'm creating a correlated topic model from public review data and getting a rather odd error.
When I call terms(ctm1, 5) on my CTM, I get back the names of the documents rather than the top 5 terms for each topic.
In more detail I ran,
library(topicmodels)
library(data.table)
library(tm)
a <-Corpus(DirSource("~/text", encoding="UTF-8"), readerControl =
list(language="lat"))
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stemDocument, language = "english")
adtm <-TermDocumentMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
ctm1 <- CTM(adtm, 30, method = "VEM", control = NULL, model = NULL)
terms(ctm1, 5)
which returned
terms(ctm1)
Topic 1 "cmnt656661.txt"
(etc.)
We cannot know for sure because you did not provide data; but it is likely that you did not import the files correctly. See ?DirSource (my emphasis):
directory : A character vector of full path names; the default
corresponds to the working directory getwd().
In your case, it seems like you should do something like this:
a <- Corpus(DirSource(list.files("~/text", full.names = TRUE)))
I am new to the tm package in R. I am trying to create a document-term matrix with the tm_map function, but apparently the function passed to tm_map(Corpus, function, lazy=TRUE) is not applied to the corpus. Concretely, the documents are not converted to lower case. R Studio does not show any errors or warnings.
Did I mess up anything here? Could this be some enconding issue?
library(tm)
setwd("...")
filenames <- list.files(getwd(), pattern="*.txt")
files <- lapply(filenames, readLines)
docs <- Corpus(VectorSource(files))
writeLines(as.character(docs[[30]]))
docs <- tm_map(docs, function(x) iconv(enc2utf8(x$content), sub = ""), lazy=TRUE)
#to lower case
docs <- tm_map(docs, content_transformer(tolower), lazy=TRUE)
writeLines(as.character(docs[[30]]))
Thank you for any advice!
This is a simple fix. Move your code for converting to lower case before iconv(...).
This works:
library(tm)
setwd("")
# Read in Files
filenames <- list.files(getwd(), pattern="*.txt")
files <- lapply(filenames, readLines)
docs <- Corpus(VectorSource(files))
writeLines(as.character(docs[[30]]))
# Lower Case
docs <- tm_map(docs, content_transformer(tolower), lazy=TRUE)
# Convert
docs <- tm_map(docs, function(x) iconv(enc2utf8(x$content), sub = ""))
writeLines(as.character(docs[[30]]))
I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create a matrix. The error is:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
For example, here is code from Jon Starkweather's text mining example. Apologies in advance for such long code, but this does produce a reproducible example. Please note that the error comes at the end with the {tdm} function.
#Read in data
policy.HTML.page <- readLines("http://policy.unt.edu/policy/3-5")
#Obtain text and remove mark-up
policy.HTML.page[186:202]
id.1 <- 3 + which(policy.HTML.page == " TOTAL UNIVERSITY </div>")
id.2 <- id.1 + 5
text.data <- policy.HTML.page[id.1:id.2]
td.1 <- gsub(pattern = "<p>", replacement = "", x = text.data,
ignore.case = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
td.2 <- gsub(pattern = "</p>", replacement = "", x = td.1, ignore.case = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
text.d <- td.2; rm(text.data, td.1, td.2)
#Create corpus and clean
library(tm)
library(SnowballC)
txt <- VectorSource(text.d); rm(text.d)
txt.corpus <- Corpus(txt)
txt.corpus <- tm_map(txt.corpus, tolower)
txt.corpus <- tm_map(txt.corpus, removeNumbers)
txt.corpus <- tm_map(txt.corpus, removePunctuation)
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
txt.corpus <- tm_map(txt.corpus, stripWhitespace); #inspect(docs[1])
txt.corpus <- tm_map(txt.corpus, stemDocument)
# NOTE ERROR WHEN CREATING TDM
tdm <- TermDocumentMatrix(txt.corpus)
The link provided by jazzurro points to the solution. The following line of code
txt.corpus <- tm_map(txt.corpus, tolower)
must be changed to
txt.corpus <- tm_map(txt.corpus, content_transformer(tolower))
There are 2 reasons for this issue in tm v0.6.
If you are doing term level transformations like tolower etc., tm_map returns character vector instead of PlainTextDocument.
Solution: Call tolower through content_transformer or call tm_map(corpus, PlainTextDocument) immediately after tolower
If the SnowballC package is not installed and if you are trying to stem the documents then also this can occur.
Solution: install.packages('SnowballC')
There is No need to apply content_transformer.
Create the corpus in this way:
trainData_corpus <- Corpus((VectorSource(trainData$Comments)))
Try it.
I have been trying to follow the document classification tutorial on YouTube using R and it's really interesting, but when I tried to run the first part of the script I keep getting this error Error in FUN(c("obama", "romney")[[1L]], ...) : could not find function "corpus". I really don't know why that is, but I am hoping someone could help me figure it out.
This is the script:
#init
libs <- c("tm", "plyr", "class")
lapply(libs, require, character.only = TRUE)
# set options
options(stringAsFactors = FALSE)
#set parameters
candidates <- c("obama","romney")
pathname <- "C:\\Users\\admin\\Documents\\speeches"
#clean text
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus, removeWords, stopWords("english"))
return(corpus.tmp)
}
#Build TDM
generateTDM <- function(cand, path){
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- corpus(DirSource(directory = s.dir, encoding = "ANSI"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <-TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
tdm <- lapply(candidates, generateTDM, path = pathname)
your path name should be
pathname <- "C:/Users/admin/Documents/speeches"
Note: there is forward slash in pathname