document classification with r

document classification with r - r

I have been trying to follow the document classification tutorial on YouTube using R and it's really interesting, but when I tried to run the first part of the script I keep getting this error Error in FUN(c("obama", "romney")[[1L]], ...) : could not find function "corpus". I really don't know why that is, but I am hoping someone could help me figure it out.
This is the script:
#init
libs <- c("tm", "plyr", "class")
lapply(libs, require, character.only = TRUE)
# set options
options(stringAsFactors = FALSE)
#set parameters
candidates <- c("obama","romney")
pathname <- "C:\\Users\\admin\\Documents\\speeches"
#clean text
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus, removeWords, stopWords("english"))
return(corpus.tmp)
}
#Build TDM
generateTDM <- function(cand, path){
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- corpus(DirSource(directory = s.dir, encoding = "ANSI"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <-TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
tdm <- lapply(candidates, generateTDM, path = pathname)

your path name should be
pathname <- "C:/Users/admin/Documents/speeches"
Note: there is forward slash in pathname

Related

Problem with tm_map(removeWords, stopwords("english")

I am trying to mess around with some R analytics. I have downloaded 10 TED talks file and save them as text. I am struggling with using removeWords stopwords
source("Project_Functions.R")
getwd()
# ====
# Load the PDF data
# pdf.loc <- file.path("data") # folder "PDF Files" with PDFs
# myFiles <- normalizePath(list.files(path = pdf.loc, pattern = "pdf", full.names = TRUE)) # Get the path (chr-vector) of PDF file names
# # Extract content from PDF files
# Docs.corpus <- Corpus(URISource(myFiles), readerControl = list(reader = readPDF(engine = "xpdf")))
# ====
# Load TED Talks Data
myFiles <- normalizePath(list.files(pattern = "txt", full.names = TRUE))
Docs.corpus <- Corpus(URISource(myFiles), readerControl=list(reader=readPlain))
length(Docs.corpus)
#Docs.corpus <-tm_map(Docs.corpus, tolower)
Docs.corpus <-tm_map(Docs.corpus, removeWords, stopwords("english"))
Docs.corpus <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus <-tm_map(Docs.corpus, removeNumbers)
Docs.corpus <-tm_map(Docs.corpus, stripWhitespace)
However, when I run:
dtm <-DocumentTermMatrix(Docs.corpus)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
It still shows some words that I don't want like "and", "just"..etc..
I have tried researching and using [[:punct:]] and other methods but they don't work.
Please help, thank you

I found out why, so the order of the tm_map matters a lot, for example, if you run tolower and then run the next line removeNumbers, it somehow does not execute the tolower anymore, but switch to removeNumbers, I fixed it, it might not be the most effective way, but it works
Docs.corpus.temp <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus.temp1 <-tm_map(Docs.corpus.temp, removeNumbers)
Docs.corpus.temp2 <-tm_map(Docs.corpus.temp1, tolower)
Docs.corpus.temp3 <-tm_map(Docs.corpus.temp2,PlainTextDocument)
Docs.corpus.temp4 <-tm_map(Docs.corpus.temp3, stripWhitespace)
Docs.corpus.temp5 <-tm_map(Docs.corpus.temp4, removeWords, stopwords("english"))
#frequency
dtm <-DocumentTermMatrix(Docs.corpus.temp5)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
ord<- order(freq)
freq
That fixes my problem, now all the tm_map preprocessing code works.
If anyone have better idea, please let me know, thank you!

Topic Modelling in R

I'm creating a correlated topic model from public review data and getting a rather odd error.
When I call terms(ctm1, 5) on my CTM, I get back the names of the documents rather than the top 5 terms for each topic.
In more detail I ran,
library(topicmodels)
library(data.table)
library(tm)
a <-Corpus(DirSource("~/text", encoding="UTF-8"), readerControl =
list(language="lat"))
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stemDocument, language = "english")
adtm <-TermDocumentMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
ctm1 <- CTM(adtm, 30, method = "VEM", control = NULL, model = NULL)
terms(ctm1, 5)
which returned
terms(ctm1)
Topic 1 "cmnt656661.txt"
(etc.)

We cannot know for sure because you did not provide data; but it is likely that you did not import the files correctly. See ?DirSource (my emphasis):
directory : A character vector of full path names; the default
corresponds to the working directory getwd().
In your case, it seems like you should do something like this:
a <- Corpus(DirSource(list.files("~/text", full.names = TRUE)))

Sentiment analysis Lexicon

i have created a corpus and processed it using tm package, a snippet below
cleanCorpus<-function(corpus){
corpus.tmp <- tm_map(corpus, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
corpus.tmp <- tm_map(corpus.tmp, removeWords,stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, stemDocument)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
return(corpus.tmp)
}
myCorpus <-Corpus(VectorSource(Data$body),readerControl = list(reader=readPlain))
cln.corpus<-cleanCorpus(myCorpus)
Now i am using the mpqa lexicon to get the total number of positive words and negative words in each document of the corpus.
so i have the list with me as
pos.words <- lexicon$word[lexicon$Polarity=="positive"]
neg.words <- lexicon$word[lexicon$Polarity=="negative"]
How should i go about comparing the content of each document with the positive and negative list and get the counts of both per document?
i checked other posts on tm dictionaries but looks like the feature is withdrawn.

For example
library(tm)
data("crude")
myCorpus <- crude[1:2]
pos.words <- c("advantag", "easy", "cut")
neg.words <- c("problem", "weak", "uncertain")
weightSenti <- structure(function (m) {
m$v <- rep(1, length(m$v))
m$v[rownames(m) %in% neg.words] <- m$v[rownames(m) %in% neg.words] * -1
attr(m, "weighting") <- c("binarySenti", "binSenti")
m
}, class = c("WeightFunction", "function"), name = "binarySenti", acronym = "binSenti")
tdm <- TermDocumentMatrix(cln.corpus, control=list(weighting=weightSenti, dictionary=c(pos.words, neg.words)))
colSums(as.matrix(tdm))
# 127 144
# 2 -2

R Error in trying to access local data

I just tried out this quite interesting Youtube-R-tutorial about building a text mining machine: http://www.youtube.com/watch?v=j1V2McKbkLo
Currently I have come so far that the whole code I have is
# Tutorial: http://www.youtube.com/watch?v=j1V2McKbkLo
# init
libs <- c("tm", "plyr", "class")
lapply(libs, require, character.only = TRUE)
# set options
options(stringsAsFactors = FALSE)
# set parameters
candidates <- c("Obama", "Romney")
pathname <- "C:/Users/***" # here I pointed out the name for reasons of anonymity
# clean text
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
return(corpus.tmp)
}
# build TDM
generateTDM <- function(cand, path){
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm(TermDocumentMatrix(s.cor.cl))
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
tdm = lapply(candidates, generateTDM, path = pathname)
When I try to run this, I constantly get the following error message:
tdm = lapply(candidates, generateTDM, path = pathname)
Error in DirSource(directory = s.dir, encoding = "ANSI") :
empty directory
and I just can't figure out where the error is. I tried several versions of writing the directory path but none works. I am unsure whether the error is in RStudio not being able to access locally saved data or whether it is in the overall code and I would be absoluty happy if anybody could help me or give any hints.
Thank you!

On Windows you need to separate path components by \ (not /), and in R strings your need to type "\\" to get a single \. Thus, you can (hopefully) solve your problem by defining pathname as follows:
pathname <- "C:\\Users\\***"
(of course writing the correct path instead of the starts).

NA/NaN/Inf in foreign function call (arg 6)

I am doing a term paper in Text mining using R. Our task is to guess the tone of an article (positive/negative). The articles are stored in respective folders. I need to create a classification system which will learn through training samples.
I reused the code from http://www.youtube.com/watch?v=j1V2McKbkLo
The entire code except the last line got executed successfully. Following is the code.
tone<- c("Positive", "Negative")
folderpath <- "C:/Users/Tanmay/Desktop/R practice/Week8"
options(stringAsFactors = FALSE)
corpus<-Corpus(DirSource(folderpath))
corpuscopy<-corpus
summary(corpus)
inspect(corpus)
#Clean data
CleanCorpus <- function(corpus){
corpustemp <- tm_map(corpus, removeNumbers)
corpustemp <- tm_map(corpus, removePunctuation)
corpustemp <- tm_map(corpus, tolower)
corpustemp <- tm_map(corpus, removeWords, stopwords("english"))
corpustemp <- tm_map(corpus, stemDocument,language="english")
corpustemp <- tm_map(corpus, stripWhitespace)
return(corpustemp )
}
#Document term matrix
generateTDM <- function(tone,path) {
corpusdir <- sprintf("%s/%s",path,tone)
corpus<- Corpus(DirSource( directory=corpusdir ,encoding = "ANSI"))
corpustemp <- CleanCorpus(corpus)
corpusclean <- DocumentTermMatrix(corpustemp)
corpusclean <- removeSparseTerms(corpusclean , 0.7)
result <- list(Tone = tone, tdm = corpusclean)
}
tdm <- lapply(tone,generateTDM,path=folderpath)
#Attach tone
ToneBindTotdm <- function(tdm){
temp.mat <- data.matrix(tdm[["tdm"]])
temp.df <- as.data.frame(temp.mat)
temp.df <- cbind(temp.df,rep(tdm[["Tone"]]),nrow(temp.df))
colnames(temp.df)[ncol(temp.df)] <- "PredictTone"
return(temp.df)
}
Tonetdm <- lapply(tdm,ToneBindTotdm)
#Stack
Stacktdm <- do.call(rbind.fill,Tonetdm)
Stacktdm[is.na(Stacktdm)] <- 0
#Holdout
trainid <- sample(nrow(Stacktdm),ceiling(nrow(Stacktdm) * 0.7))
testid <- (1:nrow(Stacktdm)) [- trainid]
#knn
tdmone <- Stacktdm[,"PredictTone"]
tdmone.nl <- Stacktdm[, !colnames(Stacktdm) %in% "PredictTone"]
knnPredict <- knn(tdmone.nl[trainid,],tdmone.nl[testid,],tdmone[trainid],k=5)
When I tried to execute this, I got error in the last line (knn) :
**Error in knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NAs introduced by coercion
2: In knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NAs introduced by coercion**
Could anyone please help me out. Also if there are other simpler and better way to classify please point me to them. Thanks and sorry for the long post.

I was stuck on the same issue. But I modified it my way to remove all the NA values. You can check my code and compare what might be the problem in your code.
#init
libs <- c("tm" , "plyr" , "class")
lapply(libs,require, character.only=TRUE)
#set options
options(stringsAsFactors = FALSE)
#set parameters
candidates <- c("user1" , "user2" ,"test")
pathname <- "C:/Users/prabhjot.rai/Documents/Project_r/textMining"
#clean text
cleanCorpus <- function(corpus)
{
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument)
}
#build TDM
generateTDM <- function(cand,path)
{
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand , tdm = s.tdm)
}
tdm <- lapply(candidates, generateTDM, path = pathname)
test <- t(data.matrix(tdm[[1]]$tdm))
rownames(test) <- c(1:nrow(test))
#attach name and convert to dataframe
makeMatrix <- function(thisTDM){
test <- t(data.matrix(thisTDM$tdm))
rownames(test) <- c(1:nrow(test))
test <- as.data.frame(test, stringsAsFactors = F , na.rm = T)
test$candidateName <- thisTDM$name
test <- as.data.frame(test, stringsAsFactors = F , na.rm = T)
}
candTDM <- lapply(tdm, makeMatrix)
# stack all the speeches together
tdm.stack <- do.call(rbind.fill, candTDM)
tdm.stack[is.na(tdm.stack)] <- as.numeric(0)
#testing and training sets
train <- tdm.stack[ tdm.stack$candidateName!= 'test' , ]
train <- train[, names(train) != 'candidateName']
test <- tdm.stack[ tdm.stack$candidateName == 'test' , ]
test <- test[, names(test) != 'candidateName']
classes <- tdm.stack [ tdm.stack$candidateName != 'test' , 'candidateName']
classes <- as.factor(classes)
myknn <- knn(train=train, test = test , cl = classes , k=1)
myknn
Keep a testing file in the test folder next to user1 and user2 folders to check the output of this algorithm. And keep the value of k as the square root of number of speeches, preferably an odd number. And ignore the redundancy of testing and training set assignment. It was not working in one line in my machine so did it in two lines.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

document classification with r - r

your path name should be pathname <- "C:/Users/admin/Documents/speeches" Note: there is forward slash in pathname

Related

Problem with tm_map(removeWords, stopwords("english")

Topic Modelling in R

Sentiment analysis Lexicon

R Error in trying to access local data

NA/NaN/Inf in foreign function call (arg 6)

Categories

Resources