I am doing a term paper in Text mining using R. Our task is to guess the tone of an article (positive/negative). The articles are stored in respective folders. I need to create a classification system which will learn through training samples.
I reused the code from http://www.youtube.com/watch?v=j1V2McKbkLo
The entire code except the last line got executed successfully. Following is the code.
tone<- c("Positive", "Negative")
folderpath <- "C:/Users/Tanmay/Desktop/R practice/Week8"
options(stringAsFactors = FALSE)
corpus<-Corpus(DirSource(folderpath))
corpuscopy<-corpus
summary(corpus)
inspect(corpus)
#Clean data
CleanCorpus <- function(corpus){
corpustemp <- tm_map(corpus, removeNumbers)
corpustemp <- tm_map(corpus, removePunctuation)
corpustemp <- tm_map(corpus, tolower)
corpustemp <- tm_map(corpus, removeWords, stopwords("english"))
corpustemp <- tm_map(corpus, stemDocument,language="english")
corpustemp <- tm_map(corpus, stripWhitespace)
return(corpustemp )
}
#Document term matrix
generateTDM <- function(tone,path) {
corpusdir <- sprintf("%s/%s",path,tone)
corpus<- Corpus(DirSource( directory=corpusdir ,encoding = "ANSI"))
corpustemp <- CleanCorpus(corpus)
corpusclean <- DocumentTermMatrix(corpustemp)
corpusclean <- removeSparseTerms(corpusclean , 0.7)
result <- list(Tone = tone, tdm = corpusclean)
}
tdm <- lapply(tone,generateTDM,path=folderpath)
#Attach tone
ToneBindTotdm <- function(tdm){
temp.mat <- data.matrix(tdm[["tdm"]])
temp.df <- as.data.frame(temp.mat)
temp.df <- cbind(temp.df,rep(tdm[["Tone"]]),nrow(temp.df))
colnames(temp.df)[ncol(temp.df)] <- "PredictTone"
return(temp.df)
}
Tonetdm <- lapply(tdm,ToneBindTotdm)
#Stack
Stacktdm <- do.call(rbind.fill,Tonetdm)
Stacktdm[is.na(Stacktdm)] <- 0
#Holdout
trainid <- sample(nrow(Stacktdm),ceiling(nrow(Stacktdm) * 0.7))
testid <- (1:nrow(Stacktdm)) [- trainid]
#knn
tdmone <- Stacktdm[,"PredictTone"]
tdmone.nl <- Stacktdm[, !colnames(Stacktdm) %in% "PredictTone"]
knnPredict <- knn(tdmone.nl[trainid,],tdmone.nl[testid,],tdmone[trainid],k=5)
When I tried to execute this, I got error in the last line (knn) :
**Error in knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NAs introduced by coercion
2: In knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NAs introduced by coercion**
Could anyone please help me out. Also if there are other simpler and better way to classify please point me to them. Thanks and sorry for the long post.
I was stuck on the same issue. But I modified it my way to remove all the NA values. You can check my code and compare what might be the problem in your code.
#init
libs <- c("tm" , "plyr" , "class")
lapply(libs,require, character.only=TRUE)
#set options
options(stringsAsFactors = FALSE)
#set parameters
candidates <- c("user1" , "user2" ,"test")
pathname <- "C:/Users/prabhjot.rai/Documents/Project_r/textMining"
#clean text
cleanCorpus <- function(corpus)
{
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument)
}
#build TDM
generateTDM <- function(cand,path)
{
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand , tdm = s.tdm)
}
tdm <- lapply(candidates, generateTDM, path = pathname)
test <- t(data.matrix(tdm[[1]]$tdm))
rownames(test) <- c(1:nrow(test))
#attach name and convert to dataframe
makeMatrix <- function(thisTDM){
test <- t(data.matrix(thisTDM$tdm))
rownames(test) <- c(1:nrow(test))
test <- as.data.frame(test, stringsAsFactors = F , na.rm = T)
test$candidateName <- thisTDM$name
test <- as.data.frame(test, stringsAsFactors = F , na.rm = T)
}
candTDM <- lapply(tdm, makeMatrix)
# stack all the speeches together
tdm.stack <- do.call(rbind.fill, candTDM)
tdm.stack[is.na(tdm.stack)] <- as.numeric(0)
#testing and training sets
train <- tdm.stack[ tdm.stack$candidateName!= 'test' , ]
train <- train[, names(train) != 'candidateName']
test <- tdm.stack[ tdm.stack$candidateName == 'test' , ]
test <- test[, names(test) != 'candidateName']
classes <- tdm.stack [ tdm.stack$candidateName != 'test' , 'candidateName']
classes <- as.factor(classes)
myknn <- knn(train=train, test = test , cl = classes , k=1)
myknn
Keep a testing file in the test folder next to user1 and user2 folders to check the output of this algorithm. And keep the value of k as the square root of number of speeches, preferably an odd number. And ignore the redundancy of testing and training set assignment. It was not working in one line in my machine so did it in two lines.
Related
working on a document classification model using the code provided by Tim DAuria (https://www.youtube.com/watch?v=j1V2McKbkLo), but can not figure out how to actually see the detail analysis of the classification on the 'Test' data.
I am using the model to help classify contracts by type and want to see the specific classification assigned to the different contracts. For example, the model assigns 15 contracts as "x" type of contract. How can I view those 15 file names? The code below works great for the classification piece. Just posting it for reference.
Please help! Really new at this and I'm sure I'm missing something obvious but could not find anything on the web.
Classification Code below:
#int
libs <- c("tm", "plyr","class")
lapply(libs, require, character.only = TRUE)
#set options
options(stringsAsFactors = FALSE)
#set parameters
contract <- c("build construction", "other")
pathname <- ..Desktop/ML/ContractReview"
#clean text
cleanCorpus <- function(corpus) {
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, stemDocument)
return(corpus.tmp)
}
#build TDM
generateTDM <- function(contract, path) {
c.dir <- sprintf ("%s/%s", path, contract)
c.cor <- VCorpus(DirSource(directory = c.dir), readerControl = list(reader=readPlain))
c.cor.cl <- cleanCorpus(c.cor)
c.tdm <- TermDocumentMatrix(c.cor.cl)
c.tdm <- removeSparseTerms(c.tdm, .07)
result <- list(name = contract, tdm = c.tdm)
}
tdm <- lapply(contract, generateTDM, path = pathname)
# attach name
bindcontractToTDM <- function(tdm) {
c.mat <-t(data.matrix(tdm[["tdm"]]))
c.df <- as.data.frame(c.mat, stringsAsFactors = FALSE)
c.df <- cbind(c.df, rep(tdm[["name"]], nrow(c.df)))
colnames(c.df) [ncol(c.df)] <- "targetcontract"
return(c.df)
}
contractTDM <- lapply(tdm, bindcontractToTDM)
#stack if you have more than one dataframe
tdm.stack <- do.call(rbind.fill, contractTDM)
tdm.stack[is.na(tdm.stack)] <-0
#hold-out
train.idx <- sample(nrow(tdm.stack), ceiling(nrow(tdm.stack)* 0.7))
test.idx <- (1:nrow(tdm.stack))[- train.idx]
#model - knn
tdm.contract <-tdm.stack[, "targetcontract"]
tdm.stack.nl <- tdm.stack[, !colnames(tdm.stack) %in% "targetcontract"]
knn.pred <- knn(tdm.stack.nl[train.idx, ], tdm.stack.nl[test.idx, ], tdm.contract[train.idx])
#accuracy
conf.mat <- table("predictions"= knn.pred, Actual = tdm.contract[test.idx])
(accuracy <- sum(diag(conf.mat)) / length(test.idx)*100)
The answer to your question is hidden in the knn.pred object, which stores predicted labels for the test cases.
Since there is no input data provided (I don't think it was provided by the author of the video?), I am not sure about the details of the contract classes from your example. However, the output of the knn function from the class package you are using is a factor with labels that the algorithm predicted for the test documents
(you will notice that
length(knn.pred)
is equal to 0.3 * nrow(tdm.stack) ).
If you would like to view/store the predicted label and the actual label for each entry, then you can create a suitable data frame:
label_df <- cbind(label_pred = knn.pred, label_actual = tdm.contract[test.idx])
Alternatively, you can also choose to include the remaining columns of the tdm.contract (if you would like to re-examine the tdm.stack information in that context)
label_df <- cbind(label_pred = knn.pred, label_actual = tdm.contract[test.idx], tdm.stack.nl[test.idx,])
You can then filter either of these data frames to see how your entries of interest have been labelled.
Alternatively, you can choose to run k nearest neighbors algorithm using a function from a different package, in which case the output might differ.
I seem to run into a problem whenever I try to inspect my freq. words and associations.
When I make the tdm I get this info:
TermDocumentMatrix
I can see I have plenty of terms to use, in plenty of documents.
However!
When I try to inspect the content of "tdm", I get this info:
Inspecting the TDM
Howcome the tdm all of a sudden is empty?
Hope someone can help
tweets <- userTimeline("RDataMining", n = 1000)
(n.tweet <- length(tweets))
tweets[1:3]
#convert tweets to a data frame
tweets.df <- twListToDF(tweets)
dim(tweets.df)
##Text cleaning
library(tm)
#build a corpus and specify the source to be a character vector
myCorpus <- Corpus(VectorSource(tweets.df$text))
#convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove URLs
removeURL <- function(x) gsub ("http[^[:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeURL))
#remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeNumPunct))
#remove stopwords + 2
myStopwords <- c(stopwords('english'),"available","via")
#remove "r" and "big" from stopwords
myStopwords <- setdiff(myStopwords, c("r","big"))
#remove stopwords from corpus
myCorpus <- tm_map(myCorpus,removeWords,myStopwords)
#remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
#keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy <- myCorpus
#stem words
library(SnowballC)
myCorpus <- tm_map(myCorpus,stemDocument)
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x),""))
#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
x <- x[x !=""]
x <- stemCompletion(x, dictionary = dictionary)
x <- paste (x,sep = "",collapse = "")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary = myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
#count freq of "mining"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<mining")})
sum(unlist(miningCases))
#count freq of "miner"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<miner")})
sum(unlist(miningCases))
#count freq of "r"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<r")})
sum(unlist(miningCases))
#replace "miner" with "mining"
myCorpus <- tm_map(myCorpus,content_transformer(gsub),
pattern = "miner", replacement = "mining")
tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE,stopwords = TRUE))
tdm
##Freq words and associations
idx <- which(dimnames(tdm)$Terms == "r")
inspect(tdm[idx + (0:5), 101:110])
#inspect frequent words
(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq,term.freq >= 15)
df <- data.frame(term = names(term.freq), freq = term.freq)
I've been using the following Twitter query to test your code:
tweets = searchTwitter("r data mining", n=10)
and I think the problem is with your function stemCompletion2, which should look something like this:
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x)," "))
print("before:")
print(x)
#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
x <- x[x !=""]
x <- stemCompletion(x, dictionary = dictionary)
print("after:")
print(x)
x <- paste(x, sep = " ")
PlainTextDocument(stripWhitespace(x))
}
The modifications are as follows: before you had
x <- unlist(strsplit(as.character(x),""))
which was creating a list with all the characters of in each of the documents, and I've modified it to
x <- unlist(strsplit(as.character(x)," "))
to create a list of words. Similarly, when recomposing your documents, you where doing
x <- paste (x,sep = "",collapse = "")
which was creating the long strings you mention in your post, and I've modified it to:
x <- paste(x, sep = " ")
to recompose the words.
One example of the completions would be for my data:
[1] "before:"
[1] "rt" "ebookdealalert" "r" "datamin" "project" "learn" "data" "mine"
[9] "realworld" "project" "book" "solv" "predict" "model"
[1] "after:"
rt ebookdealalert r datamin project learn data mine
"rt" "ebookdealalerts" "r" "datamining" "projects" "learn" "data" ""
realworld project book solv predict model
"realworld" "projects" "book" "solve" "predictive" "modeling"
After that step, you may be able to work with TermDocumentMatrix as expected.
Hope it helps.
I have been trying to follow this example by Norbert Ryciak, whom I havent been able to get in touch with.
Since this article was written in 2014, some things in R have changed so I have been able to update some of those things in the code, but I got stuck in the last part.
Here is my Working code so far:
library(tm)
library(stringi)
library(proxy)
wiki <- "https://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
"Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
"Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))
for (i in 1:length(titles)) {
articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
}
docs <- Corpus(VectorSource(articles))
docs[[1]]
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, content_transformer(tolower))
docs8[[1]]
docsTDM <- TermDocumentMatrix(docs8)
docsTDM2 <- as.matrix(docsTDM)
docsdissim <- dist(docsTDM2, method = "cosine")
But I havent been able to get pass this part:
docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
docsdissim2
h <- hclust(docsdissim, method = "ward.D")
plot(h, labels = titles, sub = "")
I tried to run the "hclust" directly, and then I was able to Plot, but nothing readable came out of it.
This are the errors Im getting:
rownames(docsdissim2) <- titles
Error in `rownames<-`(`*tmp*`, value = c("Integral", "Riemann_integral", :
length of 'dimnames' [1] not equal to array extent
Another:
plot(h, labels = titles, sub = "")
Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
invalid dendrogram input
Is there anyone that could give me a hand to finish this example?
Best Regards,
I was able to solve this problem thanks to Norbert Ryciak (the author of the tutorial).
Since he used an older version of "tm" (which was probably the latest at the time) it was not compatible with the one I used.
The solution was to replace "docsTDM <- TermDocumentMatrix(docs8)" with "docsTDM <- DocumentTermMatrix(docs8)".
So the final code:
library(tm)
library(stringi)
library(proxy)
wiki <- "https://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
"Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
"Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))
for (i in 1:length(titles)) {
articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
}
docs <- Corpus(VectorSource(articles))
docs[[1]]
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, content_transformer(tolower))
docs8[[1]]
docsTDM <- DocumentTermMatrix(docs8)
docsTDM2 <- as.matrix(docsTDM)
docsdissim <- dist(docsTDM2, method = "cosine")
docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
docsdissim2
h <- hclust(docsdissim, method = "ward")
plot(h, labels = titles, sub = "")
I have been trying to follow the document classification tutorial on YouTube using R and it's really interesting, but when I tried to run the first part of the script I keep getting this error Error in FUN(c("obama", "romney")[[1L]], ...) : could not find function "corpus". I really don't know why that is, but I am hoping someone could help me figure it out.
This is the script:
#init
libs <- c("tm", "plyr", "class")
lapply(libs, require, character.only = TRUE)
# set options
options(stringAsFactors = FALSE)
#set parameters
candidates <- c("obama","romney")
pathname <- "C:\\Users\\admin\\Documents\\speeches"
#clean text
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus, removeWords, stopWords("english"))
return(corpus.tmp)
}
#Build TDM
generateTDM <- function(cand, path){
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- corpus(DirSource(directory = s.dir, encoding = "ANSI"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <-TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
tdm <- lapply(candidates, generateTDM, path = pathname)
your path name should be
pathname <- "C:/Users/admin/Documents/speeches"
Note: there is forward slash in pathname
I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set!
trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set.
here's my (not so beautiful) code:
# create data.frame with labelled sentences
labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T))
# create training set and test set
traindata <- as.data.frame(labeled[1:700,c("ARTICLE","CLASS")])
testdata <- as.data.frame(labeled[701:1000, c("ARTICLE","CLASS")])
# Vector, Source Transformation
trainvector <- as.vector(traindata$"ARTICLE")
testvector <- as.vector(testdata$"ARTICLE")
trainsource <- VectorSource(trainvector)
testsource <- VectorSource(testvector)
# CREATE CORPUS FOR DATA
traincorpus <- Corpus(trainsource)
testcorpus <- Corpus(testsource)
# my own stopwords
sw <- c("i", "me", "my")
## CLEAN TEXT
# FUNCTION FOR CLEANING
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp,tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, sw)
corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
corpus.tmp <- tm_map(corpus.tmp, stemDocument, language="en")
return(corpus.tmp)}
# CLEAN CORP WITH ABOVE FUNCTION
traincorpus.cln <- cleanCorpus(traincorpus)
testcorpus.cln <- cleanCorpus(testcorpus)
## CREATE N-GRAM DOCUMENT TERM MATRIX
# CREATE N-GRAM TOKENIZER
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
# CREATE DTM
trainmatrix.cln.bi <- DocumentTermMatrix(traincorpus.cln, control = list(tokenize = BigramTokenizer))
testmatrix.cln.bi <- DocumentTermMatrix(testcorpus.cln, control = list(tokenize = BigramTokenizer))
# REMOVE SPARSE TERMS
trainDTM <- removeSparseTerms(trainmatrix.cln.bi, 0.98)
testDTM <- removeSparseTerms(testmatrix.cln.bi, 0.98)
# train the model
SVM <- svm(as.matrix(trainDTM), as.factor(traindata$CLASS))
# get classifications for training-set
results.train <- predict(SVM, as.matrix(trainDTM)) # works fine!
# get classifications for test-set
results <- predict(SVM,as.matrix(testDTM))
Error in scale.default(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", :
length of 'center' must equal the number of columns of 'x'
i don't understand this error. and what is 'center' ?
thank you!!
Train and test data must be in the same features space ; building two separates DTM in that way can't work.
A solution with using RTextTools :
DocTermMatrix <- create_matrix(labeled, language="english", removeNumbers=TRUE, stemWords=TRUE, ...)
container <- create_container(DocTermMatrix, labels, trainSize=1:700, testSize=701:1000, virgin=FALSE)
models <- train_models(container, "SVM")
results <- classify_models(container, models)
Or, to answer your question (with e1071), you can specify the vocabulary ('features') in the projection (DocumentTermMatrix) :
DocTermMatrixTrain <- DocumentTermMatrix(Corpus(VectorSource(trainDoc)));
Features <- DocTermMatrixTrain$dimnames$Terms;
DocTermMatrixTest <- DocumentTermMatrix(Corpus(VectorSource(testDoc)),control=list(dictionary=Features));