I am working with Bengali text, have a data set which contains around 9800 tweets, Facebook status (Bengali text mixed with English) labeled with Positive negative sentiment.
I tried to perform Naïve Bayes algorithm for text classification and
other machine learning algorithms.
I found challenges in
Dealing with languages other than English (like Bengali)
Cleaning data
Creating document term matrix for Bengali text
Inconsistent result (though it shows results)
So, using my data set (Bengali text) how I could perfectly perform Naïve Bayes Algorithm and other machine learning algorithms (decision tree, support vector machine)
NB: i am sharing sample data
https://1drv.ms/x/s!Al917DZ-85m3ghcLoFHX4rWUTFOS
library(tm)
library(e1071)
library(MLmetrics)
rawData <- bntextt
colnames(rawData) <- c("type", "text")
rawData$text <- iconv(rawData$text, to = "utf-8")
rawData$type <- factor(rawData$type)
sms_train_raw <- rawData[1:9800, ]$type
Sms_test_raw<- rawData[9801:9883,]$type
sms_corpus <- Corpus(VectorSource(rawData$text))
corpus <- tm_map(sms_corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
sms_corpus_train<-corpus[1:9800]
sms_corpus_test<-corpus[9801:9883]
sms_dtm <- DocumentTermMatrix(corpus)
sms_dtm_train <- sms_dtm[1:9800,]
sms_dtm_test <- sms_dtm[9801:9883,]
five_times_words_train <- findFreqTerms(sms_dtm_train, 5)
five_times_words_test <- findFreqTerms(sms_dtm_test, 5)
sms_dtm_train <- DocumentTermMatrix(sms_corpus_train,control=list(dictionary = five_times_words_train))
sms_dtm_test <- DocumentTermMatrix(sms_corpus_test,control=list(dictionary = five_times_words_test))
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
sms_train <- apply(sms_dtm_train, 2, convert_count)
sms_test <- apply(sms_dtm_test, 2, convert_count)
sms_classifier <- naiveBayes(sms_train, sms_train_raw)
class(sms_classifier)
sms_test_pred <- predict(sms_classifier, newdata=sms_test,type="class")
table(Sms_test_raw,sms_test_pred)
Accuracy(sms_test_pred, testData$type)
F1_Score(sms_test_pred, testData$type)
Raw data
Document term matrix
Related
working on a document classification model using the code provided by Tim DAuria (https://www.youtube.com/watch?v=j1V2McKbkLo), but can not figure out how to actually see the detail analysis of the classification on the 'Test' data.
I am using the model to help classify contracts by type and want to see the specific classification assigned to the different contracts. For example, the model assigns 15 contracts as "x" type of contract. How can I view those 15 file names? The code below works great for the classification piece. Just posting it for reference.
Please help! Really new at this and I'm sure I'm missing something obvious but could not find anything on the web.
Classification Code below:
#int
libs <- c("tm", "plyr","class")
lapply(libs, require, character.only = TRUE)
#set options
options(stringsAsFactors = FALSE)
#set parameters
contract <- c("build construction", "other")
pathname <- ..Desktop/ML/ContractReview"
#clean text
cleanCorpus <- function(corpus) {
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, stemDocument)
return(corpus.tmp)
}
#build TDM
generateTDM <- function(contract, path) {
c.dir <- sprintf ("%s/%s", path, contract)
c.cor <- VCorpus(DirSource(directory = c.dir), readerControl = list(reader=readPlain))
c.cor.cl <- cleanCorpus(c.cor)
c.tdm <- TermDocumentMatrix(c.cor.cl)
c.tdm <- removeSparseTerms(c.tdm, .07)
result <- list(name = contract, tdm = c.tdm)
}
tdm <- lapply(contract, generateTDM, path = pathname)
# attach name
bindcontractToTDM <- function(tdm) {
c.mat <-t(data.matrix(tdm[["tdm"]]))
c.df <- as.data.frame(c.mat, stringsAsFactors = FALSE)
c.df <- cbind(c.df, rep(tdm[["name"]], nrow(c.df)))
colnames(c.df) [ncol(c.df)] <- "targetcontract"
return(c.df)
}
contractTDM <- lapply(tdm, bindcontractToTDM)
#stack if you have more than one dataframe
tdm.stack <- do.call(rbind.fill, contractTDM)
tdm.stack[is.na(tdm.stack)] <-0
#hold-out
train.idx <- sample(nrow(tdm.stack), ceiling(nrow(tdm.stack)* 0.7))
test.idx <- (1:nrow(tdm.stack))[- train.idx]
#model - knn
tdm.contract <-tdm.stack[, "targetcontract"]
tdm.stack.nl <- tdm.stack[, !colnames(tdm.stack) %in% "targetcontract"]
knn.pred <- knn(tdm.stack.nl[train.idx, ], tdm.stack.nl[test.idx, ], tdm.contract[train.idx])
#accuracy
conf.mat <- table("predictions"= knn.pred, Actual = tdm.contract[test.idx])
(accuracy <- sum(diag(conf.mat)) / length(test.idx)*100)
The answer to your question is hidden in the knn.pred object, which stores predicted labels for the test cases.
Since there is no input data provided (I don't think it was provided by the author of the video?), I am not sure about the details of the contract classes from your example. However, the output of the knn function from the class package you are using is a factor with labels that the algorithm predicted for the test documents
(you will notice that
length(knn.pred)
is equal to 0.3 * nrow(tdm.stack) ).
If you would like to view/store the predicted label and the actual label for each entry, then you can create a suitable data frame:
label_df <- cbind(label_pred = knn.pred, label_actual = tdm.contract[test.idx])
Alternatively, you can also choose to include the remaining columns of the tdm.contract (if you would like to re-examine the tdm.stack information in that context)
label_df <- cbind(label_pred = knn.pred, label_actual = tdm.contract[test.idx], tdm.stack.nl[test.idx,])
You can then filter either of these data frames to see how your entries of interest have been labelled.
Alternatively, you can choose to run k nearest neighbors algorithm using a function from a different package, in which case the output might differ.
I need to train a model which would perform multilabel multiclass categorization on text data.
Currently, i'm using mlr package in R. But unluckily I didn't proceed further because of the error I got it before training a model.
More specifically I'm stuck in this place:
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
and, got this error
Error in makeMultilabelTask(id = "classif", data = termsDf, target = target) :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
I used this example: -
Multi-label text classification using mlr package in R
Here is a complete code snippet i'm using so far,
tm <- read.csv("translate_text_V02.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("", "NA"))
process <- tm[, c("label", "text")]
process <- na.omit(process)
docs <- Corpus(VectorSource(process$text))
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, mystopwords)
corpus <- tm_map(corpus, removeWords, stopwords("SMART"))
corpus <- tm_map(corpus, removeWords, stopwords("german"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument, language = "english")
return(corpus)
}
clean_corp <- clean_corpus(docs)
terms <-DocumentTermMatrix(clean_corp)
m <- as.matrix(terms)
m <- cbind(m,process$label)
termsDf <- as.data.frame(m)
target <- unique(termsDf[,2628]) %>% as.character() %>% sort()
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
I created the data frame after Document term matrix with the label class. but I'm stuck afterwords how can I proceed further with machine learning part?
Questions for kind answer: -
How can I proceed further with the creation of DocumentTermMatrix?
How to apply the random-forest algorithm on this particular dataset?
I am new bees in R programming, can you please help me out for solving the error from the given below code :
#import library
library(NLP)
library(tm)
library(wordcloud)
library(SnowballC)
library(slam)
library(lsa)
#import text files for preprocessing
text_key<-(Corpus(DirSource('V:\\Work\\R\\Input\\answer_key')))
##pre-processing is started
#do pre-processing of text data of answer
text_key <- tm_map(text_key, stripWhitespace)
text_key <- tm_map(text_key,content_transformer(tolower))
text_key <- tm_map(text_key,removeWords, stopwords("english"))
text_key <- tm_map(text_key, removeNumbers)
text_key <- tm_map(text_key, removePunctuation)
#do stemming of douments
text_key <- tm_map(text_key,stemDocument,language="english")
text_key <- tm_map(text_key, stripWhitespace) # *Stripping whitespace
text_key <- tm_map(text_key, PlainTextDocument)
#generate document term matrix
dtm_key <- DocumentTermMatrix(text_key)
#print the output of documenttermmatrix
dtm_key
# #inspect the element in document term matrix
inspect(dtm_key)
# #generate wirdcoud for twi dicument term matrix for answer-sheets
m <- as.matrix(dtm_key)
am <- lw_logtf(am) * gw_idf(am)
space <- lsa(am, dims = dimcalc_raw())
I have to generate my matrix into Semantic space using LSA but I face the error of SVD subscription out of bound.
Here is the silly mistake which to be change to
# #generate wirdcoud for twi dicument term matrix for answer-sheets
m <- as.matrix(dtm_key)
am <- lw_logtf(am) * gw_idf(am)
I change the variable to the "am" replace with "m" variable.
am <- lw_logtf(m) * gw_idf(m)
This is the change variable and I solve.
I am trying to create a dendrogram in r based off an excel sheet for use in text mining. I have one large column, each cell with a string of text. I want the smallest branch of the dendrogram to represent an individual cell, yet when I run my script I instead get a dendrogram of every word within the entire excel file. How do I fix this?
library(tm)
library(stringi)
library(proxy)
Data <- read.csv(file.choose(),header=TRUE)
docs <- Corpus(VectorSource(Data))
docs[[1]]
docs1 <- tm_map(docs, PlainTextDocument)
docs2 <- tm_map(docs1, stripWhitespace)
docs3 <- tm_map(docs2, removeWords, stopwords("english"))
docs4 <- tm_map(docs3, removePunctuation)
docs5 <- tm_map(docs4, content_transformer(tolower))
docs5[[1]]
TermMatrix <- TermDocumentMatrix(docs5)
docsdissim <- dist(as.matrix(TermMatrix), method = "euclidean")
docsdissim2 <- as.matrix(docsdissim)
docsdissim2
h <- hclust(docsdissim, method = "ward.D2")
I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set!
trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set.
here's my (not so beautiful) code:
# create data.frame with labelled sentences
labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T))
# create training set and test set
traindata <- as.data.frame(labeled[1:700,c("ARTICLE","CLASS")])
testdata <- as.data.frame(labeled[701:1000, c("ARTICLE","CLASS")])
# Vector, Source Transformation
trainvector <- as.vector(traindata$"ARTICLE")
testvector <- as.vector(testdata$"ARTICLE")
trainsource <- VectorSource(trainvector)
testsource <- VectorSource(testvector)
# CREATE CORPUS FOR DATA
traincorpus <- Corpus(trainsource)
testcorpus <- Corpus(testsource)
# my own stopwords
sw <- c("i", "me", "my")
## CLEAN TEXT
# FUNCTION FOR CLEANING
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp,tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, sw)
corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
corpus.tmp <- tm_map(corpus.tmp, stemDocument, language="en")
return(corpus.tmp)}
# CLEAN CORP WITH ABOVE FUNCTION
traincorpus.cln <- cleanCorpus(traincorpus)
testcorpus.cln <- cleanCorpus(testcorpus)
## CREATE N-GRAM DOCUMENT TERM MATRIX
# CREATE N-GRAM TOKENIZER
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
# CREATE DTM
trainmatrix.cln.bi <- DocumentTermMatrix(traincorpus.cln, control = list(tokenize = BigramTokenizer))
testmatrix.cln.bi <- DocumentTermMatrix(testcorpus.cln, control = list(tokenize = BigramTokenizer))
# REMOVE SPARSE TERMS
trainDTM <- removeSparseTerms(trainmatrix.cln.bi, 0.98)
testDTM <- removeSparseTerms(testmatrix.cln.bi, 0.98)
# train the model
SVM <- svm(as.matrix(trainDTM), as.factor(traindata$CLASS))
# get classifications for training-set
results.train <- predict(SVM, as.matrix(trainDTM)) # works fine!
# get classifications for test-set
results <- predict(SVM,as.matrix(testDTM))
Error in scale.default(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", :
length of 'center' must equal the number of columns of 'x'
i don't understand this error. and what is 'center' ?
thank you!!
Train and test data must be in the same features space ; building two separates DTM in that way can't work.
A solution with using RTextTools :
DocTermMatrix <- create_matrix(labeled, language="english", removeNumbers=TRUE, stemWords=TRUE, ...)
container <- create_container(DocTermMatrix, labels, trainSize=1:700, testSize=701:1000, virgin=FALSE)
models <- train_models(container, "SVM")
results <- classify_models(container, models)
Or, to answer your question (with e1071), you can specify the vocabulary ('features') in the projection (DocumentTermMatrix) :
DocTermMatrixTrain <- DocumentTermMatrix(Corpus(VectorSource(trainDoc)));
Features <- DocTermMatrixTrain$dimnames$Terms;
DocTermMatrixTest <- DocumentTermMatrix(Corpus(VectorSource(testDoc)),control=list(dictionary=Features));