Similarity Matrix from DocumentTermmatrix - r

I need to create a similarity matrix from a document term matrix in order to perform maximum capturing clusturing on documents. Have only found a solution for distance matrix so far. Tryed the dist method but it gives me the wrong output. Is there a way to create similarity matrixes using R? I used the tm package for the following code but i am not constricted to it, if there is any other good package, let me know. The code so far:
install.packages("tm")
install.packages("rJava")
install.packages("Snowball")
install.packages("RWeka")
install.packages("RWekajars")
install.packages("XML")
install.packages("openNLP")
install.packages("openNLPmodels.en")
Sys.setenv(NOAWT=TRUE)
library(XML)
library(rJava)
library(Snowball)
library(RWeka)
library(tm)
library(openNLP)
library(openNLPmodels.en)
sample = c(
"cc ee aa",
"dd bb ee",
"bb cc ee dd",
"cc ee dd aa",
"bb ee",
"cc dd aa",
"bb cc aa",
"bb cc",
"cc ee dd"
)
print(sample)
corpus <- Corpus(VectorSource(sample))
inspect(corpus)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tmTagPOS)
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
# need to create similarity matrix here
dist(dtm, method = "manhattan", diag = FALSE, upper = FALSE)
The output for the given sample should look like this
The similarity matrix is defined as:
if (i < j)
a[i][j] = sim[i][j]
else
a[i][j] = 0

Related

I am doing text mining, if I have a dendrogram of some documents, then cut it in one level How can I get all the terms in that level of cut?

I have a code like this:
nf<- read.csv("test2.csv")#test 2 is containing 79 rows(name of documents) and one column of text as containing document.
corpus <- Corpus(VectorSource(nf$segment))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, function(x) removeWords(x,"shall"))
corpus <- tm_map(corpus, function(x) removeWords(x,"will"))
corpus <- tm_map(corpus, function(x) removeWords(x,"can"))
corpus <- tm_map(corpus, function(x) removeWords(x,"could"))
corpus <- tm_map(corpus, stemDocument, language = "english")
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
ft <- hclust(dist.mat, method="ward.D2")
plot(ft)
my dendrogram
I have cluster dendrogram from documents. if I cut I in height=50 .how I can have the terms in this level?

Does R works for multilingual data

We have prepared machine learning algorithms like clasification algorithm having features as factors. Topic modelling on text data for which text data is in English
Below script which is prepared .
complete <- subset(complete,select=c(Group,Type,Text,Target))
data <- complete$Text
corpus <-tm_map(corpus,content_transformer(tolower))
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "-")
corpus <- tm_map(corpus, toSpace, ":")
corpus <- tm_map(corpus, toSpace, ";")
corpus <- tm_map(corpus, toSpace, "#")
corpus <- tm_map(corpus, toSpace, "\\(" )
corpus <- tm_map(corpus, toSpace, ")")
corpus <- tm_map(corpus, toSpace, ",")
corpus <- tm_map(corpus, toSpace, "_")
corpus <- tm_map(corpus, content_transformer(removeSpecialChars))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus,stemDocument)
tdm <- DocumentTermMatrix(corpus)
train1 <- as.matrix(tdm)
complete1 <- subset(complete,select=c(Group,Type,Target))
complete1 <- Filter(function(x)(length(unique(x))>1), complete1)
train <- cbind(complete1, train1)
train$Text <- NULL
train$Target <- as.factor(train$Target)
############################################################################################
# Model Run
############################################################################################
fit <-svm(Target ~ ., data = train)
termlist <- list(dictionary = Terms(tdm))
retval <- list(model = fit, termlist = termlist, complete = complete)
saveRDS(retval, "./modelTarget.rds")
Now we will be expecting data in another languages - Chinese/Korean/Japanese/French/Portugese/Spanish .
Wanted to check if R support these types of data especially for text cleaning.
Please advice

Text Categorization by uisng mlr package in R

I need to train a model which would perform multilabel multiclass categorization on text data.
Currently, i'm using mlr package in R. But unluckily I didn't proceed further because of the error I got it before training a model.
More specifically I'm stuck in this place:
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
and, got this error
Error in makeMultilabelTask(id = "classif", data = termsDf, target = target) :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
I used this example: -
Multi-label text classification using mlr package in R
Here is a complete code snippet i'm using so far,
tm <- read.csv("translate_text_V02.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("", "NA"))
process <- tm[, c("label", "text")]
process <- na.omit(process)
docs <- Corpus(VectorSource(process$text))
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, mystopwords)
corpus <- tm_map(corpus, removeWords, stopwords("SMART"))
corpus <- tm_map(corpus, removeWords, stopwords("german"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument, language = "english")
return(corpus)
}
clean_corp <- clean_corpus(docs)
terms <-DocumentTermMatrix(clean_corp)
m <- as.matrix(terms)
m <- cbind(m,process$label)
termsDf <- as.data.frame(m)
target <- unique(termsDf[,2628]) %>% as.character() %>% sort()
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
I created the data frame after Document term matrix with the label class. but I'm stuck afterwords how can I proceed further with machine learning part?
Questions for kind answer: -
How can I proceed further with the creation of DocumentTermMatrix?
How to apply the random-forest algorithm on this particular dataset?

R: Obtaining Single Term Frequencies instead of Bigrams

Here is the code I use to create bi-grams with frequency list:
library(tm)
library(RWeka)
#data <- myData[,2]
tdm.generate <- function(string, ng){
# tutorial on rweka - http://tm.r-forge.r-project.org/faq.html
corpus <- Corpus(VectorSource(string)) # create corpus for TM processing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
# corpus <- tm_map(corpus, removeWords, stopwords("english"))
options(mc.cores=1) # http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/20251039#20251039
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng)) # create n-grams
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) # create tdm from n-grams
tdm
}
source("GenerateTDM.R") # generatetdm function in appendix
tdm <- tdm.generate("The book The book The greatest The book",2)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
topwords <- as.numeric(topwords)
hist(topwords, breaks = 10)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
head(sort(topwords, decreasing = TRUE))
The result for the above code is:
the book greatest
4 3 1
Instead, I'm looking for the result where bi-grams are shown like:
"the book" "book the"
3 2
What needs to be changed in the above code to get the output as above?
You need to use VCorpus instead of Corpus, I was having the same issue you could check more details here

Why some cyrillic letters are missing in wordcloud?

I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')

Resources