Do not form clusters with text mining in R - r

I need to cluster the texts (text mining for russian texts).
here the code:
mydat=read.csv("C:/Users/Admin/Downloads/kr_csv.csv", sep=";",dec=",")
View(mydat)
library("tm")
library("SnowballC")
library("textcat")
corpus=Corpus(VectorSource(mydat))
dtm=DocumentTermMatrix(corpus,
control=list(stemming=T, stopwords=F,
minWorldLenght=3,removeNumbers=T,
removePunctuation=T,
#stopwords=c(stopwords('SMART'))
weighting=function(x)
weightTf(x) ))
m<-as.matrix(dtm)
norm_eucl=function(m)
m/apply(m,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(m)
res=kmeans(m_norm,3,100)
I chose 3-cluster decision and got this error
Error in sample.int(m, k) :
cannot take a sample larger than the population when 'replace = FALSE'
It is mean that I can't take more than one cluster, but this can not be.
How can I extract multiple clusters?
mydat
> dim(m_norm)
[1] 1 4298
head(m_norm, 20) there many docs (all on russian) I copied only part.
> Docs артикул madc артикул madc ту артикулldvoko аскуэ аскуэ зао аскуэ зао хакель асу асу тп асу тп аскуэ атх
Terms
Docs атх ол атх ол ооо ач csb аяд аяд аяд базе базе протокола базе протокола hart бандажирования
Terms
Docs бандажирования пучков бандажирования пучков проводов батарея батарея gp батарея gp в без без протяжки без протяжки диам
Terms
Docs без протяжки фмм без устройствадля без устройствадля ручного без электропривода без электропривода классгерметичности белая
Terms
Docs бирка бирка кабельная бирка кабельная у бирка пластмассовая бирка пластмассовая квадратная бкп бкп дакж бкп дакж
Terms
Docs бкра бкра бкра благовещенск благовещенский благовещенский арматурный благовещенский арматурный завод блок
Terms
Docs блок блок м блок atx блок atx w блок клапанный блок клапанный метран блок концевых блок концевых переключателей

I found the error was in preprocessing
here the working code
mydat=read.csv("C:/Users/Admin/Downloads/kr_csv.csv", sep=";",dec=",")
tw.corpus <- Corpus(VectorSource(mydat$name))
tw.corpus <- tm_map(tw.corpus, stripWhitespace)
tw.corpus <- tm_map(tw.corpus, removePunctuation)
tw.corpus <- tm_map(tw.corpus, removeNumbers)
tw.corpus <- tm_map(tw.corpus, removeWords, stopwords("russian"))
tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
tw.corpus = tm_map(tw.corpus, stemDocument)
doc.m <- DocumentTermMatrix(tw.corpus)
dtm_tfxidf<-weightTfIdf(doc.m)
m<-as.matrix(dtm_tfxidf)
rownames(m)<-1:nrow(m)
norm_eucl=function(m)
m/apply(m,1,function(x)sum(x^2)^.5)
m_norm=norm_eucl(m)
> dim(m_norm)
[1] 399 860

Related

Use mapply to replace a patterns in a vector with replacements in a vector in tm

Hi there: I m using the tm package for some text analysis and I need to sub a vector of terms with the paired replacement term in a vector of replacements. So the pattern / replacement dictionary looks like this.
#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')
I tried this and received an error
tm_map(crude, mapply, gsub, df$replace, df$with)
Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
Based on this answer you could use stringi and wrap it around content_transformer() to preserve the corpus structure:
corp <- tm_map(crude, content_transformer(
function(x) {
stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE)
})
)
Or multigsub from qdap
corp <- tm_map(crude, content_transformer(
function(x) {
multigsub(df$replace, df$with, fixed = FALSE, x)
})
)
Which gives:
> corp[[1]][1]
"Diamond Shamrock Corp said that\neffective today it had cut its
contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to
16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices
and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
Diamond is the latest in a line of U.S. xoil companies that\nhave
cut its contract, or posted, xprices over the last two
days\nciting weak xoil markets.\n Reuter"
You can then apply other tm functions on the resulting corpus:
> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity : 91%
#Maximal term length: 17
#Weighting : term frequency (tf)

DocumentTermMatrix error on Corpus argument

I have the following code:
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.
corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)
news_dtm <- DocumentTermMatrix(corpus_clean) # errors here
When I run the DocumentTermMatrix() method, it gives me this error:
Error: inherits(doc, "TextDocument") is not TRUE
Why do I get this error? Are my rows not text documents?
Here is the output upon inspecting corpus_clean:
[[153]]
[1] obama holds technical school model us
[[154]]
[1] oil boom produces jobs bonanza archaeologists
[[155]]
[1] islamic terrorist group expands territory captures tikrit
[[156]]
[1] republicans democrats feel eric cantors loss
[[157]]
[1] tea party candidates try build cantor loss
[[158]]
[1] vehicles materials stored delaware bridges
[[159]]
[1] hill testimony hagel defends bergdahl trade
[[160]]
[1] tweet selfpropagates tweetdeck
[[161]]
[1] blackwater guards face trial iraq shootings
[[162]]
[1] calif man among soldiers killed afghanistan
[[163]]
[1] stocks fall back world bank cuts growth outlook
[[164]]
[1] jabhat alnusra longer useful turkey
[[165]]
[1] catholic bishops keep focus abortion marriage
[[166]]
[1] barbra streisand visits hill heart disease
[[167]]
[1] rand paul cantors loss reason stop talking immigration
[[168]]
[1] israeli airstrike kills northern gaza
Edit: Here is my data:
type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods
Which I retrieve like:
news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)
Edit: Here is the traceback():
> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)
When I evaluate inherits(corpus_clean, "TextDocument") it is FALSE.
It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.
So you could change to
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
Or you can run
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.
I have found a way to solve this problem in an article about TM.
An example in which the error follows below:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)
Warning messages:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers
This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.
However, if you add the function content_transformer inside the tm_map command you may not need even one more command before using the function TermDocumentMatrix to keep going.
The code below changes the class (see second last line) and avoids the error:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class
matrix_term <- DocumentTermMatrix(corpus)
Change this:
corpus_clean <- tm_map(news_corpus, tolower)
For this:
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
This should work.
remove.packages(tm)
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL)
library(tm)

R tm package: utf-8 text

I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language).
The text is displayed absolutely right in inspect function of the tm package.
However, when I search for word frequency everything is displayed incorrectly:
The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess.
Is it possible to assign encoding to the tm function somehow? I tried this, but the text on its own is fine, the problem is with using tm package.
Let a sample text be:
Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді.
Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді.
Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық.
Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы.
Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді.
My simple code is this:
(Based on onertipaday.blogspot.com tutorials:)
require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
1 2
44 4
findFreqTerms(ap.tdm, lowfreq=2)
[1] "<U+04D9>лем" "арман" "еді"
[4] "м<U+04D9><U+04A3>гілік"
Those words should be: "Әлем", арман", "еді", "мәңгілік". They are displayed correctly in inspect(ap.corpus) output.
Highly appreciate any help! :)
The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).
scan_tokenizer function (x) {
scan(text = x, what = "character", quote = "", quiet = TRUE) }
One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:
scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))
Then you get the result well encoded:
findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
Actually, I disagree with agstudy's answer. It does not seem to be a tokenizer problem. I'm using version 0.6.0 of the tm package and your code works just fine for me, except that I had to explicitly set the encoding of your text data to UTF-8 using :
Encoding(text) <- "UTF-8"
Below is the complete piece of reproducible code. Just make sure you save it in a file with UTF-8 encoding, and use source() to run it; do not use source.with.encoding(), it'll throw an error.
text <- "Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді."
Encoding(text)
# [1] "unknown"
Encoding(text) <- "UTF-8"
# [1] "UTF-8"
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
content(ap.corpus[[1]])
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
print(table(ap.d$freq))
# 1 2 3
# 62 5 1
print(findFreqTerms(ap.tdm, lowfreq=2))
# [1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
It worked for me, hope it does for you too.

R tm package and cyrillic text

I am trying to do some text mining with russian text using tm package and have some issues.
preprocessing speed heavily depends on encoding.
library(tm)
rus_txt<-paste(readLines('http://lib.ru/LITRA/PUSHKIN/dubrowskij.txt',encoding='cp1251'), collapse=' ')
object.size(rus_txt)
eng_txt<-paste(readLines('http://www.gutenberg.org/cache/epub/1112/pg1112.txt',encoding='UTF-8'), collapse=' ')
object.size(eng_txt)
# text sizes nearly identical
rus_txt_utf8<-iconv(rus_txt, to='UTF-8')
system.time(rus_txt_lower<-tolower(rus_txt_utf8))
#3.17 0.00 3.19
system.time(rus_txt_lower<-tolower(eng_txt))
#0.03 0.00 0.03
system.time(rus_txt_lower<-tolower(rus_txt))
#0.07 0.00 0.08
40 times faster! and on large corporas difference was up to 500 times!
Lets try to tokenize some text (this function used in TermDocumentMatrix):
some_text<-"Несколько лет тому назад в одном из своих поместий жил старинный
русской барин, Кирила Петрович Троекуров. Его богатство, знатный род и связи
давали ему большой вес в губерниях, где находилось его имение. Соседи рады
были угождать малейшим его прихотям; губернские чиновники трепетали при его
имени; Кирила Петрович принимал знаки подобострастия как надлежащую дань;
дом его всегда был полон гостями, готовыми тешить его барскую праздность,
разделяя шумные, а иногда и буйные его увеселения. Никто не дерзал
отказываться от его приглашения, или в известные дни не являться с должным
почтением в село Покровское."
scan_tokenizer(some_text)
#[1] "Несколько" "лет" "тому" "назад" "в" "одном" "из" "своих"
# [9] "поместий" "жил" "старинный" "русской" "барин," "Кирила" "Петрович" "Троекуров."
#[17] "Его" "богатство," "знатный" "род" "и" "св"
oops... Seems R core function scan() see russian lower case letter 'я' as EOF. I tried diffrent encodings but I haven't answer how to fix this.
Ok lets try to remove punctuation:
removePunctuation("жил старинный русской барин, Кирила Петрович Троекуров")
#"жил старинный русской барин Кирила Петрови Троекуров"
Hmm...where is letter 'ч'? Ok with UTF-8 encoding this works fine, but it took some time to found it.
also I had issue with removeWords() function perfomance but can't reproduce it.
Main question is: How to read and tokenize texts with letter 'я'?
my locale:
Sys.getlocale()
#[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
1) Question: How to read and tokenize texts with letter 'я'? Answer: try to write your own tokenizer and use it. For example:
my_tokenizer <- function (x)
{
strsplit(iconv(x, to='UTF-8'), split='([[:space:]]|[[:punct:]])+', perl=F)[[1]]
}
TDM <- TermDocumentMatrix(corpus,control=list(tokenize=my_tokenizer, weighting=weightTf, wordLengths = c(3,10)))
2) Performance heavily depend on... performance of tolower function. May be this is a bug, I don't know, but on every time you call it you have to convert your text into native encoding using enc2native. (of course if your text language is not english).
doc.corpus <- Corpus(VectorSource(enc2native(textVector)))
And moreover after all text preprocessing on your corpus you have to convert it again. (this is because TermDocumentMatrix and many other function in tm package internally use tolower)
tm_map(doc.corpus, enc2native)
So your full flow will look like something like this:
createCorp <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(enc2native(textVector)))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("russian"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "russian")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
return(tm_map(doc.corpus, enc2native))
}
my_tokenizer <- function (x)
{
strsplit(iconv(x, to='UTF-8'), split='([[:space:]]|[[:punct:]])+', perl=F)[[1]]
}
TDM <- TermDocumentMatrix(corpus,control=list(tokenize=my_tokenizer, weighting=weightTf, wordLengths = c(3,10)))

R topic modeling: lda model labeling function

I used LDA to build a topic model for 2 text documents say A and B. document A is highly related to say computer science and document B is highly related to say geo-science. Then I trained an lda using this command :
text<- c(A,B) # introduced above
r <- Corpus(VectorSource(text)) # create corpus object
r <- tm_map(r, tolower) # convert all text to lower case
r <- tm_map(r, removePunctuation)
r <- tm_map(r, removeNumbers)
r <- tm_map(r, removeWords, stopwords("english"))
r.dtm <- TermDocumentMatrix(r, control = list(minWordLength = 3))
my_lda <- LDA(r.dtm,2)
now i want to use my_lda to predict the context of a new document say C and i want to see if it is related to computer Science or geo-science. i know if i use this code for prediction
x<-C# a new document (a long string) introduced above for prediction
rp <- Corpus(VectorSource(x)) # create corpus object
rp <- tm_map(rp, tolower) # convert all text to lower case
rp <- tm_map(rp, removePunctuation)
rp <- tm_map(rp, removeNumbers)
rp <- tm_map(rp, removeWords, stopwords("english"))
rp.dtm <- TermDocumentMatrix(rp, control = list(minWordLength = 3))
test.topics <- posterior(my_lda,rp.dtm)
It will give me a label 1 or 2 and I don't have any idea what 1 or 2 represents... How can I realize if it means computer science related or geo-science related?
You can extract the most likely terms from your LDA topicmodel and replace those black-box numeric names with however many of them you would like. Your example isn't reproducible, but here is example illustrating how you can do this:
> library(topicmodels)
> data(AssociatedPress)
>
> train <- AssociatedPress[1:100]
> test <- AssociatedPress[101:150]
>
> train.lda <- LDA(train,2)
>
> #returns those black box names
> test.topics <- posterior(train.lda,test)$topics
> head(test.topics)
1 2
[1,] 0.57245696 0.427543038
[2,] 0.56281568 0.437184320
[3,] 0.99486888 0.005131122
[4,] 0.45298547 0.547014530
[5,] 0.72006712 0.279932882
[6,] 0.03164725 0.968352746
> #extract top 5 terms for each topic and assign as variable names
> colnames(test.topics) <- apply(terms(train.lda,5),2,paste,collapse=",")
> head(test.topics)
percent,year,i,new,last new,people,i,soviet,states
[1,] 0.57245696 0.427543038
[2,] 0.56281568 0.437184320
[3,] 0.99486888 0.005131122
[4,] 0.45298547 0.547014530
[5,] 0.72006712 0.279932882
[6,] 0.03164725 0.968352746
> #round to one topic if you'd prefer
> test.topics <- apply(test.topics,1,function(x) colnames(test.topics)[which.max(x)])
> head(test.topics)
[1] "percent,year,i,new,last" "percent,year,i,new,last" "percent,year,i,new,last"
[4] "new,people,i,soviet,states" "percent,year,i,new,last" "new,people,i,soviet,states"

Resources