R - Automatic categorization of Wikipedia articles

R - Automatic categorization of Wikipedia articles - r

I have been trying to follow this example by Norbert Ryciak, whom I havent been able to get in touch with.
Since this article was written in 2014, some things in R have changed so I have been able to update some of those things in the code, but I got stuck in the last part.
Here is my Working code so far:
library(tm)
library(stringi)
library(proxy)
wiki <- "https://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
"Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
"Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))
for (i in 1:length(titles)) {
articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
}
docs <- Corpus(VectorSource(articles))
docs[[1]]
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, content_transformer(tolower))
docs8[[1]]
docsTDM <- TermDocumentMatrix(docs8)
docsTDM2 <- as.matrix(docsTDM)
docsdissim <- dist(docsTDM2, method = "cosine")
But I havent been able to get pass this part:
docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
docsdissim2
h <- hclust(docsdissim, method = "ward.D")
plot(h, labels = titles, sub = "")
I tried to run the "hclust" directly, and then I was able to Plot, but nothing readable came out of it.
This are the errors Im getting:
rownames(docsdissim2) <- titles
Error in `rownames<-`(`*tmp*`, value = c("Integral", "Riemann_integral", :
length of 'dimnames' [1] not equal to array extent
Another:
plot(h, labels = titles, sub = "")
Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
invalid dendrogram input
Is there anyone that could give me a hand to finish this example?
Best Regards,

I was able to solve this problem thanks to Norbert Ryciak (the author of the tutorial).
Since he used an older version of "tm" (which was probably the latest at the time) it was not compatible with the one I used.
The solution was to replace "docsTDM <- TermDocumentMatrix(docs8)" with "docsTDM <- DocumentTermMatrix(docs8)".
So the final code:
library(tm)
library(stringi)
library(proxy)
wiki <- "https://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
"Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
"Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))
for (i in 1:length(titles)) {
articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
}
docs <- Corpus(VectorSource(articles))
docs[[1]]
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, content_transformer(tolower))
docs8[[1]]
docsTDM <- DocumentTermMatrix(docs8)
docsTDM2 <- as.matrix(docsTDM)
docsdissim <- dist(docsTDM2, method = "cosine")
docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
docsdissim2
h <- hclust(docsdissim, method = "ward")
plot(h, labels = titles, sub = "")

Related

Why am I getting an Error in DocumentTerm Matrix in R even after using content_transformer for tolower in tm_map function?

I have gone through many answers here , and tried to use all suggestions given in stackoverflow but nothing seems to be working for me . Is there any order before creating document term matrix using tm package in R ?
email_corpus <- VCorpus(VectorSource(df2$final_text))
email_corpus_clean <- tm_map(email_corpus,content_transformer(tolower))
#remove special characters
for(j in seq(email_corpus_clean)) {
email_corpus_clean[[j]] <- gsub("\n", " ", email_corpus_clean[[j]])
email_corpus_clean[[j]] <- gsub("\r", " ", email_corpus_clean[[j]])
email_corpus_clean[[j]] <- gsub(">>", " ", email_corpus_clean[[j]])
}
email_corpus_clean <- tm_map(email_corpus_clean,removeNumbers)
myStopWords<- c("said","from","what")
email_corpus_clean <- tm_map(email_corpus_clean, removeWords, c(stopwords("english"), myStopWords))
email_corpus_clean <- tm_map(email_corpus_clean, removePunctuation)
email_corpus_clean <- tm_map(email_corpus_clean, stemDocument)
email_corpus_clean <- tm_map(email_corpus_clean,stripWhitespace)
#This is the line of code , where i get error
email_dtm <- DocumentTermMatrix(email_corpus_clean) #creating document term matrix
# this is the error
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"

Find 2 words phrase using tm R

I know this has been asked multiple times. For example
Finding 2 & 3 word Phrases Using R TM Package
However, I don't know why none of these solutions work with my data. The result is always one-gram word no matter how many ngram I chose (2, 3 or 4) for the ngram.
Could anybody know the reason why? I suspect the encoding is the reason.
Edited: a small part of the data.
comments <- c("Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into problem_70918\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into tm-247\n",
"Merge branch 'php5.3-upgrade-sprint6-7' of git.internal.net:/git/pn-project/LegacyCodebase into release2012.08\n",
"Merge remote-tracking branch 'dmann1/p71148-s3-callplan_mapping' into lcst-operational-changes\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into TASK-360148\n",
"Merge remote-tracking branch 'grockett/rpr-pre' into rpr-lite\n"
)
cleanCorpus <- function(vector){
corpus <- Corpus(VectorSource(vector), readerControl = list(language = "en_US"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
#corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
#corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
return(corpus)
}
# this function is provided by a team member (in the link I posted above)
test <- function(keywords_doc){
BigramTokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
# creating of document matrix
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
# remove sparse terms
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.99)
# Frequency of the words appearing
keyword.freq <- rowSums(as.matrix(keywords_naremoval))
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq)
# Sorting of the words
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
# Printing of the words
# wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
return(frequentKeywordDF)
}
corpus <- cleanCorpus(comments)
t <- test(corpus)
> head(t)
term freq
added added 6
html html 6
tracking tracking 6
common common 4
emails emails 4
template template 4
Thanks,

I haven't found the reason either, but if you are only interested in the counts regardless in which documents the bigrams occured, you could get them alternatively via this pipeline:
library(tm)
lilbrary(dplyr)
library(quanteda)
# ..construct the corpus as in your post ...
corpus %>%
unlist() %>%
tokens() %>%
tokens_ngrams(2:2, concatenator = " ") %>%
unlist() %>%
as.data.frame() %>%
group_by_(".") %>%
summarize(cnt=n()) %>%
arrange(desc(cnt))

Empty term document matrix

I seem to run into a problem whenever I try to inspect my freq. words and associations.
When I make the tdm I get this info:
TermDocumentMatrix
I can see I have plenty of terms to use, in plenty of documents.
However!
When I try to inspect the content of "tdm", I get this info:
Inspecting the TDM
Howcome the tdm all of a sudden is empty?
Hope someone can help
tweets <- userTimeline("RDataMining", n = 1000)
(n.tweet <- length(tweets))
tweets[1:3]
#convert tweets to a data frame
tweets.df <- twListToDF(tweets)
dim(tweets.df)
##Text cleaning
library(tm)
#build a corpus and specify the source to be a character vector
myCorpus <- Corpus(VectorSource(tweets.df$text))
#convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove URLs
removeURL <- function(x) gsub ("http[^[:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeURL))
#remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeNumPunct))
#remove stopwords + 2
myStopwords <- c(stopwords('english'),"available","via")
#remove "r" and "big" from stopwords
myStopwords <- setdiff(myStopwords, c("r","big"))
#remove stopwords from corpus
myCorpus <- tm_map(myCorpus,removeWords,myStopwords)
#remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
#keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy <- myCorpus
#stem words
library(SnowballC)
myCorpus <- tm_map(myCorpus,stemDocument)
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x),""))
#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
x <- x[x !=""]
x <- stemCompletion(x, dictionary = dictionary)
x <- paste (x,sep = "",collapse = "")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary = myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
#count freq of "mining"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<mining")})
sum(unlist(miningCases))
#count freq of "miner"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<miner")})
sum(unlist(miningCases))
#count freq of "r"
miningCases <- lapply(myCorpusCopy,
function(x) {grep(as.character(x),pattern = "\\<r")})
sum(unlist(miningCases))
#replace "miner" with "mining"
myCorpus <- tm_map(myCorpus,content_transformer(gsub),
pattern = "miner", replacement = "mining")
tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE,stopwords = TRUE))
tdm
##Freq words and associations
idx <- which(dimnames(tdm)$Terms == "r")
inspect(tdm[idx + (0:5), 101:110])
#inspect frequent words
(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq,term.freq >= 15)
df <- data.frame(term = names(term.freq), freq = term.freq)

I've been using the following Twitter query to test your code:
tweets = searchTwitter("r data mining", n=10)
and I think the problem is with your function stemCompletion2, which should look something like this:
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x)," "))
print("before:")
print(x)
#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
x <- x[x !=""]
x <- stemCompletion(x, dictionary = dictionary)
print("after:")
print(x)
x <- paste(x, sep = " ")
PlainTextDocument(stripWhitespace(x))
}
The modifications are as follows: before you had
x <- unlist(strsplit(as.character(x),""))
which was creating a list with all the characters of in each of the documents, and I've modified it to
x <- unlist(strsplit(as.character(x)," "))
to create a list of words. Similarly, when recomposing your documents, you where doing
x <- paste (x,sep = "",collapse = "")
which was creating the long strings you mention in your post, and I've modified it to:
x <- paste(x, sep = " ")
to recompose the words.
One example of the completions would be for my data:
[1] "before:"
[1] "rt" "ebookdealalert" "r" "datamin" "project" "learn" "data" "mine"
[9] "realworld" "project" "book" "solv" "predict" "model"
[1] "after:"
rt ebookdealalert r datamin project learn data mine
"rt" "ebookdealalerts" "r" "datamining" "projects" "learn" "data" ""
realworld project book solv predict model
"realworld" "projects" "book" "solve" "predictive" "modeling"
After that step, you may be able to work with TermDocumentMatrix as expected.
Hope it helps.

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create a matrix. The error is:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
For example, here is code from Jon Starkweather's text mining example. Apologies in advance for such long code, but this does produce a reproducible example. Please note that the error comes at the end with the {tdm} function.
#Read in data
policy.HTML.page <- readLines("http://policy.unt.edu/policy/3-5")
#Obtain text and remove mark-up
policy.HTML.page[186:202]
id.1 <- 3 + which(policy.HTML.page == " TOTAL UNIVERSITY </div>")
id.2 <- id.1 + 5
text.data <- policy.HTML.page[id.1:id.2]
td.1 <- gsub(pattern = "<p>", replacement = "", x = text.data,
ignore.case = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
td.2 <- gsub(pattern = "</p>", replacement = "", x = td.1, ignore.case = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
text.d <- td.2; rm(text.data, td.1, td.2)
#Create corpus and clean
library(tm)
library(SnowballC)
txt <- VectorSource(text.d); rm(text.d)
txt.corpus <- Corpus(txt)
txt.corpus <- tm_map(txt.corpus, tolower)
txt.corpus <- tm_map(txt.corpus, removeNumbers)
txt.corpus <- tm_map(txt.corpus, removePunctuation)
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
txt.corpus <- tm_map(txt.corpus, stripWhitespace); #inspect(docs[1])
txt.corpus <- tm_map(txt.corpus, stemDocument)
# NOTE ERROR WHEN CREATING TDM
tdm <- TermDocumentMatrix(txt.corpus)

The link provided by jazzurro points to the solution. The following line of code
txt.corpus <- tm_map(txt.corpus, tolower)
must be changed to
txt.corpus <- tm_map(txt.corpus, content_transformer(tolower))

There are 2 reasons for this issue in tm v0.6.
If you are doing term level transformations like tolower etc., tm_map returns character vector instead of PlainTextDocument.
Solution: Call tolower through content_transformer or call tm_map(corpus, PlainTextDocument) immediately after tolower
If the SnowballC package is not installed and if you are trying to stem the documents then also this can occur.
Solution: install.packages('SnowballC')

There is No need to apply content_transformer.
Create the corpus in this way:
trainData_corpus <- Corpus((VectorSource(trainData$Comments)))
Try it.

NA/NaN/Inf in foreign function call (arg 6)

I am doing a term paper in Text mining using R. Our task is to guess the tone of an article (positive/negative). The articles are stored in respective folders. I need to create a classification system which will learn through training samples.
I reused the code from http://www.youtube.com/watch?v=j1V2McKbkLo
The entire code except the last line got executed successfully. Following is the code.
tone<- c("Positive", "Negative")
folderpath <- "C:/Users/Tanmay/Desktop/R practice/Week8"
options(stringAsFactors = FALSE)
corpus<-Corpus(DirSource(folderpath))
corpuscopy<-corpus
summary(corpus)
inspect(corpus)
#Clean data
CleanCorpus <- function(corpus){
corpustemp <- tm_map(corpus, removeNumbers)
corpustemp <- tm_map(corpus, removePunctuation)
corpustemp <- tm_map(corpus, tolower)
corpustemp <- tm_map(corpus, removeWords, stopwords("english"))
corpustemp <- tm_map(corpus, stemDocument,language="english")
corpustemp <- tm_map(corpus, stripWhitespace)
return(corpustemp )
}
#Document term matrix
generateTDM <- function(tone,path) {
corpusdir <- sprintf("%s/%s",path,tone)
corpus<- Corpus(DirSource( directory=corpusdir ,encoding = "ANSI"))
corpustemp <- CleanCorpus(corpus)
corpusclean <- DocumentTermMatrix(corpustemp)
corpusclean <- removeSparseTerms(corpusclean , 0.7)
result <- list(Tone = tone, tdm = corpusclean)
}
tdm <- lapply(tone,generateTDM,path=folderpath)
#Attach tone
ToneBindTotdm <- function(tdm){
temp.mat <- data.matrix(tdm[["tdm"]])
temp.df <- as.data.frame(temp.mat)
temp.df <- cbind(temp.df,rep(tdm[["Tone"]]),nrow(temp.df))
colnames(temp.df)[ncol(temp.df)] <- "PredictTone"
return(temp.df)
}
Tonetdm <- lapply(tdm,ToneBindTotdm)
#Stack
Stacktdm <- do.call(rbind.fill,Tonetdm)
Stacktdm[is.na(Stacktdm)] <- 0
#Holdout
trainid <- sample(nrow(Stacktdm),ceiling(nrow(Stacktdm) * 0.7))
testid <- (1:nrow(Stacktdm)) [- trainid]
#knn
tdmone <- Stacktdm[,"PredictTone"]
tdmone.nl <- Stacktdm[, !colnames(Stacktdm) %in% "PredictTone"]
knnPredict <- knn(tdmone.nl[trainid,],tdmone.nl[testid,],tdmone[trainid],k=5)
When I tried to execute this, I got error in the last line (knn) :
**Error in knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NAs introduced by coercion
2: In knn(tdmone.nl[trainid, ], tdmone.nl[testid, ], tdmone[trainid], :
NAs introduced by coercion**
Could anyone please help me out. Also if there are other simpler and better way to classify please point me to them. Thanks and sorry for the long post.

I was stuck on the same issue. But I modified it my way to remove all the NA values. You can check my code and compare what might be the problem in your code.
#init
libs <- c("tm" , "plyr" , "class")
lapply(libs,require, character.only=TRUE)
#set options
options(stringsAsFactors = FALSE)
#set parameters
candidates <- c("user1" , "user2" ,"test")
pathname <- "C:/Users/prabhjot.rai/Documents/Project_r/textMining"
#clean text
cleanCorpus <- function(corpus)
{
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument)
}
#build TDM
generateTDM <- function(cand,path)
{
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand , tdm = s.tdm)
}
tdm <- lapply(candidates, generateTDM, path = pathname)
test <- t(data.matrix(tdm[[1]]$tdm))
rownames(test) <- c(1:nrow(test))
#attach name and convert to dataframe
makeMatrix <- function(thisTDM){
test <- t(data.matrix(thisTDM$tdm))
rownames(test) <- c(1:nrow(test))
test <- as.data.frame(test, stringsAsFactors = F , na.rm = T)
test$candidateName <- thisTDM$name
test <- as.data.frame(test, stringsAsFactors = F , na.rm = T)
}
candTDM <- lapply(tdm, makeMatrix)
# stack all the speeches together
tdm.stack <- do.call(rbind.fill, candTDM)
tdm.stack[is.na(tdm.stack)] <- as.numeric(0)
#testing and training sets
train <- tdm.stack[ tdm.stack$candidateName!= 'test' , ]
train <- train[, names(train) != 'candidateName']
test <- tdm.stack[ tdm.stack$candidateName == 'test' , ]
test <- test[, names(test) != 'candidateName']
classes <- tdm.stack [ tdm.stack$candidateName != 'test' , 'candidateName']
classes <- as.factor(classes)
myknn <- knn(train=train, test = test , cl = classes , k=1)
myknn
Keep a testing file in the test folder next to user1 and user2 folders to check the output of this algorithm. And keep the value of k as the square root of number of speeches, preferably an odd number. And ignore the redundancy of testing and training set assignment. It was not working in one line in my machine so did it in two lines.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Automatic categorization of Wikipedia articles - r

Related

Why am I getting an Error in DocumentTerm Matrix in R even after using content_transformer for tolower in tm_map function?

Find 2 words phrase using tm R

Empty term document matrix

TermDocumentMatrix errors in R

NA/NaN/Inf in foreign function call (arg 6)

Categories

Resources