I have used the removePuncutation from the "tm" package in R on a Term Document Matrix. For some reason I am still left with strange characters in my plot of the letters versus their proportion in a corpus I've analyzed.
Below is the code I used to clean the corpus:
docs <- tm_map(docs, toSpace, "/|#|\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
words <- dtm %>%as.matrix %>%colnames %>% (function(x) x[nchar(x) < 20])
library(dplyr)
library(stringr)
words %>%str_split("") %>%sapply(function(x) x[-1]) %>%unlist%>%dist_tab %>%mutate(Letter=factor(toupper(interval),levels=toupper(interval[order(freq)]))) %>%ggplot(aes(Letter, weight=percent))+geom_bar()+coord_flip()+ylab("Proportion")+scale_y_continuous(breaks=seq(0, 12,2),label=function(x) paste0(x, "%"),expand=c(0,0), limits=c(0,12))
I'm left with the following plot:
I'm trying to figure out what went wrong here.
Related
I have been trying to execute TermDocumentMatrix function on my corpus of texts but R and R Studio gave me an
Error: Error in tdm(txt, isTRUE(control$removePunctuation),
isTRUE(control$removeNumbers), : function 'Rcpp_precious_remove'
not provided by package 'Rcpp'.
After solving this error by update.packages('Rcpp') library(Rcpp) both R and R studio stopped continuing and I have to end R session and R and couldn't proceed to perform wordcloud followed by TermDocumentMatrix function.
I also googled and search this problem here and read several codes debugging
including TermDocumentMatrix errors in R and R-Project no applicable method for 'meta' applied to an object of class "character" but I am still stuck in this code and cannot proceed to provide an amazing visual representation by wordcloud and hist for most frequent words in my corpus.
I really appreciate any kind of help in this regard.
Here is my entire code:
#installing tm package
library(tm)
#loading required package: NLP
#Create Corpus
docs <- Corpus( DirSource('C:/Users/x/Desktop/TextMiningR/Mix22'))
inspect(docs)
#start pre-processing
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ",x))})
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, "'")
docs <- tm_map(docs, toSpace, " -")
docs <- tm_map(docs, toSpace, "'")
#remove punctuation
docs <- tm_map(docs, removePunctuation)
#transfer to lowercase
docs <- tm_map(docs, content_transformer(tolower))
#strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords from standard stopword list
docs <- tm_map(docs, removeWords, stopwords("english"))
#strip whitespace
docs <- tm_map(docs, stripWhitespace)
#inspect output
inspect(docs)
library(Rcpp)
#create document term matrix : the following line of code stops executing:
dtm <- DocumentTermMatrix(docs)
#or
dtm <- as.matrix(TermDocumentMatrix(docs))
I've got a problem trying to us the FindAssocs code when converting a file from PDF using the pdf_text code from the PDF Tools pacakge.
I've sort of pinpointed the issue. It's that because I can't use "readLines" the Corpus creates a separate area for each page in the PDF. So when I get to the FindAssocs, it returns 1's because they are on both pages.
Is there a workaround? For reference: code down below.
Thanks in advance :).
text <- pdf_text(file.choose())
docs <- Corpus(VectorSource(text))
inspect(docs)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ",
x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("dutch"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
as.data.frame(findAssocs(dtm, terms = input$v, corlimit = 0.3))
If you want to combine all the pages you loaded with pdf_text into one field, you can use paste(unlist(text), collapse =" ") before you transform the text into a corpus.
# my test pdf consists of 20 pages.
text <- pdf_text(file.choose())
summary(text)
Length Class Mode
20 character character
# collapse the text into one field
text <- paste(unlist(text), collapse ="")
summary(text)
Length Class Mode
1 character character
I am trying to add in codes so that when I count keyword occurrences, I can see both single words and phrases in the variable "d". The count should also not be duplicated (if it comes up in single words, it should not appear in the phrases again). I am using the package "tm", "NLP", "RColorBrewer", "wordcloud" and "SnowballC".
text <- readLines(file.choose())
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("pictwittercom", "twittertrendingtopics", "wwwtrendinaliacom", "singaporetodayhtml", "aldubpanibagonglihim", "http", "https", "bitly", "pictwittercom", "dlvr", "sgt", "trndnl", "niall", "wwwswarmappcom", "kak", "mtbnn", "vmas", "lang", "youtubecom", "untuk", "dan", "bagus", "sakit", "membantu", "kahit", "lahat", "mga", "pag", "tao", "kung", "akan", "penyakit"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The issue now is that when I print d into the console, only single words appear. I want to see both phrases and single words and the count should not be duplicated. Please advice as I have been looking through Stack Overflow for hours and still can't find a good solution.
I am trying to create corpus, but in that I wants to combine 2 consecutive words in document, I didn't want corpus of single words.
I am using below script. Is there a way in which I can create corpus "docs" which will be inclusion of combined 2 consecutive words in each document? Please advise.
library(plyr)
library(tm)
library(e1071)
setwd("C:/Assignment/Assignment-Group-Prediction/IPM")
training<- read.csv("Data.csv",header=T,na.strings=c(""))
Res_Desc_Train <- subset(training,select=c("Group","Description"))
##Step 1 : Create Document Matrix
docs <- Corpus(VectorSource(Res_Desc_Train$Description))
docs <-tm_map(docs,content_transformer(tolower))
#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, ";")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\(" )
docs <- tm_map(docs, toSpace, ")")
docs <- tm_map(docs, toSpace, ",")
docs <- tm_map(docs, toSpace, "_")
docs <- tm_map(docs, content_transformer(removeSpecialChars))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("en"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
The FAQ of the tm package answers your question directly:
Can I use bigrams instead of single tokens in a term-document matrix?
Yes. Package NLP provides functionality to compute n-grams which can be used to construct a corresponding tokenizer. E.g.:
library("tm")
data("crude")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(removeSparseTerms(tdm[, 1:10], 0.7))
I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.