Collapse "docs" for textmining in R - r

I've got a problem trying to us the FindAssocs code when converting a file from PDF using the pdf_text code from the PDF Tools pacakge.
I've sort of pinpointed the issue. It's that because I can't use "readLines" the Corpus creates a separate area for each page in the PDF. So when I get to the FindAssocs, it returns 1's because they are on both pages.
Is there a workaround? For reference: code down below.
Thanks in advance :).
text <- pdf_text(file.choose())
docs <- Corpus(VectorSource(text))
inspect(docs)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ",
x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("dutch"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
as.data.frame(findAssocs(dtm, terms = input$v, corlimit = 0.3))

If you want to combine all the pages you loaded with pdf_text into one field, you can use paste(unlist(text), collapse =" ") before you transform the text into a corpus.
# my test pdf consists of 20 pages.
text <- pdf_text(file.choose())
summary(text)
Length Class Mode
20 character character
# collapse the text into one field
text <- paste(unlist(text), collapse ="")
summary(text)
Length Class Mode
1 character character

Related

How can I do a frequency count for words and phrases in the same TermDocumentMatrix using package tm, SnowballC, RColorBrewer, wordcloud and NLP?

I am trying to add in codes so that when I count keyword occurrences, I can see both single words and phrases in the variable "d". The count should also not be duplicated (if it comes up in single words, it should not appear in the phrases again). I am using the package "tm", "NLP", "RColorBrewer", "wordcloud" and "SnowballC".
text <- readLines(file.choose())
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("pictwittercom", "twittertrendingtopics", "wwwtrendinaliacom", "singaporetodayhtml", "aldubpanibagonglihim", "http", "https", "bitly", "pictwittercom", "dlvr", "sgt", "trndnl", "niall", "wwwswarmappcom", "kak", "mtbnn", "vmas", "lang", "youtubecom", "untuk", "dan", "bagus", "sakit", "membantu", "kahit", "lahat", "mga", "pag", "tao", "kung", "akan", "penyakit"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The issue now is that when I print d into the console, only single words appear. I want to see both phrases and single words and the count should not be duplicated. Please advice as I have been looking through Stack Overflow for hours and still can't find a good solution.

RStudio - How to create a WordCloud with spanish characters like: à,è,ì,ò,ù,ñ

I'm importing a txt file with words in spanish, because I want to create wordCloud...
The problem is that I get this words without accent marks inside my wordcloud...
There are words like: "México" that are displayed as "mc3a9xico" ???
text <- readLines(file.choose())
# Load the data as a corpus
docs <- Corpus(VectorSource(text))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
#Generate WordCloud
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The problem was that I didn't set my System Locale. So after trying several times to change to spanish, I was getting this error: "OS reports request to set locale to "sp_MX.UTF-8" cannot be honored" So I ended up using this:
Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8")
And after that everything was working.
Thanks to #hrbrmstr He pointed me to the actual problem :)

Create Corpus by combining words in r

I am trying to create corpus, but in that I wants to combine 2 consecutive words in document, I didn't want corpus of single words.
I am using below script. Is there a way in which I can create corpus "docs" which will be inclusion of combined 2 consecutive words in each document? Please advise.
library(plyr)
library(tm)
library(e1071)
setwd("C:/Assignment/Assignment-Group-Prediction/IPM")
training<- read.csv("Data.csv",header=T,na.strings=c(""))
Res_Desc_Train <- subset(training,select=c("Group","Description"))
##Step 1 : Create Document Matrix
docs <- Corpus(VectorSource(Res_Desc_Train$Description))
docs <-tm_map(docs,content_transformer(tolower))
#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, ";")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\(" )
docs <- tm_map(docs, toSpace, ")")
docs <- tm_map(docs, toSpace, ",")
docs <- tm_map(docs, toSpace, "_")
docs <- tm_map(docs, content_transformer(removeSpecialChars))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("en"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
The FAQ of the tm package answers your question directly:
Can I use bigrams instead of single tokens in a term-document matrix?
Yes. Package NLP provides functionality to compute n-grams which can be used to construct a corresponding tokenizer. E.g.:
library("tm")
data("crude")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(removeSparseTerms(tdm[, 1:10], 0.7))

Text mining pdf files/issues with word frequencies

I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.

Still have punctuation issues after removePunctuation function

I have used the removePuncutation from the "tm" package in R on a Term Document Matrix. For some reason I am still left with strange characters in my plot of the letters versus their proportion in a corpus I've analyzed.
Below is the code I used to clean the corpus:
docs <- tm_map(docs, toSpace, "/|#|\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
words <- dtm %>%as.matrix %>%colnames %>% (function(x) x[nchar(x) < 20])
library(dplyr)
library(stringr)
words %>%str_split("") %>%sapply(function(x) x[-1]) %>%unlist%>%dist_tab %>%mutate(Letter=factor(toupper(interval),levels=toupper(interval[order(freq)]))) %>%ggplot(aes(Letter, weight=percent))+geom_bar()+coord_flip()+ylab("Proportion")+scale_y_continuous(breaks=seq(0, 12,2),label=function(x) paste0(x, "%"),expand=c(0,0), limits=c(0,12))
I'm left with the following plot:
I'm trying to figure out what went wrong here.

Resources