I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.
Related
I am trying to mess around with some R analytics. I have downloaded 10 TED talks file and save them as text. I am struggling with using removeWords stopwords
source("Project_Functions.R")
getwd()
# ====
# Load the PDF data
# pdf.loc <- file.path("data") # folder "PDF Files" with PDFs
# myFiles <- normalizePath(list.files(path = pdf.loc, pattern = "pdf", full.names = TRUE)) # Get the path (chr-vector) of PDF file names
# # Extract content from PDF files
# Docs.corpus <- Corpus(URISource(myFiles), readerControl = list(reader = readPDF(engine = "xpdf")))
# ====
# Load TED Talks Data
myFiles <- normalizePath(list.files(pattern = "txt", full.names = TRUE))
Docs.corpus <- Corpus(URISource(myFiles), readerControl=list(reader=readPlain))
length(Docs.corpus)
#Docs.corpus <-tm_map(Docs.corpus, tolower)
Docs.corpus <-tm_map(Docs.corpus, removeWords, stopwords("english"))
Docs.corpus <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus <-tm_map(Docs.corpus, removeNumbers)
Docs.corpus <-tm_map(Docs.corpus, stripWhitespace)
However, when I run:
dtm <-DocumentTermMatrix(Docs.corpus)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
It still shows some words that I don't want like "and", "just"..etc..
I have tried researching and using [[:punct:]] and other methods but they don't work.
Please help, thank you
I found out why, so the order of the tm_map matters a lot, for example, if you run tolower and then run the next line removeNumbers, it somehow does not execute the tolower anymore, but switch to removeNumbers, I fixed it, it might not be the most effective way, but it works
Docs.corpus.temp <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus.temp1 <-tm_map(Docs.corpus.temp, removeNumbers)
Docs.corpus.temp2 <-tm_map(Docs.corpus.temp1, tolower)
Docs.corpus.temp3 <-tm_map(Docs.corpus.temp2,PlainTextDocument)
Docs.corpus.temp4 <-tm_map(Docs.corpus.temp3, stripWhitespace)
Docs.corpus.temp5 <-tm_map(Docs.corpus.temp4, removeWords, stopwords("english"))
#frequency
dtm <-DocumentTermMatrix(Docs.corpus.temp5)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
ord<- order(freq)
freq
That fixes my problem, now all the tm_map preprocessing code works.
If anyone have better idea, please let me know, thank you!
I've got a problem trying to us the FindAssocs code when converting a file from PDF using the pdf_text code from the PDF Tools pacakge.
I've sort of pinpointed the issue. It's that because I can't use "readLines" the Corpus creates a separate area for each page in the PDF. So when I get to the FindAssocs, it returns 1's because they are on both pages.
Is there a workaround? For reference: code down below.
Thanks in advance :).
text <- pdf_text(file.choose())
docs <- Corpus(VectorSource(text))
inspect(docs)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ",
x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("dutch"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
as.data.frame(findAssocs(dtm, terms = input$v, corlimit = 0.3))
If you want to combine all the pages you loaded with pdf_text into one field, you can use paste(unlist(text), collapse =" ") before you transform the text into a corpus.
# my test pdf consists of 20 pages.
text <- pdf_text(file.choose())
summary(text)
Length Class Mode
20 character character
# collapse the text into one field
text <- paste(unlist(text), collapse ="")
summary(text)
Length Class Mode
1 character character
I am trying to add in codes so that when I count keyword occurrences, I can see both single words and phrases in the variable "d". The count should also not be duplicated (if it comes up in single words, it should not appear in the phrases again). I am using the package "tm", "NLP", "RColorBrewer", "wordcloud" and "SnowballC".
text <- readLines(file.choose())
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("pictwittercom", "twittertrendingtopics", "wwwtrendinaliacom", "singaporetodayhtml", "aldubpanibagonglihim", "http", "https", "bitly", "pictwittercom", "dlvr", "sgt", "trndnl", "niall", "wwwswarmappcom", "kak", "mtbnn", "vmas", "lang", "youtubecom", "untuk", "dan", "bagus", "sakit", "membantu", "kahit", "lahat", "mga", "pag", "tao", "kung", "akan", "penyakit"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The issue now is that when I print d into the console, only single words appear. I want to see both phrases and single words and the count should not be duplicated. Please advice as I have been looking through Stack Overflow for hours and still can't find a good solution.
I'm importing a txt file with words in spanish, because I want to create wordCloud...
The problem is that I get this words without accent marks inside my wordcloud...
There are words like: "México" that are displayed as "mc3a9xico" ???
text <- readLines(file.choose())
# Load the data as a corpus
docs <- Corpus(VectorSource(text))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
#Generate WordCloud
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The problem was that I didn't set my System Locale. So after trying several times to change to spanish, I was getting this error: "OS reports request to set locale to "sp_MX.UTF-8" cannot be honored" So I ended up using this:
Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8")
And after that everything was working.
Thanks to #hrbrmstr He pointed me to the actual problem :)
I have used the removePuncutation from the "tm" package in R on a Term Document Matrix. For some reason I am still left with strange characters in my plot of the letters versus their proportion in a corpus I've analyzed.
Below is the code I used to clean the corpus:
docs <- tm_map(docs, toSpace, "/|#|\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
words <- dtm %>%as.matrix %>%colnames %>% (function(x) x[nchar(x) < 20])
library(dplyr)
library(stringr)
words %>%str_split("") %>%sapply(function(x) x[-1]) %>%unlist%>%dist_tab %>%mutate(Letter=factor(toupper(interval),levels=toupper(interval[order(freq)]))) %>%ggplot(aes(Letter, weight=percent))+geom_bar()+coord_flip()+ylab("Proportion")+scale_y_continuous(breaks=seq(0, 12,2),label=function(x) paste0(x, "%"),expand=c(0,0), limits=c(0,12))
I'm left with the following plot:
I'm trying to figure out what went wrong here.