I am trying to create corpus, but in that I wants to combine 2 consecutive words in document, I didn't want corpus of single words.
I am using below script. Is there a way in which I can create corpus "docs" which will be inclusion of combined 2 consecutive words in each document? Please advise.
library(plyr)
library(tm)
library(e1071)
setwd("C:/Assignment/Assignment-Group-Prediction/IPM")
training<- read.csv("Data.csv",header=T,na.strings=c(""))
Res_Desc_Train <- subset(training,select=c("Group","Description"))
##Step 1 : Create Document Matrix
docs <- Corpus(VectorSource(Res_Desc_Train$Description))
docs <-tm_map(docs,content_transformer(tolower))
#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, ";")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\(" )
docs <- tm_map(docs, toSpace, ")")
docs <- tm_map(docs, toSpace, ",")
docs <- tm_map(docs, toSpace, "_")
docs <- tm_map(docs, content_transformer(removeSpecialChars))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("en"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
The FAQ of the tm package answers your question directly:
Can I use bigrams instead of single tokens in a term-document matrix?
Yes. Package NLP provides functionality to compute n-grams which can be used to construct a corresponding tokenizer. E.g.:
library("tm")
data("crude")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(removeSparseTerms(tdm[, 1:10], 0.7))
Related
I am running the following code, but I am getting:
Error : cannot coerce type 'closure' to vector of type 'character'
#Load required packages
library(ggplot2)
library(tm)
library(wordcloud)
library(syuzhet)
#get the data from whatsapp chat
texts <- readLines("abc.txt")
#let us create the corpus
docs <- Corpus(VectorSource(text))
#clean our chat data
trans <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, trans, "/")
docs <- tm_map(docs, trans, "#")
docs <- tm_map(docs, trans, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
I've got a problem trying to us the FindAssocs code when converting a file from PDF using the pdf_text code from the PDF Tools pacakge.
I've sort of pinpointed the issue. It's that because I can't use "readLines" the Corpus creates a separate area for each page in the PDF. So when I get to the FindAssocs, it returns 1's because they are on both pages.
Is there a workaround? For reference: code down below.
Thanks in advance :).
text <- pdf_text(file.choose())
docs <- Corpus(VectorSource(text))
inspect(docs)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ",
x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("dutch"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
as.data.frame(findAssocs(dtm, terms = input$v, corlimit = 0.3))
If you want to combine all the pages you loaded with pdf_text into one field, you can use paste(unlist(text), collapse =" ") before you transform the text into a corpus.
# my test pdf consists of 20 pages.
text <- pdf_text(file.choose())
summary(text)
Length Class Mode
20 character character
# collapse the text into one field
text <- paste(unlist(text), collapse ="")
summary(text)
Length Class Mode
1 character character
I am trying to add in codes so that when I count keyword occurrences, I can see both single words and phrases in the variable "d". The count should also not be duplicated (if it comes up in single words, it should not appear in the phrases again). I am using the package "tm", "NLP", "RColorBrewer", "wordcloud" and "SnowballC".
text <- readLines(file.choose())
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "#")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("pictwittercom", "twittertrendingtopics", "wwwtrendinaliacom", "singaporetodayhtml", "aldubpanibagonglihim", "http", "https", "bitly", "pictwittercom", "dlvr", "sgt", "trndnl", "niall", "wwwswarmappcom", "kak", "mtbnn", "vmas", "lang", "youtubecom", "untuk", "dan", "bagus", "sakit", "membantu", "kahit", "lahat", "mga", "pag", "tao", "kung", "akan", "penyakit"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The issue now is that when I print d into the console, only single words appear. I want to see both phrases and single words and the count should not be duplicated. Please advice as I have been looking through Stack Overflow for hours and still can't find a good solution.
I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.
I have used the removePuncutation from the "tm" package in R on a Term Document Matrix. For some reason I am still left with strange characters in my plot of the letters versus their proportion in a corpus I've analyzed.
Below is the code I used to clean the corpus:
docs <- tm_map(docs, toSpace, "/|#|\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
words <- dtm %>%as.matrix %>%colnames %>% (function(x) x[nchar(x) < 20])
library(dplyr)
library(stringr)
words %>%str_split("") %>%sapply(function(x) x[-1]) %>%unlist%>%dist_tab %>%mutate(Letter=factor(toupper(interval),levels=toupper(interval[order(freq)]))) %>%ggplot(aes(Letter, weight=percent))+geom_bar()+coord_flip()+ylab("Proportion")+scale_y_continuous(breaks=seq(0, 12,2),label=function(x) paste0(x, "%"),expand=c(0,0), limits=c(0,12))
I'm left with the following plot:
I'm trying to figure out what went wrong here.