I am trying to mess around with some R analytics. I have downloaded 10 TED talks file and save them as text. I am struggling with using removeWords stopwords
source("Project_Functions.R")
getwd()
# ====
# Load the PDF data
# pdf.loc <- file.path("data") # folder "PDF Files" with PDFs
# myFiles <- normalizePath(list.files(path = pdf.loc, pattern = "pdf", full.names = TRUE)) # Get the path (chr-vector) of PDF file names
# # Extract content from PDF files
# Docs.corpus <- Corpus(URISource(myFiles), readerControl = list(reader = readPDF(engine = "xpdf")))
# ====
# Load TED Talks Data
myFiles <- normalizePath(list.files(pattern = "txt", full.names = TRUE))
Docs.corpus <- Corpus(URISource(myFiles), readerControl=list(reader=readPlain))
length(Docs.corpus)
#Docs.corpus <-tm_map(Docs.corpus, tolower)
Docs.corpus <-tm_map(Docs.corpus, removeWords, stopwords("english"))
Docs.corpus <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus <-tm_map(Docs.corpus, removeNumbers)
Docs.corpus <-tm_map(Docs.corpus, stripWhitespace)
However, when I run:
dtm <-DocumentTermMatrix(Docs.corpus)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
It still shows some words that I don't want like "and", "just"..etc..
I have tried researching and using [[:punct:]] and other methods but they don't work.
Please help, thank you
I found out why, so the order of the tm_map matters a lot, for example, if you run tolower and then run the next line removeNumbers, it somehow does not execute the tolower anymore, but switch to removeNumbers, I fixed it, it might not be the most effective way, but it works
Docs.corpus.temp <-tm_map(Docs.corpus, removePunctuation)
Docs.corpus.temp1 <-tm_map(Docs.corpus.temp, removeNumbers)
Docs.corpus.temp2 <-tm_map(Docs.corpus.temp1, tolower)
Docs.corpus.temp3 <-tm_map(Docs.corpus.temp2,PlainTextDocument)
Docs.corpus.temp4 <-tm_map(Docs.corpus.temp3, stripWhitespace)
Docs.corpus.temp5 <-tm_map(Docs.corpus.temp4, removeWords, stopwords("english"))
#frequency
dtm <-DocumentTermMatrix(Docs.corpus.temp5)
dtm$dimnames$Terms
freq <- colSums(as.matrix(dtm))
freq <- subset(freq, freq > 10)
ord<- order(freq)
freq
That fixes my problem, now all the tm_map preprocessing code works.
If anyone have better idea, please let me know, thank you!
Related
I have multiple *.txt files that contain the title and texts that I want to process in R. A program below reads all the *.txt and displays the final file while skipping the first read texts.
My program is as here below. It uses for loop and I want to see all the texts
library(here)
library(glue)
library(tm)
library(SnowballC)
library(tidyverse)
library(tidytext)
all_texts <- list.files(setwd('.KCI/'), (startsWith = 'abstract'))
for(i in seq(1:length(all_texts)))
{
data <- read_tsv(all_texts[i], , show_col_types = FALSE)
corpus <- Corpus(VectorSource(data[i]))
corpus[i] <- tm_map(corpus[i], tolower)
corpus[i] <- tm_map(corpus[i], removePunctuation)
corpus[i] <- tm_map(corpus[i], removeNumbers)
corpus[i] <- tm_map(corpus[i], stripWhitespace)
corpus[i] <- tm_map(corpus[i], removeWords, c(stopwords("english"), mystopwords))
corpus[i] <- tm_map(corpus[i], stemDocument)
dtm <- DocumentTermMatrix(corpus[i])
}
This program just reads the final document but skips the previous ones. Therefore I want even other documents to be displayed before the last one.
<Title> <Year> <Text>
How is it? 1998 I am wondering if it could end like that. Therefore the deal is too good to be true
This would be a lot easier if you had provided some data.
library(tm)
library(SnowballC)
##
# two documents based on your example (t1 & t2 are identical here).
#
t1 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
t2 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
data <- list(t1,t2) # listof documents
dtm.list <- lapply(data, function(x) {
corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)
DocumentTermMatrix(corpus)
})
lapply(dtm.list, inspect)
Note I left out mystopwords because you did not provide any.
In your case you could put the read_tsv(...) back into the function and use lapply(...) in the list of file names. Something like:
dtm.list <- lapply(all.texts, function(x) {
data <- read_tsv(x)
corpus <- Corpus(VectorSource(data))
...
})
Where ... are the lines of code in my example above.
If your ultimate goal is to analyze word frequency, you might be better off using ?termFreq.
I am doing sentiment analysis, I have two documents in my directory of
corpus 1 is of positive tweets and other is of negative tweets but in
comparison wordcloud I have words those are stopwords. This means it is not
removing the stopwords ("english").
I created custom stopwords but failed to retain that output too. After that I have searched and found a stopwords.txt file of stopwords that I have downloaded from the github and used it to remove the stopwords. For this I have to convert the corpus (atomic vector) to table and then to vector (dataframe) as to read this file. I have combined it with stopwords of tm library.
The output was as expected, but when I tried to remove the punctuation and inspected the corpus, the output was just according to removePunctuation output not retaining the output of stopwords.
Then, I tried the removeNumbers and inspect the corpus but it is not retaining the output of stopwords but retaining the output of removePunctuation.
so, what is the problem here?
What I am missing here?
[This is the code]
[1][This is the output after removing the stopwords from the tweets using R]
[2][This is the output after appling other cleaning like removePunctuation,
removeNumbers, stipwhitespace, stemDocument but it is not retaining the removed stopwords output]
[3]
[1]: https://i.stack.imgur.com/RMbvD.png
[2]: https://i.stack.imgur.com/18H3P.png
[3]: https://i.stack.imgur.com/SxaJE.png
This is the code that I have used. I have put the two text files in the
directory and converted it into the corpus.
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-
Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
stopwords("english"))
inspect(clean_tweets_corpus)
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
class(stop)
stop
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
class(stop_vec)
stop_vec
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
inspect(clean_tweets_corpus)
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(tweets_corpus,
content_transformer(remove_multiplechar))
inspect(clean_tweets_corpus)
clean_tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(tweets_corpus,removeNumbers)
clean_tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(tweets_corpus, stemDocument)
inspect(clean_tweets_corpus)
str(clean_tweets_corpus)
Here is the corrected code, replacing "tweets_corpus" with "clean_tweets_corpus" in all calls to tm_map except the first one:
library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeWords,
c(stopwords("english"), stop_vec))
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(clean_tweets_corpus,
content_transformer(remove_multiplechar))
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeNumbers)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stemDocument)
I have a problem with the word stemming completion of my created corpus using the tm package.
Here are the most important lines of my code:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
corpus
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z" and "0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
# Keep a copy of the generated corpus for stem completion later as dictionary
corpus_copy <- corpus
# Stemming words of corpus
corpus <- tm_map(corpus, stemDocument, language="english")
Now to complete the word stemming I apply stemCompletion of the tm package.
# Completing the stemming with the generated dictionary
corpus <- tm_map(corpus, content_transformer(stemCompletion), dictionary = corpus_copy, type="prevalent")
However, this is where my corpus gets destroyed and messed up and the stemCompletion does not work properly. Peculiarly, R does not indicate an error, the code runs but the result is terrible.
Does anybody know a solution for this? BTW my "comments_final" data frame consist of youtube comments, which I downloaded using the tubeR package.
Thank you so much for your help in advance, I really need help for my master's thesis thank you.
It does seem to work in a bit weird way, so I came up with my own stemCompletion function and applied it to the corpus. In your case try this:
stemCompletion2 <- function(x, dictionary) {
# split each word and store it
x <- unlist(strsplit(as.character(x), " "))
# # Oddly, stemCompletion completes an empty string to
# a word in dictionary. Remove empty string to avoid issue.
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpus <- lapply(corpus, stemCompletion2, corpus_copy)
corpus <- as.VCorpus(corpus)`
Hope this helps!
I am new in supervised methods. Here is my way to normalize my data:
corpuscleaned1 <- tm_map(AI_corpus, removePunctuation) ## Revome punctuation.
corpuscleaned2 <- tm_map(corpuscleaned1, stripWhitespace) ## Remove Whitespace.
corpuscleaned3 <- tm_map(corpuscleaned2, removeNumbers) ## Remove Numbers.
corpuscleaned4 <- tm_map(corpuscleaned3, stemDocument, language = "english") ## Remove StemW.
corpuscleaned5 <- tm_map(corpuscleaned4, removeWords, stopwords("en")) ## Remove StopW.
head(AI_corpus[[1]]$content) ## Examine original txt.
head(corpuscleaned5[[1]]$content) ## Examine clean txt.
AI_corpus <- my corpus about Amnesty Int. reports 1993-2013.
I'm creating a correlated topic model from public review data and getting a rather odd error.
When I call terms(ctm1, 5) on my CTM, I get back the names of the documents rather than the top 5 terms for each topic.
In more detail I ran,
library(topicmodels)
library(data.table)
library(tm)
a <-Corpus(DirSource("~/text", encoding="UTF-8"), readerControl =
list(language="lat"))
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stemDocument, language = "english")
adtm <-TermDocumentMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
ctm1 <- CTM(adtm, 30, method = "VEM", control = NULL, model = NULL)
terms(ctm1, 5)
which returned
terms(ctm1)
Topic 1 "cmnt656661.txt"
(etc.)
We cannot know for sure because you did not provide data; but it is likely that you did not import the files correctly. See ?DirSource (my emphasis):
directory : A character vector of full path names; the default
corresponds to the working directory getwd().
In your case, it seems like you should do something like this:
a <- Corpus(DirSource(list.files("~/text", full.names = TRUE)))
I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.