Remove stopwords based on a long list - r

I have a dataframe with 60000 rows/phrase which I would like to use as stopwords and remove from text.
I use tm package and I use this line, after I read the csv file with the list of stopwords:
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords)
but I receive this error:
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), :
PCRE pattern compilation error
'regular expression is too large'
at ''
Is there any problem because the list is to big? Is there anything I could make to fix it?

You could probably resolve your issue by splitting the stopword list into multiple parts, something like the following:
chunk <- 1000
i <- 0
n <- length(df$mylistofstopwords)
while (i != n) {
i2 <- min(i + chunk, n)
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords[(i+1):i2])
i <- i2
}
Or, you could just use a package that can handle long stopword lists. corpus is one such package. quanteda is another. Here's how to get a document-by-term matrix in corpus
library(corpus)
x <- term_matrix(corpus, drop = df$mylistofstopwords)
Here, the input argument corpus can be a tm corpus.

Related

stemCompletion Error: Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression, reason 'Missing ')''

Im fairly new to text analytics in R and I am trying to use stemCompletion.
Here's what I did at first:
#Clean Corpus
# 1. Stripping any extra white space:
corpus <- tm_map(corpus, stripWhitespace)
# 2. Transforming everything to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# 3. Removing numbers
corpus <- tm_map(corpus, removeNumbers)
# 4. Removing punctuation
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_contractions=FALSE)
# 5. Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# 6. Stem words
corpusStem <- tm_map(corpus, stemDocument, language="english")
I then ran this line for stemCompletion and it didnt actually do anything:
corpusStem <- tm_map(corpusStem, stemCompletion, dictionary=corpus, type="shortest")
I read up on stemCompletion and learned that it needs to be done on each individual word.
I saw this code on another thread SOF?48022087:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}'
I edited the above with my corpus names, but, when I ran the stemCompletion_mod, I got an error:
stemCompletion_mod(corpusStem,corpus)
Error in grep(sprintf("^%s", w), dictionary, value = TRUE) :
invalid regular expression, reason 'Missing ')''
What is causing this error?
(I also posted on the original thread where I found that code, but its quite old, so seeing if anyone else has some insight here!)
Thanks so much!
Here is the CSV that threw the error.
structure(list(Type = c("Example 1", "Example 2"), Comment = c("This is an example for a corpus. Words like business and charge are not stemming correctly.",
"Here is another example. Challenge and always also need to have stemCompletion."
)), class = "data.frame", row.names = c(NA, -2L))

Remove languages other than English from corpus or data frame in R

I am currently looking to perform some text mining on 25000 YouTube comments, which I gathered using the tuber package. I am very new to coding and with all these different information out there, this can be a bit overwhelming at times. So I already cleaned my corpus, that I created:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Add extra stopwords
myStopwords <- c(stopwords('english'),"im", "just", "one","youre",
"hes","shes","its","were","theyre","ive","youve","weve","theyve","id")
# Remove stopwords from corpus
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z""0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)}))
# Replace word elongations using the textclean package by Tyler Rinker.
corpus <- tm_map(corpus, replace_word_elongation)
# Creating data frame from corpus
corpus_asdataframe<-data.frame(text = sapply(corpus, as.character),stringsAsFactors = FALSE)
# Due to pre-processing some rows are empty. Therefore, the empty rows should be removed.
# Remove empty rows from data frame and "NA's"
corpus_asdataframe <-corpus_asdataframe[!apply(is.na(corpus_asdataframe) | corpus_asdataframe == "", 1, all),]
corpus_asdataframe<-as.data.frame(corpus_asdataframe)
# Create corpus of clean data frame
corpus <- Corpus(VectorSource(corpus_asdataframe$corpus_asdataframe))
So now the issue is that there are a lot of Spanish or German comments in my corpus, which I would like to exclude. I thought that maybe it is possible to download an English dictionary and maybe use an inner jointo detect english words and remove all other languages. However, I am very new to coding (I am studying Business Administration and never had to do anything with computer science) and so my skills are not sufficient for applying my idea to my corpus (or data frame). I really hope find a little help here. That would me very much appreciated! Thank you and best regards from Germany!
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
CREDIT: Ken Benoit at: Find in a dfm non-english tokens and remove them

Issue with stemCompletion of Corpus for text mining in R (tm package)

I have a problem with the word stemming completion of my created corpus using the tm package.
Here are the most important lines of my code:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
corpus
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z" and "0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
# Keep a copy of the generated corpus for stem completion later as dictionary
corpus_copy <- corpus
# Stemming words of corpus
corpus <- tm_map(corpus, stemDocument, language="english")
Now to complete the word stemming I apply stemCompletion of the tm package.
# Completing the stemming with the generated dictionary
corpus <- tm_map(corpus, content_transformer(stemCompletion), dictionary = corpus_copy, type="prevalent")
However, this is where my corpus gets destroyed and messed up and the stemCompletion does not work properly. Peculiarly, R does not indicate an error, the code runs but the result is terrible.
Does anybody know a solution for this? BTW my "comments_final" data frame consist of youtube comments, which I downloaded using the tubeR package.
Thank you so much for your help in advance, I really need help for my master's thesis thank you.
It does seem to work in a bit weird way, so I came up with my own stemCompletion function and applied it to the corpus. In your case try this:
stemCompletion2 <- function(x, dictionary) {
# split each word and store it
x <- unlist(strsplit(as.character(x), " "))
# # Oddly, stemCompletion completes an empty string to
# a word in dictionary. Remove empty string to avoid issue.
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpus <- lapply(corpus, stemCompletion2, corpus_copy)
corpus <- as.VCorpus(corpus)`
Hope this helps!
I am new in supervised methods. Here is my way to normalize my data:
corpuscleaned1 <- tm_map(AI_corpus, removePunctuation) ## Revome punctuation.
corpuscleaned2 <- tm_map(corpuscleaned1, stripWhitespace) ## Remove Whitespace.
corpuscleaned3 <- tm_map(corpuscleaned2, removeNumbers) ## Remove Numbers.
corpuscleaned4 <- tm_map(corpuscleaned3, stemDocument, language = "english") ## Remove StemW.
corpuscleaned5 <- tm_map(corpuscleaned4, removeWords, stopwords("en")) ## Remove StopW.
head(AI_corpus[[1]]$content) ## Examine original txt.
head(corpuscleaned5[[1]]$content) ## Examine clean txt.
AI_corpus <- my corpus about Amnesty Int. reports 1993-2013.

How to remove error in term-document matrix in R?

I am trying to create Term-Document matrix using R from a corpus of file. But on running the code I am getting this error followed by 2 warnings:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j' invalid
Calls: DocumentTermMatrix ... TermDocumentMatrix.VCorpus -> simple_triplet_matrix -> .Call
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
My code is given below:
library(tm)
library(RWeka)
library(tmcn.word2vec)
#Reading data
data <- read.csv("Train.csv", header=T)
#text <- data$EventDescription
#Pre-processing
corpus <- Corpus(VectorSource(data$EventDescription))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
#dataframe <- data.frame(text=unlist(sapply(corpus,'[',"content")))
#Reading dictionary file
dict <- scan("dictionary.txt", what='character',sep='\n')
#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 4))
tdm_doc <- DocumentTermMatrix(corpus,control=list(stopwords = dict, tokenize=BigramTokenizer))
tdm_dic <- DocumentTermMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))
As given in other answers in SO, I have tried installing SnowballC package and other listed ideas. Still I am getting the same error. Can anyone help me in this regard? Thanks in advance.
I had the same problem for getting my DocumnetTermMatrix and I solved it by removing the following command:
corpus <- tm_map(corpus, PlainTextDocument)
I had a similar error when cleaning a corpus. To fix the problem I added the following after the offending line of code and it fixed it. Some of the tm_map functions do not return a corpus...
corpus <- Corpus(VectorSource(corpus))
For me the problem arose after stem completion. I would suggest trying to make a tdm after each tm_map call. That will tell you which cleaning step is causing the problem.
Best of luck!

Using RemoveWords in tm_map in R on words loaded from a file

I have seen several questions about using the removewords function in the tm_map package of R in order to remove either stopwords() or hard coded words from a corpus. However, I am trying to remove words stored in a file (currently csv, but I don't care which type). Using the code below, I don't get any errors, but my words are still there. Could someone please explain what is wrong?
#install.packages('tm')
library(tm)
setwd("c://Users//towens101317//Desktop")
problem_statements <- read.csv("query_export_results_100.csv", stringsAsFactors = FALSE, header = TRUE)
problem_statements_text <- paste(problem_statements, collapse=" ")
problem_statements_source <- VectorSource(problem_statements_text)
my_stop_words <- read.csv("mystopwords.csv", stringsAsFactors=FALSE, header = TRUE)
my_stop_words_text <- paste(my_stop_words, collapse=" ")
corpus <- Corpus(problem_statements_source)
corpus <- tm_map(corpus, removeWords, my_stop_words_text)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)

Resources