I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords
tm_map(abs, removeWords, stopwords("english"))
Is there a way to add my own custom stop words to this list?
stopwords just provides you with a vector of words, just combine your own ones to this.
tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words"))
Save your custom stop words in a csv file (ex: word.csv).
library(tm)
stopwords <- read.csv("word.csv", header = FALSE)
stopwords <- as.character(stopwords$V1)
stopwords <- c(stopwords, stopwords())
Then you can apply custom words to your text file.
text <- VectorSource(text)
text <- VCorpus(text)
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, stopwords)
text <- tm_map(text, stripWhitespace)
text[[1]]$content
You can create a vector of your custom stopwords & use the statement like this:
tm_map(abs, removeWords, c(stopwords("english"), myStopWords))
You could also use the textProcessor package. It works quite well:
textProcessor(documents,
removestopwords = TRUE, customstopwords = NULL)
It is possible to add your own stopwords to the default list of stopwords that came along with tm install. The "tm" package comes with many data files including stopwords, and note that stopwords files come for many languages. You can add, delete, or update the english.dat file under stopwords directory.
The easiest way to find the stopwords directory is to search for "stopwords" directory in your system through your file browser. And you should find english.dat along with many other language files. Open the english.dat file from RStudio which should enable to edit the file - you can add your own words or drop existing words as needed.
It is the same process if you want to edit stopwords in any other language.
I am using the stopwords library instead of the tm library. I just decided to put my solution here in case anyone will need it.
# Create a list of custom stopwords that should be added
word <- c("quick", "recovery")
lexicon <- rep("custom", times=length(word))
# Create a dataframe from the two vectors above
mystopwords <- data.frame(word, lexicon)
names(mystopwords) <- c("word", "lexicon")
# Add the dataframe to stop_words df that exists in the library stopwords
stop_words <- dplyr::bind_rows(stop_words, mystopwords)
View(stop_words)
Related
I am trying to manipulate text in R. I am loading word documents and want to preprocess them in such a way, that every text till a certain point is deleted.
library(readtext)
#List all documents
file_list = list.files()
#Read Texts and write them to a data table
data = readtext(file_list)
# Create a corpus
library(tm)
corp = VCorpus(VectorSource(data$text))
#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)
Now what I am trying to do is, to delete every text till a certain keyword, here "Disclosure", for each text corpus and delete everything after the word "Conclusion"
There are many ways to do what you want, but without knowing more about your case or your example it is difficult to come up with the right solution.
If you are SURE that there will only be one instance of Disclosure and one instance of Conclusion you can use the following. Also, be warned, this assumes that each document is a single content vector and will not work otherwise. It will be relatively slow, but for a few small to medium sized documents it will work fine.
All I did was write some functions that apply regex to content in a corpus. You could also do this with an apply statement instead of a tm_map.
#Read Texts and write them to a data table
data = c("My fake text Disclosure This is just a sentence Conclusion Don't consider it a file.",
"My second fake Disclosure This is just a sentence Conclusion Don't consider it a file.")
# Create a corpus
library(tm)
library(stringr)
corp = VCorpus(VectorSource(data))
#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)
remove_before_Disclosure <- function(doc.in){
doc.in$content <- str_remove(doc.in$content,".+(?=Disclosure)")
return(doc.in)
}
corp2 <- tm_map(corp,remove_before_Disclosure)
remove_after_Conclusion <- function(doc.in){
doc.in$content <- str_remove(doc.in$content,"(?<=Conclusion).+")
return(doc.in)
}
corp2 <- tm_map(corp2,remove_after_Conclusion)
I am trying to do some text mining, using tm package, on reviews that Italian users of a certain website wrote there. I scraped the texts, stored them on a corpus, did some sort of cleaning, but when I try to get the stems of the words by removing the common endings, I have problem specifying the Italian language instead of default one, i.e. English.
reviews_corpus <- tm_map(reviews_corpus, removeNumbers)
reviews_corpus <- tm_map(reviews_corpus, removePunctuation)
reviews_corpus <- tm_map(reviews_corpus, stripWhitespace)
reviews_corpus <- tm_map(reviews_corpus, content_transformer(tolower))
reviews_corpus <- tm_map(reviews_corpus, removeWords, stopwords("italian"))
reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))
First five lines work fine, but for the last one R gives me:
Error in UseMethod("stemDocument", x) :
no applicable method for 'stemDocument' applied to an object of class "c('VCorpus', 'Corpus')"
So, my problem is that how can I use stemDocument on a corpus but specify the language I want to be used?
There is a bug in stemDocument. If you use any other language than English, it reverts back to English. But there is a way around it and directly call the word stemmer that stemDocument points to.
Instead of
reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))
use
reviews_corpus <- tm_map(reviews_corpus, function(x) SnowballC::wordStem(x, language = "italian"))
But my advice is, if you are using a non English language, to use the quanteda package.
I'm using the R tm package for text analysis on a facebook group, and find that the removewords function isn't working for me. I tried to combine the french stopwords with my own, but they are still appearing. So I create a file named "french.txt" with my own list as in the following command:
nom_fichier <- "Analyse textuelle/french.txt"
my_stop_words <- readLines(nom_fichier, encoding="UTF-8")
Here is the data for text mining:
text <- readLines(groupe_fb_ief, encoding="UTF-8")```
docs <- Corpus(VectorSource(text))
inspect(docs)
Here are the tm_map commands:
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, my_stop_words)
Applying that, it's still not working and I don't understand why. I even try to change to order of the commands with no result.
Do you have any idea ? Is it possible to change the french stopwords within R ? Where this list is located ?
Thanks!!
Rather than use RemoveWords, I typically use an anti_join() to remove all stop words.
library(tidytext)
my_stop_words <- my_stop_words %>%
unnest_tokens(output = word, input = text, token = "words")
# anti_join
anti_join(docs,my_stop_words, by = "word")
That is if the the column that contains your corpus is called "word". Hope this helps.
I'm quite new to R and currently working on a project for my studies (readability vs performance of annual reports). I've literally screened hundreds of posts but could not find a proper solution. So, I'm stuck and need you're help.
My goal is to tm roughly 1000 text documents and export the edited texts from the VCorpus into a folder, including the original file names.
So far I managed to import & do (some) text mining:
### folder of txt files
dest <- ("C:\\Xpdf_pdftotext\\TestCorpus")
### create a Corpus in R
docs <- VCorpus(DirSource(dest))
### do some text mining of the txt-documents
for (j in seq(docs)) {
docs[[j]] <- gsub("\\d", "", docs[[j]])
docs[[j]] <- gsub("\\b[A-z]\\b{3}", "", docs[[j]])
docs[[j]] <- gsub("\\t", "", docs[[j]])
}
Export each file in the Corpus with its original file names.
works for 1 file, when assigning a new name:
writeLines(as.character(docs[1]), con="text1.txt")
I've found the command for the meta ID in a post, but I don't know how to include it in my code
docs[[1]]$meta$id
How can I efficiently export the edited textfiles from the VCorpus including their original file names?
Thanks for helping
Actually it is very simple.
If you have a corpus loaded as you did, you can write the whole corpus to disk in one command with using writeCorpus. The meta tag id needs to be filled in, but in your case that is already done how you loaded the data.
If we take the crude dataset as an example, the id's are already included:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
# bit of textcleaning
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
#write to disk in subfolder data
writeCorpus(crude, path = "data/")
# check files
dir("data/")
[1] "127.txt" "144.txt" "191.txt" "194.txt" "211.txt" "236.txt" "237.txt" "242.txt" "246.txt" "248.txt" "273.txt" "349.txt" "352.txt"
[14] "353.txt" "368.txt" "489.txt" "502.txt" "543.txt" "704.txt" "708.txt"
The files from the crude dataset are written to disk with the id's as filenames.
Using R{tm} package, i create a corpus, per usual:
mycorpus <- Corpus(DirSource(folder,pattern="txt"))
Please note I am not using an encoding variable. The summary (mycorpus) shows document names listed. However after a series of tm_map transforms:
(content_transformer(tolower),content_transformer(removeWords), stopwords("SMART"),stripWhitespace)
ending with mycorpus<- tm_map(mycorpus, PlainTextDocument) and mydtm <- DocumentTermMatrix(mycorpus, control = list(...))
I get an error with inspect(mydtm[1:10, intersect(colnames(dtm), 'toyota')]) to get my variable of choice:
Terms
Docs toyota
character(0) 0
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
The file names (doc ids) have disappeared. Any idea what could be causing this error? more importantly, how do i reinstate the document names? Many thanks.
Code below will work for single file. You likely could use something like list.files to read all files in the directory.
First, I would wrap the cleaning functions in a custom function. Note the order matters and you have to use content_transformer if the function is not from tm.
clean.corpus<-function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, custom.stopwords)
return(corpus)
}
Then concatenate English words with custom words. This is passed as the last part of the custom function above.
custom.stopwords <- c(stopwords('english'), 'lol', 'smh')
doc<-read.csv('coffee.csv', header=TRUE)
The CSV is a data frame with a column of tweets in a text document and another column with an ID for each tweet. The file from my workshop with this file is here.
The csv file is now in memory so next step is to read it in tabular fashion with a specific mapping when making a corpus. Here the content is in a column called text and the unique ID is in a column name "id".
custom.reader <- readTabular(mapping=list(content="text", id="id"))
corpus <- VCorpus(DataframeSource(doc), readerControl=list(reader=custom.reader))
corpus<-clean.corpus(corpus)
The corpus creation uses the readerControl and then once done you can apply the pre-processing steps. Without the reader control the package assigns the 0 character as the name.
The corpus content of document 1 can be accessed here
corpus[[1]][1]
You can review the corpus meta data for the first document with this code
corpus[[1]][2]
So I think you are needing to use readTabular and readerControl in your corpus construction no matter the source.
I was having the same problem and I realized that it was due to tolower. tolower, unlike removeNumbers, removePunctuation, removeWords, stemDocument, stripWhitespace are not tranformations defined in the tm package. To get a list of transformations defined in the tm package that can be directly applied to a corpus, type:
getTransformations()
[1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”
Thus, in order to use tolower it first must make a transformation for tolower for it to handle corpus objects properly.
docs <- tm_map(docs,content_transformer(tolower))
The above line of code should stop the files from being renamed to character(0)
The same trick can be applied to any R function to work with corpuses. For example for gsub, the following syntax applies:
docs <- tm_map(docs, content_transformer(gsub), pattern = “internt”, replacement = “internet”)