Wordcloud in R including characters “ and ”, how do I delete them? - r

I'm trying to make a word cloud out of a Sherlock Holmes story, the problem is the top words are ” and “.
I can't delete them as other words with the tm_map function with the removeWords property. What I've tried is this:
docs <- tm_map(docs, removeWords, c('“'))

You can use functions like removePunctuation from the tm package.
library(tm)
library(janeaustenr)
# With Punctuation
data("prideprejudice")
prideprejudice[30]
# Punctuation Removed
prideprejudice <- removePunctuation(prideprejudice)
prideprejudice[30]
You can also use the tidytext package. The unnest_tokens function will automatically strip punctuation. You probably also want to get rid of stop words, which you can do with something like this:
library(tm)
library(tidytext)
library(janeaustenr)
library(dplyr)
data("prideprejudice")
data(stop_words)
prideprej_tibble <- tibble(text=prideprejudice)
prideprej_words <- prideprej_tibble %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
See here for more.

Related

tm_map: Can use the removewords function with my own stopwords registered as an txt file?

I'm using the R tm package for text analysis on a facebook group, and find that the removewords function isn't working for me. I tried to combine the french stopwords with my own, but they are still appearing. So I create a file named "french.txt" with my own list as in the following command:
nom_fichier <- "Analyse textuelle/french.txt"
my_stop_words <- readLines(nom_fichier, encoding="UTF-8")
Here is the data for text mining:
text <- readLines(groupe_fb_ief, encoding="UTF-8")```
docs <- Corpus(VectorSource(text))
inspect(docs)
Here are the tm_map commands:
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, my_stop_words)
Applying that, it's still not working and I don't understand why. I even try to change to order of the commands with no result.
Do you have any idea ? Is it possible to change the french stopwords within R ? Where this list is located ?
Thanks!!
Rather than use RemoveWords, I typically use an anti_join() to remove all stop words.
library(tidytext)
my_stop_words <- my_stop_words %>%
unnest_tokens(output = word, input = text, token = "words")
# anti_join
anti_join(docs,my_stop_words, by = "word")
That is if the the column that contains your corpus is called "word". Hope this helps.

creating corpus from multiple txt files

I have multiple txt files, I want to have a tidy data. To do that first I create corpus ( I am not sure is it true way to do it). I wrote the following code to have the corpus data.
folder<-"C:\\Users\\user\\Desktop\\text analysis\\doc"
list.files(path=folder)
filelist<- list.files(path=folder, pattern="*.txt")
paste(folder, "\\", filelist)
filelist<-paste(folder, "\\", filelist, sep="")
typeof(filelist)
a<- lapply(filelist,FUN=readLines)
corpus <- lapply(a ,FUN=paste, collapse=" ")
When I check the class(corpus) it returns list. From that point how can I create tidy data?
Looking at your other question as well, you need to read up on text-mining and how to read in files. Your result now is a list object. In itself not a bad object, but for your purposes not correct. Instead of lapply, use sapply in your last line, like this:
corpus <- sapply(a , FUN = paste, collapse = " ")
This will return a character vector. Next you need to turn this into a data.frame. I added the filelist to the data.frame to keep track of which text belongs to which document.
my_data <- data.frame(files = filelist, text = corpus, stringsAsFactors = FALSE)
and then use tidytext to continue:
library(tidytext)
tidy_text <- unnest_tokens(my_data, words, text)
using tm and tidytext package
If you would use the tm package, you could read everything in like this:
library(tm)
folder <- getwd() # <-- here goes your folder
corpus <- VCorpus(DirSource(directory = folder,
pattern = "*.txt"))
which you could turn into tidytext like this:
library(tidytext)
tidy_corpus <- tidy(corpus)
tidy_text <- unnest_tokens(tidy_corpus, words, text)
If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.
To find all the text files within a working directory, you can use list.files with an argument:
all_txts <- list.files(pattern = ".txt$")
The all_txts object will then be a character vector that contains all your filenames.
Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.
library(tidyverse)
library(tidytext)
map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
mutate(filename = basename(.x)) %>%
unnest_tokens(word, txt))

How to apply a custom function to a quanteda corpus

I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.
I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.
Using tm package previously I did this:
# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
Then within my tm corpus transformations I did this:
mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))
What is the equivilent way to apply this custom function to my quanteda corpus?
Impossible to know if that will work from your example, which leaves some parts out, but generally:
If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.
So in your case, assuming that mycorpus is a tm corpus, you could do this:
library("quanteda")
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
stringi::stri_replace_all_regex(str = x,
pattern = paste0("\\b", lut[,1], "\\b"),
replacement = lut[,2],
vectorize_all = FALSE)
}
myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
I think I found an indirect answer over here.
texts(myCorpus) <- myFunction(myCorpus)

Text Analysis in R - word frequency

I only have R available at work and I have done this before in Python. I need to get a count of each set of incidents in a CSV file. I have done a sentiment analysis in Python, where I had a dictionary Python searched in a provided a table with the count for each phrase. I am researching how to do this in R and have only found ways to do a general word count using a predetermined frequency.
Please let me know if anyone has any resource links on how to perform this in R. Thank you :)
Here's a place to start: http://tidytextmining.com
library(tidytext)
text_df %>%
unnest_tokens(word, text)
library(tidytext)
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books
tidy_books %>%
count(word, sort = TRUE)
The package tidytext is a good solution. Another option is to use the text mining package tm:
library(tm)
df<-read.csv(myfile)
corpus<-Corpus(VectorSource(df$text))
corpus<-tm_map(corpus, content_transformer(tolower))
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, removeWords, stopwords('english'))
#corpus<-tm_map(corpus, stemDocument, language = "english")
corpus<-tm_map(corpus, removePunctuation)
tdm<-TermDocumentMatrix(corpus)
tdmatrix<-as.matrix(tdm)
wordfreq<-sort(rowSums(tdmatrix), decreasing = TRUE)
the code example cleans up the text by removing stop words, any numbers and punctuation. The final answer wordfreq is ready for with the wordcloud package if interested.

Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords
tm_map(abs, removeWords, stopwords("english"))
Is there a way to add my own custom stop words to this list?
stopwords just provides you with a vector of words, just combine your own ones to this.
tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words"))
Save your custom stop words in a csv file (ex: word.csv).
library(tm)
stopwords <- read.csv("word.csv", header = FALSE)
stopwords <- as.character(stopwords$V1)
stopwords <- c(stopwords, stopwords())
Then you can apply custom words to your text file.
text <- VectorSource(text)
text <- VCorpus(text)
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, stopwords)
text <- tm_map(text, stripWhitespace)
text[[1]]$content
You can create a vector of your custom stopwords & use the statement like this:
tm_map(abs, removeWords, c(stopwords("english"), myStopWords))
You could also use the textProcessor package. It works quite well:
textProcessor(documents,
removestopwords = TRUE, customstopwords = NULL)
It is possible to add your own stopwords to the default list of stopwords that came along with tm install. The "tm" package comes with many data files including stopwords, and note that stopwords files come for many languages. You can add, delete, or update the english.dat file under stopwords directory.
The easiest way to find the stopwords directory is to search for "stopwords" directory in your system through your file browser. And you should find english.dat along with many other language files. Open the english.dat file from RStudio which should enable to edit the file - you can add your own words or drop existing words as needed.
It is the same process if you want to edit stopwords in any other language.
I am using the stopwords library instead of the tm library. I just decided to put my solution here in case anyone will need it.
# Create a list of custom stopwords that should be added
word <- c("quick", "recovery")
lexicon <- rep("custom", times=length(word))
# Create a dataframe from the two vectors above
mystopwords <- data.frame(word, lexicon)
names(mystopwords) <- c("word", "lexicon")
# Add the dataframe to stop_words df that exists in the library stopwords
stop_words <- dplyr::bind_rows(stop_words, mystopwords)
View(stop_words)

Resources