I am using the library(tm) package in R to stem words in R but I am still getting different words with the same root in the document term matrix (dtm). For example, I am getting "certif" and "certifi" as different words, "categor" and "categori" as different words, "cathet" and "catheter" as different words, "character" and "characteristi" as different words, and so on. Isn't stemDocument supposed to take endings off and count them as one word? How can I fix this? This is the code I used:
docs <- Corpus(VectorSource(df$Long_Descriptor)
docs <- tm_map(docs, removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower), lazy = TRUE) %>%
tm_map(removeWords, stopwords("english"), lazy = TRUE) %>%
tm_map(stemDocument, language = c("english"), lazy = TRUE)
dtm <- DocumentTermMatrix(docs)
I'm working with quanteda.corpora's SOTU corpus and need to subset it to look at roughly the last century of SOTU speeches. I'm coming from tm though, so I'm not super familiar with managing dfm objects.
I've learned how to preprocess the corpus when in dfm format, but I'm not certain what to do next. This is what I have right now. To my understanding, this code ought to subset my corpus to include only documents that were delivered after 1913.
dfmat_sotu <- dfm(data_corpus_sotu, tolower = TRUE, remove = stopwords("english"), remove_numbers = TRUE, remove_punct = TRUE)
dfmat_sotu <- dfm_wordstem(dfmat_sotu, language = quanteda_options("language_stemmer"))
dfmat_sotu <- dfm_subset(dfmat_sotu, Date > 1913-12-02)
wf_sotu <- textmodel_wordfish(dfmat_sotu)
The issue is that when I run this code as well as wordfish, it becomes clear that I haven't subset the corpus as intended -- it seems to only include speeches from 1978 and later. What do I need to do differently?
I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given that topicmodels package (as does STM) only selects non-zero counts. The number of documents in the original dfm and the model output hence differ. Is there any way for me to correctly identify the documents in casu?
I checked your outcome. Because of your select statement you have no features left in dfm_speeches. Convert that to the "dtm" format as used by the topicmodels and you indeed get a document term matrix that has no documents and no terms.
But if your selection with dfm_select results in a dfm with features and you then convert it into a dtm format you will see docvars appearing.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("Bruton", "Cowen"))
dfmlda <- convert(dfm_speeches, to = "topicmodels")
This will then work further with topicmodels. I will admit that if you convert to a dtm for tm and you have no features you will see the documents appearing in the dtm. I'm not sure if there is a unintended side effect with the conversion to topicmodels if there are no features.
I don't think the problem is described clearly, but I believe I understand what it is.
Topic models' document feature matrix cannot contain empty documents, so they return named vector of topics without these. But you can still live with it if you match them to the document names:
# mx is a quanteda's dfm
# topic is a named vector for topics from LDA
docvars(mx, "topic") <- topic[match(docnames(mx), names(topic))]
Sorry, here's an example.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("corbyn", "hillary"))
dfmlda <- convert(dfm_speeches, to = "topicmodels") %>%
As you can see, the dfmlda object is empty because the fact that I modified my dfm by removing specific words.
I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.
I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.
Using tm package previously I did this:
# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
Then within my tm corpus transformations I did this:
mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))
What is the equivilent way to apply this custom function to my quanteda corpus?
Impossible to know if that will work from your example, which leaves some parts out, but generally:
If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.
So in your case, assuming that mycorpus is a tm corpus, you could do this:
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
stringi::stri_replace_all_regex(str = x,
pattern = paste0("\\b", lut[,1], "\\b"),
replacement = lut[,2],
vectorize_all = FALSE)
myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
I think I found an indirect answer over here.
texts(myCorpus) <- myFunction(myCorpus)
I only have R available at work and I have done this before in Python. I need to get a count of each set of incidents in a CSV file. I have done a sentiment analysis in Python, where I had a dictionary Python searched in a provided a table with the count for each phrase. I am researching how to do this in R and have only found ways to do a general word count using a predetermined frequency.
Please let me know if anyone has any resource links on how to perform this in R. Thank you :)
Here's a place to start: http://tidytextmining.com
text_df %>%
unnest_tokens(word, text)
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books %>%
count(word, sort = TRUE)
The package tidytext is a good solution. Another option is to use the text mining package tm:
corpus<-tm_map(corpus, content_transformer(tolower))
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, removeWords, stopwords('english'))
#corpus<-tm_map(corpus, stemDocument, language = "english")
corpus<-tm_map(corpus, removePunctuation)
wordfreq<-sort(rowSums(tdmatrix), decreasing = TRUE)
the code example cleans up the text by removing stop words, any numbers and punctuation. The final answer wordfreq is ready for with the wordcloud package if interested.
I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.
I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as....
stemDocument(x, language = map_IETF(Language(x)))
So assume that this is my doc "this is a test for R load"
How do I load the data for text processing and to create the object x?
Like #richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix:
library(tm) #load text mining library
setwd('F:/My Documents/My texts') #sets R's working directory to near where my files are
a <-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) #specifies the exact folder where my text file(s) is for analysis with tm.
summary(a) #check what went in
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords
a <- tm_map(a, stemDocument, language = "english")
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
In this case you don't need to specify the exact file name. So long as it's the only one in the directory referred to in line 3, it will be used by the tm functions. I do it this way because I have not had any success in specifying the file name in line 3.
If anyone can suggest how to get text into the lda package I'd be most grateful. I haven't been able to work that out at all.
Can't you just use the function readPlain from the same library? Or you could just use the more common scan function.
mydoc.txt <-scan("./mydoc.txt", what = "character")
I actually found this quite tricky to begin with, so here's a more comprehensive explanation.
First, you need to set up a source for your text documents. I found that the easiest way (especially if you plan on adding more documents, is to create a directory source that will read all of your files in.
source <- DirSource("yourdirectoryname/") #input path for documents
YourCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents
You can then apply the StemDocument function to your Corpus. HTH.
I believe what you wanted to do was read individual file into a corpus and then make it treat the different rows in the text file as different observations.
See if this gives you what you want:
text <- read.delim("this is a test for R load.txt", sep = "/t")
text_corpus <- Corpus(VectorSource(text), readerControl = list(language = "en"))
This is assuming that the file "this is a test for R load.txt" has only one column which has the text data.
Here the "text_corpus" is the object that you are looking for.
Hope this helps.
Here's my solution for a text file with a line per observation. the latest vignette on tm (Feb 2017) gives more detail.
text <- read.delim(textFileName, header=F, sep = "\n",stringsAsFactors = F)
colnames(text) <- c("MyCol")
docs <- text$MyCol
a <- VCorpus(VectorSource(docs))
The following assumes you have a directory of text files from which you want to create a bag of words.
The only change that needs to be made is replace
path = "C:\\windows\\path\\to\\text\\files\\
with your directory path.
# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\", # path can be relative or absolute
pattern = ".txt$", # this pattern only selects files ending with .txt
full.names = TRUE) # gives the file path as well as name
# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>% # read in each file in list
mutate(filename = basename(.x)) %>% # add the file name as a new column
unnest_tokens(word, txt)) # split each word out as a separate row
# count the total # of rows/words in your corpus
my_corpus %>%
summarize(number_rows = n())
# group and count by "filename" field and sort descending
my_corpus %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
# remove stop words
my_corpus2 <- my_corpus %>%
# repeat the count after stop words are removed
my_corpus2 %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%