Text-mining with the tm-package - word stemming - r

I am doing some text mining in R with the tm-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).
For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.
Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))
# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus
# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
inspect(corpus.temp)
# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)

I'm not 100% sure what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.
Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
Basically it works by:
subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
then you stem (using stemDocument)
next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
last complete the Stem (stemCompletion)
Here's the output:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer

You can also use the following package for steeming words: https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf.
You just need to use the function wordStem, passing the vector of words to be stemmed and also the language you are dealing with. To know the exactly language string you need to use, you can refer to the method getStemLanguages, which will return all possible options for it.
Kind Regards

Related

Keep only sentences in corpus that contain specific key words (in R)

I have a corpus with .txt documents. From these .txt documents, I do not need all sentences, but I only want to keep certain sentences that contain specific key words. From there on, I will perform similarity measures etc.
So, here is an example.
From the data_corpus_inaugural data set of the quanteda package, I only want to keep the sentences in my corpus that contain the words "future" and/or "children".
I load my packages and create the corpus:
library(quanteda)
library(stringr)
## corpus with data_corpus_inaugural of the quanteda package
corpus <- corpus(data_corpus_inaugural)
summary(corpus)
Then I want to keep only those sentences that contain my key words
## keep only those sentences of a document that contain words future or/and
children
First, let's see which documents contain these key words
## extract all matches of future or children
str_extract_all(corpus, pattern = "future|children")
So far, I only found out how to exclude the sentences that contain my key words, which is the opposite of what I want to do.
## excluded sentences that contains future or children or both (?)
corpustrim <- corpus_trimsentences(corpus, exclude_pattern =
"future|children")
summary(corpustrim)
The above command excludes sentences containing my key words.
My idea here with the corpus_trimsentences function is to exclude all sentences BUT those containing "future" and/or "children".
I tried with regular expression. However, I did not manage to do it. It does not return what I want.
I looked into the corpus_reshape and corpus_subset functions of the quanteda package but I can't figure out how to use them for my purpose.
You are correct that it's corpus_reshape() and corpus_subset() that you want here. Here's how to use them.
First, reshape the corpus to sentences.
library("quanteda")
data_corpus_inauguralsents <-
corpus_reshape(data_corpus_inaugural, to = "sentences")
data_corpus_inauguralsents
The use stringr to create a logical (Boolean) that indicates the presence or absence of the pattern, equal in length to the new sentence corpus.
containstarget <-
stringr::str_detect(texts(data_corpus_inauguralsents), "future|children")
summary(containstarget)
## Mode FALSE TRUE
## logical 4879 137
Then use corpus_subset() to keep only those with the pattern:
data_corpus_inauguralsentssub <-
corpus_subset(data_corpus_inauguralsents, containstarget)
tail(texts(data_corpus_inauguralsentssub), 2)
## 2017-Trump.30
## "But for too many of our citizens, a different reality exists: mothers and children trapped in poverty in our inner cities; rusted-out factories scattered like tombstones across the landscape of our nation; an education system, flush with cash, but which leaves our young and beautiful students deprived of all knowledge; and the crime and the gangs and the drugs that have stolen too many lives and robbed our country of so much unrealized potential."
## 2017-Trump.41
## "And now we are looking only to the future."
Finally, if you want to put these selected sentences back into their original document containers, but without the sentences that did not contain the target words, then reshape again:
# reshape back to documents that contain only sentences with the target terms
corpus_reshape(data_corpus_inauguralsentssub, to = "documents")
## Corpus consisting of 49 documents and 3 docvars.
You need to use the tokens function.
library(quanteda)
corpus <- corpus(data_corpus_inaugural)
# tokens to keep
tok_to_keep <- tokens_select(tokens(corpus, what = "sentence"), pattern = "future|children", valuetype = "regex", selection = "keep")
This returns a list of all the speeches and sentences where the key words are present. Next you can unlist the list of tok_to_keep or do whatever you need to it to get what you want.

Stem completion in R replaces names, not data

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the same word as different topics.
Only problem is that the stemming algorithm leaves behind some words that aren't really words. "Happiness" stems to "happi," "arrange" stems to "arrang," and so on. So, before I visualize the results of the topic modeling, I'd like to restore the stems to complete words.
By reading through some previous threads here on StackOverflow, I came across a function, stemCompletion(), from the TM package, that does this, at least approximately. It seems to work reasonably well.
But when I apply it to the terms vector within a document text matrix, stemCompletion() always replaces the names of the character vector, not the characters themselves. Here's a reproducible example:
# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)
# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)
# Build a corpus from words
corpus <- quanteda::corpus(words)
# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")
# Create a document text matrix and do topic modeling
dtm <- corpus %>%
quanteda::dfm(remove_punct = TRUE,
remove = STOPWORDS) %>%
quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
quanteda::convert("topicmodels")
# Word stems are now stored in dtm$dimnames$Terms
# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)
# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)
# Apply tm::stemCompletion to Terms
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = words, # or corpus
type = "shortest")
# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)
tail(unstemmed_terms, 20)
I'm looking for a way to get the results returned by stemCompletion() into a character vector, and not into the names attribute of a character vector. Any insights into this issue are much appreciated.
The problem is that your dictionary argument to tm::stemCompletion() is not a character vector of words (or a tm Corpus object), but rather a set of lines from the Austen novel.
tail(words)
# [1] "most liberal-minded sister and aunt in the world."
# [2] ""
# [3] "When the subject was brought forward again, her views were more fully"
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall"
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"
But this can easily be tokenised using quanteda's tokens(), and converting that to a character vector.
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = as.character(tokens(words, remove_punct = TRUE)),
type = "shortest")
tail(unstemmed_terms, 20)
# arrang chariti perhap parsonag convers happi
# "arranging" NA "perhaps" NA "conversation" "happily"
# belief most liberal-mind aunt again view
# "belief" "most" "liberal-minded" "aunt" "again" "views"
# explain calm inquiri where come heard
# "explained" "calm" NA NA "come" "heard"
# surpris total
# "surprise" "totally"

Subsetting a corpus based on content of textfile

I'm using R and the tm package to do some text analysis.
I'm trying to build a subset of a corpus based on whether a certain expression is found within the content of the individual text files.
I create a corpus with 20 textfiles (thank you lukeA for this example):
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
I now would like to select only those textfiles that contain the string "price reduction" to create a subset-corpus.
Inspecting the first textfile of the document, I know that there is at least one textfile containing that string:
writeLines(as.character(corp[1]))
How would I best go about doing this?
Here's a simpler way using the quanteda package, and one more consistent with the way that reuses existing methods already defined for other R objects. quanteda has a subset method for corpus objects that works just like the subset method for a data.frame, but selects on logical vectors including document variables defined in the corpus. Below, I have extracted the texts from the corpus using the texts() method for corpus objects, and used that in a grep() to search for your pair of words.
require(tm)
data(crude)
require(quanteda)
# corpus constructor recognises tm Corpus objects
(qcorpus <- corpus(crude))
## Corpus consisting of 20 documents.
# use subset method
(qcorpussub <- corpus_subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus))))
## Corpus consisting of 1 document.
# see the context
## kwic(qcorpus, "price reduction")
## contextPre keyword contextPost
## [127, 45:46] copany said." The [ price reduction ] today was made in the
Note: I spaced your regex with "\s+" since you could have some variation of spaces, tabs, or newlines instead of just a single space.
Here's one way using tm_filter:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
( corp_sub <- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 1
cat(content(corp_sub[[1]]))
# Diamond Shamrock Corp said that
# effective today it had cut its contract prices for crude oil by
# 1.50 dlrs a barrel.
# The reduction brings its posted price for West Texas
# Intermediate to 16.00 dlrs a barrel, the copany said.
# "The price reduction today was made in the light of falling # <=====
# oil product prices and a weak crude oil market," a company
# spokeswoman said.
# Diamond is the latest in a line of U.S. oil companies that
# have cut its contract, or posted, prices over the last two days
# citing weak oil markets.
# Reuter
How did I get there? By looking into the packages' vignette, searching for subset, and then looking at the examples for tm_filter (help: ?tm_filter), which is mentioned there. It might also be worth looking at ?grep to inspect the options for pattern matching.
#lukeA's solution works. I want to give another solution I prefer.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
corpTF <- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))
for(i in 1:length(corp))
corp[[i]]$meta["mySubset"] <- corpTF[i]
idx <- meta(corp, tag ="mySubset") == 'TRUE'
filtered <- corp[idx]
cat(content(filtered[[1]]))
Advantage of this solution by using meta tags, we can see all corpus elements with a selection tag mySubset, value 'TRUE' for our selected ones, and value 'FALSE' for otherwise.

R: Finding frequency per term -- Warning Message

I have attempting to find the frequency per term in Martin Luther King's "I Have a Dream" speech. I have converted all uppercase letters to lowercase and I have removed all stop words. I have the text in a .txt file so I cannot display it on here. The code that reads in the file is below:
speech <- readLines(speech.txt)
Then I performed the conversion to lowercase and removal of stop words successfully and called it:
clean.speech
Now I am having some issues with finding the frequency per term. I have created a corpus, inspected my corpus, and created a TermDocumentMatrix as follows:
myCorpus <- Corpus(VectorSource(clean.speech))
inspect(myCorpus)
TDM <- TermDocumentMatrix(myCorpus)
Everything is fine up to this point. However, I then wrote the following code and got the warning message of:
m < as.matrix(TDM)
Warning Message:
"In m < as.matrix(TDM): longer object length is not a multiple of shorter object length
I know this is a very common warning message, so I Googled it first, but I could not find anything pertaining to frequency of terms. I proceeded to run the following text, to see if it would run with a warning message, but it did not.
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word=names(v), freq=v)
head(d, 15)
My goal is just to find the frequency of terms. I sincerely apologize for asking this question because I know this question gets asked a lot. I just do not understand what to change about my code. Thank you everyone I appreciate it!
If your goal is just to find the frequency of the terms, then try this.
First, I get the "I Have a Dream" speech into a character vector:
# get the text of the speech from an HTML source, and extract the text
library(XML)
doc.html <- htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm', useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
doc.text = paste(doc.text, collapse = ' ')
Then, I create the document-term matrix in quanteda, removing stop words (and adding "will" since quanteda's built-in list of english stop words does not include this term). From there the topfeatures() gives you the most frequent terms and their counts.
library(quanteda)
# create a document-feature matrix
IHADdfm <- dfm(doc.text, ignoredFeatures = c("will", stopwords("english")), verbose = FALSE)
# 12 most frequent features
topfeatures(IHADdfm, 12)
## freedom one ring dream let day negro today able every together years
## 13 12 12 11 10 9 8 7 7 7 6 5
# a word cloud, if you wish
plot(IHADdfm, random.order = FALSE)
just call findFreqTerms(), e.g. as tm::findFreqTerms(TDM, lowfreq=2, highfreq = 5) .
(The tm:: is optional - just saying that it is a built-in funciton of the tm package)

Snowball Stemmer only stems last word

I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.
library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
readerControl = list(reader = readPlain, language = "en_US",
load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem
I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:
> vec<-c("running runner runs","happyness happies")
> stemDocument(vec)
[1] "running runner run" "happyness happi"
> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
[1] "run" "runner" "run" "happy" "happi" <-
> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
run runner run
[[2]]
happy happi
> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" , load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`1.txt`
running runner runs
$`2.txt`
happyness happies
load required libraries
library(tm)
library(Snowball)
create vector
vec<-c("running runner runs","happyness happies")
create corpus from vector
vec<-Corpus(VectorSource(vec))
very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand
class(vec[[1]])
vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs
this will probably tell you Plain text document
So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.
stemDocumentfix <- function(x)
{
PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}
now we can use standard tm_map on our corpus
vec1 = tm_map(vec, stemDocumentfix)
result is
vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run
most important thing you need remember is to presever class of documents in corpus always.
i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.
The problem I see is that wordStem takes in a vector of words but Corpus plainTextReader assumes that in the documents that it reads, each word is on its own line. In other words, this would confuse plainTextReader as you will end up with 3 "words" in your document
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes
Instead the document should be
From
ancient
grudge
break
to
new
mutiny
where
civil
...etc...
Note also that punctuation also confuses wordStem so you would have to take them out as well.
Another way to do this without modifying your actual documents is defining a function that would do the separation and remove non-alphanumerics that appear before or after a word. Here is a simple one:
wordStem2 <- function(x) {
mywords <- unlist(strsplit(x, " "))
mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
mycleanwords <- mycleanwords[mycleanwords != ""]
wordStem(mycleanwords)
}
corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));
Now just use corpB as your usual Corpus.

Resources