Subsetting a corpus based on content of textfile

Subsetting a corpus based on content of textfile - r

I'm using R and the tm package to do some text analysis.
I'm trying to build a subset of a corpus based on whether a certain expression is found within the content of the individual text files.
I create a corpus with 20 textfiles (thank you lukeA for this example):
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
I now would like to select only those textfiles that contain the string "price reduction" to create a subset-corpus.
Inspecting the first textfile of the document, I know that there is at least one textfile containing that string:
writeLines(as.character(corp[1]))
How would I best go about doing this?

Here's a simpler way using the quanteda package, and one more consistent with the way that reuses existing methods already defined for other R objects. quanteda has a subset method for corpus objects that works just like the subset method for a data.frame, but selects on logical vectors including document variables defined in the corpus. Below, I have extracted the texts from the corpus using the texts() method for corpus objects, and used that in a grep() to search for your pair of words.
require(tm)
data(crude)
require(quanteda)
# corpus constructor recognises tm Corpus objects
(qcorpus <- corpus(crude))
## Corpus consisting of 20 documents.
# use subset method
(qcorpussub <- corpus_subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus))))
## Corpus consisting of 1 document.
# see the context
## kwic(qcorpus, "price reduction")
## contextPre keyword contextPost
## [127, 45:46] copany said." The [ price reduction ] today was made in the
Note: I spaced your regex with "\s+" since you could have some variation of spaces, tabs, or newlines instead of just a single space.

Here's one way using tm_filter:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
( corp_sub <- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 1
cat(content(corp_sub[[1]]))
# Diamond Shamrock Corp said that
# effective today it had cut its contract prices for crude oil by
# 1.50 dlrs a barrel.
# The reduction brings its posted price for West Texas
# Intermediate to 16.00 dlrs a barrel, the copany said.
# "The price reduction today was made in the light of falling # <=====
# oil product prices and a weak crude oil market," a company
# spokeswoman said.
# Diamond is the latest in a line of U.S. oil companies that
# have cut its contract, or posted, prices over the last two days
# citing weak oil markets.
# Reuter
How did I get there? By looking into the packages' vignette, searching for subset, and then looking at the examples for tm_filter (help: ?tm_filter), which is mentioned there. It might also be worth looking at ?grep to inspect the options for pattern matching.

#lukeA's solution works. I want to give another solution I prefer.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
corpTF <- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))
for(i in 1:length(corp))
corp[[i]]$meta["mySubset"] <- corpTF[i]
idx <- meta(corp, tag ="mySubset") == 'TRUE'
filtered <- corp[idx]
cat(content(filtered[[1]]))
Advantage of this solution by using meta tags, we can see all corpus elements with a selection tag mySubset, value 'TRUE' for our selected ones, and value 'FALSE' for otherwise.

Related

Create a Document Frequency Matrix in R

I am attempting to create a document frequency matrix in R.
I currently have a dataframe (df_2), which is made up of 2 columns:
doc_num: which details which document each term is coming from
text_token: which contains each tokenized word relating to each document.
The df's dimensions are 79,447 * 2.
However, there are only 400 actual documents in the 79,447 rows.
I have been trying to create this dfm using the tm package.
I have tried creating a corpus (vectorsource) and then attempting to coerce that into a dfm using
the appropriately named "dfm" command.
However, this indicates that "dfm() only works on character, corpus, dfm, tokens objects."
I understand my data isn't currently in the correct format for the dfm command to work.
My issue is that I don't know how to get from my current point to a matrix as appears below.
Example of what I would like the matrix to look like when complete:
Where 2 is the number of times cat appears in doc_2.
Any help on this would be greatly appreciated.
Is mise le meas.

It will be useful for you and others if all pertinent details are made available with your code - such as the use of quanteda package for dfm().
If the underlying text is setup correctly, the dfm() will directly give you what you are looking for - that is precisely what it is set up for.
Here is a simulation:
library(tm)
library(quanteda)
# install.packages("readtext")
library(readtext)
doc1 <- "COVID-19 can be beaten if all ensure social distance, social distance is critical"
doc2 <- "COVID-19 can be defeated through early self isolation, self isolation is your responsibility"
doc3 <- "Corona Virus can be beaten through early detection & slowing of spread, Corona Virus can be beaten, Yes, Corona Virus can be beaten"
doc4 <- "Corona Virus can be defeated through maximization of social distance"
write.table(doc1,"doc1.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc2,"doc2.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc3,"doc3.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc4,"doc4.txt",sep="\t",row.names=FALSE, col.names = F)
# save above into your WD
getwd()
txt <- readtext(paste0("Your WD/docs", "/*"))
txt
corp <- corpus(txt)
x <- dfm(corp)
View(x)
If the issue is one of formatting /cleaning your data so that you can run dfm(), then you need to post a new question which provides necessary details on your data.

Sentiment Analysis in R using TDM/DTM

I am trying to apply a sentiment analysis in R with the help of my DTM (document term matrix) or TDM (term document matrix). I could not find any similar topic in the forum and on google. Thus, I created a corpus and from that corpus I generated a dtm/tdm in R. My next step would be to apply the sentiment analysis which I need later for stock prediction via SVM. My give code is that:
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.data.frame(as.matrix(dtm))
tdm <- TermDocumentMatrix(docs)
tdm <- removeSparseTerms(tdm, 0.99)
tdm <- as.data.frame(as.matrix(tdm))
I read that it is possible through the tidytext package with the help of the get_sentiments() function. But it was not possible to apply that with a DTM/TDM. How can I run a sentiment analysis for my cleaned filter words which are already stemmed, tokenized etc.? I saw that a lot of people did the sentiment analysis for a hole sentence, but I would like to apply it for my single words in order to see if they are positive, negative, score etc. Many thanks in advance!

SentimentAnalysis has good integration with tm.
library(tm)
library(SentimentAnalysis)
documents <- c("Wow, I really like the new light sabers!",
"That book was excellent.",
"R is a fantastic language.",
"The service in this restaurant was miserable.",
"This is neither positive or negative.",
"The waiter forget about my dessert -- what poor service!")
vc <- VCorpus(VectorSource(documents))
dtm <- DocumentTermMatrix(vc)
analyzeSentiment(dtm,
rules=list(
"SentimentLM"=list(
ruleSentiment, loadDictionaryLM()
),
"SentimentQDAP"=list(
ruleSentiment, loadDictionaryQDAP()
)
)
)
# SentimentLM SentimentQDAP
# 1 0.000 0.1428571
# 2 0.000 0.0000000
# 3 0.000 0.0000000
# 4 0.000 0.0000000
# 5 0.000 0.0000000
# 6 -0.125 -0.2500000

To use tidytext on dtm to get sentiments convert dtm to tidy format first and then do inner join between tidy data and dictionary of polarised words.I will use the same document as used above. Some doc in above example are positive but given neutrel score.
let's see how tidytext performs
library(tidytext)
library(tm)
library(dplyr)
library(tidyr)
documents <- c("Wow I really like the new light sabers",
"That book was excellent",
"R is a fantastic language",
"The service in this restaurant was miserable",
"This is neither positive or negative",
"The waiter forget about my dessert -- what poor service")
# create tidy format
vectors <- as.character(documents)
v_source <- VectorSource(vectors)
corpuss <- VCorpus(v_source)
dtm <- DocumentTermMatrix(corpuss)
as_tidy <- tidy(dtm)
# Using bing lexicon: you can use other as well(nrc/afinn)
bing <- get_sentiments("bing")
as_bing_words <- inner_join(as_tidy,bing,by = c("term"="word"))
# check positive and negative words
as_bing_words
# set index for documents number
index <- as_bing_words%>%mutate(doc=as.numeric(document))
# count by index and sentiment
index <- index %>% count(sentiment,doc)
# spread into positives and negavtives
index <- index %>% spread(sentiment,n,fill=0)
# add polarity scorer
index <- index %>% mutate(polarity = positive-negative)
index
Doc 4 and 6 are negative,5 neutrel and rest positive which is actually the case

Stem completion in R replaces names, not data

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the same word as different topics.
Only problem is that the stemming algorithm leaves behind some words that aren't really words. "Happiness" stems to "happi," "arrange" stems to "arrang," and so on. So, before I visualize the results of the topic modeling, I'd like to restore the stems to complete words.
By reading through some previous threads here on StackOverflow, I came across a function, stemCompletion(), from the TM package, that does this, at least approximately. It seems to work reasonably well.
But when I apply it to the terms vector within a document text matrix, stemCompletion() always replaces the names of the character vector, not the characters themselves. Here's a reproducible example:
# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)
# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)
# Build a corpus from words
corpus <- quanteda::corpus(words)
# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")
# Create a document text matrix and do topic modeling
dtm <- corpus %>%
quanteda::dfm(remove_punct = TRUE,
remove = STOPWORDS) %>%
quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
quanteda::convert("topicmodels")
# Word stems are now stored in dtm$dimnames$Terms
# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)
# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)
# Apply tm::stemCompletion to Terms
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = words, # or corpus
type = "shortest")
# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)
tail(unstemmed_terms, 20)
I'm looking for a way to get the results returned by stemCompletion() into a character vector, and not into the names attribute of a character vector. Any insights into this issue are much appreciated.

The problem is that your dictionary argument to tm::stemCompletion() is not a character vector of words (or a tm Corpus object), but rather a set of lines from the Austen novel.
tail(words)
# [1] "most liberal-minded sister and aunt in the world."
# [2] ""
# [3] "When the subject was brought forward again, her views were more fully"
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall"
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"
But this can easily be tokenised using quanteda's tokens(), and converting that to a character vector.
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = as.character(tokens(words, remove_punct = TRUE)),
type = "shortest")
tail(unstemmed_terms, 20)
# arrang chariti perhap parsonag convers happi
# "arranging" NA "perhaps" NA "conversation" "happily"
# belief most liberal-mind aunt again view
# "belief" "most" "liberal-minded" "aunt" "again" "views"
# explain calm inquiri where come heard
# "explained" "calm" NA NA "come" "heard"
# surpris total
# "surprise" "totally"

Text-mining with the tm-package - word stemming

I am doing some text mining in R with the tm-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).
For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.
Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))
# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus
# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
inspect(corpus.temp)
# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)

I'm not 100% sure what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.
Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
Basically it works by:
subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
then you stem (using stemDocument)
next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
last complete the Stem (stemCompletion)
Here's the output:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer

You can also use the following package for steeming words: https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf.
You just need to use the function wordStem, passing the vector of words to be stemmed and also the language you are dealing with. To know the exactly language string you need to use, you can refer to the method getStemLanguages, which will return all possible options for it.
Kind Regards

Snowball Stemmer only stems last word

I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.
library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
readerControl = list(reader = readPlain, language = "en_US",
load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem
I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:
> vec<-c("running runner runs","happyness happies")
> stemDocument(vec)
[1] "running runner run" "happyness happi"
> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
[1] "run" "runner" "run" "happy" "happi" <-
> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
run runner run
[[2]]
happy happi
> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" , load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`1.txt`
running runner runs
$`2.txt`
happyness happies

load required libraries
library(tm)
library(Snowball)
create vector
vec<-c("running runner runs","happyness happies")
create corpus from vector
vec<-Corpus(VectorSource(vec))
very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand
class(vec[[1]])
vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs
this will probably tell you Plain text document
So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.
stemDocumentfix <- function(x)
{
PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}
now we can use standard tm_map on our corpus
vec1 = tm_map(vec, stemDocumentfix)
result is
vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run
most important thing you need remember is to presever class of documents in corpus always.
i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.

The problem I see is that wordStem takes in a vector of words but Corpus plainTextReader assumes that in the documents that it reads, each word is on its own line. In other words, this would confuse plainTextReader as you will end up with 3 "words" in your document
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes
Instead the document should be
From
ancient
grudge
break
to
new
mutiny
where
civil
...etc...
Note also that punctuation also confuses wordStem so you would have to take them out as well.
Another way to do this without modifying your actual documents is defining a function that would do the separation and remove non-alphanumerics that appear before or after a word. Here is a simple one:
wordStem2 <- function(x) {
mywords <- unlist(strsplit(x, " "))
mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
mycleanwords <- mycleanwords[mycleanwords != ""]
wordStem(mycleanwords)
}
corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));
Now just use corpB as your usual Corpus.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subsetting a corpus based on content of textfile - r

Related

Create a Document Frequency Matrix in R

Sentiment Analysis in R using TDM/DTM

Stem completion in R replaces names, not data

Text-mining with the tm-package - word stemming

Snowball Stemmer only stems last word

Categories

Resources