I am trying to apply a sentiment analysis in R with the help of my DTM (document term matrix) or TDM (term document matrix). I could not find any similar topic in the forum and on google. Thus, I created a corpus and from that corpus I generated a dtm/tdm in R. My next step would be to apply the sentiment analysis which I need later for stock prediction via SVM. My give code is that:
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.data.frame(as.matrix(dtm))
tdm <- TermDocumentMatrix(docs)
tdm <- removeSparseTerms(tdm, 0.99)
tdm <- as.data.frame(as.matrix(tdm))
I read that it is possible through the tidytext package with the help of the get_sentiments() function. But it was not possible to apply that with a DTM/TDM. How can I run a sentiment analysis for my cleaned filter words which are already stemmed, tokenized etc.? I saw that a lot of people did the sentiment analysis for a hole sentence, but I would like to apply it for my single words in order to see if they are positive, negative, score etc. Many thanks in advance!
SentimentAnalysis has good integration with tm.
library(tm)
library(SentimentAnalysis)
documents <- c("Wow, I really like the new light sabers!",
"That book was excellent.",
"R is a fantastic language.",
"The service in this restaurant was miserable.",
"This is neither positive or negative.",
"The waiter forget about my dessert -- what poor service!")
vc <- VCorpus(VectorSource(documents))
dtm <- DocumentTermMatrix(vc)
analyzeSentiment(dtm,
rules=list(
"SentimentLM"=list(
ruleSentiment, loadDictionaryLM()
),
"SentimentQDAP"=list(
ruleSentiment, loadDictionaryQDAP()
)
)
)
# SentimentLM SentimentQDAP
# 1 0.000 0.1428571
# 2 0.000 0.0000000
# 3 0.000 0.0000000
# 4 0.000 0.0000000
# 5 0.000 0.0000000
# 6 -0.125 -0.2500000
To use tidytext on dtm to get sentiments convert dtm to tidy format first and then do inner join between tidy data and dictionary of polarised words.I will use the same document as used above. Some doc in above example are positive but given neutrel score.
let's see how tidytext performs
library(tidytext)
library(tm)
library(dplyr)
library(tidyr)
documents <- c("Wow I really like the new light sabers",
"That book was excellent",
"R is a fantastic language",
"The service in this restaurant was miserable",
"This is neither positive or negative",
"The waiter forget about my dessert -- what poor service")
# create tidy format
vectors <- as.character(documents)
v_source <- VectorSource(vectors)
corpuss <- VCorpus(v_source)
dtm <- DocumentTermMatrix(corpuss)
as_tidy <- tidy(dtm)
# Using bing lexicon: you can use other as well(nrc/afinn)
bing <- get_sentiments("bing")
as_bing_words <- inner_join(as_tidy,bing,by = c("term"="word"))
# check positive and negative words
as_bing_words
# set index for documents number
index <- as_bing_words%>%mutate(doc=as.numeric(document))
# count by index and sentiment
index <- index %>% count(sentiment,doc)
# spread into positives and negavtives
index <- index %>% spread(sentiment,n,fill=0)
# add polarity scorer
index <- index %>% mutate(polarity = positive-negative)
index
Doc 4 and 6 are negative,5 neutrel and rest positive which is actually the case
Related
I'm trying to run Naive Bayes algorithm with 2 csv files one for training and the other for testing, the first file contain only one row which mean only one answer for training.
The first Data set for example:-
Answer
yes beacause of the average number of laureates per prize increased a prize may not be shared among more than three people
Evaluation
correct
Second Data set for example:-
Answer
1 yes, They can share the prize. If 2 people won the prize in the same fileds they shared it
2 No, beacause a prize increased substantially during The 20Th century
3 it may not be shared among more than three people, because the average number of laureates per prize increased substantially during the 20th century
4 No, Because the average number of laureates per prize increased substantially during the 20th century, if it shared the miximum is three people
5 yes because The Nobel prize is widely regarded as the most prestigious award available
Evaluation
correct
incorrect
correct
incorrect
correct
and here's the code
#importing the csv data
train_ds <- read.csv2("I:/Datasets/traindataset.csv", stringsAsFactors = FALSE)
test_ds <- read.csv2("I:/Datasets/dataset5.csv", stringsAsFactors = FALSE)
#sms_raw
#The type element is currently a character vector. Since this is a categorical variable, it would be better to convert it into a factor.
#Both numeric and character variables can be made into factors, but a factor's levels will always be character values.
train_ds$Evaluation <- factor(train_ds$Evaluation)
train_ds$Evaluation
class(train_ds$Evaluation)
#Examining this with the str() and table()
str(train_ds$Evaluation)
#see how many are ham and spam
table(train_ds$Evaluation)
#install tm package
#install.packages("tm")
library(tm)
#create a source object from existing sms_raw$text vector, which can be supplied to VCorpus()
sms_corpus <- VCorpus(VectorSource(train_ds$Answer))# create the corpus - VCorpus(vecSource)
sms_corpus
#clean up our cropus using a series of transformations and save the result in a new object called corpus_clean
#standardize the messages to use only lowercase characters
# tm_map for an interface to apply transformations to corpora.
sms_corpus_clean <- tm_map(sms_corpus,
content_transformer(tolower))
sms_corpus_clean
#to check whether the command worked a advertised
as.character(sms_corpus[[1]]) ## Here will preent the first line in the dataset as it is
as.character(sms_corpus_clean[[1]]) ## Each upercase characters convert to lowercase character of the first line
#continue our cleanup be removing numbers from SMS messages
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
## as.character(sms_corpus_clean[[3]]) ## This is an example to remove numbers from text
sms_corpus_clean
#remove any words thats in the stop word list
sms_corpus_clean <- tm_map(sms_corpus_clean,
removeWords, stopwords())
sms_corpus_clean
#eliminate any punctuation from the text messages
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
sms_corpus_clean
#create a custom function that replaces rather than removes punctuation characters
replacePunctuation <- function(x){
gsub("[[:punct:]]+", " ", x) ## gsub() function replaces all matches of a string
}
replacePunctuation
#install the package for the stemming functionality
#install.packages("SnowballC") ## Snowball stemmers packages are stemming algorithm for collapsing words to a common root to aid comparison of vocabulary.
library(SnowballC)
wordStem(c("learn", "learned", "learning", "learns")) ## This function computes the stems of each of the given words in the vector. This reduces a word to its base component
#apply the wordStem() function to an entire corpus of text documents
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
##as.character(sms_corpus_clean[[5038]])
sms_corpus_clean
#the final step in our text clean up process is to remove the additional white spaces
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
sms_corpus_clean
#sms message before and after clean up
as.character(sms_corpus[1:3])
as.character(sms_corpus_clean[1:3])
#creating DTM sparse matrix
#Constructs or coerces to a term-document matrix or a document-term matrix
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
sms_dtm
#create a DTM directly from the raw, unprocessed SMS corpus
sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
steming = TRUE
)
)
sms_dtm2
#comparing sms_dtm and sms_dtm2
sms_dtm
sms_dtm2
str(sms_dtm)
#creating training and test datasets
sms_dtm_train <- sms_dtm[1:1, ]
sms_dtm_train
sms_dtm_test<- sms_dtm[1:1, ]
sms_dtm_test
#creating the labels
sms_train_labels <- train_ds[1:1, ]$Evaluation
sms_train_labels
sms_test_labels <- test_ds[1:39, ]$Evaluation
sms_test_labels
########################################################3
#to confirm the subset are representative of the complete set of SMS data
#lets compare the proportion of spam in the training and test data frames:
prop.table(table(sms_train_labels))
#install word cloud package
#install.packages("wordcloud")
library(wordcloud)
library(RColorBrewer)
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
#lets use R's subset() function to take a subset of the sms_raw data by the SMS type.
spam <- subset(test_ds, Evaluation == "incorrect")
spam
#Next, we will do the same thing for the ham subset
ham <- subset(test_ds, Evaluation == "correct")
ham
#we will use the max.words parameter to look at the 40 most common words in each of the two sets
wordcloud(spam$Answer, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$Answer, max.words = 40, scale = c(3, 0.5))
#display the words appearing at least five times in the sms_dtm_train matrix:
findFreqTerms(sms_dtm_train, 3)
#the result of the function is a character vector, so let's save our frequent words for later on:
sms_freq_words <- findFreqTerms(sms_dtm_train, 3)
sms_freq_words
#We now need to filter our DTM to include only the terms appearing in a specified vector
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_train
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
sms_dtm_freq_test
#To measure the number of times a word appears in a message, We need to change this to a categorical variable
#that simply indicates yes or no depending on whether the word appears at all
#The following defines convert_counts() function to convert counts to a Yes/No strings
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
#now we need to apply convert_counts() to each of the columns in oursparse matrix
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_train
sms_test <- apply(sms_dtm_test, MARGIN = 2, convert_counts)
sms_test
#install packages
#install.packages("e1071")
library(e1071)
#To build our model on the sms_train, we will use the following command
sms_classifier <- naiveBayes(train_ds, sms_train_labels)
#The predict() function is usedto make the predictions. We will store these in a vector named sms_test_pred. We will simply Supply the
#function with the names of our classifier and test data set
sms_test_pred <- predict(sms_classifier, test_ds)
sms_test_pred
#compare prediction with true values
table(sms_test_labels, sms_test_pred)
when I'm trying to run this code I get 2 or 3 errors
Error in apply(log(sapply(seq_along(attribs), function(v) { : dim(X)
must have a positive length
Error in table(sms_test_labels, sms_test_pred) : all arguments must have the same length
I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers.
I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data.
However, I want to use a function that says search for "word" and return how many times "word" appears in the TermDocumentMatrix.
Is there a function in TM that achieves this? Do I have to change my data to a data.frame and use a different package & function?
Since you have not given a reproducible example, I will give one using the crude dataset available in the tm package.
You can do it in (at least) 2 different ways. But anything that turns a sparse matrix into a dense matrix can use a lot of memory. So I will give you 2 options. The first one is more memory friendly as it makes use of the sparse tdm matrix. The second one, first transforms the tdm into a dense matrix before creating a frequency vector.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(crude)
# Making use of the fact that a tdm or dtm is a simple_triplet_matrix from slam
my_func <- function(data, word){
slam::row_sums(data[data$dimnames$Terms == word, ])
}
my_func(tdm, "crude")
crude
21
my_func(tdm, "oil")
oil
85
# turn tdm into dense matrix and create frequency vector.
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude
21
freq["oil"]
oil
85
edit:
As requested in comment:
# all words starting with cru. Adjust regex to find what you need.
freq[grep("^cru", names(freq))]
crucial crude
2 21
# separate words
freq[c("crude", "oil")]
crude oil
21 85
My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the same word as different topics.
Only problem is that the stemming algorithm leaves behind some words that aren't really words. "Happiness" stems to "happi," "arrange" stems to "arrang," and so on. So, before I visualize the results of the topic modeling, I'd like to restore the stems to complete words.
By reading through some previous threads here on StackOverflow, I came across a function, stemCompletion(), from the TM package, that does this, at least approximately. It seems to work reasonably well.
But when I apply it to the terms vector within a document text matrix, stemCompletion() always replaces the names of the character vector, not the characters themselves. Here's a reproducible example:
# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)
# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)
# Build a corpus from words
corpus <- quanteda::corpus(words)
# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")
# Create a document text matrix and do topic modeling
dtm <- corpus %>%
quanteda::dfm(remove_punct = TRUE,
remove = STOPWORDS) %>%
quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
quanteda::convert("topicmodels")
# Word stems are now stored in dtm$dimnames$Terms
# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)
# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)
# Apply tm::stemCompletion to Terms
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = words, # or corpus
type = "shortest")
# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)
tail(unstemmed_terms, 20)
I'm looking for a way to get the results returned by stemCompletion() into a character vector, and not into the names attribute of a character vector. Any insights into this issue are much appreciated.
The problem is that your dictionary argument to tm::stemCompletion() is not a character vector of words (or a tm Corpus object), but rather a set of lines from the Austen novel.
tail(words)
# [1] "most liberal-minded sister and aunt in the world."
# [2] ""
# [3] "When the subject was brought forward again, her views were more fully"
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall"
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"
But this can easily be tokenised using quanteda's tokens(), and converting that to a character vector.
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = as.character(tokens(words, remove_punct = TRUE)),
type = "shortest")
tail(unstemmed_terms, 20)
# arrang chariti perhap parsonag convers happi
# "arranging" NA "perhaps" NA "conversation" "happily"
# belief most liberal-mind aunt again view
# "belief" "most" "liberal-minded" "aunt" "again" "views"
# explain calm inquiri where come heard
# "explained" "calm" NA NA "come" "heard"
# surpris total
# "surprise" "totally"
I'm using R and the tm package to do some text analysis.
I'm trying to build a subset of a corpus based on whether a certain expression is found within the content of the individual text files.
I create a corpus with 20 textfiles (thank you lukeA for this example):
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
I now would like to select only those textfiles that contain the string "price reduction" to create a subset-corpus.
Inspecting the first textfile of the document, I know that there is at least one textfile containing that string:
writeLines(as.character(corp[1]))
How would I best go about doing this?
Here's a simpler way using the quanteda package, and one more consistent with the way that reuses existing methods already defined for other R objects. quanteda has a subset method for corpus objects that works just like the subset method for a data.frame, but selects on logical vectors including document variables defined in the corpus. Below, I have extracted the texts from the corpus using the texts() method for corpus objects, and used that in a grep() to search for your pair of words.
require(tm)
data(crude)
require(quanteda)
# corpus constructor recognises tm Corpus objects
(qcorpus <- corpus(crude))
## Corpus consisting of 20 documents.
# use subset method
(qcorpussub <- corpus_subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus))))
## Corpus consisting of 1 document.
# see the context
## kwic(qcorpus, "price reduction")
## contextPre keyword contextPost
## [127, 45:46] copany said." The [ price reduction ] today was made in the
Note: I spaced your regex with "\s+" since you could have some variation of spaces, tabs, or newlines instead of just a single space.
Here's one way using tm_filter:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
( corp_sub <- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 1
cat(content(corp_sub[[1]]))
# Diamond Shamrock Corp said that
# effective today it had cut its contract prices for crude oil by
# 1.50 dlrs a barrel.
# The reduction brings its posted price for West Texas
# Intermediate to 16.00 dlrs a barrel, the copany said.
# "The price reduction today was made in the light of falling # <=====
# oil product prices and a weak crude oil market," a company
# spokeswoman said.
# Diamond is the latest in a line of U.S. oil companies that
# have cut its contract, or posted, prices over the last two days
# citing weak oil markets.
# Reuter
How did I get there? By looking into the packages' vignette, searching for subset, and then looking at the examples for tm_filter (help: ?tm_filter), which is mentioned there. It might also be worth looking at ?grep to inspect the options for pattern matching.
#lukeA's solution works. I want to give another solution I prefer.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
corpTF <- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))
for(i in 1:length(corp))
corp[[i]]$meta["mySubset"] <- corpTF[i]
idx <- meta(corp, tag ="mySubset") == 'TRUE'
filtered <- corp[idx]
cat(content(filtered[[1]]))
Advantage of this solution by using meta tags, we can see all corpus elements with a selection tag mySubset, value 'TRUE' for our selected ones, and value 'FALSE' for otherwise.
I am doing some text mining in R with the tm-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).
For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.
Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))
# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus
# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
inspect(corpus.temp)
# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
I'm not 100% sure what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.
Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
Basically it works by:
subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
then you stem (using stemDocument)
next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
last complete the Stem (stemCompletion)
Here's the output:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer
You can also use the following package for steeming words: https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf.
You just need to use the function wordStem, passing the vector of words to be stemmed and also the language you are dealing with. To know the exactly language string you need to use, you can refer to the method getStemLanguages, which will return all possible options for it.
Kind Regards