I have attempting to find the frequency per term in Martin Luther King's "I Have a Dream" speech. I have converted all uppercase letters to lowercase and I have removed all stop words. I have the text in a .txt file so I cannot display it on here. The code that reads in the file is below:
speech <- readLines(speech.txt)
Then I performed the conversion to lowercase and removal of stop words successfully and called it:
clean.speech
Now I am having some issues with finding the frequency per term. I have created a corpus, inspected my corpus, and created a TermDocumentMatrix as follows:
myCorpus <- Corpus(VectorSource(clean.speech))
inspect(myCorpus)
TDM <- TermDocumentMatrix(myCorpus)
Everything is fine up to this point. However, I then wrote the following code and got the warning message of:
m < as.matrix(TDM)
Warning Message:
"In m < as.matrix(TDM): longer object length is not a multiple of shorter object length
I know this is a very common warning message, so I Googled it first, but I could not find anything pertaining to frequency of terms. I proceeded to run the following text, to see if it would run with a warning message, but it did not.
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word=names(v), freq=v)
head(d, 15)
My goal is just to find the frequency of terms. I sincerely apologize for asking this question because I know this question gets asked a lot. I just do not understand what to change about my code. Thank you everyone I appreciate it!
If your goal is just to find the frequency of the terms, then try this.
First, I get the "I Have a Dream" speech into a character vector:
# get the text of the speech from an HTML source, and extract the text
library(XML)
doc.html <- htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm', useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
doc.text = paste(doc.text, collapse = ' ')
Then, I create the document-term matrix in quanteda, removing stop words (and adding "will" since quanteda's built-in list of english stop words does not include this term). From there the topfeatures() gives you the most frequent terms and their counts.
library(quanteda)
# create a document-feature matrix
IHADdfm <- dfm(doc.text, ignoredFeatures = c("will", stopwords("english")), verbose = FALSE)
# 12 most frequent features
topfeatures(IHADdfm, 12)
## freedom one ring dream let day negro today able every together years
## 13 12 12 11 10 9 8 7 7 7 6 5
# a word cloud, if you wish
plot(IHADdfm, random.order = FALSE)
just call findFreqTerms(), e.g. as tm::findFreqTerms(TDM, lowfreq=2, highfreq = 5) .
(The tm:: is optional - just saying that it is a built-in funciton of the tm package)
Related
I am attempting to create a document frequency matrix in R.
I currently have a dataframe (df_2), which is made up of 2 columns:
doc_num: which details which document each term is coming from
text_token: which contains each tokenized word relating to each document.
The df's dimensions are 79,447 * 2.
However, there are only 400 actual documents in the 79,447 rows.
I have been trying to create this dfm using the tm package.
I have tried creating a corpus (vectorsource) and then attempting to coerce that into a dfm using
the appropriately named "dfm" command.
However, this indicates that "dfm() only works on character, corpus, dfm, tokens objects."
I understand my data isn't currently in the correct format for the dfm command to work.
My issue is that I don't know how to get from my current point to a matrix as appears below.
Example of what I would like the matrix to look like when complete:
Where 2 is the number of times cat appears in doc_2.
Any help on this would be greatly appreciated.
Is mise le meas.
It will be useful for you and others if all pertinent details are made available with your code - such as the use of quanteda package for dfm().
If the underlying text is setup correctly, the dfm() will directly give you what you are looking for - that is precisely what it is set up for.
Here is a simulation:
library(tm)
library(quanteda)
# install.packages("readtext")
library(readtext)
doc1 <- "COVID-19 can be beaten if all ensure social distance, social distance is critical"
doc2 <- "COVID-19 can be defeated through early self isolation, self isolation is your responsibility"
doc3 <- "Corona Virus can be beaten through early detection & slowing of spread, Corona Virus can be beaten, Yes, Corona Virus can be beaten"
doc4 <- "Corona Virus can be defeated through maximization of social distance"
write.table(doc1,"doc1.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc2,"doc2.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc3,"doc3.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc4,"doc4.txt",sep="\t",row.names=FALSE, col.names = F)
# save above into your WD
getwd()
txt <- readtext(paste0("Your WD/docs", "/*"))
txt
corp <- corpus(txt)
x <- dfm(corp)
View(x)
If the issue is one of formatting /cleaning your data so that you can run dfm(), then you need to post a new question which provides necessary details on your data.
My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the same word as different topics.
Only problem is that the stemming algorithm leaves behind some words that aren't really words. "Happiness" stems to "happi," "arrange" stems to "arrang," and so on. So, before I visualize the results of the topic modeling, I'd like to restore the stems to complete words.
By reading through some previous threads here on StackOverflow, I came across a function, stemCompletion(), from the TM package, that does this, at least approximately. It seems to work reasonably well.
But when I apply it to the terms vector within a document text matrix, stemCompletion() always replaces the names of the character vector, not the characters themselves. Here's a reproducible example:
# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)
# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)
# Build a corpus from words
corpus <- quanteda::corpus(words)
# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")
# Create a document text matrix and do topic modeling
dtm <- corpus %>%
quanteda::dfm(remove_punct = TRUE,
remove = STOPWORDS) %>%
quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
quanteda::convert("topicmodels")
# Word stems are now stored in dtm$dimnames$Terms
# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)
# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)
# Apply tm::stemCompletion to Terms
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = words, # or corpus
type = "shortest")
# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)
tail(unstemmed_terms, 20)
I'm looking for a way to get the results returned by stemCompletion() into a character vector, and not into the names attribute of a character vector. Any insights into this issue are much appreciated.
The problem is that your dictionary argument to tm::stemCompletion() is not a character vector of words (or a tm Corpus object), but rather a set of lines from the Austen novel.
tail(words)
# [1] "most liberal-minded sister and aunt in the world."
# [2] ""
# [3] "When the subject was brought forward again, her views were more fully"
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall"
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"
But this can easily be tokenised using quanteda's tokens(), and converting that to a character vector.
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = as.character(tokens(words, remove_punct = TRUE)),
type = "shortest")
tail(unstemmed_terms, 20)
# arrang chariti perhap parsonag convers happi
# "arranging" NA "perhaps" NA "conversation" "happily"
# belief most liberal-mind aunt again view
# "belief" "most" "liberal-minded" "aunt" "again" "views"
# explain calm inquiri where come heard
# "explained" "calm" NA NA "come" "heard"
# surpris total
# "surprise" "totally"
I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!
I am a beginner in R and text mining. I have already performed the LDA and now I want to visualise my results with the LDAvis package. I have followed every step from the github example (https://ldavis.cpsievert.me/reviews/reviews.html) starting from the 'visualizing' chapter. However, I either get error notifications or empty pages.
I have tried the following:
RedditResults <- list(phi = phi,
theta = theta,
doc.length = doc.length,
vocab = vocab,
term.frequency = term.frequency)
json <- createJSON(phi = RedditResults$phi,
theta = RedditResults$theta,
doc.length = RedditResults$doc.length,
vocab = RedditResults$vocab,
term.frequency = RedditResults$term.frequency)
serVis(json, out.dir = "vis", open.browser = FALSE)
However, this gives me an error display saying:
Error in cat(list(...), file, sep, fill, labels, append) :
argument 1 (type 'closure') cannot be handled by 'cat'
I reasoned this might have happened because the 'json' object is of class 'function' rather than a character string, which I read the object has to be in to perform serVis. Therefore I tried to convert it before using serVis by means of
RedditResults <- sapply(RedditResults, toJSON)
Resulting in the following error:
Error in run(timeoutMs) :
Evaluation error: argument must be a character vector of length 1.
I feel like I'm making a very obvious mistake somewhere, but after days of trial and error I haven't been able to spot what I should do differently.
The weirdest thing to me is that sometimes it does work, but then when I try to open the html file I only see a blank page. I have tried opening it in multiple browsers as well as opening up those browsers to display local files. I have also tried opening it using the servr package, but this gives me the same result, which is either an error notification (character vector length is not equal to 1) or an empty page.
Hope anyone can spot what I'm doing wrong. Thanks!
EDIT: objects/code underlying the code above:
Convenient to know:
I cleaned the data in corpus form (reddit_data_textcleaned) before converting it to my document-term matrix (tdm3).
After converting it to tdm3, I eliminated any 'empty' documents by excluding those with less than 2 words. Thus, 'reddit_data_textcleaned' contains more documents than relevant and 'tdm3' contains the data I want to work with.
'fit3' is the fitted model resulting from doing LDA on tdm3
'DTM' is the term-document matrix with exactly the same data as tdm3, but with transposed rows/columns.
I am aware that it makes very little sense to call your term-document matrix 'DTM' whilst naming your document-term matrix 'tdm', seeing the abbreviations. Sorry about that.
phi <- as.matrix(posterior(fit3)$terms)
theta <- as.matrix(posterior(fit3)$topics)
dp <- dim(phi) # should be K x W
dt <- dim(theta) # should be D x K
D <- length(as.matrix(tdm3[, 1])) # number of documents (2812)
doc.length <- colSums(as.matrix(DTM)) #number of tokens in each document
N <- sum(doc.length) # total number of tokens in the data (54,136)
vocab <- colnames(phi)# all terms in the vocab
W <- length(vocab) # number of terms in the vocab (6470)
temp_frequency <- inspect(tdm3)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
Freq = colSums(temp_frequency))
rm(temp_frequency)
term.frequency <- freq_matrix$Freq
doc.list <- as.list(reddit_data_textcleaned, "[[:space:]]+")
get.terms <- function(x) {
index <- match(x, vocab)
index <- index[!is.na(index)]
rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
documents <- lapply(doc.list, get.terms)
I presume something goes wrong in the creation of the 'get.terms' and 'documents' objects, as I don't exactly know what happens there. I used these methods based on answers to similar questions I read on this platform. Also, the 'doc.list' object still contains the empty documents I removed from the data after converting 'reddit_data_textcleaned' to 'tdm3'. However, the code above doesn't work with a document-term matrix object so that's why I used 'reddit_data_textcleaned' instead of 'tdm3'. I figured I would fix that issue later.
I am doing some text mining in R with the tm-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).
For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.
Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))
# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus
# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
inspect(corpus.temp)
# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
I'm not 100% sure what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.
Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
Basically it works by:
subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
then you stem (using stemDocument)
next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
last complete the Stem (stemCompletion)
Here's the output:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer
You can also use the following package for steeming words: https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf.
You just need to use the function wordStem, passing the vector of words to be stemmed and also the language you are dealing with. To know the exactly language string you need to use, you can refer to the method getStemLanguages, which will return all possible options for it.
Kind Regards