Create a Document Frequency Matrix in R - r

I am attempting to create a document frequency matrix in R.
I currently have a dataframe (df_2), which is made up of 2 columns:
doc_num: which details which document each term is coming from
text_token: which contains each tokenized word relating to each document.
The df's dimensions are 79,447 * 2.
However, there are only 400 actual documents in the 79,447 rows.
I have been trying to create this dfm using the tm package.
I have tried creating a corpus (vectorsource) and then attempting to coerce that into a dfm using
the appropriately named "dfm" command.
However, this indicates that "dfm() only works on character, corpus, dfm, tokens objects."
I understand my data isn't currently in the correct format for the dfm command to work.
My issue is that I don't know how to get from my current point to a matrix as appears below.
Example of what I would like the matrix to look like when complete:
Where 2 is the number of times cat appears in doc_2.
Any help on this would be greatly appreciated.
Is mise le meas.

It will be useful for you and others if all pertinent details are made available with your code - such as the use of quanteda package for dfm().
If the underlying text is setup correctly, the dfm() will directly give you what you are looking for - that is precisely what it is set up for.
Here is a simulation:
library(tm)
library(quanteda)
# install.packages("readtext")
library(readtext)
doc1 <- "COVID-19 can be beaten if all ensure social distance, social distance is critical"
doc2 <- "COVID-19 can be defeated through early self isolation, self isolation is your responsibility"
doc3 <- "Corona Virus can be beaten through early detection & slowing of spread, Corona Virus can be beaten, Yes, Corona Virus can be beaten"
doc4 <- "Corona Virus can be defeated through maximization of social distance"
write.table(doc1,"doc1.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc2,"doc2.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc3,"doc3.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc4,"doc4.txt",sep="\t",row.names=FALSE, col.names = F)
# save above into your WD
getwd()
txt <- readtext(paste0("Your WD/docs", "/*"))
txt
corp <- corpus(txt)
x <- dfm(corp)
View(x)
If the issue is one of formatting /cleaning your data so that you can run dfm(), then you need to post a new question which provides necessary details on your data.

Related

Text Mining: Getting a Sentence-Term Matrix

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!

stri_replace_all_fixed slow on big data set - is there an alternative?

I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).
I have posted the code below. Does anyone know an alternative for doing this?
Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.
##Read in the dictionary
voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
word <- Corpus(VectorSource(voc))[1]
tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Structure of "voc" data frame:
Word Stem
1 abandonnere abandonner
2 abandonnerede abandonner
3 abandonnerende abandonner
...
313273 åsyns åsyn
To make a dictionary marching fast, you need to implement some clever data structures such as a prefix tree. 300000x search and replace just does not scale.
I don't think this will be efficient in R, but you will need to write a C or C++ extension. You have many tiny operations there, the overhead of the R interpreter will kill you when trying to do this in pure R.

Subsetting a corpus based on content of textfile

I'm using R and the tm package to do some text analysis.
I'm trying to build a subset of a corpus based on whether a certain expression is found within the content of the individual text files.
I create a corpus with 20 textfiles (thank you lukeA for this example):
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
I now would like to select only those textfiles that contain the string "price reduction" to create a subset-corpus.
Inspecting the first textfile of the document, I know that there is at least one textfile containing that string:
writeLines(as.character(corp[1]))
How would I best go about doing this?
Here's a simpler way using the quanteda package, and one more consistent with the way that reuses existing methods already defined for other R objects. quanteda has a subset method for corpus objects that works just like the subset method for a data.frame, but selects on logical vectors including document variables defined in the corpus. Below, I have extracted the texts from the corpus using the texts() method for corpus objects, and used that in a grep() to search for your pair of words.
require(tm)
data(crude)
require(quanteda)
# corpus constructor recognises tm Corpus objects
(qcorpus <- corpus(crude))
## Corpus consisting of 20 documents.
# use subset method
(qcorpussub <- corpus_subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus))))
## Corpus consisting of 1 document.
# see the context
## kwic(qcorpus, "price reduction")
## contextPre keyword contextPost
## [127, 45:46] copany said." The [ price reduction ] today was made in the
Note: I spaced your regex with "\s+" since you could have some variation of spaces, tabs, or newlines instead of just a single space.
Here's one way using tm_filter:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
( corp_sub <- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 1
cat(content(corp_sub[[1]]))
# Diamond Shamrock Corp said that
# effective today it had cut its contract prices for crude oil by
# 1.50 dlrs a barrel.
# The reduction brings its posted price for West Texas
# Intermediate to 16.00 dlrs a barrel, the copany said.
# "The price reduction today was made in the light of falling # <=====
# oil product prices and a weak crude oil market," a company
# spokeswoman said.
# Diamond is the latest in a line of U.S. oil companies that
# have cut its contract, or posted, prices over the last two days
# citing weak oil markets.
# Reuter
How did I get there? By looking into the packages' vignette, searching for subset, and then looking at the examples for tm_filter (help: ?tm_filter), which is mentioned there. It might also be worth looking at ?grep to inspect the options for pattern matching.
#lukeA's solution works. I want to give another solution I prefer.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
corpTF <- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))
for(i in 1:length(corp))
corp[[i]]$meta["mySubset"] <- corpTF[i]
idx <- meta(corp, tag ="mySubset") == 'TRUE'
filtered <- corp[idx]
cat(content(filtered[[1]]))
Advantage of this solution by using meta tags, we can see all corpus elements with a selection tag mySubset, value 'TRUE' for our selected ones, and value 'FALSE' for otherwise.

R: Finding frequency per term -- Warning Message

I have attempting to find the frequency per term in Martin Luther King's "I Have a Dream" speech. I have converted all uppercase letters to lowercase and I have removed all stop words. I have the text in a .txt file so I cannot display it on here. The code that reads in the file is below:
speech <- readLines(speech.txt)
Then I performed the conversion to lowercase and removal of stop words successfully and called it:
clean.speech
Now I am having some issues with finding the frequency per term. I have created a corpus, inspected my corpus, and created a TermDocumentMatrix as follows:
myCorpus <- Corpus(VectorSource(clean.speech))
inspect(myCorpus)
TDM <- TermDocumentMatrix(myCorpus)
Everything is fine up to this point. However, I then wrote the following code and got the warning message of:
m < as.matrix(TDM)
Warning Message:
"In m < as.matrix(TDM): longer object length is not a multiple of shorter object length
I know this is a very common warning message, so I Googled it first, but I could not find anything pertaining to frequency of terms. I proceeded to run the following text, to see if it would run with a warning message, but it did not.
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word=names(v), freq=v)
head(d, 15)
My goal is just to find the frequency of terms. I sincerely apologize for asking this question because I know this question gets asked a lot. I just do not understand what to change about my code. Thank you everyone I appreciate it!
If your goal is just to find the frequency of the terms, then try this.
First, I get the "I Have a Dream" speech into a character vector:
# get the text of the speech from an HTML source, and extract the text
library(XML)
doc.html <- htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm', useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
doc.text = paste(doc.text, collapse = ' ')
Then, I create the document-term matrix in quanteda, removing stop words (and adding "will" since quanteda's built-in list of english stop words does not include this term). From there the topfeatures() gives you the most frequent terms and their counts.
library(quanteda)
# create a document-feature matrix
IHADdfm <- dfm(doc.text, ignoredFeatures = c("will", stopwords("english")), verbose = FALSE)
# 12 most frequent features
topfeatures(IHADdfm, 12)
## freedom one ring dream let day negro today able every together years
## 13 12 12 11 10 9 8 7 7 7 6 5
# a word cloud, if you wish
plot(IHADdfm, random.order = FALSE)
just call findFreqTerms(), e.g. as tm::findFreqTerms(TDM, lowfreq=2, highfreq = 5) .
(The tm:: is optional - just saying that it is a built-in funciton of the tm package)

Text Retrieval using R

I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing.
How can a simple VSM model be implemented using the R's text mining package?
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:
Create a Document Term Matrix of your collection/corpus
Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
Compute similarity with the query and the matrix from the first step.
Rank the results and choose the top n.
A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.

Resources