R: tm package, aggregate / join docs - r

I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!

A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.

I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!

Related

R - Setting a Frequency to a Document Term Matrix

I am looking for assistance with my R code when exporting a DocumentTermMatrix. The file size is too large to export so I was curious if there is a way to set a Frequency to the DTM? For example, only return values in the DTM that have been used 5 or more times.
dtm <- DocumentTermMatrix(alltextclean)
write.csv(as.matrix(dtm), "dtm.csv")
The above produces too large of a file, can I add a frequency to this? I also tried the below but I am left with a list of terms but without a term count (this would also be useful).
termsonly <- findFreqTerms(dtm, 5)
write.csv(termsonly, "termsonly.csv")
Adding a frequency to the above would also be helpful.
Thanks for the help!
I guess you are looking for the total occurrence of each term, across all documents. Using an example dataset:
library(tm)
data(crude)
If your matrix is not so huge, you can do:
dtm = DocumentTermMatrix(crude)
Freq = colSums(as.matrix(dtm))
Otherwise, let's say we take terms with at least 5 occurences:
termsonly <- findFreqTerms(dtm, 5)
Freq = colSums(as.matrix(dtm[,termsonly]))
Or you cast it into a sparseMatrix and sum the columns:
library(Matrix)
Freq = colSums(sparseMatrix(i=dtm$i,j=dtm$j,x=dtm$v,dimnames=dtm$dimnames))
You can also check this post if you like a tidy solution.

Text Mining: Getting a Sentence-Term Matrix

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!

How to add metadata to tm Corpus object with tm_map

I have been reading different questions/answers (especially here and here) without managing to apply any to my situation.
I have a 11,390 rows matrix with attributes id, author, text, such as:
library(tm)
m <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
I want to create a tm corpus out of it. I can quickly create my corpus with
tm_corpus <- Corpus(VectorSource(m[,3]))
which terminates execution for my 11,390 rows matrix in
user system elapsed
2.383 0.175 2.557
But then when I try to add metadata to the corpus with
meta(tm_corpus, type="local", tag="Author") <- m[,2]
the execution time is over the 15 minutes and counting (I then stopped execution).
According to the discussion here chances are to decreasing significantly the time in processing the corpus with tm_map; something like
tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])
Still I am not sure how to do this. Probably it is going to be something like
addMeta <- function(text, vector) {
meta(text, tag="Author") = vector[??]
text
}
For one thing how to pass to tm_map a vector of values to be assign to each text of the corpus? Should I call the function from within a loop? Should I enclose the tm_map function within vapply?
Have you already tried the excellent readTabular?
## your sample data
matrix <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
## simple transformations
matrix <- as.data.frame(matrix)
names(matrix) <- c("id", "author", "content")
Now your ex-matrix now data.frame can be read easily in as a corpus using readTabular. ReadTabular wants you to define a Reader which itselfs takes a mapping. In your mapping "content" points to the text data and the other names - well - to meta.
## define myReader, which will be used in creation of Corpus
myReader <- readTabular(mapping=list(id="id", author="author", content="content"))
Now the creation of the Corpus is same as always, apart from small changes:
## create the corpus
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus,
readerControl = list(reader=myReader))
Now have a look at the content and meta data of the first items:
lapply(tm_corpus, as.character)
lapply(tm_corpus, meta)
## output just as expected.
This should be fast, as it is part of the package and extremely adaptable. In my own project I am using this on a data.table with some 20 variables - it works like a charm.
However I cannot provide benchmarking with the answer you have already approved as suitable. I simply guess it is faster and more efficient.
Yes tm_map is faster and it is the way to go. You should use it here with a global counter.
auths <- paste0('Author',seq(11390))
i <- 0
tm_corpus = tm_map(tm_corpus, function(x) {
i <<- i +1
meta(x, "Author") <- m[i,2]
x
})
Since readTabular from tm package has been deprecated, now the solution might be like this:
matrix <- cbind(c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
matrix <- as.data.frame(matrix)
names(matrix) <- c("doc_id", "text")
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus)
inspect(tm_corpus)
meta(tm_corpus)

How to recreate same DocumentTermMatrix with new (test) data

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.
I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.
Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.
However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?
Specifically, what I am trying to find out is that how do I create the exact DTM on new data?
If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.
Any ideas/thoughts?
tm has so many pitfalls... See much more efficient text2vec and vectorization vignette which fully answers to the question.
For tm here is probably one more simple way to reconstruct DTM matrix for second corpus:
crude2.dtm <- DocumentTermMatrix(crude2, control = list
(dictionary=Terms(crude1.dtm), wordLengths = c(3,10)) )
If I understand correctly, you have made a dtm, and you want to make a new dtm from new documents that has the same columns (ie. terms) as the first dtm. If that's the case, then it should be a matter of sub-setting the second dtm by the terms in the first, perhaps something like this:
First set up some reproducible data...
This is your training data...
library(tm)
# make corpus for text mining (data comes from package, for reproducibility)
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10)))
And this is your testing data...
corpus2 <- Corpus(VectorSource(crude[15:20]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10)))
Here is the bit that does what you want:
Now we keep only the terms in the testing data that are present in the training data...
# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing
# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
colnames(crude1.dtm.mat))])
Finally add to the testing data all the empty columns for terms in the training data that are not in the testing data...
# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
colClasses = "integer")
# add incols of NAs for terms absent in the
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)
So zz is a data frame of the testing documents, but has the same structure as the training documents (ie. same columns, though many of them contain NA, as SchaunW notes).
Is that along the lines of what you want?

Text Retrieval using R

I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing.
How can a simple VSM model be implemented using the R's text mining package?
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:
Create a Document Term Matrix of your collection/corpus
Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
Compute similarity with the query and the matrix from the first step.
Rank the results and choose the top n.
A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.

Resources