How to recreate same DocumentTermMatrix with new (test) data - r

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.
I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.
Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.
However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?
Specifically, what I am trying to find out is that how do I create the exact DTM on new data?
If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.
Any ideas/thoughts?

tm has so many pitfalls... See much more efficient text2vec and vectorization vignette which fully answers to the question.
For tm here is probably one more simple way to reconstruct DTM matrix for second corpus:
crude2.dtm <- DocumentTermMatrix(crude2, control = list
(dictionary=Terms(crude1.dtm), wordLengths = c(3,10)) )

If I understand correctly, you have made a dtm, and you want to make a new dtm from new documents that has the same columns (ie. terms) as the first dtm. If that's the case, then it should be a matter of sub-setting the second dtm by the terms in the first, perhaps something like this:
First set up some reproducible data...
This is your training data...
library(tm)
# make corpus for text mining (data comes from package, for reproducibility)
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10)))
And this is your testing data...
corpus2 <- Corpus(VectorSource(crude[15:20]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10)))
Here is the bit that does what you want:
Now we keep only the terms in the testing data that are present in the training data...
# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing
# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
colnames(crude1.dtm.mat))])
Finally add to the testing data all the empty columns for terms in the training data that are not in the testing data...
# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
colClasses = "integer")
# add incols of NAs for terms absent in the
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)
So zz is a data frame of the testing documents, but has the same structure as the training documents (ie. same columns, though many of them contain NA, as SchaunW notes).
Is that along the lines of what you want?

Related

SMS Naive bayes with correct answer for training, correct and incorrect for testing

I'm trying to run Naive Bayes algorithm with 2 csv files one for training and the other for testing, the first file contain only one row which mean only one answer for training.
The first Data set for example:-
Answer
yes beacause of the average number of laureates per prize increased a prize may not be shared among more than three people
Evaluation
correct
Second Data set for example:-
Answer
1 yes, They can share the prize. If 2 people won the prize in the same fileds they shared it
2 No, beacause a prize increased substantially during The 20Th century
3 it may not be shared among more than three people, because the average number of laureates per prize increased substantially during the 20th century
4 No, Because the average number of laureates per prize increased substantially during the 20th century, if it shared the miximum is three people
5 yes because The Nobel prize is widely regarded as the most prestigious award available
Evaluation
correct
incorrect
correct
incorrect
correct
and here's the code
#importing the csv data
train_ds <- read.csv2("I:/Datasets/traindataset.csv", stringsAsFactors = FALSE)
test_ds <- read.csv2("I:/Datasets/dataset5.csv", stringsAsFactors = FALSE)
#sms_raw
#The type element is currently a character vector. Since this is a categorical variable, it would be better to convert it into a factor.
#Both numeric and character variables can be made into factors, but a factor's levels will always be character values.
train_ds$Evaluation <- factor(train_ds$Evaluation)
train_ds$Evaluation
class(train_ds$Evaluation)
#Examining this with the str() and table()
str(train_ds$Evaluation)
#see how many are ham and spam
table(train_ds$Evaluation)
#install tm package
#install.packages("tm")
library(tm)
#create a source object from existing sms_raw$text vector, which can be supplied to VCorpus()
sms_corpus <- VCorpus(VectorSource(train_ds$Answer))# create the corpus - VCorpus(vecSource)
sms_corpus
#clean up our cropus using a series of transformations and save the result in a new object called corpus_clean
#standardize the messages to use only lowercase characters
# tm_map for an interface to apply transformations to corpora.
sms_corpus_clean <- tm_map(sms_corpus,
content_transformer(tolower))
sms_corpus_clean
#to check whether the command worked a advertised
as.character(sms_corpus[[1]]) ## Here will preent the first line in the dataset as it is
as.character(sms_corpus_clean[[1]]) ## Each upercase characters convert to lowercase character of the first line
#continue our cleanup be removing numbers from SMS messages
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
## as.character(sms_corpus_clean[[3]]) ## This is an example to remove numbers from text
sms_corpus_clean
#remove any words thats in the stop word list
sms_corpus_clean <- tm_map(sms_corpus_clean,
removeWords, stopwords())
sms_corpus_clean
#eliminate any punctuation from the text messages
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
sms_corpus_clean
#create a custom function that replaces rather than removes punctuation characters
replacePunctuation <- function(x){
gsub("[[:punct:]]+", " ", x) ## gsub() function replaces all matches of a string
}
replacePunctuation
#install the package for the stemming functionality
#install.packages("SnowballC") ## Snowball stemmers packages are stemming algorithm for collapsing words to a common root to aid comparison of vocabulary.
library(SnowballC)
wordStem(c("learn", "learned", "learning", "learns")) ## This function computes the stems of each of the given words in the vector. This reduces a word to its base component
#apply the wordStem() function to an entire corpus of text documents
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
##as.character(sms_corpus_clean[[5038]])
sms_corpus_clean
#the final step in our text clean up process is to remove the additional white spaces
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
sms_corpus_clean
#sms message before and after clean up
as.character(sms_corpus[1:3])
as.character(sms_corpus_clean[1:3])
#creating DTM sparse matrix
#Constructs or coerces to a term-document matrix or a document-term matrix
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
sms_dtm
#create a DTM directly from the raw, unprocessed SMS corpus
sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
steming = TRUE
)
)
sms_dtm2
#comparing sms_dtm and sms_dtm2
sms_dtm
sms_dtm2
str(sms_dtm)
#creating training and test datasets
sms_dtm_train <- sms_dtm[1:1, ]
sms_dtm_train
sms_dtm_test<- sms_dtm[1:1, ]
sms_dtm_test
#creating the labels
sms_train_labels <- train_ds[1:1, ]$Evaluation
sms_train_labels
sms_test_labels <- test_ds[1:39, ]$Evaluation
sms_test_labels
########################################################3
#to confirm the subset are representative of the complete set of SMS data
#lets compare the proportion of spam in the training and test data frames:
prop.table(table(sms_train_labels))
#install word cloud package
#install.packages("wordcloud")
library(wordcloud)
library(RColorBrewer)
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
#lets use R's subset() function to take a subset of the sms_raw data by the SMS type.
spam <- subset(test_ds, Evaluation == "incorrect")
spam
#Next, we will do the same thing for the ham subset
ham <- subset(test_ds, Evaluation == "correct")
ham
#we will use the max.words parameter to look at the 40 most common words in each of the two sets
wordcloud(spam$Answer, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$Answer, max.words = 40, scale = c(3, 0.5))
#display the words appearing at least five times in the sms_dtm_train matrix:
findFreqTerms(sms_dtm_train, 3)
#the result of the function is a character vector, so let's save our frequent words for later on:
sms_freq_words <- findFreqTerms(sms_dtm_train, 3)
sms_freq_words
#We now need to filter our DTM to include only the terms appearing in a specified vector
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_train
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
sms_dtm_freq_test
#To measure the number of times a word appears in a message, We need to change this to a categorical variable
#that simply indicates yes or no depending on whether the word appears at all
#The following defines convert_counts() function to convert counts to a Yes/No strings
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
#now we need to apply convert_counts() to each of the columns in oursparse matrix
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_train
sms_test <- apply(sms_dtm_test, MARGIN = 2, convert_counts)
sms_test
#install packages
#install.packages("e1071")
library(e1071)
#To build our model on the sms_train, we will use the following command
sms_classifier <- naiveBayes(train_ds, sms_train_labels)
#The predict() function is usedto make the predictions. We will store these in a vector named sms_test_pred. We will simply Supply the
#function with the names of our classifier and test data set
sms_test_pred <- predict(sms_classifier, test_ds)
sms_test_pred
#compare prediction with true values
table(sms_test_labels, sms_test_pred)
when I'm trying to run this code I get 2 or 3 errors
Error in apply(log(sapply(seq_along(attribs), function(v) { : dim(X)
must have a positive length
Error in table(sms_test_labels, sms_test_pred) : all arguments must have the same length

Text Mining: Getting a Sentence-Term Matrix

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!

R: tm package, aggregate / join docs

I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!

Plotting term document matrix from clipboard

I want to plot a term document matrix, but am having trouble generating a corpus. I want to be able to generate a corpus from selecting text and copying it to clipboard. For example, I want a plotted TDM off of 150 paragraphs of Lorem Ipsum data.
This part here is just for drawing in word data from lipsum.com
library("tm")
#generate a corpus from clipboard
clipboard2 <- read.table("clipboard",sep="\r")
The next part would (if it worked), split clipboard2 into a bunch of documents from which to get correlations off of. I think there's an easier solution here than creating documents which are then re-read back in for corpus' sake.
#how many docs to print out for correlations sake
for (i in 1:10) {
start <- floor(1 + (i-1) * nrow(clipboard2) / 10)
end <- i * nrow(clipboard2) / 10
write.table(clipboard2[start:end, 1],
paste0("C:/Users/me/Documents/", i ,".txt", collapse=""), sep="\t")
}
Pulling in the corpus of documents into a variable. Everything from this point on works fine if I manually split lipsum.com data into a few documents in some directory.
#Corpus collection
feedback <- Corpus(DirSource("C:/Users/me/Documents/"))
Removing words and whitespace, though there might be some redundancy here. Then creating the TDM.
#Cleanup
feedback <- tm_map(feedback, stripWhitespace)
feedback <- tm_map(feedback, tolower)
feedback <- tm_map(feedback, removeWords, stopwords("english"))
#TDM creation (redundant?)
tdm <- TermDocumentMatrix(feedback, control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE))
And finally, plotting the TDM. No issues here.
#plotting TDM
plot(tdm,
terms = findFreqTerms(tdm, lowfreq = 70),
corThreshold = 0.6)
)
It's somewhat unclear to me which part you are asking about, but as far as reading in the clipboard directly into a corpus, you could use
dd <- read.table("clipboard", sep="\r", stringsAsFactors=F)
feedback <- Corpus(VectorSource(dd$V1))
That will create a new document for each paragraph. But the idea is that you can use a character vector as a source so you can collapse/merge elements in the vector first to create more complex documents.

Remove empty documents from DocumentTermMatrix in R topicmodels?

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:
corpus <- Corpus(VectorSource(vec), readerControl=list(language="en"))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))
And then performing LDA:
LDA(dtm, 30)
This final call to LDA() returns the error
"Each row of the input matrix needs to contain at least one non-zero entry".
I assume this means that there is at least one document that has no terms in it after preprocessing. Is there an easy way to remove documents that contain no terms from a DocumentTermMatrix?
I looked in the documentation for the topicmodels package and found the function removeSparseTerms, which removes terms that do not appear in any document, but there is no analogue for removing documents.
"Each row of the input matrix needs to contain at least one non-zero entry"
The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words
agstudy's answer works great, but using it on a slow computer proved mildly problematic.
tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed
(this was done with a 4000x15000 dtm)
The bottleneck appears to be applying sum() to a sparse matrix.
A document-term-matrix created by the tm package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i does not contain a particular row index p, then row p is empty.
tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed
ui contains all the non-zero indices, and since dtm$i is already ordered, dtm.new will be in the same order as dtm. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.
This is just to elaborate on the answer given by agstudy.
Instead of removing the empty rows from the dtm matrix, we can identify the documents in our corpus that have zero length and remove the documents directly from the corpus, before performing a second dtm with only non empty documents.
This is useful to keep a 1:1 correspondence between the dtm and the corpus.
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
corpus <- corpus[-as.numeric(empty.rows)]
Just remove the sparse terms from the DTM and all will work well.
dtm <- DocumentTermMatrix(crude, sparse=TRUE)
Just small addendum to the answer of Dario Lacan:
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
will collect record's id, rather than order numbers. Try this:
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"
If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:
corpus <- tm_filter(
corpus,
FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
# !( meta(doc)$id %in% emptyRows )
)
I had a column in a data frame lt$title which contained strings. I had no "empty" rows in this column, but still got the error:
Error in LDA(dtm, k = 20, control = list(seed = 813)) : Each row of
the input matrix needs to contain at least one non-zero entry
Some of the solutions above did not work for me, since I needed to join the vector of predicted topics to my original data frame. So removing non-zero entries from the document term matrix was no option.
The problem was, that some (very short) strings in lt$title contained special characters which could not be processed by Corpus() and/or DocumentTermMatrix().
My solution was to remove "short" strings (one or two words max.) which do not carry much information anyway.
# Clean up text data
lt$test=nchar(lt$title)
lt = lt[!lt$test<10,]
lt$test<-NULL
# Topic modeling
corpus <- Corpus(VectorSource(lt$title))
dtm = DocumentTermMatrix(corpus)
tm = LDA(dtm, k = 20, control = list(seed = 813))
# Add "topics" to original DF
lt$topic = topics(tm)

Resources