Plotting term document matrix from clipboard - r

I want to plot a term document matrix, but am having trouble generating a corpus. I want to be able to generate a corpus from selecting text and copying it to clipboard. For example, I want a plotted TDM off of 150 paragraphs of Lorem Ipsum data.
This part here is just for drawing in word data from lipsum.com
library("tm")
#generate a corpus from clipboard
clipboard2 <- read.table("clipboard",sep="\r")
The next part would (if it worked), split clipboard2 into a bunch of documents from which to get correlations off of. I think there's an easier solution here than creating documents which are then re-read back in for corpus' sake.
#how many docs to print out for correlations sake
for (i in 1:10) {
start <- floor(1 + (i-1) * nrow(clipboard2) / 10)
end <- i * nrow(clipboard2) / 10
write.table(clipboard2[start:end, 1],
paste0("C:/Users/me/Documents/", i ,".txt", collapse=""), sep="\t")
}
Pulling in the corpus of documents into a variable. Everything from this point on works fine if I manually split lipsum.com data into a few documents in some directory.
#Corpus collection
feedback <- Corpus(DirSource("C:/Users/me/Documents/"))
Removing words and whitespace, though there might be some redundancy here. Then creating the TDM.
#Cleanup
feedback <- tm_map(feedback, stripWhitespace)
feedback <- tm_map(feedback, tolower)
feedback <- tm_map(feedback, removeWords, stopwords("english"))
#TDM creation (redundant?)
tdm <- TermDocumentMatrix(feedback, control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE))
And finally, plotting the TDM. No issues here.
#plotting TDM
plot(tdm,
terms = findFreqTerms(tdm, lowfreq = 70),
corThreshold = 0.6)
)

It's somewhat unclear to me which part you are asking about, but as far as reading in the clipboard directly into a corpus, you could use
dd <- read.table("clipboard", sep="\r", stringsAsFactors=F)
feedback <- Corpus(VectorSource(dd$V1))
That will create a new document for each paragraph. But the idea is that you can use a character vector as a source so you can collapse/merge elements in the vector first to create more complex documents.

Related

Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" [duplicate]

I am trying to run this code (Ubuntu 12.04, R 3.1.1)
# Load requisite packages
library(tm)
library(ggplot2)
library(lsa)
# Place Enron email snippets into a single vector.
text <- c(
"To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
"while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
"you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it",
"This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . .",
"Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.",
"Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.",
"We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.",
"The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.",
"In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.")
view <- factor(rep(c("view 1", "view 2", "view 3"), each = 3))
df <- data.frame(text, view, stringsAsFactors = FALSE)
# Prepare mini-Enron corpus
corpus <- Corpus(VectorSource(df$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus # check corpus
# Mini-Enron corpus with 9 text documents
# Compute a term-document matrix that contains occurrance of terms in each email
# Compute distance between pairs of documents and scale the multidimentional semantic space (MDS) onto two dimensions
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat # check distance matrix
# Compute distance between pairs of documents and scale the multidimentional semantic space onto two dimensions
fit <- cmdscale(dist.mat, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y, color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))
However, when I run it I get this error (in the td.mat <-
as.matrix(TermDocumentMatrix(corpus)) line):
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
I am not sure what to look at - all modules loaded.
The latest version of tm (0.60) made it so you can't use functions with tm_map that operate on simple character values any more. So the problem is your tolower step since that isn't a "canonical" transformation (See getTransformations()). Just replace it with
corpus <- tm_map(corpus, content_transformer(tolower))
The content_transformer function wrapper will convert everything to the correct data type within the corpus. You can use content_transformer with any function that is intended to manipulate character vectors so that it will work in a tm_map pipeline.
This is a little old, but just for purposes of later google searches: there's an alternative solution. After corpus <- tm_map(corpus, tolower) you can use corpus <- tm_map(corpus, PlainTextDocument) which beats it right back into the correct data type.
I had the same issue, and finally came to a solution:
It seems that the meta information within the corpus object gets corrupted after applying transformations on it.
What I did is just creating again the corpus at the very end of the process, after it was completely ready. Having to overcome other issues, I wrote also a loop in order to copy the text back to my dataframe:
a<- list()
for (i in seq_along(corpus)) {
a[i] <- gettext(corpus[[i]][[1]]) #Do not use $content here!
}
df$text <- unlist(a)
corpus <- Corpus(VectorSource(df$text)) #This action restores the corpus.
The order of operations on text matters. You should remove stop words before removing punctuation.
I use the following to prepare text. My text is contained in cleanData$LikeMost.
Sometimes, depending on the source, you need the following first:
textData$LikeMost <- iconv(textData$LikeMost, to = "utf-8")
Some stop words are important, so you can create a revised set.
#create revised stopwords list
newWords <- stopwords("english")
keep <- c("no", "more", "not", "can't", "cannot", "isn't", "aren't", "wasn't",
"weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't")
newWords <- newWords [! newWords %in% keep]
Then, you can run your tm functions:
like <- Corpus(VectorSource(cleanData$LikeMost))
like <- tm_map(like,PlainTextDocument)
like <- tm_map(like, removeWords, newWords)
like <- tm_map(like, removePunctuation)
like <- tm_map(like, removeNumbers)
like <- tm_map(like, stripWhitespace)

R: tm package, aggregate / join docs

I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!

In R tm package, build corpus FROM Document-Term-Matrix

It's straightforward to build a document-term matrix from a corpus with the tm package.
I'd like to build a corpus from a document-term-matrix.
Let M be the number of documents in a document set.
Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix.
I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix.
From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab manually - i.e. there never was a tm "corpus" object representing my dataset, so I can't use the function,
tm_map(corpus, stemDocument, language="english")
I've been trying to build a workaround where I stem the vocabulary and only keep unique words, but then it gets somewhat complicated trying to maintain the correspondence between the dtm and the vocabulary vector.
Ideally, the end result would be that my vocabulary vector is stemmed and only contains unique entries, and the dtm indices correspond to the stemmed vocabulary vector. If you can think of some other way to do that, I would appreciate that as well.
My troubles would be fixed if I could simply build a tm "corpus" from my dtm and vocabulary vector, stem the corpus, and then convert back to a dtm and vocabulary vector (I already know how to make those conversions).
Let me know if I can clarify the problem any further.
Here's on approach providing my own minimal reproducible example (as a new user you may not be aware that this is your responsibility) from the tm package:
## Minimal Reproducible Example
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize = FALSE),
stopwords = TRUE))
## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
paste(rep(names(x), x), collapse=" ")
})
## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)
## Stemming
myCorp <- tm_map(myCorp, stemDocument)
inspect(myCorp)

How to recreate same DocumentTermMatrix with new (test) data

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.
I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.
Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.
However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?
Specifically, what I am trying to find out is that how do I create the exact DTM on new data?
If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.
Any ideas/thoughts?
tm has so many pitfalls... See much more efficient text2vec and vectorization vignette which fully answers to the question.
For tm here is probably one more simple way to reconstruct DTM matrix for second corpus:
crude2.dtm <- DocumentTermMatrix(crude2, control = list
(dictionary=Terms(crude1.dtm), wordLengths = c(3,10)) )
If I understand correctly, you have made a dtm, and you want to make a new dtm from new documents that has the same columns (ie. terms) as the first dtm. If that's the case, then it should be a matter of sub-setting the second dtm by the terms in the first, perhaps something like this:
First set up some reproducible data...
This is your training data...
library(tm)
# make corpus for text mining (data comes from package, for reproducibility)
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10)))
And this is your testing data...
corpus2 <- Corpus(VectorSource(crude[15:20]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10)))
Here is the bit that does what you want:
Now we keep only the terms in the testing data that are present in the training data...
# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing
# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
colnames(crude1.dtm.mat))])
Finally add to the testing data all the empty columns for terms in the training data that are not in the testing data...
# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
colClasses = "integer")
# add incols of NAs for terms absent in the
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)
So zz is a data frame of the testing documents, but has the same structure as the training documents (ie. same columns, though many of them contain NA, as SchaunW notes).
Is that along the lines of what you want?

Remove empty documents from DocumentTermMatrix in R topicmodels?

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:
corpus <- Corpus(VectorSource(vec), readerControl=list(language="en"))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))
And then performing LDA:
LDA(dtm, 30)
This final call to LDA() returns the error
"Each row of the input matrix needs to contain at least one non-zero entry".
I assume this means that there is at least one document that has no terms in it after preprocessing. Is there an easy way to remove documents that contain no terms from a DocumentTermMatrix?
I looked in the documentation for the topicmodels package and found the function removeSparseTerms, which removes terms that do not appear in any document, but there is no analogue for removing documents.
"Each row of the input matrix needs to contain at least one non-zero entry"
The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words
agstudy's answer works great, but using it on a slow computer proved mildly problematic.
tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed
(this was done with a 4000x15000 dtm)
The bottleneck appears to be applying sum() to a sparse matrix.
A document-term-matrix created by the tm package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i does not contain a particular row index p, then row p is empty.
tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed
ui contains all the non-zero indices, and since dtm$i is already ordered, dtm.new will be in the same order as dtm. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.
This is just to elaborate on the answer given by agstudy.
Instead of removing the empty rows from the dtm matrix, we can identify the documents in our corpus that have zero length and remove the documents directly from the corpus, before performing a second dtm with only non empty documents.
This is useful to keep a 1:1 correspondence between the dtm and the corpus.
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
corpus <- corpus[-as.numeric(empty.rows)]
Just remove the sparse terms from the DTM and all will work well.
dtm <- DocumentTermMatrix(crude, sparse=TRUE)
Just small addendum to the answer of Dario Lacan:
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
will collect record's id, rather than order numbers. Try this:
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"
If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:
corpus <- tm_filter(
corpus,
FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
# !( meta(doc)$id %in% emptyRows )
)
I had a column in a data frame lt$title which contained strings. I had no "empty" rows in this column, but still got the error:
Error in LDA(dtm, k = 20, control = list(seed = 813)) : Each row of
the input matrix needs to contain at least one non-zero entry
Some of the solutions above did not work for me, since I needed to join the vector of predicted topics to my original data frame. So removing non-zero entries from the document term matrix was no option.
The problem was, that some (very short) strings in lt$title contained special characters which could not be processed by Corpus() and/or DocumentTermMatrix().
My solution was to remove "short" strings (one or two words max.) which do not carry much information anyway.
# Clean up text data
lt$test=nchar(lt$title)
lt = lt[!lt$test<10,]
lt$test<-NULL
# Topic modeling
corpus <- Corpus(VectorSource(lt$title))
dtm = DocumentTermMatrix(corpus)
tm = LDA(dtm, k = 20, control = list(seed = 813))
# Add "topics" to original DF
lt$topic = topics(tm)

Resources