So I have to find the most frequent word and its value from a DTM.
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location)) class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
This is for cleaning the Corpus and the below one creating the DTM and finding the frequency.
ptm.tf <- DocumentTermMatrix(apapers)
dim(ptm.tf)
findFreqTerms(ptm.tf)
Is there a way to get the frequent word and the frequency value together?
findFreqTerms is nothing more than using rowsums on a sparse Matrix. The function uses slam's row_sums. To keep the counts with the words we can use the same functions. The slam package is installed when you installed tm, so the functions are available if you load slam or call them via slam::. Using the functions from slam is better as they work on sparse matrices. Base rowsums would transform the sparse matrix into a dense matrix which is slower and uses a lot more memory.
# your code.....
ptm.tf <- DocumentTermMatrix(apapers)
# using col_sums since it is a document term matrix. If it is a term document matrix use row_sums
frequency <- slam::col_sums(ptm.tf)
# Filtering like findFreqTerms. Find words that occur 10 times or more.
frequency <- frequency[frequency >= 10]
# turn into data.frame if needed:
frequency_df <- data.frame(words = names(frequency ), freq = frequency , row.names = NULL)
If you you don't mind using another package, this should work (instead of creating DTM object):
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location))
class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
# new lines here
library(qdap)
freq_terms(apapers) ^
Created on 2018-09-28 by the reprex package (v0.2.0).
Related
I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers.
I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data.
However, I want to use a function that says search for "word" and return how many times "word" appears in the TermDocumentMatrix.
Is there a function in TM that achieves this? Do I have to change my data to a data.frame and use a different package & function?
Since you have not given a reproducible example, I will give one using the crude dataset available in the tm package.
You can do it in (at least) 2 different ways. But anything that turns a sparse matrix into a dense matrix can use a lot of memory. So I will give you 2 options. The first one is more memory friendly as it makes use of the sparse tdm matrix. The second one, first transforms the tdm into a dense matrix before creating a frequency vector.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(crude)
# Making use of the fact that a tdm or dtm is a simple_triplet_matrix from slam
my_func <- function(data, word){
slam::row_sums(data[data$dimnames$Terms == word, ])
}
my_func(tdm, "crude")
crude
21
my_func(tdm, "oil")
oil
85
# turn tdm into dense matrix and create frequency vector.
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude
21
freq["oil"]
oil
85
edit:
As requested in comment:
# all words starting with cru. Adjust regex to find what you need.
freq[grep("^cru", names(freq))]
crucial crude
2 21
# separate words
freq[c("crude", "oil")]
crude oil
21 85
I have a document out of which I have special characters along with text such as !, #, #, $, % and more. The following code is used to obtain the most frequent terms list. But when it is performed, the special characters are missing in the frequent terms list i.e. if "#StackOverFlow" is the word present 100 times in the document, I get it as "StackOverFlow" without # in the frequent terms list. Here is my code:
review_text <- paste(rome_1$text, collapse=" ")
#The special characters are present within review_text
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
head(frequency)
Where exactly have I gone wrong here?
As you can see in the DocumentTermMatrix documentation :
This is different for a SimpleCorpus. In this case all options are
processed in a fixed order in one pass to improve performance. It
always uses the Boost Tokenizer (via Rcpp) and takes no custom
functions as option arguments.
It seems that SimpleCorpus objects (created by Corpus function) use a pre-defined Boost tokenizer which automatically splits words removing punctuations (including #).
You could use VCorpus instead, and removes the punctuations characters you want e.g. :
library(tm)
review_text <-
"I love #StackOverflow. #Stackoverflow is great, but Stackoverflow exceptions are not!"
review_source <- VectorSource(review_text)
corpus <- VCorpus(review_source) # N.B. use VCorpus here !!
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
patternRemover <- content_transformer(function(x,patternToRemove) gsub(patternToRemove,'',x))
corpus <- tm_map(corpus, patternRemover, '\\!|\\.|\\,|\\;|\\?') # remove only !.,;?
dtm <- DocumentTermMatrix(corpus,control=list(tokenize='words'))
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
Result :
> frequency
#stackoverflow exceptions great love stackoverflow
2 1 1 1 1
I need some help making calculating the cosine similarity score of vectors in a term document matrix much faster. I have a matrix of strings and I need to get the word similarity scores between the strings in each row of the matrix.
I am using the 'tm' package to create a term document matrix for each row of a data frame of text strings and the lsa package to get the cosine similarity score between the two vectors of words in the strings. I'm also using apply() to run the function below on an entire data frame:
similarity_score <- function (x) {
x <- VectorSource(x)
x <- Corpus(x)
x <- tm_map(x, tolower)
x <- tm_map(x, removePunctuation)
x <- tm_map(x, removeNumbers)
x <- tm_map(x, removeWords, stopwords("english"))
x <- tm_map(x, stemDocument)
x <- tm_map(x, stripWhitespace)
x <- tm_map(x, PlainTextDocument)
x <- TermDocumentMatrix(x)
x <- as.matrix(x)
return(as.numeric(cosine(x[,1], x[,2])))
apply_similarity <- function(x) {
return(as.data.frame(apply(x , 1, similarity_score)))
}
list_data_frames <- list(df_1, df_2, df_3,...)
output <- as.data.frame(lapply(list_data_frames, apply_similarity))
It gives me the values I need but I need to do this on a massive dataset and it is extremely slow. Running it on 1% of the dataset took 3 hours on my local machine. I need to do this on around 40 different data frames so I am using lapply on a list of data frames and applying that function to each data frame.
1) Is there a better way to do this that is faster? Maybe with another package or more efficient code? Am I using apply and lapply wrong in my code?
2) Can I parallelize this code and run it on multiple processors?
I tried using the snowfall package and the sfLapply and sfapply function but when the clone is created with snowfall it doesn't load packages and cannot find the function from the 'tm package. If I end up doing this on amazons cloud is there a way to have R use more than one processor and run functions within packages like 'tm' on multiple cores?
Have you tried using the parallel package? There is a good guide on gforge here. Essentially, you just have to start up a cluster, load the libraries into the nodes with clusterEvalQ, and you should be good to go. I'm trying it out now. Then again, this answer is coming more than a year old and you've probably got a good solution to this. Do share with me if you've come across something.
It's straightforward to build a document-term matrix from a corpus with the tm package.
I'd like to build a corpus from a document-term-matrix.
Let M be the number of documents in a document set.
Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix.
I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix.
From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab manually - i.e. there never was a tm "corpus" object representing my dataset, so I can't use the function,
tm_map(corpus, stemDocument, language="english")
I've been trying to build a workaround where I stem the vocabulary and only keep unique words, but then it gets somewhat complicated trying to maintain the correspondence between the dtm and the vocabulary vector.
Ideally, the end result would be that my vocabulary vector is stemmed and only contains unique entries, and the dtm indices correspond to the stemmed vocabulary vector. If you can think of some other way to do that, I would appreciate that as well.
My troubles would be fixed if I could simply build a tm "corpus" from my dtm and vocabulary vector, stem the corpus, and then convert back to a dtm and vocabulary vector (I already know how to make those conversions).
Let me know if I can clarify the problem any further.
Here's on approach providing my own minimal reproducible example (as a new user you may not be aware that this is your responsibility) from the tm package:
## Minimal Reproducible Example
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize = FALSE),
stopwords = TRUE))
## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
paste(rep(names(x), x), collapse=" ")
})
## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)
## Stemming
myCorp <- tm_map(myCorp, stemDocument)
inspect(myCorp)
I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:
corpus <- Corpus(VectorSource(vec), readerControl=list(language="en"))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))
And then performing LDA:
LDA(dtm, 30)
This final call to LDA() returns the error
"Each row of the input matrix needs to contain at least one non-zero entry".
I assume this means that there is at least one document that has no terms in it after preprocessing. Is there an easy way to remove documents that contain no terms from a DocumentTermMatrix?
I looked in the documentation for the topicmodels package and found the function removeSparseTerms, which removes terms that do not appear in any document, but there is no analogue for removing documents.
"Each row of the input matrix needs to contain at least one non-zero entry"
The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words
agstudy's answer works great, but using it on a slow computer proved mildly problematic.
tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed
(this was done with a 4000x15000 dtm)
The bottleneck appears to be applying sum() to a sparse matrix.
A document-term-matrix created by the tm package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i does not contain a particular row index p, then row p is empty.
tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed
ui contains all the non-zero indices, and since dtm$i is already ordered, dtm.new will be in the same order as dtm. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.
This is just to elaborate on the answer given by agstudy.
Instead of removing the empty rows from the dtm matrix, we can identify the documents in our corpus that have zero length and remove the documents directly from the corpus, before performing a second dtm with only non empty documents.
This is useful to keep a 1:1 correspondence between the dtm and the corpus.
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
corpus <- corpus[-as.numeric(empty.rows)]
Just remove the sparse terms from the DTM and all will work well.
dtm <- DocumentTermMatrix(crude, sparse=TRUE)
Just small addendum to the answer of Dario Lacan:
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
will collect record's id, rather than order numbers. Try this:
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"
If you construct your own corpus with consecutive numbering, after data cleaning some documents can be removed and numbering also will be broken. So, it's better to use id directly:
corpus <- tm_filter(
corpus,
FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
# !( meta(doc)$id %in% emptyRows )
)
I had a column in a data frame lt$title which contained strings. I had no "empty" rows in this column, but still got the error:
Error in LDA(dtm, k = 20, control = list(seed = 813)) : Each row of
the input matrix needs to contain at least one non-zero entry
Some of the solutions above did not work for me, since I needed to join the vector of predicted topics to my original data frame. So removing non-zero entries from the document term matrix was no option.
The problem was, that some (very short) strings in lt$title contained special characters which could not be processed by Corpus() and/or DocumentTermMatrix().
My solution was to remove "short" strings (one or two words max.) which do not carry much information anyway.
# Clean up text data
lt$test=nchar(lt$title)
lt = lt[!lt$test<10,]
lt$test<-NULL
# Topic modeling
corpus <- Corpus(VectorSource(lt$title))
dtm = DocumentTermMatrix(corpus)
tm = LDA(dtm, k = 20, control = list(seed = 813))
# Add "topics" to original DF
lt$topic = topics(tm)