how to read and write TermDocumentMatrix in r? - r

I made wordcloud using a csv file in R. I used TermDocumentMatrix method in the tm package. Here is my code:
csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)
Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))
m <- as.matrix(TDM)
This process seemed to take too much time. I think extractNoun is what accounts for too much time being spent. To make the code more time-efficient, I want to save the resulting TDM as a file. When I read this saved file, can I use m <- as.matrix(saved TDM file) completely? Or, is there a better alternative?

I'm not an expert but I've used NLP sometimes.
I do use parSapply from parallel package. Here's the documentation http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
parallel comes with R base and this is a silly using example:
library(parallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
clusterExport(cl, "base")
base <- 2
parSapply(cl, as.character(2:4),
function(exponent){
x <- as.numeric(exponent)
c(base = base^x, self = x^x)
})
So, parallelize nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F) and it will be faster :)

I noticed that you have a call to several library(tm) commands which can also easily be parallelized. For library tm this functionality was updated in March 2017, a month after your question.
In the new features section of the release notes for library tm version 0.7 (2017-03-02) it is indicated:
tm_parLapply() is now internally used for the parallelization of transformations, filters, and term-document matrix construction. The preferred parallelization engine can be registered via tm_parLapply_engine(). The default is to use no parallelization (instead of mclapply (package parallel) in previous versions).
To set up parallelization for the tm commands the following has worked for me:
library(parallel)
cores <- detectCores()
cl <- makeCluster(cores) # use cores-1 if you want to do anything else on the PC.
tm_parLapply_engine(cl)
## insert your commands for create corpus,
## tm_map and TermDocumentMatrix commands here
tm_parLapply_engine(NULL)
stopCluster(cl)
If you have function that you are applying through a tm_map content transformer, you will need to use clusterExport to pass that function to the parallelized environment before the tm_map(MyCorpus, content_transformer(clean)) command. EG. passing my clean function to the environment.
clusterExport(cl, "clean")
One last comment, keep an eye on your memory usage. If your computer starts paging memory out to disk the CPU is no longer the critical path and all the parallelization won't make a difference.

Related

Find Frequent Word and its Value in Document Term Frequency

So I have to find the most frequent word and its value from a DTM.
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location)) class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
This is for cleaning the Corpus and the below one creating the DTM and finding the frequency.
ptm.tf <- DocumentTermMatrix(apapers)
dim(ptm.tf)
findFreqTerms(ptm.tf)
Is there a way to get the frequent word and the frequency value together?
findFreqTerms is nothing more than using rowsums on a sparse Matrix. The function uses slam's row_sums. To keep the counts with the words we can use the same functions. The slam package is installed when you installed tm, so the functions are available if you load slam or call them via slam::. Using the functions from slam is better as they work on sparse matrices. Base rowsums would transform the sparse matrix into a dense matrix which is slower and uses a lot more memory.
# your code.....
ptm.tf <- DocumentTermMatrix(apapers)
# using col_sums since it is a document term matrix. If it is a term document matrix use row_sums
frequency <- slam::col_sums(ptm.tf)
# Filtering like findFreqTerms. Find words that occur 10 times or more.
frequency <- frequency[frequency >= 10]
# turn into data.frame if needed:
frequency_df <- data.frame(words = names(frequency ), freq = frequency , row.names = NULL)
If you you don't mind using another package, this should work (instead of creating DTM object):
library('tm')
library("SnowballC")
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location))
class(apapers)
apapers <- tm_map(apapers, removeNumbers)
apapers <- tm_map(apapers, removePunctuation)
apapers <- tm_map(apapers, stemDocument, language ="en")
# new lines here
library(qdap)
freq_terms(apapers) ^
Created on 2018-09-28 by the reprex package (v0.2.0).

How to apply a custom function to a quanteda corpus

I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.
I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.
Using tm package previously I did this:
# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
Then within my tm corpus transformations I did this:
mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))
What is the equivilent way to apply this custom function to my quanteda corpus?
Impossible to know if that will work from your example, which leaves some parts out, but generally:
If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.
So in your case, assuming that mycorpus is a tm corpus, you could do this:
library("quanteda")
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
stringi::stri_replace_all_regex(str = x,
pattern = paste0("\\b", lut[,1], "\\b"),
replacement = lut[,2],
vectorize_all = FALSE)
}
myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
I think I found an indirect answer over here.
texts(myCorpus) <- myFunction(myCorpus)

r text analysis stem completion

How to complete words after stemming in R?
x <- c("completed","complete","completion","teach","taught")
tm <- Corpus(VectorSource(x))
tm <- tm_map(tm, stemDocument)
inspect(tm)
Example for illustration purpose as the actual text corpus is much bigger.
I've searched for earlier examples which points to creating a set of synonyms, but for large corpus, how is it possible to get such as synonym dictionary? For verbs how can I complete stemmed words to current tense? Thanks
TM has a function stemCompletion()
x <- c("completed","complete","completion","teach","taught")
tm <- Corpus(VectorSource(x))
tm <- tm_map(tm, stemDocument)
inspect(tm)
dictCorpus <- tm
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, stripWhitespace, mc.cores=cores)
tm<-tm_map(tm, stemCompletion,dictionary=dictCorpus)
As for completing verbs to the present tense, I am not sure that is possible with tm. Maybe RWeka, word2vec or qdap will have methods but I am not sure.
A quick and dirty, solution may be to set type = shortest in stemDocument generally I think current tense words will be shorter than past tense and gerunds.

Scaling and parallel processing 'tm' package Term-Document Matrix calculations in R studio?

I need some help making calculating the cosine similarity score of vectors in a term document matrix much faster. I have a matrix of strings and I need to get the word similarity scores between the strings in each row of the matrix.
I am using the 'tm' package to create a term document matrix for each row of a data frame of text strings and the lsa package to get the cosine similarity score between the two vectors of words in the strings. I'm also using apply() to run the function below on an entire data frame:
similarity_score <- function (x) {
x <- VectorSource(x)
x <- Corpus(x)
x <- tm_map(x, tolower)
x <- tm_map(x, removePunctuation)
x <- tm_map(x, removeNumbers)
x <- tm_map(x, removeWords, stopwords("english"))
x <- tm_map(x, stemDocument)
x <- tm_map(x, stripWhitespace)
x <- tm_map(x, PlainTextDocument)
x <- TermDocumentMatrix(x)
x <- as.matrix(x)
return(as.numeric(cosine(x[,1], x[,2])))
apply_similarity <- function(x) {
return(as.data.frame(apply(x , 1, similarity_score)))
}
list_data_frames <- list(df_1, df_2, df_3,...)
output <- as.data.frame(lapply(list_data_frames, apply_similarity))
It gives me the values I need but I need to do this on a massive dataset and it is extremely slow. Running it on 1% of the dataset took 3 hours on my local machine. I need to do this on around 40 different data frames so I am using lapply on a list of data frames and applying that function to each data frame.
1) Is there a better way to do this that is faster? Maybe with another package or more efficient code? Am I using apply and lapply wrong in my code?
2) Can I parallelize this code and run it on multiple processors?
I tried using the snowfall package and the sfLapply and sfapply function but when the clone is created with snowfall it doesn't load packages and cannot find the function from the 'tm package. If I end up doing this on amazons cloud is there a way to have R use more than one processor and run functions within packages like 'tm' on multiple cores?
Have you tried using the parallel package? There is a good guide on gforge here. Essentially, you just have to start up a cluster, load the libraries into the nodes with clusterEvalQ, and you should be good to go. I'm trying it out now. Then again, this answer is coming more than a year old and you've probably got a good solution to this. Do share with me if you've come across something.

How to add metadata to tm Corpus object with tm_map

I have been reading different questions/answers (especially here and here) without managing to apply any to my situation.
I have a 11,390 rows matrix with attributes id, author, text, such as:
library(tm)
m <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
I want to create a tm corpus out of it. I can quickly create my corpus with
tm_corpus <- Corpus(VectorSource(m[,3]))
which terminates execution for my 11,390 rows matrix in
user system elapsed
2.383 0.175 2.557
But then when I try to add metadata to the corpus with
meta(tm_corpus, type="local", tag="Author") <- m[,2]
the execution time is over the 15 minutes and counting (I then stopped execution).
According to the discussion here chances are to decreasing significantly the time in processing the corpus with tm_map; something like
tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])
Still I am not sure how to do this. Probably it is going to be something like
addMeta <- function(text, vector) {
meta(text, tag="Author") = vector[??]
text
}
For one thing how to pass to tm_map a vector of values to be assign to each text of the corpus? Should I call the function from within a loop? Should I enclose the tm_map function within vapply?
Have you already tried the excellent readTabular?
## your sample data
matrix <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
## simple transformations
matrix <- as.data.frame(matrix)
names(matrix) <- c("id", "author", "content")
Now your ex-matrix now data.frame can be read easily in as a corpus using readTabular. ReadTabular wants you to define a Reader which itselfs takes a mapping. In your mapping "content" points to the text data and the other names - well - to meta.
## define myReader, which will be used in creation of Corpus
myReader <- readTabular(mapping=list(id="id", author="author", content="content"))
Now the creation of the Corpus is same as always, apart from small changes:
## create the corpus
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus,
readerControl = list(reader=myReader))
Now have a look at the content and meta data of the first items:
lapply(tm_corpus, as.character)
lapply(tm_corpus, meta)
## output just as expected.
This should be fast, as it is part of the package and extremely adaptable. In my own project I am using this on a data.table with some 20 variables - it works like a charm.
However I cannot provide benchmarking with the answer you have already approved as suitable. I simply guess it is faster and more efficient.
Yes tm_map is faster and it is the way to go. You should use it here with a global counter.
auths <- paste0('Author',seq(11390))
i <- 0
tm_corpus = tm_map(tm_corpus, function(x) {
i <<- i +1
meta(x, "Author") <- m[i,2]
x
})
Since readTabular from tm package has been deprecated, now the solution might be like this:
matrix <- cbind(c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
matrix <- as.data.frame(matrix)
names(matrix) <- c("doc_id", "text")
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus)
inspect(tm_corpus)
meta(tm_corpus)

Resources