How to complete words after stemming in R?
x <- c("completed","complete","completion","teach","taught")
tm <- Corpus(VectorSource(x))
tm <- tm_map(tm, stemDocument)
inspect(tm)
Example for illustration purpose as the actual text corpus is much bigger.
I've searched for earlier examples which points to creating a set of synonyms, but for large corpus, how is it possible to get such as synonym dictionary? For verbs how can I complete stemmed words to current tense? Thanks
TM has a function stemCompletion()
x <- c("completed","complete","completion","teach","taught")
tm <- Corpus(VectorSource(x))
tm <- tm_map(tm, stemDocument)
inspect(tm)
dictCorpus <- tm
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, stripWhitespace, mc.cores=cores)
tm<-tm_map(tm, stemCompletion,dictionary=dictCorpus)
As for completing verbs to the present tense, I am not sure that is possible with tm. Maybe RWeka, word2vec or qdap will have methods but I am not sure.
A quick and dirty, solution may be to set type = shortest in stemDocument generally I think current tense words will be shorter than past tense and gerunds.
Related
I am trying to do some text mining, using tm package, on reviews that Italian users of a certain website wrote there. I scraped the texts, stored them on a corpus, did some sort of cleaning, but when I try to get the stems of the words by removing the common endings, I have problem specifying the Italian language instead of default one, i.e. English.
reviews_corpus <- tm_map(reviews_corpus, removeNumbers)
reviews_corpus <- tm_map(reviews_corpus, removePunctuation)
reviews_corpus <- tm_map(reviews_corpus, stripWhitespace)
reviews_corpus <- tm_map(reviews_corpus, content_transformer(tolower))
reviews_corpus <- tm_map(reviews_corpus, removeWords, stopwords("italian"))
reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))
First five lines work fine, but for the last one R gives me:
Error in UseMethod("stemDocument", x) :
no applicable method for 'stemDocument' applied to an object of class "c('VCorpus', 'Corpus')"
So, my problem is that how can I use stemDocument on a corpus but specify the language I want to be used?
There is a bug in stemDocument. If you use any other language than English, it reverts back to English. But there is a way around it and directly call the word stemmer that stemDocument points to.
Instead of
reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))
use
reviews_corpus <- tm_map(reviews_corpus, function(x) SnowballC::wordStem(x, language = "italian"))
But my advice is, if you are using a non English language, to use the quanteda package.
I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers.
I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data.
However, I want to use a function that says search for "word" and return how many times "word" appears in the TermDocumentMatrix.
Is there a function in TM that achieves this? Do I have to change my data to a data.frame and use a different package & function?
Since you have not given a reproducible example, I will give one using the crude dataset available in the tm package.
You can do it in (at least) 2 different ways. But anything that turns a sparse matrix into a dense matrix can use a lot of memory. So I will give you 2 options. The first one is more memory friendly as it makes use of the sparse tdm matrix. The second one, first transforms the tdm into a dense matrix before creating a frequency vector.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(crude)
# Making use of the fact that a tdm or dtm is a simple_triplet_matrix from slam
my_func <- function(data, word){
slam::row_sums(data[data$dimnames$Terms == word, ])
}
my_func(tdm, "crude")
crude
21
my_func(tdm, "oil")
oil
85
# turn tdm into dense matrix and create frequency vector.
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude
21
freq["oil"]
oil
85
edit:
As requested in comment:
# all words starting with cru. Adjust regex to find what you need.
freq[grep("^cru", names(freq))]
crucial crude
2 21
# separate words
freq[c("crude", "oil")]
crude oil
21 85
I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.
I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.
Using tm package previously I did this:
# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
Then within my tm corpus transformations I did this:
mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))
What is the equivilent way to apply this custom function to my quanteda corpus?
Impossible to know if that will work from your example, which leaves some parts out, but generally:
If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.
So in your case, assuming that mycorpus is a tm corpus, you could do this:
library("quanteda")
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
stringi::stri_replace_all_regex(str = x,
pattern = paste0("\\b", lut[,1], "\\b"),
replacement = lut[,2],
vectorize_all = FALSE)
}
myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
I think I found an indirect answer over here.
texts(myCorpus) <- myFunction(myCorpus)
I made wordcloud using a csv file in R. I used TermDocumentMatrix method in the tm package. Here is my code:
csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)
Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))
m <- as.matrix(TDM)
This process seemed to take too much time. I think extractNoun is what accounts for too much time being spent. To make the code more time-efficient, I want to save the resulting TDM as a file. When I read this saved file, can I use m <- as.matrix(saved TDM file) completely? Or, is there a better alternative?
I'm not an expert but I've used NLP sometimes.
I do use parSapply from parallel package. Here's the documentation http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
parallel comes with R base and this is a silly using example:
library(parallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
clusterExport(cl, "base")
base <- 2
parSapply(cl, as.character(2:4),
function(exponent){
x <- as.numeric(exponent)
c(base = base^x, self = x^x)
})
So, parallelize nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F) and it will be faster :)
I noticed that you have a call to several library(tm) commands which can also easily be parallelized. For library tm this functionality was updated in March 2017, a month after your question.
In the new features section of the release notes for library tm version 0.7 (2017-03-02) it is indicated:
tm_parLapply() is now internally used for the parallelization of transformations, filters, and term-document matrix construction. The preferred parallelization engine can be registered via tm_parLapply_engine(). The default is to use no parallelization (instead of mclapply (package parallel) in previous versions).
To set up parallelization for the tm commands the following has worked for me:
library(parallel)
cores <- detectCores()
cl <- makeCluster(cores) # use cores-1 if you want to do anything else on the PC.
tm_parLapply_engine(cl)
## insert your commands for create corpus,
## tm_map and TermDocumentMatrix commands here
tm_parLapply_engine(NULL)
stopCluster(cl)
If you have function that you are applying through a tm_map content transformer, you will need to use clusterExport to pass that function to the parallelized environment before the tm_map(MyCorpus, content_transformer(clean)) command. EG. passing my clean function to the environment.
clusterExport(cl, "clean")
One last comment, keep an eye on your memory usage. If your computer starts paging memory out to disk the CPU is no longer the critical path and all the parallelization won't make a difference.
It's straightforward to build a document-term matrix from a corpus with the tm package.
I'd like to build a corpus from a document-term-matrix.
Let M be the number of documents in a document set.
Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix.
I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix.
From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab manually - i.e. there never was a tm "corpus" object representing my dataset, so I can't use the function,
tm_map(corpus, stemDocument, language="english")
I've been trying to build a workaround where I stem the vocabulary and only keep unique words, but then it gets somewhat complicated trying to maintain the correspondence between the dtm and the vocabulary vector.
Ideally, the end result would be that my vocabulary vector is stemmed and only contains unique entries, and the dtm indices correspond to the stemmed vocabulary vector. If you can think of some other way to do that, I would appreciate that as well.
My troubles would be fixed if I could simply build a tm "corpus" from my dtm and vocabulary vector, stem the corpus, and then convert back to a dtm and vocabulary vector (I already know how to make those conversions).
Let me know if I can clarify the problem any further.
Here's on approach providing my own minimal reproducible example (as a new user you may not be aware that this is your responsibility) from the tm package:
## Minimal Reproducible Example
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize = FALSE),
stopwords = TRUE))
## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
paste(rep(names(x), x), collapse=" ")
})
## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)
## Stemming
myCorp <- tm_map(myCorp, stemDocument)
inspect(myCorp)