I have the following problem: I converted a corpus into a dfm and this dfmm has some zero entries that I need to remove before fitting a LDA model. I would usually do as follows:
OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)
Creating a dfm from a corpus input...
... lowercasing
... found 272,912 documents, 112,588 features
... removed 613 features
... stemming features (English)
, trimmed 27491 feature variants
... created a 272,912 x 84,515 sparse dfm
... complete.
Elapsed time: 78.7 seconds.
# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]
However, when I try to perform the last operations I get: Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 hinting at the fact the the matrix is too large to be manipulated.
Is there anyone who has met and solved this issue before? Any alternative strategy to remove the 0 entries?
Thanks a lot!
Your apply with sum transforms the dfm from a sparse matrix into a dense matrix for calculating the row sum.
Either use slam::row_sums since slam functions work on sparse matrices, but better yet, just use quantada::dfm_subset to select all the documents with more than 0 tokens.
dfm_subset(OutDfm, ntoken(OutDfm) > 0)
Example to show how it works with ntokens > 5000:
library(quanteda)
x <- corpus(data_corpus_inaugural)
x <- dfm(x)
x
Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1789-Washington 1 71 116 1 48 2 2 1 1 1
# subset based on amount of tokens.
dfm_subset(x, ntoken(x) > 5000)
Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1841-Harrison 11 604 829 5 231 1 4 1 3 0
I use Quanteda R package in order to extract ngrams (here 1grams and 2grams) from text Data_clean$Review, but I am looking for a way with R to compte Chi-square between document and the extracted ngrams :
Here the R code that I did to clean Up text (revoiew) and generate the n-grams.
Any idea please?
thank you
#delete rows with empty value columns
Data_clean <- Data[Data$Note!="" & Data$Review!="",]
Data_clean$id <- seq.int(nrow(Data_clean))
train.index <- 1:50000
test.index <- 50001:nrow(Data_clean)
#clean up
# remove grammar/punctuation
Data_clean$Review.clean <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))
train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]
temp.tf <- Data_clean$Raison.Reco.clean %>% tokens(ngrams = 1:2) %>% # generate tokens
dfm # generate dfm
You would not use ngrams for this, but rather a function called textstat_collocations().
It's a bit hard to follow your exact example since none of those objects are explained or supplied, but let's try it with some of quanteda's built-in data. I'll get the texts from the inaugural corpus and apply some filters similar to what you have above.
So to score bigrams for chi^2, you would use:
# create the corpus, subset on some conditions (could be Note != "" for instance)
corp_example <- data_corpus_inaugural
corp_example <- corpus_subset(corp_example, Year > 1960)
# this will remove punctuation and numbers
toks_example <- tokens(corp_example, remove_punct = TRUE, remove_numbers = TRUE)
# find and score chi^2 bigrams
coll2 <- textstat_collocations(toks_example, method = "chi2", max_size = 2)
head(coll2, 10)
# collocation count X2
# 1 reverend clergy 2 28614.00
# 2 Majority Leader 2 28614.00
# 3 Information Age 2 28614.00
# 4 Founding Fathers 3 28614.00
# 5 distinguished guests 3 28614.00
# 6 Social Security 3 28614.00
# 7 Chief Justice 9 23409.82
# 8 middle class 4 22890.40
# 9 Abraham Lincoln 2 19075.33
# 10 society's ills 2 19075.33
Added:
# needs to be a list of the collocations as separate character elements
coll2a <- sapply(coll2$collocation, strsplit, " ", USE.NAMES = FALSE)
# compound the tokens using top 100 collocations
toks_example_comp <- tokens_compound(toks_example, coll2a[1:100])
toks_example_comp[[1]][1:20]
# [1] "Vice_President" "Johnson" "Mr_Speaker" "Mr_Chief" "Chief_Justice"
# [6] "President" "Eisenhower" "Vice_President" "Nixon" "President"
# [11] "Truman" "reverend_clergy" "fellow_citizens" "we" "observe"
# [16] "today" "not" "a" "victory" "of"
I have a big dataset (>1 million rows) and each row is a multi-sentence text. For example following is a sample of 2 rows:
mydat <- data.frame(text=c('I like apple. Me too','One two. Thank you'),stringsAsFactors = F)
What I was trying to do is extracting the bigram terms in each row (the "." will be able to separate ngram terms). If I simply use the dfm function:
mydfm = dfm(mydat$text,toLower = T,removePunct = F,ngrams=2)
dtm = as.DocumentTermMatrix(mydfm)
txt_data = as.data.frame(as.matrix(dtm))
These are the terms I got:
"i_like" "like_apple" "apple_." "._me" "me_too" "one_two" "two_." "._thank" "thank_you"
These are What I expect, basically "." is skipped and used to separate the terms:
"i_like" "like_apple" "me_too" "one_two" "thank_you"
Believe writing slow loops can solve this as well but given it is a huge dataset I would prefer efficient ways similar to the dfm() in quanteda to solve this. Any suggestions would be appreciated!
#Jota's answer works but there is a way to control the tokenisation more finely while calling it only once:
(toks <- tokenize(toLower(mydat$text), removePunct = 2, ngrams = 2))
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "apple_me" "me_too"
##
## Component 2 :
## [1] "one_two" "two_thank" "thank_you"
dfm(toks)
## Document-feature matrix of: 2 documents, 7 features.
## 2 x 7 sparse Matrix of class "dfmSparse"
## features
## docs i_like like_apple apple_me me_too one_two two_thank thank_you
## text1 1 1 1 1 0 0 0
## text2 0 0 0 0 1 1 1
Added:
Then to remove any ngram with . punctuation, you can use: the following, which defaults to valuetype = "glob":
removeFeatures(toks2, "*.*")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "me_too"
##
## Component 2 :
## [1] "one_two" "thank_you"
If your goal is just to extract those bigrams, then you could use tokens twice. Once to tokenize to sentences, then again to make the ngrams for each sentence.
library("quanteda")
mydat$text %>%
tokens(mydat$text, what = "sentence") %>%
as.character() %>%
tokens(ngrams = 2, remove_punct = TRUE) %>%
as.character()
#[1] "I_like" "like_apple" "Me_too" "One_two" "Thank_you"
Insert a tokens_tolower() after the first tokens() call if you like, or use char_tolower() at the end.
Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.
An example:
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
gives an error.
The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.
Also recall that 'dfm' from 'quanteda' is a S4 class.
Should work "out of the box", if you are using the latest version:
packageVersion("quanteda")
## [1] ‘0.9.6.9’
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## is one sample surprise text this
## doc1 1 1 2 0 1 1
## doc2 1 1 2 1 1 1
See also ?selectFeatures where features is a dfm object (there are examples in the help file).
Added:
Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:
require(tm)
dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
rbind(dtm1, dtm2)
## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.
This almost gets it, but seems to duplicate the repeated feature:
as.matrix(rbind(c(dtm1, dtm2)))
## Terms
## Docs one sample sample. text this surprise!
## 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1
My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.
Consider the following code:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
The output:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.
I have heard that I could use data.table but I am not sure how.
I have also looked at the qdap package, but it gives me an error when trying to load the package, plus I don't even know if it will work.
Ref. http://cran.r-project.org/web/packages/qdap/qdap.pdf
Which approach?
data.table is definitely the right way to go. Regex operations are slow, although the ones in stringi are much faster (in addition to being much better). Anything with
I went through many iterations of solving problem in creating quanteda::dfm() for my quanteda package (see the GitHub repo here). The fastest solution, by far, involves using the data.table and Matrix packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix.
In the code below, I've taken for an example texts found with the quanteda package, which you can (and should!) install from CRAN or the development version from
devtools::install_github("kbenoit/quanteda")
I'd be very interested to see how it works on your 4m documents. Based on my experience working with corpuses of that size, it will work pretty well (if you have enough memory).
Note that in all my profiling, I could not improve the speed of the data.table operations through any sort of parallelisation, because of the way they are written in C++.
Core of the quanteda dfm() function
Here is the bare bones of the data.table based source code, in case any one wants to have a go at improving it. It takes a input a list of character vectors representing the tokenized texts. In the quanteda package, the full-featured dfm() works directly on character vectors of documents, or corpus objects, directly and implements lowercasing, removal of numbers, and removal of spacing by default (but these can all be modified if wished).
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R).
quanteda on your example
How's this for simplicity?
require(quanteda)
mytext <- c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.
# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs bar big child dog hold honor hunt let student
# text1 0 1 0 1 0 0 1 1 0
# text2 1 0 0 0 1 0 0 0 0
# text3 0 0 1 0 0 1 0 0 1
I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.
In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.
data <- data.frame(
text=c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student"
), stringsAsFactors = F)
## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))
lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)
library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ]
library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)
tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm
## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
You have a few choices. #TylerRinker commented about qdap, which is certainly a way to go.
Alternatively (or additionally) you could also benefit from a healthy does of parallelism. There's a nice CRAN page detailing HPC resources in R. It's a bit dated though and the multicore package's functionality is now contained within parallel.
You can scale up your text mining using the multicore apply functions of the parallel package or with cluster computing (also supported by that package, as well as by snowfall and biopara).
Another way to go is to employ a MapReduce approach. A nice presentation on combining tm and MapReduce for big data is available here. While that presentation is a few years old, all of the information is still current, valid and relevant. The same authors have a newer academic article on the topic, which focuses on the tm.plugin.dc plugin. To get around having a Vector Source instead of DirSource you can use coercion:
data("crude")
as.DistributedCorpus(crude)
If none of those solutions fit your taste, or if you're just feeling adventurous, you might also see how well your GPU can tackle the problem. There's a lot of variation in how well GPUs perform relative to CPUs and this may be a use case. If you'd like to give it a try, you can use gputools or the other GPU packages mentioned on the CRAN HPC Task View.
Example:
library(tm)
install.packages("tm.plugin.dc")
library(tm.plugin.dc)
GetDCorpus <-function(textVector)
{
doc.corpus <- as.DistributedCorpus(VCorpus(VectorSource(textVector)))
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, content_transformer(removeNumbers))
doc.corpus <- tm_map(doc.corpus, content_transformer(removePunctuation))
# <- tm_map(doc.corpus, removeWords, stopwords("english")) # won't accept this for some reason...
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
dcorp <- GetDCorpus(data[,1])
tdm <- TermDocumentMatrix(dcorp)
inspect(tdm)
Output:
> inspect(tdm)
<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 10/20
Sparsity : 67%
Maximal term length: 7
Weighting : term frequency (tf)
Docs
Terms 1 2 3
barred 0 1 0
big 1 0 0
child 0 0 1
dogs 1 0 0
holds 0 1 0
honor 0 0 1
hunt 1 0 0
let 1 0 0
student 0 0 1
the 1 0 0
This is better than my earlier answer.
The quanteda package has evolved significantly and is now faster and much simpler to use given its built-in tools for this sort of problem -- which is exactly what we designed it for. Part of the OP asked how to prepare the texts for a Bayesian classifier. I've added an example for this too, since quanteda's textmodel_nb() would crunch through 300k documents without breaking a sweat, plus it correctly implements the multinomial NB model (which is the most appropriate for text count matrices -- see also https://stackoverflow.com/a/54431055/4158274).
Here I demonstrate on the built-in inaugural corpus object, but the functions below would also work with a plain character vector input. I've used this same workflow to process and fit models to 10s of millions of Tweets in minutes, on a laptop, so it's fast.
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
# use a built-in data object
data <- data_corpus_inaugural
data
## Corpus consisting of 58 documents and 3 docvars.
# here we input a corpus, but plain text input works fine too
dtm <- dfm(data, tolower = TRUE, remove_numbers = TRUE, remove_punct = TRUE) %>%
dfm_wordstem(language = "english") %>%
dfm_remove(stopwords("english"))
dtm
## Document-feature matrix of: 58 documents, 5,346 features (89.0% sparse).
tail(dtm, nf = 5)
## Document-feature matrix of: 6 documents, 5 features (83.3% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs bleed urban sprawl windswept nebraska
## 1997-Clinton 0 0 0 0 0
## 2001-Bush 0 0 0 0 0
## 2005-Bush 0 0 0 0 0
## 2009-Obama 0 0 0 0 0
## 2013-Obama 0 0 0 0 0
## 2017-Trump 1 1 1 1 1
This is a rather trivial example, but for illustration, let's fit a Naive Bayes model, holding out the Trump document. This was the last inaugural speech at the time of this posting ("2017-Trump"), equal in position to the ndoc()th document.
# fit a Bayesian classifier
postwar <- ifelse(docvars(data, "Year") > 1945, "post-war", "pre-war")
textmod <- textmodel_nb(dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)], prior = "docfreq")
The same sorts of commands that work with other fitted model objects (e.g. lm(), glm(), etc.) will work with a fitted Naive Bayes textmodel object. So:
summary(textmod)
##
## Call:
## textmodel_nb.dfm(x = dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)],
## prior = "docfreq")
##
## Class Priors:
## (showing first 2 elements)
## post-war pre-war
## 0.2982 0.7018
##
## Estimated Feature Scores:
## fellow-citizen senat hous repres among vicissitud incid
## post-war 0.02495 0.4701 0.2965 0.06968 0.213 0.1276 0.08514
## pre-war 0.97505 0.5299 0.7035 0.93032 0.787 0.8724 0.91486
## life event fill greater anxieti notif transmit order
## post-war 0.3941 0.1587 0.3945 0.3625 0.1201 0.3385 0.1021 0.1864
## pre-war 0.6059 0.8413 0.6055 0.6375 0.8799 0.6615 0.8979 0.8136
## receiv 14th day present month one hand summon countri
## post-war 0.1317 0.3385 0.5107 0.06946 0.4603 0.3242 0.307 0.6524 0.1891
## pre-war 0.8683 0.6615 0.4893 0.93054 0.5397 0.6758 0.693 0.3476 0.8109
## whose voic can never hear vener
## post-war 0.2097 0.482 0.3464 0.2767 0.6418 0.1021
## pre-war 0.7903 0.518 0.6536 0.7233 0.3582 0.8979
predict(textmod, newdata = dtm[ndoc(dtm), ])
## 2017-Trump
## post-war
## Levels: post-war pre-war
predict(textmod, newdata = dtm[ndoc(dtm), ], type = "probability")
## post-war pre-war
## 2017-Trump 1 1.828083e-157