Similarly to this post, I'm trying to use the Affective Norms for English Words (in French) for a sentiment analysis with Quanteda. I ultimately want to create a "mean sentiment" per text in my corpus.
First, I load in the ANEW dictionary (FAN in French) and create a named vector of weights. ANEW differs from other dictionaries since it does not use a key: value pair format, but rather assigns a numerical score to each word. The goal is to select features and then scoring them using weighted counts.
The ANEW file looks like this : MOT/ VALENCE cancer: 1.01, potato: 3.56, love: 6.56
#### FAN DATA ####
# read in the FAN data
df_fan <- read.delim("fan_anew.txt", stringsAsFactors = FALSE)
# construct a vector of weights with the term as the name
vector_fan <- df_fan$valence
names(vector_fan) <- df_fan$mot
Then I tried to apply dfm_weight() to my corpus of 27 documents.
# create a dfm selecting on the FAN words
dfm_fan <- dfm(my_corpus, select = df_fan$mot, language = "French")
dfm_fan_weighted <- dfm_fan %>%
dfm_weight(scheme = "prop") %>%
dfm_weight(weights = vector_fan)
## Warning messages:
## 1: dfm_weight(): ignoring 696 unmatched weight features
## 2: In diag(weight) : NAs introduced by coercion
Here is what I get, only 6 documents are included in the dfm object generated and the code doesn't estimate the ANEW mean score for each document in the original corpus.
tail(dfm_fan_weighted)
## Document-feature matrix of: 6 documents, 335 features (72.6% sparse).
tail(dfm_fan_weighted)[, c("absent", "politique")]
## Error in intI(j, n = x#Dim[2], dn[[2]], give.dn = FALSE) : invalid character indexing
tail(rowSums(dfm_fan_weighted))
## text22 text23 text24 text25 text26 text27
## NA NA NA NA NA NA
tail(dfm_fan_weighted)[, c("beau")]
## Document-feature matrix of: 6 documents, 1 feature (100% sparse).
## 6 x 1 sparse Matrix of class "dfm"
## features
## docs beau
## text22 0
## text23 0
## text24 0
## text25 0
## text26 0
## text27 0
Any idea to fix it? I think the code needs just some small changes to work properly.
Edit: I edited the code following Ken Benoit comment.
I'm just getting to grips with the tm package in R.
Probably a simple question, but trying to use the findAssocs function to get an idea for word associations in my customer enquiries insight document and I can't seem to get findAssocs to work correctly.
When I use the following:
findAssocs(dtm, words, corlimit = 0.30)
$population
numeric(0)
$migration
numeric(0)
What does this mean? Words is a character vector of 667 words - surely there must be some correlative relationships?
Consider the following example:
library(tm)
corp <- VCorpus(VectorSource(
c("hello world", "hello another World ", "and hello yet another world")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
# Docs
# Terms 1 2 3
# and 0 0 1
# another 0 1 1
# hello 1 1 1
# world 1 1 1
# yet 0 0 1
Now consider
findAssocs(x=tdm, terms=c("hello", "yet"), corlimit=.4)
# $hello
# numeric(0)
#
# $yet
# and another
# 1.0 0.5
From what I understand, findAssocs looks at the correlations of hello with everything but hello and yet, as well as yet with everything except hello and yet. yet and and have a correlation coefficient of 1.0, which is above the lower limit of 0.4. yet is also in 50% of all documents containing another - that's also above our 0.4 limit.
Here's another example showcasing this:
findAssocs(x=tdm, terms=c("yet", "another"), corlimit=0)
# $yet
# and
# 1
#
# $another
# and
# 0.5
Note that hello (and world) don't yield any results because they are in every document. This means the term frequency has zero variance and cor under the hood yields NA (like cor(rep(1,3), 1:3), which gives NA plus a zero-standard-deviation-warning).
I have a big dataset (>1 million rows) and each row is a multi-sentence text. For example following is a sample of 2 rows:
mydat <- data.frame(text=c('I like apple. Me too','One two. Thank you'),stringsAsFactors = F)
What I was trying to do is extracting the bigram terms in each row (the "." will be able to separate ngram terms). If I simply use the dfm function:
mydfm = dfm(mydat$text,toLower = T,removePunct = F,ngrams=2)
dtm = as.DocumentTermMatrix(mydfm)
txt_data = as.data.frame(as.matrix(dtm))
These are the terms I got:
"i_like" "like_apple" "apple_." "._me" "me_too" "one_two" "two_." "._thank" "thank_you"
These are What I expect, basically "." is skipped and used to separate the terms:
"i_like" "like_apple" "me_too" "one_two" "thank_you"
Believe writing slow loops can solve this as well but given it is a huge dataset I would prefer efficient ways similar to the dfm() in quanteda to solve this. Any suggestions would be appreciated!
#Jota's answer works but there is a way to control the tokenisation more finely while calling it only once:
(toks <- tokenize(toLower(mydat$text), removePunct = 2, ngrams = 2))
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "apple_me" "me_too"
##
## Component 2 :
## [1] "one_two" "two_thank" "thank_you"
dfm(toks)
## Document-feature matrix of: 2 documents, 7 features.
## 2 x 7 sparse Matrix of class "dfmSparse"
## features
## docs i_like like_apple apple_me me_too one_two two_thank thank_you
## text1 1 1 1 1 0 0 0
## text2 0 0 0 0 1 1 1
Added:
Then to remove any ngram with . punctuation, you can use: the following, which defaults to valuetype = "glob":
removeFeatures(toks2, "*.*")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "me_too"
##
## Component 2 :
## [1] "one_two" "thank_you"
If your goal is just to extract those bigrams, then you could use tokens twice. Once to tokenize to sentences, then again to make the ngrams for each sentence.
library("quanteda")
mydat$text %>%
tokens(mydat$text, what = "sentence") %>%
as.character() %>%
tokens(ngrams = 2, remove_punct = TRUE) %>%
as.character()
#[1] "I_like" "like_apple" "Me_too" "One_two" "Thank_you"
Insert a tokens_tolower() after the first tokens() call if you like, or use char_tolower() at the end.
My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.
Consider the following code:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
The output:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.
I have heard that I could use data.table but I am not sure how.
I have also looked at the qdap package, but it gives me an error when trying to load the package, plus I don't even know if it will work.
Ref. http://cran.r-project.org/web/packages/qdap/qdap.pdf
Which approach?
data.table is definitely the right way to go. Regex operations are slow, although the ones in stringi are much faster (in addition to being much better). Anything with
I went through many iterations of solving problem in creating quanteda::dfm() for my quanteda package (see the GitHub repo here). The fastest solution, by far, involves using the data.table and Matrix packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix.
In the code below, I've taken for an example texts found with the quanteda package, which you can (and should!) install from CRAN or the development version from
devtools::install_github("kbenoit/quanteda")
I'd be very interested to see how it works on your 4m documents. Based on my experience working with corpuses of that size, it will work pretty well (if you have enough memory).
Note that in all my profiling, I could not improve the speed of the data.table operations through any sort of parallelisation, because of the way they are written in C++.
Core of the quanteda dfm() function
Here is the bare bones of the data.table based source code, in case any one wants to have a go at improving it. It takes a input a list of character vectors representing the tokenized texts. In the quanteda package, the full-featured dfm() works directly on character vectors of documents, or corpus objects, directly and implements lowercasing, removal of numbers, and removal of spacing by default (but these can all be modified if wished).
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R).
quanteda on your example
How's this for simplicity?
require(quanteda)
mytext <- c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.
# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs bar big child dog hold honor hunt let student
# text1 0 1 0 1 0 0 1 1 0
# text2 1 0 0 0 1 0 0 0 0
# text3 0 0 1 0 0 1 0 0 1
I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.
In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.
data <- data.frame(
text=c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student"
), stringsAsFactors = F)
## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))
lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)
library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ]
library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)
tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm
## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
You have a few choices. #TylerRinker commented about qdap, which is certainly a way to go.
Alternatively (or additionally) you could also benefit from a healthy does of parallelism. There's a nice CRAN page detailing HPC resources in R. It's a bit dated though and the multicore package's functionality is now contained within parallel.
You can scale up your text mining using the multicore apply functions of the parallel package or with cluster computing (also supported by that package, as well as by snowfall and biopara).
Another way to go is to employ a MapReduce approach. A nice presentation on combining tm and MapReduce for big data is available here. While that presentation is a few years old, all of the information is still current, valid and relevant. The same authors have a newer academic article on the topic, which focuses on the tm.plugin.dc plugin. To get around having a Vector Source instead of DirSource you can use coercion:
data("crude")
as.DistributedCorpus(crude)
If none of those solutions fit your taste, or if you're just feeling adventurous, you might also see how well your GPU can tackle the problem. There's a lot of variation in how well GPUs perform relative to CPUs and this may be a use case. If you'd like to give it a try, you can use gputools or the other GPU packages mentioned on the CRAN HPC Task View.
Example:
library(tm)
install.packages("tm.plugin.dc")
library(tm.plugin.dc)
GetDCorpus <-function(textVector)
{
doc.corpus <- as.DistributedCorpus(VCorpus(VectorSource(textVector)))
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, content_transformer(removeNumbers))
doc.corpus <- tm_map(doc.corpus, content_transformer(removePunctuation))
# <- tm_map(doc.corpus, removeWords, stopwords("english")) # won't accept this for some reason...
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
dcorp <- GetDCorpus(data[,1])
tdm <- TermDocumentMatrix(dcorp)
inspect(tdm)
Output:
> inspect(tdm)
<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 10/20
Sparsity : 67%
Maximal term length: 7
Weighting : term frequency (tf)
Docs
Terms 1 2 3
barred 0 1 0
big 1 0 0
child 0 0 1
dogs 1 0 0
holds 0 1 0
honor 0 0 1
hunt 1 0 0
let 1 0 0
student 0 0 1
the 1 0 0
This is better than my earlier answer.
The quanteda package has evolved significantly and is now faster and much simpler to use given its built-in tools for this sort of problem -- which is exactly what we designed it for. Part of the OP asked how to prepare the texts for a Bayesian classifier. I've added an example for this too, since quanteda's textmodel_nb() would crunch through 300k documents without breaking a sweat, plus it correctly implements the multinomial NB model (which is the most appropriate for text count matrices -- see also https://stackoverflow.com/a/54431055/4158274).
Here I demonstrate on the built-in inaugural corpus object, but the functions below would also work with a plain character vector input. I've used this same workflow to process and fit models to 10s of millions of Tweets in minutes, on a laptop, so it's fast.
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
# use a built-in data object
data <- data_corpus_inaugural
data
## Corpus consisting of 58 documents and 3 docvars.
# here we input a corpus, but plain text input works fine too
dtm <- dfm(data, tolower = TRUE, remove_numbers = TRUE, remove_punct = TRUE) %>%
dfm_wordstem(language = "english") %>%
dfm_remove(stopwords("english"))
dtm
## Document-feature matrix of: 58 documents, 5,346 features (89.0% sparse).
tail(dtm, nf = 5)
## Document-feature matrix of: 6 documents, 5 features (83.3% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs bleed urban sprawl windswept nebraska
## 1997-Clinton 0 0 0 0 0
## 2001-Bush 0 0 0 0 0
## 2005-Bush 0 0 0 0 0
## 2009-Obama 0 0 0 0 0
## 2013-Obama 0 0 0 0 0
## 2017-Trump 1 1 1 1 1
This is a rather trivial example, but for illustration, let's fit a Naive Bayes model, holding out the Trump document. This was the last inaugural speech at the time of this posting ("2017-Trump"), equal in position to the ndoc()th document.
# fit a Bayesian classifier
postwar <- ifelse(docvars(data, "Year") > 1945, "post-war", "pre-war")
textmod <- textmodel_nb(dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)], prior = "docfreq")
The same sorts of commands that work with other fitted model objects (e.g. lm(), glm(), etc.) will work with a fitted Naive Bayes textmodel object. So:
summary(textmod)
##
## Call:
## textmodel_nb.dfm(x = dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)],
## prior = "docfreq")
##
## Class Priors:
## (showing first 2 elements)
## post-war pre-war
## 0.2982 0.7018
##
## Estimated Feature Scores:
## fellow-citizen senat hous repres among vicissitud incid
## post-war 0.02495 0.4701 0.2965 0.06968 0.213 0.1276 0.08514
## pre-war 0.97505 0.5299 0.7035 0.93032 0.787 0.8724 0.91486
## life event fill greater anxieti notif transmit order
## post-war 0.3941 0.1587 0.3945 0.3625 0.1201 0.3385 0.1021 0.1864
## pre-war 0.6059 0.8413 0.6055 0.6375 0.8799 0.6615 0.8979 0.8136
## receiv 14th day present month one hand summon countri
## post-war 0.1317 0.3385 0.5107 0.06946 0.4603 0.3242 0.307 0.6524 0.1891
## pre-war 0.8683 0.6615 0.4893 0.93054 0.5397 0.6758 0.693 0.3476 0.8109
## whose voic can never hear vener
## post-war 0.2097 0.482 0.3464 0.2767 0.6418 0.1021
## pre-war 0.7903 0.518 0.6536 0.7233 0.3582 0.8979
predict(textmod, newdata = dtm[ndoc(dtm), ])
## 2017-Trump
## post-war
## Levels: post-war pre-war
predict(textmod, newdata = dtm[ndoc(dtm), ], type = "probability")
## post-war pre-war
## 2017-Trump 1 1.828083e-157
I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?
Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.