How to count collocations in quanteda based on grouping variables? - r

I have been working on identifying and classfying collocations over Quenteda package in R.
For instance;
I create token object from a list of documents, and apply collocation analysis.
toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)
however, as far as I can see, there is not a clear method to see which collocation(s) is frequent/exist in which document. Even if I apply kwic(toks, pattern = phrase(collocations), selection = 'keep') result will only include rowid as text1, text2 etc.
I would like to group collocation analysis results based on docvars. is it possible with Quanteda ?

It sounds like you wish to tally collocations by document. The output from textstat_collocations() already provides counts for each collocation, but these are for the entire corpus.
So the solution to group by document (or any other variable) is to
Get the collocations using textstat_collocations(). Below, I've done that after removing stopwords and punctuation.
Compound the tokens from which the stopwords were formed, using tokens_compound(). This converts each collocation sequence into a single token.
Form a dfm from the compounded tokens, and use textstat_frequency() to count the compounds by document.
This is a bit trickier
Implementation using the built-in inaugural corpus:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
toks <- data_corpus_inaugural %>%
tail(10) %>%
tokens(remove_punct = TRUE, padding = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 let us 34 0 2 6.257000 17.80637
## 2 fellow citizens 14 0 2 6.451738 16.18314
## 3 fellow americans 15 0 2 6.221678 16.16410
## 4 one another 14 0 2 6.592755 14.56082
## 5 god bless 15 0 2 8.628894 13.57027
## 6 united states 12 0 2 9.192044 13.22077
Now we compound them and keep only the collocations, then get the frequencies by document:
dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
dfm() %>%
dfm_keep("* *")
That dfm already contains the counts by document of each collocation, but if you want counts in a data.frame format, with a grouping option, use textstat_frequency(). Here I've only output the top two by document, but if you remove the n = 2 then it will give you the frequencies of all collocations by document.
textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
head(10)
## feature frequency rank docfreq group
## 1 nuclear weapons 4 1 1 1985-Reagan
## 2 human freedom 3 2 1 1985-Reagan
## 3 new breeze 4 1 1 1989-Bush
## 4 new engagement 3 2 1 1989-Bush
## 5 let us 7 1 1 1993-Clinton
## 6 fellow americans 4 2 1 1993-Clinton
## 7 let us 6 1 1 1997-Clinton
## 8 new century 6 1 1 1997-Clinton
## 9 nation's promise 2 1 1 2001-Bush
## 10 common good 2 1 1 2001-Bush

Related

Identify WHICH words in a document have been matched by dictionary lookup and how many times

Quanteda question.
For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much.
Put differently, I want to get a matrix of the features in each dictionary category that have been matched using the tokens_lookup and dfm_lookup functions, and their frequency per document. So not the aggregated frequency of all words in the category, but of each of them separately.
Is there an easy way to get this?
The easiest way to do this is to iterate over your dictionary "keys" (what you call "categories") and select the matches to create one dfm per key. There are a few steps needed to deal with the non-matches and the compound dictionary values (such as "not fail").
I can demonstrate this using the built-in inaugural address corpus and the LSD2015 dictionary, which has four keys and includes multi-word values.
The loop iterates over the dictionary keys to build up a list, each time doing the following:
select the tokens but leave a pad for ones not selected;
compound the multi-word tokens into single tokens;
rename the pad ("") to OTHER, so that we can count non-matches; and
create the dfm.
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
Now we have all of the dictionary matches for each key in a list of dfm objects:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658

Speed of Cleaning Text in R using a Dictionary

I currently have a list of misspellings and a list of corrections, indexed with a 1 to 1 relationship.
These corrections are specific to the work I am doing so I cannot use existing spelling correction packages.
Given a list of strings which I want to apply these corrections to, I have the following code:
for (i in 1:n){
new_text <- gsub(match[i], dict[i], new_text)
new_text <- gsub('[[:punct:]]', '', new_text)
}
Although this gives the results I want, it takes most of the day to run.
I cannot figure out how to use apply functions because the operations happen in a specific order on the same object.
Is there anything else I can try to speed this up?
Edit: This is the very small test set I have put together to benchmark performance.
match <- c("\\b(abouta|aobut|bout|abot|abotu)\\b","\\b(avdised|advisd|advized|advsied)\\b","\\b(posible|possibl)\\b","\\b(replacment|repalcement|replacemnt|replcement|rplacement)\\b","\\b(tommorrow|tomorow|tommorow|tomorro|tommoro)\\b")
dict <- c('about','advised','possible','replacement','tomorrow')
new_text <- c('be advisd replacment coming tomorow','did you get the email aobut the repalcement tomorro','the customer has been avdised of a posible replacement','there is a replacement coming tomorrow','what time tommorow is the replacment coming')
n <- 5
Running my current code 1000 times on this data gives 0.424 elapsed.
Try the corpus library, using a custom stemmer. The library lets you provide an arbitrary stemmer function. In your case you would use something like the following for your stemmer:
library(corpus)
dict <- strsplit(split = "\\|",
c("about" = "abouta|aobut|bout|abot|abotu",
"advised" = "avdised|advisd|advized|advsied",
"possible" = "posible|possibl",
"replacement" = "replacment|repalcement|replacemnt|replcement|rplacement",
"tomorrow" = "tommorrow|tomorow|tommorow|tomorro|tommoro"))
my_stemmer <- new_stemmer(unlist(dict), rep(names(dict), lengths(dict)))
Then, you can either pass this function as the stemmer argument to any function expecting text, or else you can create a corpus_text object with the stemmer attribute (as part of its token_filter that defines how text gets transformed to tokens):
new_text <- c('be advisd replacment coming tomorow',
'did you get the email aobut the repalcement tomorro',
'the customer has been avdised of a posible replacement',
'there is a replacement coming tomorrow','what time tommorow is the replacment coming')
Use term_stats to count (stemmed) token occurrences:
text <- as_corpus_text(new_text, stemmer = my_stemmer, drop_punct = TRUE)
term_stats(text)
#> term count support
#> 1 replacement 5 5
#> 2 tomorrow 4 4
#> 3 the 4 3
#> 4 coming 3 3
#> 5 a 2 2
#> 6 advised 2 2
#> 7 is 2 2
#> 8 about 1 1
#> 9 be 1 1
#> 10 been 1 1
#> 11 customer 1 1
#> 12 did 1 1
#> 13 email 1 1
#> 14 get 1 1
#> 15 has 1 1
#> 16 of 1 1
#> 17 possible 1 1
#> 18 there 1 1
#> 19 time 1 1
#> 20 what 1 1
#> ⋮ (21 rows total)
Use text_locate to find instances of (stemmed) tokens in the original text:
text_locate(text, "replacement")
#> text before instance after
#> 1 1 be advisd replacment coming tomorow
#> 2 2 …u get the email aobut the repalcement tomorro
#> 3 3 …been avdised of a posible replacement
#> 4 4 there is a replacement coming tomorrow
#> 5 5 what time tommorow is the replacment coming
The results of the stemming function get cached, so this is all very fast.
More examples at http://corpustext.com/articles/stemmer.html

Quanteda - Apply Function to DFM Over Document Variables

I am using R's quanteda package and the latest versions for both R and the package. I have a corpus of documents which number in the millions.
Let's suppose I have a DFM generated from quanteda with each document having a docvar of the date. There are thousands of documents generated in a given day, but I want to obtain the DFMs applied to the documents by day (so that I have total word counts for each term by day). I know that quanteda is built using data.table, so it should be possible to do this, but I have found little in the "Getting Started with Quanteda" or on StackOverflow that gives a clean way of doing this.
Any suggestions?
You want the 'groups' argument to dfm:
> # Add some random dates to an existing corpus
> docvars(data_corpus_inaugural)$date <- rep(as.Date(runif(19, 1, 18000), origin='1970-01-01'), 3)
> dfm_inaugural <- dfm(data_corpus_inaugural, groups='date')
> head(dfm_inaugural)
Document-feature matrix of: 19 documents, 9,215 features (80.8% sparse).
(showing first 6 documents and first 6 features)
features
docs fellow citizens i appear before you
1970-12-27 4 7 39 2 10 17
1972-04-25 8 13 29 1 8 8
1973-08-22 1 3 48 1 6 1
1973-10-11 2 4 25 0 3 5
1974-01-05 3 9 57 0 7 2
1975-04-12 7 21 63 4 6 16

Include ID number in dfm() output

I have a dataset with an ID number column and a text column, and I am running a LIWC analysis on the text data using the quanteda package. Here's an example of my data setup:
mydata<-data.frame(
id=c(19,101,43,12),
text=c("No wonder, then, that ever gathering volume from the mere transit ",
"So that in many cases such a panic did he finally strike, that few ",
"But there were still other and more vital practical influences at work",
"Not even at the present day has the original prestige of the Sperm Whale"),
stringsAsFactors=F
)
I have been able to conduct the LIWC analysis using scores <- dfm(as.character(mydata$text), dictionary = liwc)
However, when I view the results (View(scores)), I find that the function does not reference the original ID numbers (19, 101, 43, 12) in the final results. Instead, a row.names column is included but it contains non-descriptive identifiers (e.g., "text1", "text2"):
How can I get the dfm() function to include the ID numbers in its output? Thank you!
It sounds like you would like the row names of the dfm object to be the ID numbers from your mydata$id. This will happen automatically if you declare this ID to be the docnames for the texts. The easiest way to do this is to create a quanteda corpus object from your data.frame.
The corpus() call below assigns the docnames from your id variable. Note: The "Text" from the summary() call looks like a numeric value but it's actually the document name for the text.
require(quanteda)
myCorpus <- corpus(mydata[["text"]], docnames = mydata[["id"]])
summary(myCorpus)
# Corpus consisting of 4 documents.
#
# Text Types Tokens Sentences
# 19 11 11 1
# 101 13 14 1
# 43 12 12 1
# 12 12 14 1
#
# Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Tue Dec 29 11:54:00 2015
# Notes:
From there, the document name is automatically the row label in your dfm. (You can add the dictionary = argument for your LIWC application.)
myDfm <- dfm(myCorpus, verbose = FALSE)
head(myDfm)
# Document-feature matrix of: 4 documents, 45 features.
# (showing first 4 documents and first 6 features)
# features
# docs no wonder then that ever gathering
# 19 1 1 1 1 1 1
# 101 0 0 0 2 0 0
# 43 0 0 0 0 0 0
# 12 0 0 0 0 0 0

Sentiment Analysis in R

I am new in sentiment analysis, and totally have no idea on how to go about it using R. Hence, I would like to seek help and guidance in this.
I have a set of data consisting of opinions, and would like to analyse the the opinions.
Title Date Content
Boy May 13 2015 "She is pretty", Tom said.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart..
Monster Jan 22 2015 Ms May, a student, said that John has $10.80.
Thank you.
Sentiment analysis encompasses a broad category of methods designed to measure positive versus negative sentiment from text, so that makes this a fairly difficult question to answer simply. But here is a simple answer: You can apply a dictionary to your document-term matrix and then combine the positive versus negative key categories of your dictionary to create a sentiment measure.
I suggest trying this in the text analysis package quanteda, which handles a variety of existing dictionary formats and allows you to create very flexible custom dictionaries.
For example:
require(quanteda)
mycorpus <- subset(inaugCorpus, Year>1980)
mydict <- dictionary(list(negative = c("detriment*", "bad*", "awful*", "terrib*", "horribl*"),
postive = c("good", "great", "super*", "excellent")))
myDfm <- dfm(mycorpus, dictionary = mydict)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 9 documents
## ... indexing features: 3,113 feature types
## ... applying a dictionary consisting of 2 keys
## ... created a 9 x 2 sparse dfm
## ... complete.
## Elapsed time: 0.057 seconds.
myDfm
## Document-feature matrix of: 9 documents, 2 features.
## 9 x 2 sparse Matrix of class "dfmSparse"
## features
## docs negative postive
## 1981-Reagan 0 6
## 1985-Reagan 0 6
## 1989-Bush 0 18
## 1993-Clinton 1 2
## 1997-Clinton 2 8
## 2001-Bush 1 6
## 2005-Bush 0 8
## 2009-Obama 2 3
## 2013-Obama 1 3
# use a LIWC dictionary - obviously you need this file
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
myDfmLIWC <- dfm(mycorpus, dictionary = liwcdict)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 9 documents
## ... indexing features: 3,113 feature types
## ... applying a dictionary consisting of 68 keys
## ... created a 9 x 68 sparse dfm
## ... complete.
## Elapsed time: 1.844 seconds.
myDfmLIWC[, grep("^Pos|^Neg", features(myDfmLIWC))]
## Document-feature matrix of: 9 documents, 4 features.
## 9 x 4 sparse Matrix of class "dfmSparse"
## features
## docs Negate Posemo Posfeel Negemo
## 1981-Reagan 46 89 5 24
## 1985-Reagan 28 104 7 33
## 1989-Bush 40 102 10 8
## 1993-Clinton 25 51 3 23
## 1997-Clinton 27 64 5 22
## 2001-Bush 40 80 6 27
## 2005-Bush 25 117 5 31
## 2009-Obama 40 83 5 46
## 2013-Obama 42 80 13 22
For your corpus, assuming that you get it into a data.frame called data, you can create a quanteda corpus using:
mycorpus <- corpus(data$Content, docvars = data[, 1:2])
See also ?textfile for loading in content from files in one easy command. This works with .csv files for instance, although you would have problems with that file because the Content field contains text containing commas.
There are many other ways to measure sentiment of course, but if you are new to sentiment mining and R, that should get you started. You can read more on sentiment mining methods (and apologies if you already have encountered them) from:
Liu, Bing. 2010. "Sentiment Analysis and Subjectivity." Handbook of
natural language processing 2:
627–66.
Liu, Bing. 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press.

Resources