Identify WHICH words in a document have been matched by dictionary lookup and how many times - r

Quanteda question.
For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much.
Put differently, I want to get a matrix of the features in each dictionary category that have been matched using the tokens_lookup and dfm_lookup functions, and their frequency per document. So not the aggregated frequency of all words in the category, but of each of them separately.
Is there an easy way to get this?

The easiest way to do this is to iterate over your dictionary "keys" (what you call "categories") and select the matches to create one dfm per key. There are a few steps needed to deal with the non-matches and the compound dictionary values (such as "not fail").
I can demonstrate this using the built-in inaugural address corpus and the LSD2015 dictionary, which has four keys and includes multi-word values.
The loop iterates over the dictionary keys to build up a list, each time doing the following:
select the tokens but leave a pad for ones not selected;
compound the multi-word tokens into single tokens;
rename the pad ("") to OTHER, so that we can count non-matches; and
create the dfm.
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
Now we have all of the dictionary matches for each key in a list of dfm objects:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658

Related

How to count collocations in quanteda based on grouping variables?

I have been working on identifying and classfying collocations over Quenteda package in R.
For instance;
I create token object from a list of documents, and apply collocation analysis.
toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)
however, as far as I can see, there is not a clear method to see which collocation(s) is frequent/exist in which document. Even if I apply kwic(toks, pattern = phrase(collocations), selection = 'keep') result will only include rowid as text1, text2 etc.
I would like to group collocation analysis results based on docvars. is it possible with Quanteda ?
It sounds like you wish to tally collocations by document. The output from textstat_collocations() already provides counts for each collocation, but these are for the entire corpus.
So the solution to group by document (or any other variable) is to
Get the collocations using textstat_collocations(). Below, I've done that after removing stopwords and punctuation.
Compound the tokens from which the stopwords were formed, using tokens_compound(). This converts each collocation sequence into a single token.
Form a dfm from the compounded tokens, and use textstat_frequency() to count the compounds by document.
This is a bit trickier
Implementation using the built-in inaugural corpus:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
toks <- data_corpus_inaugural %>%
tail(10) %>%
tokens(remove_punct = TRUE, padding = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 let us 34 0 2 6.257000 17.80637
## 2 fellow citizens 14 0 2 6.451738 16.18314
## 3 fellow americans 15 0 2 6.221678 16.16410
## 4 one another 14 0 2 6.592755 14.56082
## 5 god bless 15 0 2 8.628894 13.57027
## 6 united states 12 0 2 9.192044 13.22077
Now we compound them and keep only the collocations, then get the frequencies by document:
dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
dfm() %>%
dfm_keep("* *")
That dfm already contains the counts by document of each collocation, but if you want counts in a data.frame format, with a grouping option, use textstat_frequency(). Here I've only output the top two by document, but if you remove the n = 2 then it will give you the frequencies of all collocations by document.
textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
head(10)
## feature frequency rank docfreq group
## 1 nuclear weapons 4 1 1 1985-Reagan
## 2 human freedom 3 2 1 1985-Reagan
## 3 new breeze 4 1 1 1989-Bush
## 4 new engagement 3 2 1 1989-Bush
## 5 let us 7 1 1 1993-Clinton
## 6 fellow americans 4 2 1 1993-Clinton
## 7 let us 6 1 1 1997-Clinton
## 8 new century 6 1 1 1997-Clinton
## 9 nation's promise 2 1 1 2001-Bush
## 10 common good 2 1 1 2001-Bush

Remove words from a dtm

I have created a dtm.
library(tm)
corpus = Corpus(VectorSource(dat$Reviews))
dtm = DocumentTermMatrix(corpus)
I used it to remove rare terms.
dtm = removeSparseTerms(dtm, 0.98)
After removeSparseTermsthere are still some terms in the dtm which are useless for my analysis.
The tm package has a function to remove words. However, this function can only be applied to a corpus or a vector.
How can I remove defined terms from a dtm?
Here is a small sample of the input data:
samp = dat %>%
select(Reviews) %>%
sample_n(20)
dput(samp)
structure(list(Reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"work perfect time", "amaze buy phone smoothly update charm glte yet comparably fast several different provider sims perfectly small size definitely replacemnent simple",
"phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend",
"perfect", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy",
"", "phone verizon contract phone buyer beware", "good phone",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"
)), row.names = c(12647L, 10088L, 14055L, 3720L, 6588L, 10626L,
10362L, 1428L, 12580L, 5381L, 10431L, 2803L, 6644L, 12969L, 348L,
10582L, 3215L, 13358L, 12708L, 7049L), class = "data.frame")
You should try quanteda, which calls a DocumentTermMatrix a "dfm" (document feature matrix) and has more options to trim it to reduce sparsity, including a function dfm_remove() for removing specific features (terms).
If we rename your samp object as dat, then:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
corp <- corpus(dat, text_field = "Reviews")
corp
## Corpus consisting of 20 documents and 0 docvars.
tail(texts(corp), 2)
## 12708 7049
## "good phone price fine" "phone star battery little soon yes"
dtm <- dfm(corp)
dtm
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).
Now we can trim this. For this small one, the sparsity setting of 0.98 has no effect, but we can trim based on frequency thresholds.
# does not actually have an effect
dfm_trim(dtm, sparsity = 0.98, verbose = TRUE)
## Note: converting sparsity into min_docfreq = 1 - 0.98 = NULL .
## No features removed.
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).
# trimming based on rare terms
dtm <- dfm_trim(dtm, min_termfreq = 3, verbose = TRUE)
## Removing features occurring:
## - fewer than 3 times: 119
## Total features removed: 119 (93.0%).
head(dtm)
## Document-feature matrix of: 6 documents, 9 features (83.3% sparse).
## 6 x 9 sparse Matrix of class "dfm"
## features
## docs phone screen sim card work good perfect buy never
## 12647 0 0 0 0 0 0 0 0 0
## 10088 0 0 0 0 0 0 0 0 0
## 14055 0 0 0 0 0 0 0 0 0
## 3720 1 0 0 0 0 0 0 0 0
## 6588 1 1 1 1 1 1 0 0 0
## 10626 0 0 0 0 1 0 1 0 0
Anyway to answer your question directly, you want dfm_remove() to get rid of specific features.
# removing from a specific list of terms
dtm <- dfm_remove(dtm, c("screen", "buy", "sim", "card"), verbose = TRUE)
## removed 4 features
##
dtm
## Document-feature matrix of: 20 documents, 5 features (75.0% sparse).
head(dtm)
## Document-feature matrix of: 6 documents, 5 features (80.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs phone work good perfect never
## 12647 0 0 0 0 0
## 10088 0 0 0 0 0
## 14055 0 0 0 0 0
## 3720 1 0 0 0 0
## 6588 1 1 1 0 0
## 10626 0 1 0 1 0
And finally, if you still really want to, you can convert the dtm into the tm format using quanteda's convert() function:
convert(dtm, to = "tm")
## <<DocumentTermMatrix (documents: 20, terms: 5)>>
## Non-/sparse entries: 25/75
## Sparsity : 75%
## Maximal term length: 7
## Weighting : term frequency (tf)

Feature extraction using Chi2 with Quanteda

I have a dataframe df with this structure :
Rank Review
5 good film
8 very good film
..
Then I tried to create a DocumentTermMatris using quanteda package :
mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)
I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value
Can you help me to resolve this problem please?
EDIT :
head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
features
docs bon accueil conseillèr efficac écout répond
text1 0 0 0 0 0 0
text2 1 1 1 1 1 1
text3 0 0 0 0 0 0
text4 0 0 0 0 0 0
text5 0 0 1 0 0 0
text6 0 0 0 0 1 0
...
text60300 0 0 1 1 1 1
Here I have my dfm matrix, then I create my tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :
textstat_keyness(mydfm, target = 2)
But, since I have 60300 target, I don't know how to do this automatically .
I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(
EDIT 2 :
Rank Review
10 always good
1 nice film
3 fine as usual
Here I try to group document with dfm :
mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)
But it fails to group documents
Can you help me please to resolve this problem
Thank you
See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.
textstat_keyness(mydfm, target = 1)
for the first document against the frequencies of all others, or
textstat_keyness(mydfm, target = 2)
for the second against all others, etc.
If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.

Converting a Term Document Matrix to a Term Document Matrix supported by tm library

I have a csv file, where I have all of my documents stemmed in a Term Document Matrix form and a categorical variable as a sentiment.
I'd like to use tm's capabilities (terms frequencies etc.). Is there a way to do so, given the data I started with?
# given:
dtm = read.csv(file_path, na.strings="")
dtm$rating = as.factor(dtm$rating)
str(dtm)
# 'data.frame': 2000 obs. of 2002 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ abl : int 0 0 0 0 0 0 0 0 0 0 ...
# ...
head(dtm)
#ID abl absolut absorb accept
#1 1 0 0 0
#2 2 0 0 1
# I'd like to achieve...
tdm <- TermDocumentMatrix(dtm,
control = list(removePunctuation = TRUE,
stopwords = TRUE))
Can you use as.TermDocumentMatrix(df, weighting = weightTf) (in the R package tm) to do what you seek?

R text mining how to segment document into phrases not terms

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to segment document into phases but not word(term).
You can do this in R using the quanteda package, which can detect multi-word expressions as statistical collocates, which would be the multi-word expressions that you are probably referring to in English. To remove the collocations containing stop words, you would first tokenise the text, then remove the stop words leaving a "pad" in place to prevent false adjacencies in the results (two words that were not actually adjacent before the removal of stop words between them).
require(quanteda)
pres_tokens <-
tokens(data_corpus_inaugural) %>%
tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%
tokens_remove(stopwords("english"), padding = TRUE)
pres_collocations <- textstat_collocations(pres_tokens, size = 2)
head(pres_collocations)
# collocation count count_nested length lambda z
# 1 united states 157 0 2 7.893307 41.19459
# 2 let us 97 0 2 6.291128 36.15520
# 3 fellow citizens 78 0 2 7.963336 32.93813
# 4 american people 40 0 2 4.426552 23.45052
# 5 years ago 26 0 2 7.896626 23.26935
# 6 federal government 32 0 2 5.312702 21.80328
# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])
tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon" "shall_endeavor" "high_sense" "official_act"
Using this "compounded" token set, we can now turn this into a document-feature matrix where the features consist of a mixture of original terms (those not found in a collocation) and the collocations. As can be seen below, "united" occurs alone and as part of the collocation "united_states".
pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
# features
# docs united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
# 1789-Washington 4 2 0 0 0 0 0 0 0 0
# 1793-Washington 1 0 0 0 0 0 0 0 0 0
# 1797-Adams 3 9 0 0 0 0 0 0 0 0
# 1801-Jefferson 0 0 0 0 0 0 0 0 0 0
# 1805-Jefferson 1 4 0 0 0 0 0 0 0 0
If you want a more brute-force approach, it's possible simply to create a document-by-bigram matrix this way:
# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens_of of_the the_senate senate_and and_of the_house
## 1789-Washington 1 20 1 1 2 2
## 1797-Adams 0 29 0 0 2 0
## 1793-Washington 0 4 0 0 1 0
## 1801-Jefferson 0 28 0 0 3 0
## 1805-Jefferson 0 17 0 0 1 0
## 1809-Madison 0 20 0 0 2 0

Resources