Discard longer dictionary matches which contain a nested target word - r

I am using tokens_lookup to see whether some texts contain the words in my dictionary. Now I am trying to find a way to discard the matches that occur when the dictionary word is in an ordered sequence of words. To make an example, suppose that Ireland is in the dictionary. I would like to exclude the cases where, for instance, Northern Ireland is mentioned (or any fixed set of words that contains Britain). The only indirect solution that I figured out is to build another dictionary with these sets of words (e.g. Great Britain). However, this solution would not work when both Britain and Great Britain are cited. Thank you.
library("quanteda")
dict <- dictionary(list(IE = "Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = dict)

You can do this by specifying another dictionary key for "Northern Ireland", with the value also "Northern Ireland". If you use the argument nested_scope = "dictionary" in tokens_lookup(), then this will match the longer phrase first and only once, separating "Ireland" from "Northern Ireland". By using the same key as the value, you replace it like for like (with the side benefit of now having these two tokens, "Northern" and "Ireland", combined a single token.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks,
dictionary = dict, exclusive = FALSE,
nested_scope = "dictionary", capkeys = FALSE
)
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE" "lorem" "ipsum"
##
## doc2 :
## [1] "Lorem" "ipsum" "Northern Ireland"
##
## doc3 :
## [1] "IE" "lorem" "ipsum" "Northern Ireland"
Here I used exclusive = FALSE for illustration purposes, so you could see what got looked up and replaced. You can remove that and the capkeys argument when you run it.
If you want to discard the "Northern Ireland" tokens, just use
tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary") %>%
tokens_remove("Northern Ireland")
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"
##
## doc2 :
## character(0)
##
## doc3 :
## [1] "IE"

Related

How to remove punctuation from tokens, when quanteda tokenizes at sentence level?

It is my ultimate goal to select some sentences from a corpus which match a certain pattern & perform a sentiment analysis upon these selected cutouts from the corpus. I am trying to do all of that with a current version of quanteda in R.
I noticed that remove_punctuation does not remove punctuation when tokens is applied at the sentence-level (what = "sentence"). When decomposing the selected sentence-tokens to word-tokens for the sentiment analysis, the word-tokens will contain punctuation such as "," or ".". Dictionaries are then no longer able to match on these tokens. Reproducible example:
mypattern <- c("country", "honor")
#
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.",
blind <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.")
#
toks <- tokens_select(tokens(txt, what = "sentence", remove_punct = TRUE),
pattern = paste0(mypattern, collapse = "|"),
valuetype = "regex",
selection = "keep")
#
toks
For instance, the tokens in toks contain "citizens," or "arrive,". I thought about splitting the tokens back to word-tokens by tokens_split(toks, separator = " ") but separator does allow one input parameter only.
Is there a way to remove the punctuation from the sentences when tokenizing at the sentence-level?
There are better ways to go about your goal, which consists of performing sentiment analysis on just sentences from documents containing your target pattern. You can do this by first reshaping your corpus into sentences, then tokenising them, then using tokens_select() with the window argument to select only those documents containing the pattern. In this case you will set a window so large that it will include the entire sentence.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.
When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.
Lorem ipsum dolor sit amet.")
corp <- corpus(txt)
corp_sent <- corpus_reshape(corp, to = "sentences")
corp_sent
#> Corpus consisting of 3 documents.
#> text1.1 :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
#> text1.2 :
#> "When the occasion proper for it shall arrive, I shall endeav..."
#>
#> text1.3 :
#> "Lorem ipsum dolor sit amet."
# sentiment on just the documents with the pattern
mypattern <- c("country", "honor")
toks <- tokens(corp_sent) %>%
tokens_select(pattern = mypattern, window = 10000000)
toks
#> Tokens consisting of 3 documents.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 11 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 12 more ]
#>
#> text1.3 :
#> character(0)
# now perform sentiment analysis on the selected tokens
tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
dfm()
#> Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs negative positive neg_positive neg_negative
#> text1.1 0 0 0 0
#> text1.2 0 5 0 0
#> text1.3 0 0 0 0
Created on 2022-03-22 by the reprex package (v2.0.1)
Note that if you to exclude the sentences that were empty, just use dfm_subset(dfmat, nfeat(dfmat) > 0) where dfmat is your saved output sentiment analysis dfm.

quanteda collocations and lemmatization

I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) Generate collocations using the corpus object:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) preprocess text and identify collocations and lemmatize for downstream tasks.
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) test results
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature
count
this
1
column
1
has
1
a
2
lot
1
of
1
almost
1
i
2
am
1
interested
1
in
1
problems
1
is
1
headache
1
how
1
do
1
you
1
handle
1
missing data
4
"missing data" should be "miss datum".
This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.
The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase() wrapper in tokens_compound(), you are telling tokens_replace() to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"

How to maintain ngrams in a quanteda dfm?

I'm using quanteda to create a document feature matrix (dfm) from a tokens object. My tokens object contains many ngrams (ex: "united_states"). When I create a dfm using the dfm() function, my ngrams are split by the understcore ("united_states" gets split into "united" "states"). How can I create a dfm while maintaining my ngrams?
Here's my process:
my_tokens <- tokens(my_corpus, remove_symbols=TRUE, remove_punct = TRUE, remove_numbers = TRUE)
my_tokens <- tokens_compound(pattern=phrase(my_ngrams))
my_dfm <- dfm(my_tokens, stem= FALSE, tolower=TRUE)
I see "united_states" in my_tokens, but in the dfm it becomes "united" and "states" as separate tokens.
Thank you for any help you can offer!
It's not clear which version of quanteda you are using, but basically this should work, since the default tokenizer (from tokens()) will not split words containing an inner _.
Demonstration:
library("quanteda")
## Package version: 2.1.1
# tokens() will not separate _ words
tokens("united_states")
## Tokens consisting of 1 document.
## text1 :
## [1] "united_states"
Here's a reproducible example for the phrase "United States":
my_corpus <- tail(data_corpus_inaugural, 3)
# show that the phrase exists
head(kwic(my_corpus, phrase("united states"), window = 2))
##
## [2009-Obama, 2685:2686] bless the | United States | of America
## [2013-Obama, 13:14] of the | United States | Congress,
## [2013-Obama, 2313:2314] bless these | United States | of America
## [2017-Trump, 347:348] , the | United States | of America
## [2017-Trump, 1143:1144] to the | United States | of America
my_tokens <- tokens(my_corpus,
remove_symbols = TRUE,
remove_punct = TRUE, remove_numbers = TRUE
)
my_tokens <- tokens_compound(my_tokens, pattern = phrase("united states"))
my_dfm <- dfm(my_tokens, stem = FALSE, tolower = TRUE)
dfm_select(my_dfm, "*_*")
## Document-feature matrix of: 3 documents, 1 feature (0.0% sparse) and 4 docvars.
## features
## docs united_states
## 2009-Obama 1
## 2013-Obama 2
## 2017-Trump 2

how to extract ngrams from a text in R (newspaper articles)

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm:
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE)
I am trying to extract bigrams (e.g. "climate change", "global warming") but keep getting an error message when I type the following, saying the ngrams argument is not used.
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE, ngrams = 2)
I have installed the tokenizer, tidyverse, dplyr, ngram, readtext, quanteda and stm libraries.
Below is a screenshot of my corpus.
Doc_iD is the article titles. I need the bigrams to be extracted from the "texts" column.
Do I need to extract the ngrams from the corpus first or can I do it from the dfm? Am I missing some piece of code that allows me to extract the bigrams?
Strictly speaking, if ngrams are what you want, then you can use tokens_ngrams() to form them. But sounds like you rather get more interesting multi-word expressions than "of the" etc. For that, I would use textstat_collocations(). You will want to do this on tokens, not on a dfm - the dfm will have already split your tokens into bag of words features, from which ngrams or MWEs can no longer be formed.
Here's an example from the built-in inaugural corpus. It removes stopwords but leaves a "pad" so that words that were not adjacent before the stopword removal will not appear as adjacent after their removal.
library("quanteda")
## Package version: 2.0.1
toks <- tokens(data_corpus_inaugural) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 united states 157 0 2 7.893348 41.19480
## 2 let us 97 0 2 6.291169 36.15544
## 3 fellow citizens 78 0 2 7.963377 32.93830
## 4 american people 40 0 2 4.426593 23.45074
## 5 years ago 26 0 2 7.896667 23.26947
## 6 federal government 32 0 2 5.312744 21.80345
These are by default scored and sorted in order of descending score.
To "extract" them, just take the collocation column:
head(colls$collocation, 50)
## [1] "united states" "let us" "fellow citizens"
## [4] "american people" "years ago" "federal government"
## [7] "almighty god" "general government" "fellow americans"
## [10] "go forward" "every citizen" "chief justice"
## [13] "four years" "god bless" "one another"
## [16] "state governments" "political parties" "foreign nations"
## [19] "solemn oath" "public debt" "religious liberty"
## [22] "public money" "domestic concerns" "national life"
## [25] "future generations" "two centuries" "social order"
## [28] "passed away" "good faith" "move forward"
## [31] "earnest desire" "naval force" "executive department"
## [34] "best interests" "human dignity" "public expenditures"
## [37] "public officers" "domestic institutions" "tariff bill"
## [40] "first time" "race feeling" "western hemisphere"
## [43] "upon us" "civil service" "nuclear weapons"
## [46] "foreign affairs" "executive branch" "may well"
## [49] "state authorities" "highest degree"
I think you need to create the ngram directly from the corpus. This is an example adapted from the quanteda tutorial website:
library(quanteda)
corp <- corpus(data_corpus_inaugural)
toks <- tokens(corp)
tokens_ngrams(toks, n = 2)
Tokens consisting of 58 documents and 4 docvars.
1789-Washington :
[1] "Fellow-Citizens_of" "of_the" "the_Senate" "Senate_and" "and_of" "of_the" "the_House"
[8] "House_of" "of_Representatives" "Representatives_:" ":_Among" "Among_the"
[ ... and 1,524 more ]
EDITED Hi this example from the help dfm may be useful
library(quanteda)
# You say you're already creating the corpus?
# where it says "data_corpus_inaugaral" put your corpus name
# Where is says "the_senate" put "climate change"
# where is says "the_house" put "global_warming"
tokens(data_corpus_inaugural) %>%
tokens_ngrams(n = 2) %>%
dfm(stem = TRUE, select = c("the_senate", "the_house"))
#> Document-feature matrix of: 58 documents, 2 features (89.7% sparse) and 4 docvars.
#> features
#> docs the_senat the_hous
#> 1789-Washington 1 2
#> 1793-Washington 0 0
#> 1797-Adams 0 0
#> 1801-Jefferson 0 0
#> 1805-Jefferson 0 0
#> 1809-Madison 0 0
#> [ reached max_ndoc ... 52 more documents ]

substituting several ngrams in quanteda

In my text of news articles I would like to convert several different ngrams that refer to the same political party to an acronym. I would like to do this because I would like to avoid any sentiment dictionaries confusing the words in the party's name (Liberal Party) with the same word in different contexts (liberal helping).
I can do this below with str_replace_all and I know about the token_compound() function in quanteda, but it doesn't seem to do exactly what I need.
library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')
Should I somehow just preprocess the text before turning it into a corpus? Or is there a way to do this after turning it into a corpus in quanteda.
Here is some expanded sample code that specifies the problem a little better:
`text<-c('a text about some political parties called the new democratic party
the new democrats and the liberal party and the liberals. I would like the
word democratic to be counted in the dfm but not the words new democratic.
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))
dfm(text, dictionary=partydict)`
This example counts democratic in both the new democratic and the democratic sense, but I would those counted separately.
You want the function tokens_lookup(), after having defined a dictionary that defines the canonical party labels as keys, and lists all the ngram variations of the party names as values. By setting exclusive = FALSE it will keep the tokens that are not matched, in effect acting as a substitution of all variations with the canonical party names.
In the example below, I've modified your input text a bit to illustrate the ways that the party names will be combined to be different from the phrases using "liberal" but not "liberal party".
library("quanteda")
text<-c('a text about some political parties called the new democratic party
which is conservative the new democrats and the liberal party and the
liberals which are liberal helping poor people')
toks <- tokens(text)
partydict <- dictionary(list(
olp = c("liberal party", "the liberals"),
ndp = c("new democrats", "new democratic party")
))
(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
## [1] "a" "text" "about" "some" "political" "parties"
## [7] "called" "the" "NDP" "which" "is" "conservative"
## [13] "the" "NDP" "and" "the" "OLP" "and"
## [19] "OLP" "which" "are" "liberal" "helping" "poor"
## [25] "people"
So that has replaced the party name variances with the party keys.
Constructing a dfm from this new tokens now occurs on these new tokens, preserving the uses of (e.g.) "liberal" that might be linked to sentiment, but having already combined the "liberal party" and replaced it with "OLP". Applying a dictionary to the dfm will now work for your example of "liberal" in "liberal helping" without having confused it with the use of "liberal" in the party name.
sentdict <- dictionary(list(
left = c("liberal", "left"),
right = c("conservative", "")
))
dfm(toks2) %>%
dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
## features
## docs olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
## text1 2 2 1 1 1 1 1 1 1 3 2 1 1 2 1 1 1
## features
## docs poor people
## text1 1 1
Two additional notes:
If you do not want the keys uppercased in the replacement tokens, set capkeys = FALSE.
You can set different matching types using the valuetype argument, including valuetype = regex. (And note that your regular expression in the example is probably not correctly formed, since the scope of your | operator in the ndp example will get "new democrats" OR "new" and then " democratic party". But with tokens_lookup() you won't need to worry about that!)

Resources