How to maintain ngrams in a quanteda dfm? - r

I'm using quanteda to create a document feature matrix (dfm) from a tokens object. My tokens object contains many ngrams (ex: "united_states"). When I create a dfm using the dfm() function, my ngrams are split by the understcore ("united_states" gets split into "united" "states"). How can I create a dfm while maintaining my ngrams?
Here's my process:
my_tokens <- tokens(my_corpus, remove_symbols=TRUE, remove_punct = TRUE, remove_numbers = TRUE)
my_tokens <- tokens_compound(pattern=phrase(my_ngrams))
my_dfm <- dfm(my_tokens, stem= FALSE, tolower=TRUE)
I see "united_states" in my_tokens, but in the dfm it becomes "united" and "states" as separate tokens.
Thank you for any help you can offer!

It's not clear which version of quanteda you are using, but basically this should work, since the default tokenizer (from tokens()) will not split words containing an inner _.
Demonstration:
library("quanteda")
## Package version: 2.1.1
# tokens() will not separate _ words
tokens("united_states")
## Tokens consisting of 1 document.
## text1 :
## [1] "united_states"
Here's a reproducible example for the phrase "United States":
my_corpus <- tail(data_corpus_inaugural, 3)
# show that the phrase exists
head(kwic(my_corpus, phrase("united states"), window = 2))
##
## [2009-Obama, 2685:2686] bless the | United States | of America
## [2013-Obama, 13:14] of the | United States | Congress,
## [2013-Obama, 2313:2314] bless these | United States | of America
## [2017-Trump, 347:348] , the | United States | of America
## [2017-Trump, 1143:1144] to the | United States | of America
my_tokens <- tokens(my_corpus,
remove_symbols = TRUE,
remove_punct = TRUE, remove_numbers = TRUE
)
my_tokens <- tokens_compound(my_tokens, pattern = phrase("united states"))
my_dfm <- dfm(my_tokens, stem = FALSE, tolower = TRUE)
dfm_select(my_dfm, "*_*")
## Document-feature matrix of: 3 documents, 1 feature (0.0% sparse) and 4 docvars.
## features
## docs united_states
## 2009-Obama 1
## 2013-Obama 2
## 2017-Trump 2

Related

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

quanteda collocations and lemmatization

I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) Generate collocations using the corpus object:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) preprocess text and identify collocations and lemmatize for downstream tasks.
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) test results
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature
count
this
1
column
1
has
1
a
2
lot
1
of
1
almost
1
i
2
am
1
interested
1
in
1
problems
1
is
1
headache
1
how
1
do
1
you
1
handle
1
missing data
4
"missing data" should be "miss datum".
This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.
The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase() wrapper in tokens_compound(), you are telling tokens_replace() to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"

Discard longer dictionary matches which contain a nested target word

I am using tokens_lookup to see whether some texts contain the words in my dictionary. Now I am trying to find a way to discard the matches that occur when the dictionary word is in an ordered sequence of words. To make an example, suppose that Ireland is in the dictionary. I would like to exclude the cases where, for instance, Northern Ireland is mentioned (or any fixed set of words that contains Britain). The only indirect solution that I figured out is to build another dictionary with these sets of words (e.g. Great Britain). However, this solution would not work when both Britain and Great Britain are cited. Thank you.
library("quanteda")
dict <- dictionary(list(IE = "Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = dict)
You can do this by specifying another dictionary key for "Northern Ireland", with the value also "Northern Ireland". If you use the argument nested_scope = "dictionary" in tokens_lookup(), then this will match the longer phrase first and only once, separating "Ireland" from "Northern Ireland". By using the same key as the value, you replace it like for like (with the side benefit of now having these two tokens, "Northern" and "Ireland", combined a single token.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks,
dictionary = dict, exclusive = FALSE,
nested_scope = "dictionary", capkeys = FALSE
)
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE" "lorem" "ipsum"
##
## doc2 :
## [1] "Lorem" "ipsum" "Northern Ireland"
##
## doc3 :
## [1] "IE" "lorem" "ipsum" "Northern Ireland"
Here I used exclusive = FALSE for illustration purposes, so you could see what got looked up and replaced. You can remove that and the capkeys argument when you run it.
If you want to discard the "Northern Ireland" tokens, just use
tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary") %>%
tokens_remove("Northern Ireland")
## Tokens consisting of 3 documents.
## doc1 :
## [1] "IE"
##
## doc2 :
## character(0)
##
## doc3 :
## [1] "IE"

Find in a dfm non-english tokens and remove them

In a dfm how is it possible to detect non english words and remove them?
dftest <- data.frame(id = 1:3,
text = c("Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"))
Example the construction of dfm is this:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% %>% tokens_wordstem() %>%
dfm()
I found textcat package as an alterative solution but there are many case in a real dataset where a whole row which is in the English language it recognize it as another language only for a character. Is there any alternative to find non-English rows in a dataframe or token in the dfm using quanteda?
You can do this using a word list of all English words. One place where this exists is in the hunspell pacakges, which is meant for spell checking.
library(quanteda)
# find the path in which the right dictionary file is stored
hunspell::dictionary(lang = "en_US")
#> <hunspell dictionary>
#> affix: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.aff
#> dictionary: /home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic
#> encoding: UTF-8
#> wordchars: ’
#> added: 0 custom words
# read this into a vector
english_words <- readLines("/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/hunspell/dict/en_US.dic") %>%
# the vector contains extra information on the words, which is removed
gsub("/.+", "", .)
# let's display a sample of the words
set.seed(1)
sample(english_words, 50)
#> [1] "furnace" "steno" "Hadoop" "alumna"
#> [5] "gonorrheal" "multichannel" "biochemical" "Riverside"
#> [9] "granddad" "glum" "exasperation" "restorative"
#> [13] "appropriate" "submarginal" "Nipponese" "hotting"
#> [17] "solicitation" "pillbox" "mealtime" "thunderbolt"
#> [21] "chaise" "Milan" "occidental" "hoeing"
#> [25] "debit" "enlightenment" "coachload" "entreating"
#> [29] "grownup" "unappreciative" "egret" "barre"
#> [33] "Queen" "Tammany" "Goodyear" "horseflesh"
#> [37] "roar" "fictionalization" "births" "mediator"
#> [41] "resitting" "waiter" "instructive" "Baez"
#> [45] "Muenster" "sleepless" "motorbike" "airsick"
#> [49] "leaf" "belie"
Armed with this vector which should, in theory, contain all English words but only words in English, we can remove non-English tokens:
testDfm <- dftest$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_keep(english_words, valuetype = "fixed") %>%
tokens_wordstem() %>%
dfm()
testDfm
#> Document-feature matrix of: 3 documents, 9 features (66.7% sparse).
#> features
#> docs this a spanish word english onlin here comment va
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 0 0
#> text3 0 0 0 0 0 0 0 1 1
As you can see, this works pretty well but isn't perfect. The "va" from "ça va" has been picked up as an English word as has "comment". What you want to do is thus a matter of finding the right word list and/or cleaning it. You can also think about removing texts in which too many words have been removed.
The question is not entirely clear as to whether you want to remove non-English "rows" first, or remove non-English words later. There are a lot of cognates between European languages (words that are homographs appearing in more than one language) so the tokens_keep() strategy will be imperfect.
You could remove the non-English documents after detecting the language, using the cld3 library:
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
And then input that into quanteda::dfm().

Implementing N-grams in my corpus, Quanteda Error

I am trying to implement quanteda on my corpus in R, but I am getting:
Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, :
duplicate row.names: character(0)
I don't have much experience with this. Here is a download of the dataset: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0
Here is the code:
tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)
quanteda.corpus <- corpus(corpus)
The processing that you're doing with tm is preparing a object for tm and quanteda doesn't know what to do with it...quanteda does all of these steps itself, help("dfm"), as can be seen from the options.
If you try the following you can move ahead:
dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english", ignoredFeatures=stopwords("english"), stem=TRUE)
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing documents: 6,943 documents
... indexing features: 15,164 feature types
... removed 161 features, from 174 supplied (glob) feature types
... stemming features (English), trimmed 2175 feature variants
... created a 6943 x 12828 sparse dfm
... complete.
Elapsed time: 0.756 seconds.
HTH
No need to start with the tm package, or even to use read.csv() at all - this is what the quanteda companion package readtext is for.
So to read in the data, you can send the object created by readtext::readtext() straight to the corpus constructor:
myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
##
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1 19 21 1 2 0.7579
## text2 18 20 2 2 0.8775
## text3 23 24 1 -1 0.6805
## text5 17 19 2 0 1.0000
## text4 18 19 1 -1 0.8820
##
## Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:
From there, you can perform all of the pre-processing stems directly in the dfm() call, including the choice of ngrams:
# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete.
## Elapsed time: 0.662 seconds.
# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete.
## Elapsed time: 1.419 seconds.

Resources