I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) Generate collocations using the corpus object:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) preprocess text and identify collocations and lemmatize for downstream tasks.
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) test results
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature
count
this
1
column
1
has
1
a
2
lot
1
of
1
almost
1
i
2
am
1
interested
1
in
1
problems
1
is
1
headache
1
how
1
do
1
you
1
handle
1
missing data
4
"missing data" should be "miss datum".
This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.
The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase() wrapper in tokens_compound(), you are telling tokens_replace() to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"
Related
This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).
For some reason, stop words is not working for my corpus, entirely in French. I've been trying repeatedly over the past few days, but many words that should have been filtered simply are not. I am not sure if anyone else has a similar issue? I read somewhere that it could be because of the accents. I tried stringi::stri_trans_general(x, "Latin-ASCII") but I am not certain I did this correctly. Also, I notice that French stop words are sometimes referred to as "french" or "fr".
This is one example of code I tried, I would be extremely grateful for any advice.
I also manually installed quanteda, because I had difficulties downloading it, so it could be linked to that.
text_corp <- quanteda::corpus(data,
text_field="text")
head(stopwords("french"))
summary(text_corp)
my_dfm <- dfm(text_corp)
myStemMat <- dfm(text_corp, remove = stopwords("french"), stem = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE)
myStemMat[, 1:5]
topfeatures(myStemMat 20)
In this last step, there are still words like "etre" (to be), "plus" (more), comme ("like"), avant ("before"), avoir ("to have")
I also tried to filter stop words in a different way, through token creation:
tokens <-
tokens(
text_corp,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
split_hyphens = TRUE,
include_docvars = TRUE,
)
mydfm <- dfm(tokens,
tolower = TRUE,
stem = TRUE,
remove = stopwords("french")
)
topfeatures(mydfm, 20)
The stopwords are working just fine, however the default Snowball list of French stopwords simply does not include the words you wish to remove.
You can see that by inspecting the vector of stopwords returned by stopwords("fr"):
library("quanteda")
## Package version: 2.1.2
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr")
## [1] FALSE FALSE FALSE FALSE FALSE
This is the full list of words:
sort(stopwords("fr"))
## [1] "à" "ai" "aie" "aient" "aies" "ait"
## [7] "as" "au" "aura" "aurai" "auraient" "aurais"
## [13] "aurait" "auras" "aurez" "auriez" "aurions" "aurons"
## [19] "auront" "aux" "avaient" "avais" "avait" "avec"
## [25] "avez" "aviez" "avions" "avons" "ayant" "ayez"
## [31] "ayons" "c" "ce" "ceci" "cela" "celà"
## [37] "ces" "cet" "cette" "d" "dans" "de"
## [43] "des" "du" "elle" "en" "es" "est"
## [49] "et" "étaient" "étais" "était" "étant" "été"
## [55] "étée" "étées" "étés" "êtes" "étiez" "étions"
## [61] "eu" "eue" "eues" "eûmes" "eurent" "eus"
## [67] "eusse" "eussent" "eusses" "eussiez" "eussions" "eut"
## [73] "eût" "eûtes" "eux" "fûmes" "furent" "fus"
## [79] "fusse" "fussent" "fusses" "fussiez" "fussions" "fut"
## [85] "fût" "fûtes" "ici" "il" "ils" "j"
## [91] "je" "l" "la" "le" "les" "leur"
## [97] "leurs" "lui" "m" "ma" "mais" "me"
## [103] "même" "mes" "moi" "mon" "n" "ne"
## [109] "nos" "notre" "nous" "on" "ont" "ou"
## [115] "par" "pas" "pour" "qu" "que" "quel"
## [121] "quelle" "quelles" "quels" "qui" "s" "sa"
## [127] "sans" "se" "sera" "serai" "seraient" "serais"
## [133] "serait" "seras" "serez" "seriez" "serions" "serons"
## [139] "seront" "ses" "soi" "soient" "sois" "soit"
## [145] "sommes" "son" "sont" "soyez" "soyons" "suis"
## [151] "sur" "t" "ta" "te" "tes" "toi"
## [157] "ton" "tu" "un" "une" "vos" "votre"
## [163] "vous" "y"
That's why they are not removed. We can see this with an example I created, using many of your words:
toks <- tokens("Je veux avoir une glace et être heureux, comme un enfant avant le dîner.",
remove_punct = TRUE
)
tokens_remove(toks, stopwords("fr"))
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "avoir" "glace" "être" "heureux" "comme" "enfant"
## [8] "avant" "dîner"
How to remove them? Either use a more complete list of stopwords, or customize the Snowball list by appending the stopwords you want to the existing ones.
mystopwords <- c(stopwords("fr"), "comme", "avoir", "plus", "avant", "être")
tokens_remove(toks, mystopwords)
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "glace" "heureux" "enfant" "dîner"
You could also use one of the other stopword sources, such as the "stopwords-iso", which does contain all of the words you wish to remove:
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr", source = "stopwords-iso")
## [1] TRUE TRUE TRUE TRUE TRUE
With regard to the language question, see the help for ?stopwords::stopwords, which states:
The language codes for each stopword list use the two-letter ISO code from https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. For backwards compatibility, the full English names of the stopwords from the quanteda package may also be used, although these are deprecated.
With regard to what you tried with stringi::stri_trans_general(x, "Latin-ASCII"), this would only help you if you wanted to remove "etre" and your stopword list contained only "être". In the example below, the stopword vector containing the accented character is concatenated with a version of itself in which the accents have been removed.
sw <- "être"
tokens("etre être heureux") %>%
tokens_remove(sw)
## Tokens consisting of 1 document.
## text1 :
## [1] "etre" "heureux"
tokens("etre être heureux") %>%
tokens_remove(c(sw, stringi::stri_trans_general(sw, "Latin-ASCII")))
## Tokens consisting of 1 document.
## text1 :
## [1] "heureux"
c(sw, stringi::stri_trans_general(sw, "Latin-ASCII"))
## [1] "être" "etre"
What's the Quanteda way of cleaning a corpus like shown in the example below using tm (lowercase, remove punct., remove numbers, stem words)? To be clear, I don't want to create a document-feature matrix with dfm(), I just want a clean corpus that I can use for a specific downstream task.
# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
PS I am aware that I could just do quanteda_corpus <- quanteda::corpus(crude)to get what I want, but I would much prefer being able to do everything in Quanteda.
I think what you want to do is deliberately impossible in quanteda.
You can, of course, do the cleaning quite easily without losing the order of words using the tokens* set of functions:
library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_wordstem()
print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effect"
#> [7] "today" "it" "had" "cut" "it" "contract"
#> [ ... and 78 more ]
#>
#> reut-00002.xml :
#> [1] "OPEC" "may" "be" "forc" "to" "meet" "befor"
#> [8] "a" "schedul" "June" "session" "to"
#> [ ... and 427 more ]
#>
#> reut-00004.xml :
#> [1] "Texaco" "Canada" "said" "it" "lower" "the"
#> [7] "contract" "price" "it" "will" "pay" "for"
#> [ ... and 40 more ]
#>
#> [ reached max_ndoc ... 17 more documents ]
But it is not possible to return this tokens object into a corpus. Now it would be possible to write a new function to do this:
corpus.tokens <- function(x, ...) {
quanteda:::build_corpus(
unlist(lapply(x, paste, collapse = " ")),
docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
)
}
corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#>
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#>
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#>
#> [ reached max_ndoc ... 17 more documents ]
But this object, while technically being a corpus class object, is not what a corpus is supposed to be. From ?corpus [emphasis added]:
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level
metadata, and default settings for subsequent processing of the
corpus.
The object above does not meet this description as the original texts have been processed already. Yet the class of the object communicates otherwise. I don't see a reason to break this logic as all subsequent analyses steps should be possible using either tokens* or dfm* functions.
I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm:
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE)
I am trying to extract bigrams (e.g. "climate change", "global warming") but keep getting an error message when I type the following, saying the ngrams argument is not used.
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE, ngrams = 2)
I have installed the tokenizer, tidyverse, dplyr, ngram, readtext, quanteda and stm libraries.
Below is a screenshot of my corpus.
Doc_iD is the article titles. I need the bigrams to be extracted from the "texts" column.
Do I need to extract the ngrams from the corpus first or can I do it from the dfm? Am I missing some piece of code that allows me to extract the bigrams?
Strictly speaking, if ngrams are what you want, then you can use tokens_ngrams() to form them. But sounds like you rather get more interesting multi-word expressions than "of the" etc. For that, I would use textstat_collocations(). You will want to do this on tokens, not on a dfm - the dfm will have already split your tokens into bag of words features, from which ngrams or MWEs can no longer be formed.
Here's an example from the built-in inaugural corpus. It removes stopwords but leaves a "pad" so that words that were not adjacent before the stopword removal will not appear as adjacent after their removal.
library("quanteda")
## Package version: 2.0.1
toks <- tokens(data_corpus_inaugural) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 united states 157 0 2 7.893348 41.19480
## 2 let us 97 0 2 6.291169 36.15544
## 3 fellow citizens 78 0 2 7.963377 32.93830
## 4 american people 40 0 2 4.426593 23.45074
## 5 years ago 26 0 2 7.896667 23.26947
## 6 federal government 32 0 2 5.312744 21.80345
These are by default scored and sorted in order of descending score.
To "extract" them, just take the collocation column:
head(colls$collocation, 50)
## [1] "united states" "let us" "fellow citizens"
## [4] "american people" "years ago" "federal government"
## [7] "almighty god" "general government" "fellow americans"
## [10] "go forward" "every citizen" "chief justice"
## [13] "four years" "god bless" "one another"
## [16] "state governments" "political parties" "foreign nations"
## [19] "solemn oath" "public debt" "religious liberty"
## [22] "public money" "domestic concerns" "national life"
## [25] "future generations" "two centuries" "social order"
## [28] "passed away" "good faith" "move forward"
## [31] "earnest desire" "naval force" "executive department"
## [34] "best interests" "human dignity" "public expenditures"
## [37] "public officers" "domestic institutions" "tariff bill"
## [40] "first time" "race feeling" "western hemisphere"
## [43] "upon us" "civil service" "nuclear weapons"
## [46] "foreign affairs" "executive branch" "may well"
## [49] "state authorities" "highest degree"
I think you need to create the ngram directly from the corpus. This is an example adapted from the quanteda tutorial website:
library(quanteda)
corp <- corpus(data_corpus_inaugural)
toks <- tokens(corp)
tokens_ngrams(toks, n = 2)
Tokens consisting of 58 documents and 4 docvars.
1789-Washington :
[1] "Fellow-Citizens_of" "of_the" "the_Senate" "Senate_and" "and_of" "of_the" "the_House"
[8] "House_of" "of_Representatives" "Representatives_:" ":_Among" "Among_the"
[ ... and 1,524 more ]
EDITED Hi this example from the help dfm may be useful
library(quanteda)
# You say you're already creating the corpus?
# where it says "data_corpus_inaugaral" put your corpus name
# Where is says "the_senate" put "climate change"
# where is says "the_house" put "global_warming"
tokens(data_corpus_inaugural) %>%
tokens_ngrams(n = 2) %>%
dfm(stem = TRUE, select = c("the_senate", "the_house"))
#> Document-feature matrix of: 58 documents, 2 features (89.7% sparse) and 4 docvars.
#> features
#> docs the_senat the_hous
#> 1789-Washington 1 2
#> 1793-Washington 0 0
#> 1797-Adams 0 0
#> 1801-Jefferson 0 0
#> 1805-Jefferson 0 0
#> 1809-Madison 0 0
#> [ reached max_ndoc ... 52 more documents ]
I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words.
For example:
Teddybear = (fluffy AND adorable AND soft)
Is this possible? I only found a solution yet to test for phrases like (Teddybear = (soft fluffy adorable)). But then it has to be the exact phrase match in the text. But how can I get results neglecting the order of the words?
This is not currently something that is directly possible in quanteda (v1.2.0). However, there are workarounds in which you create dictionary sequences that are permutations of your desired sequence. Here is one such solution.
First, I will create some example texts. Note that the sequences are separated by either "," or "and" in some cases. Also, the third text has just two of your phrases rather than three. (More on that in a moment.)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
Now, let's generate a pair of functions to generate permutation sequences and subsequences from a vector. These will use some functions from the combinat package. The first is an inner function to generate permutations, the second is the main calling function that can generate full-length permutations or any subsample down to subsample_limit. (To use these more generally, of course, I'd add error checking, but I've skipped that for this example.)
genperms <- function(vec) {
combs <- combinat::permn(vec)
sapply(combs, paste, collapse = " ")
}
# vec any vector
# subsample_limit integer from 1 to length(vec), subsamples from
# which to return permutations; default is no subsamples
permutefn <- function(vec, subsample_limit = length(vec)) {
ret <- character()
for (i in length(vec):subsample_limit) {
ret <- c(ret,
unlist(lapply(combinat::combn(vec, i, simplify = FALSE),
genperms)))
}
ret
}
To demonstrate how these work:
fas <- c("fluffy", "adorable", "soft")
permutefn(fas)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# and with subsampling:
permutefn(fas, 2)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# [7] "fluffy adorable" "adorable fluffy" "fluffy soft"
# [10] "soft fluffy" "adorable soft" "soft adorable"
Now apply these to the texts using tokens_lookup(). I've avoided the punctuation issue by setting remove_punct = TRUE. To show the original tokens not replaced, I have also uses exclusive = FALSE.
tokens(txt, remove_punct = TRUE) %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "fluffy" "adorable" "and" "soft"
# [8] "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the"
# [8] "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
The first case here was not caught, because the second and third elements were separated by "and". We can remove that using tokens_remove(), and then get the match:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
Finally, to match the third text in which just two of the three dictionary elements exist, we can pass 2 as the subsample_limit argument:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas, 2))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "TEDDYBEAR" "toy" "was" "shaped" "like" "a"
# [8] "bear"
#
If you want to know which documents have all the words, just do:
require(quanteda)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
dict <- dictionary(list(teddybear = list(c1 = "fluffy", c2 = "adorable", c3 = "soft")))
mt <- dfm_lookup(dfm(txt), dictionary = dict["teddybear"], levels = 2)
cbind(mt, "teddybear" = as.numeric(rowSums(mt > 0) == length(dict[["teddybear"]])))
# Document-feature matrix of: 3 documents, 4 features (16.7% sparse).
# 3 x 4 sparse Matrix of class "dfm"
# features
# docs c1 c2 c3 teddybear
# text1 1 1 1 1
# text2 1 1 1 1
# text3 1 1 0 0