Extract top positive and negative features when applying dictionary in quanteda - r

I have a data frame with around 100k rows that contain textual data. Using the quanteda package, I apply sentiment analysis (Lexicoder dictionary) to eventually calculate a sentiment score.
For an additional - more qualitative - step of analysis I would like extract the top features (i.e. negative/positive words from the dictionary that occur most frequent in my data) to examine whether the discourse is driven by particular words.
my_corpus <- corpus(my_df, docid_field = "ID", text_field = "my_text", metacorpus = NULL, compress = FALSE)
sentiment_corp <- dfm(my_corpus, dictionary = data_dictionary_LSD2015)
However, going through the quanteda documentation, I couldn't figure out how to achieve this - is there a way?
I'm aware of topfeatures and I did read this question, but it didn't help.

In all of the quanteda functions that take a pattern argument, the valid types of patterns are character vectors, lists, and dictionaries. So the best way to assess each the top features in each dictionary category (what we also call a dictionary key) is to select on that dictionary and then use topfeatures().
Here is how to do this using the built-in data_corpus_irishbudget2010 object, as an example, with the Lexicoder Sentiment Dictionary.
library("quanteda")
## Package version: 1.4.3
# tokenize and select just the dictionary value matches
toks <- tokens(data_corpus_irishbudget2010) %>%
tokens_select(pattern = data_dictionary_LSD2015)
lapply(toks[1:5], head)
## $`Lenihan, Brian (FF)`
## [1] "severe" "distress" "difficulties" "recovery"
## [5] "benefit" "understanding"
##
## $`Bruton, Richard (FG)`
## [1] "failed" "warnings" "sucking" "losses" "debt" "hurt"
##
## $`Burton, Joan (LAB)`
## [1] "remarkable" "consensus" "Ireland" "opposition" "knife"
## [6] "dispute"
##
## $`Morgan, Arthur (SF)`
## [1] "worst" "worst" "well" "corrupt" "golden" "protected"
##
## $`Cowen, Brian (FF)`
## [1] "challenge" "succeeding" "challenge" "oppose"
## [5] "responsibility" "support"
To explore the top matches for the positive entry, we can select them further by subsetting the dictionary for the Positive key.
# top positive matches
tokens_select(toks, pattern = data_dictionary_LSD2015["positive"]) %>%
dfm() %>%
topfeatures()
## benefit support recovery fair create confidence
## 68 52 44 41 39 37
## provide well credit help
## 36 33 31 29
And for Negative:
# top negative matches
tokens_select(toks, pattern = data_dictionary_LSD2015[["negative"]]) %>%
dfm() %>%
topfeatures()
## ireland benefit not support crisis recovery
## 79 68 52 52 47 44
## fair create deficit confidence
## 41 39 38 37
Why is "Ireland" a negative match? Because the LSD2015 includes ir* as a negative word that is intended to match ire and ireful but with the default case insensitive matching, also matches Ireland (a term frequently used in this example corpus). This is an example of a "false positive" match, always a risk in dictionaries when using wildcarding or when using a language such as English that has a very high rate of polysemes and homographs.

Related

Computing frequency of words for each document in a corpus/DFM for R

I want to replicate a measure of common words from a Paper in R.
They describe their procedure as follows: "To construct Common words,..., we first determine the relative frequency of all words occurring in all documents. We then calculate Common words as the average of this proportion for every word occurring in a given document. The higher the value of common words, the more ordinary is the documents’s language and thus the more
readable it should be." (Loughran & McDonald 2014)
Can anybody help me with this? I work with corpus objects in order to make analysis with the text documents in R.
I have already computed the relative frequency of all words occurring in all documents as follows:
dfm_Notes_Summary <- dfm(tokens_Notes_Summary)
Summary_FreqStats_Notes <- textstat_frequency(dfm_Notes_Summary)
Summary_FreqStats_Notes$RelativeFreq <- Summary_FreqStats_Notes$frequency/sum(Summary_FreqStats_Notes$frequency)
-> I basically transformed the tokens object (tokens_Notes_Summary) into an dfm Object (dfm_Notes_Summary) and got the relative frequency of all words in all documents.
Now I struggle to calculate the average of this proportion for every word occurring in a given document.
I reread Loughran and McDonald (2014) meant, since I could not find code for that, but I think it's based on the average of a document's terms' document frequencies. The code will probably make this more clear:
library("quanteda")
#> Package version: 3.2.3
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
dfmat <- data_corpus_inaugural |>
head(5) |>
tokens(remove_punct = TRUE, remove_numbers = TRUE) |>
dfm()
readability_commonwords <- function(x) {
# compute document frequencies of all features
relative_docfreq <- docfreq(x) / nfeat(x)
# average of all words by the relative document frequency
result <- x %*% relative_docfreq
# return as a named vector
structure(result[, 1], names = rownames(result))
}
readability_commonwords(dfmat)
#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson
#> 2.6090768 0.2738525 4.2026818 3.0928314 3.8256833
To know full details though you should ask the authors.
Created on 2022-11-30 with reprex v2.0.2

Count words in texts that are NOT in a given dictionary

How can I find and count words that are NOT in a given dictionary?
The example below counts every time specific dictionary words (clouds and storms) appear in the text.
library("quanteda")
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
mydict <- dictionary(list(all_terms = c("clouds", "storms")))
dfmat <- tokens(txt) %>%
tokens_select(mydict) %>%
dfm()
dfmat
The output:
docs clouds storms
text1 1 1
How can I instead generate a count of all words that are NOT in the dictionary (clouds/storms)? Ideally with stopwords excluded.
E.g., desired output:
docs Forty-four Americans ...
text1 1 1
When you check the help file for tokens_select (run ?tokens_select) you can see that the third argument is selection. The default value is "keep", yet what you want is "remove". Since this is a common thing to do, there is also a dedicated tokens_remove command, which I use below to to remove stopwords.
dfmat <- tokens(txt) %>%
tokens_select(mydict, selection = "remove") %>%
tokens_remove(stopwords::stopwords(language = "en")) %>%
dfm()
dfmat
#> Document-feature matrix of: 1 document, 38 features (0.00% sparse) and 0 docvars.
#> features
#> docs forty-four americans now taken presidential oath . words spoken rising
#> text1 1 1 1 2 1 2 4 1 1 1
#> [ reached max_nfeat ... 28 more features ]
I think this is what you are trying to do.
Created on 2021-12-28 by the reprex package (v2.0.1)
this is kinda a case example a use of the setdiff() function. Here is an example of how to extract the words used by Obama (in $2013-Obama) not used by Biden (in $2021-Biden) from your example:
diff <- setdiff(toks[[1]], toks[[3]])

How to count frequency of a multiword expression in Quanteda?

I am trying to count the frequency of a multiword expression in Quanteda. I know several articles in the corpus contain this expression, as when I look for it using 're' in Python it can find them. However, with Quanteda it doesn't seem to be working. Can anybody tell me what I am doing wrong?
> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]
First off, apologies for not being able to use a fully Chinese text. But here's presidential address into which I've taken the liberty of inserting your Mandarin words:
data <- "I stand here today humbled by the task before us 抗美 援朝,
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation,
as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
What you can do, if you want to use quanteda, is you can compute 4-grams (I take it your words consist of four signs and will hence be treated as four words)
Step 1: split text into word tokens:
data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)
Step 2: compute 4-grams and make a frequency list of them
fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)
You can inspect the first ten:
fourgrams[1:10]
抗 美 援 朝 美 援 朝 have America has carried on Americans 抗 美 援
4 2 1 1
amidst gathering clouds and ancestors I thank President and cooperation he has and raging storms At
1 1 1 1
and the still waters and true to our
1 1
If you just want to know the frequency of your target compound:
fourgrams["抗 美 援 朝"]
抗 美 援 朝
4
Alternatively, and much more simply, especially if your interest is really just in a single compound, you could use str_extract_all from stringr. This will provide you the frequency count immediately:
library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4
Generally speaking, it is the best to make a dictionary to lookup or compound tokens in Chinese or Japanese languages, but dictionary values should be segmented in the same way as tokens does.
require(quanteda)
require(stringi)
txt <- "10月初,聯合國軍逆轉戰情,向北開進,越過38度線,終促使中华人民共和国決定出兵介入,中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")
## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)
## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs mwe1 mwe2
## text1 1 1
You're on the right track, but quanteda's default tokenizer seems to separate the tokens in your phrase into four characters:
> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"
For these reasons, you should consider an alternative tokenizer. Fortunately the excellent spaCy Python library offers a means to do this, and has Chinese language models. Using the spacyr package and quanteda, you can create tokens directly from the output of spacyr::spacy_tokenize() after loading the small Chinese language model.
To count just these expressions, you can use a combination of tokens_select() and then textstat_frequency() on the dfm.
library("quanteda")
## Package version: 2.1.0
txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
spacy_tokenize(txt) %>%
as.tokens() %>%
tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
tokens_select("抗美 援朝") %>%
dfm() %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 抗美 援朝 3 1 1 all

Find frequency of terms from Function

I need to find frequency of terms from the function that I have created that find terms with punctuation in them.
library("tm")
my.text.location <- "C:/Users/*/"
newpapers <- VCorpus(DirSource(my.text.location))
I read it then make the function:
library("stringr")
punctterms <- function(x){str_extract_all(x, "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}")}
terms <- lapply(newpapers, punctterms)
Now I'm lost as to how will I find the frequency for each term in each file. Do I turn it into a DTM or is there a better way without it?
Thank you!
This task is better suited for quanteda, not tm. Your function creates a list and removes everything out of the corpus. Using quanteda you can just use the quanteda commands to get everything you want.
Since you didn't provide any reproducible data, I will use a data set that comes with quanteda. Comments above the code explain what is going on. Most important function in this code is dfm_select. Here you can use a diverse set of selection patterns to find terms in the text.
library(quanteda)
# load corpus
my_corpus <- corpus(data_corpus_inaugural)
# create document features (like document term matrix)
my_dfm <- dfm(my_corpus)
# dfm_select can use regex selections to select terms
my_dfm_punct <- dfm_select(my_dfm,
pattern = "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}",
selection = "keep",
valuetype = "regex")
# show frequency of selected terms.
head(textstat_frequency(my_dfm_punct))
feature frequency rank docfreq group
1 fellow-citizens 39 1 19 all
2 america's 35 2 11 all
3 self-government 30 3 16 all
4 world's 24 4 15 all
5 nation's 22 5 13 all
6 god's 15 6 14 all
So I got it to work without using quanteda:
m <- as.data.frame(table(unlist(terms)))
names(m) <- c("Terms", "Frequency")

R Text Mining with quanteda

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
Everything works until:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
How would you interpret the error message? Are there any suggestions to solve the problem?
There was a bug in quanteda version 0.7.2 that caused dfm() to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.
This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in dfm(), resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)
devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
## ... indexing 57 documents
## ... lowercasing
## ... tokenizing
## ... shaping tokens into data.table, found 134,024 total tokens
## ... applying a dictionary consisting of 68 key entries
## ... summing dictionary-matched features by document
## ... indexing 68 feature types
## ... building sparse matrix
## ... created a 57 x 68 sparse dfm
## ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers Nonfl Swear TV Eating Sleep Groom Death Sports Sexual
## 0 0 0 42 47 49 53 76 81 100
It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older dfm you are using with your Facebook texts.
mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams
## 3

Resources