Count words in texts that are NOT in a given dictionary

Count words in texts that are NOT in a given dictionary - r

How can I find and count words that are NOT in a given dictionary?
The example below counts every time specific dictionary words (clouds and storms) appear in the text.
library("quanteda")
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
mydict <- dictionary(list(all_terms = c("clouds", "storms")))
dfmat <- tokens(txt) %>%
tokens_select(mydict) %>%
dfm()
dfmat
The output:
docs clouds storms
text1 1 1
How can I instead generate a count of all words that are NOT in the dictionary (clouds/storms)? Ideally with stopwords excluded.
E.g., desired output:
docs Forty-four Americans ...
text1 1 1

When you check the help file for tokens_select (run ?tokens_select) you can see that the third argument is selection. The default value is "keep", yet what you want is "remove". Since this is a common thing to do, there is also a dedicated tokens_remove command, which I use below to to remove stopwords.
dfmat <- tokens(txt) %>%
tokens_select(mydict, selection = "remove") %>%
tokens_remove(stopwords::stopwords(language = "en")) %>%
dfm()
dfmat
#> Document-feature matrix of: 1 document, 38 features (0.00% sparse) and 0 docvars.
#> features
#> docs forty-four americans now taken presidential oath . words spoken rising
#> text1 1 1 1 2 1 2 4 1 1 1
#> [ reached max_nfeat ... 28 more features ]
I think this is what you are trying to do.
Created on 2021-12-28 by the reprex package (v2.0.1)

this is kinda a case example a use of the setdiff() function. Here is an example of how to extract the words used by Obama (in $2013-Obama) not used by Biden (in $2021-Biden) from your example:
diff <- setdiff(toks[[1]], toks[[3]])

Related

Computing frequency of words for each document in a corpus/DFM for R

I want to replicate a measure of common words from a Paper in R.
They describe their procedure as follows: "To construct Common words,..., we first determine the relative frequency of all words occurring in all documents. We then calculate Common words as the average of this proportion for every word occurring in a given document. The higher the value of common words, the more ordinary is the documents’s language and thus the more
readable it should be." (Loughran & McDonald 2014)
Can anybody help me with this? I work with corpus objects in order to make analysis with the text documents in R.
I have already computed the relative frequency of all words occurring in all documents as follows:
dfm_Notes_Summary <- dfm(tokens_Notes_Summary)
Summary_FreqStats_Notes <- textstat_frequency(dfm_Notes_Summary)
Summary_FreqStats_Notes$RelativeFreq <- Summary_FreqStats_Notes$frequency/sum(Summary_FreqStats_Notes$frequency)
-> I basically transformed the tokens object (tokens_Notes_Summary) into an dfm Object (dfm_Notes_Summary) and got the relative frequency of all words in all documents.
Now I struggle to calculate the average of this proportion for every word occurring in a given document.

I reread Loughran and McDonald (2014) meant, since I could not find code for that, but I think it's based on the average of a document's terms' document frequencies. The code will probably make this more clear:
library("quanteda")
#> Package version: 3.2.3
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
dfmat <- data_corpus_inaugural |>
head(5) |>
tokens(remove_punct = TRUE, remove_numbers = TRUE) |>
dfm()
readability_commonwords <- function(x) {
# compute document frequencies of all features
relative_docfreq <- docfreq(x) / nfeat(x)
# average of all words by the relative document frequency
result <- x %*% relative_docfreq
# return as a named vector
structure(result[, 1], names = rownames(result))
}
readability_commonwords(dfmat)
#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson
#> 2.6090768 0.2738525 4.2026818 3.0928314 3.8256833
To know full details though you should ask the authors.
Created on 2022-11-30 with reprex v2.0.2

In R, how can I count specific words in a corpus?

I need to count the frequency of particular words. Lots of words. I know how to do this by putting all words in one group (see below), but I would like to get the count for each specific word.
This is what I have at the moment:
library(quanteda)
#function to count
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')
As you can see, the counts for "clouds" and "storms" are in the "all_terms" category in the resulting data.frame. Is there an easy way to get the count for all terms in "mydict" in individual columns, without writing the code for each individual term?
E.g.
clouds, storms
1, 1
Rather than
all_terms
2

You want to use the dictionary values as a pattern in tokens_select(), rather than using them in a lookup function, which is what dfm(x, dictionary = ...) does. Here's how:
library("quanteda")
## Package version: 2.1.2
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
mydict <- dictionary(list(all_terms = c("clouds", "storms")))
This creates the dfm where each column is the term, not the dictionary key:
dfmat <- tokens(txt) %>%
tokens_select(mydict) %>%
dfm()
dfmat
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs clouds storms
## text1 1 1
You can turn this into a data.frame of counts in two ways:
convert(dfmat, to = "data.frame")
## doc_id clouds storms
## 1 text1 1 1
textstat_frequency(dfmat)
## feature frequency rank docfreq group
## 1 clouds 1 1 1 all
## 2 storms 1 1 1 all
And while a dictionary is a valid input for a pattern (see ?pattern), you could also have just fed the character vector of values to tokens_select():
# no need for dictionary
tokens(txt) %>%
tokens_select(c("clouds", "storms")) %>%
dfm()
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs clouds storms
## text1 1 1

You can use the unnest_tokens() function from tidytext in combination with pivot_wider() from tidyr to get the count for each word in separate columns:
library(dplyr)
library(tidytext)
library(tidyr)
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
mydict <- c("clouds","storms")
df <- data.frame(text = txt) %>%
unnest_tokens(word, text) %>%
count(word) %>%
pivot_wider(names_from = word, values_from = n)
df %>% select(mydict)
# A tibble: 1 x 2
clouds storms
<int> <int>
1 1 1

How to count frequency of a multiword expression in Quanteda?

I am trying to count the frequency of a multiword expression in Quanteda. I know several articles in the corpus contain this expression, as when I look for it using 're' in Python it can find them. However, with Quanteda it doesn't seem to be working. Can anybody tell me what I am doing wrong?
> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]

First off, apologies for not being able to use a fully Chinese text. But here's presidential address into which I've taken the liberty of inserting your Mandarin words:
data <- "I stand here today humbled by the task before us 抗美 援朝,
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation,
as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
What you can do, if you want to use quanteda, is you can compute 4-grams (I take it your words consist of four signs and will hence be treated as four words)
Step 1: split text into word tokens:
data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)
Step 2: compute 4-grams and make a frequency list of them
fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)
You can inspect the first ten:
fourgrams[1:10]
抗 美 援 朝 美 援 朝 have America has carried on Americans 抗 美 援
4 2 1 1
amidst gathering clouds and ancestors I thank President and cooperation he has and raging storms At
1 1 1 1
and the still waters and true to our
1 1
If you just want to know the frequency of your target compound:
fourgrams["抗 美 援 朝"]
抗 美 援 朝
4
Alternatively, and much more simply, especially if your interest is really just in a single compound, you could use str_extract_all from stringr. This will provide you the frequency count immediately:
library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4

Generally speaking, it is the best to make a dictionary to lookup or compound tokens in Chinese or Japanese languages, but dictionary values should be segmented in the same way as tokens does.
require(quanteda)
require(stringi)
txt <- "10月初，聯合國軍逆轉戰情，向北開進，越過38度線，終促使中华人民共和国決定出兵介入，中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")
## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)
## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs mwe1 mwe2
## text1 1 1

You're on the right track, but quanteda's default tokenizer seems to separate the tokens in your phrase into four characters:
> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"
For these reasons, you should consider an alternative tokenizer. Fortunately the excellent spaCy Python library offers a means to do this, and has Chinese language models. Using the spacyr package and quanteda, you can create tokens directly from the output of spacyr::spacy_tokenize() after loading the small Chinese language model.
To count just these expressions, you can use a combination of tokens_select() and then textstat_frequency() on the dfm.
library("quanteda")
## Package version: 2.1.0
txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
spacy_tokenize(txt) %>%
as.tokens() %>%
tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
tokens_select("抗美 援朝") %>%
dfm() %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 抗美 援朝 3 1 1 all

How to calculate proximity of words to a specific term in a document

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text:
song <- "Far over the misty mountains cold To dungeons deep and caverns old We
must away ere break of day To seek the pale enchanted gold. The dwarves of
yore made mighty spells, While hammers fell like ringing bells In places deep,
where dark things sleep, In hollow halls beneath the fells. For ancient king
and elvish lord There many a gleaming golden hoard They shaped and wrought,
and light they caught To hide in gems on hilt of sword. On silver necklaces
they strung The flowering stars, on crowns they hung The dragon-fire, in
twisted wire They meshed the light of moon and sun. Far over the misty
mountains cold To dungeons deep and caverns old We must away, ere break of
day, To claim our long-forgotten gold. Goblets they carved there for
themselves And harps of gold; where no man delves There lay they long, and
many a song Was sung unheard by men or elves. The pines were roaring on the
height, The winds were moaning in the night. The fire was red, it flaming
spread; The trees like torches blazed with light. The bells were ringing in
the dale And men they looked up with faces pale; The dragon’s ire more fierce
than fire Laid low their towers and houses frail. The mountain smoked beneath
the moon; The dwarves they heard the tramp of doom. They fled their hall to
dying fall Beneath his feet, beneath the moon. Far over the misty mountains
grim To dungeons deep and caverns dim We must away, ere break of day,
To win our harps and gold from him!"
I want to be able to see what words appear within 15 (I would like this number to be interchangeable) words on either side (15 to the left and 15 to the right) of the word "fire" (also interchangeable) every time it appears. I want to see each word and the number of times it appears in this 15 word span for each instance of "fire." So, for example, "fire" is used 3 times. Of those 3 times the word "light" falls within 15 words on either side twice. I would want a table that shows the word, the number of times it appears within the specified proximity of 15, the maximum distance (which in this case is 12), the minimum distance (which is 7), and the average distance (which is 9.5).
I figured I would need several steps and packages to make this work. My first thought was to use the "kwic" function from quanteda since it allows you to choose a "window" around a specific term. Then a frequency count of terms based on the kwic results is not that hard (with stopwords removed for the frequency, but not for the word proximity measure). My real problem is finding the maximum, minimum, and average distances from the focus term and then getting the results into a nice neat table with the terms as rows in descending order by frequency and the columns giving me the frequency count, max distance, minimum distance, and average distance.
Here is what I have so far:
library(quanteda)
library(tm)
mysong <- char_tolower(song)
toks <- tokens(mysong, remove_hyphens = TRUE, remove_punct = TRUE,
remove_numbers = TRUE, remove_symbols = TRUE)
mykwic <- kwic(toks, "fire", window = 15, valuetype ="fixed")
thekwic <- as.character(mykwic)
thekwic <- removePunctuation(thekwic)
thekwic <- removeNumbers(thekwic)
thekwic <- removeWords(thekwic, stopwords("en"))
kwicFreq <- termFreq(thekwic)
Any help is much appreciated.

I'd suggest solving this with a combination of my tidytext and fuzzyjoin packages.
You can start by tokenizing it into a one-row-per-word data frame, adding a position column, and removing stopwords:
library(tidytext)
library(dplyr)
all_words <- data_frame(text = song) %>%
unnest_tokens(word, text) %>%
mutate(position = row_number()) %>%
filter(!word %in% tm::stopwords("en"))
You can then find just the word fire, and use difference_inner_join() from fuzzyjoin to find all rows within 15 words of those rows. You can then use group_by() and summarize() to get your desired statistics for each word.
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == "fire") %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
mutate(distance = abs(focus_position - position))
words_summarized <- nearby_words %>%
group_by(word) %>%
summarize(number = n(),
maximum_distance = max(distance),
minimum_distance = min(distance),
average_distance = mean(distance)) %>%
arrange(desc(number))
Output in this case:
# A tibble: 49 × 5
word number maximum_distance minimum_distance average_distance
<chr> <int> <dbl> <dbl> <dbl>
1 fire 3 0 0 0.0
2 light 2 12 7 9.5
3 moon 2 13 9 11.0
4 bells 1 14 14 14.0
5 beneath 1 11 11 11.0
6 blazed 1 10 10 10.0
7 crowns 1 5 5 5.0
8 dale 1 15 15 15.0
9 dragon 1 1 1 1.0
10 dragon’s 1 5 5 5.0
# ... with 39 more rows
Note that this approach also lets you perform the analysis on multiple focus words at once. All you'd have to do is change filter(word == "fire") to filter(word %in% c("fire", "otherword")), and change group_by(word) to group_by(focus_word, word).

The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic() but rather fcm() (feature co-occurrence matrix).
require(quanteda)
# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)
# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
## features
## features fire
## Far 1
## over 1
## the 5
## misty 1
## mountains 0
## cold 0
head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
## features
## features fire
## light 2
To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:
# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
## features
## light 9.5
## features fire
Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")

The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...

Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)

We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)

Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2

Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count words in texts that are NOT in a given dictionary - r

this is kinda a case example a use of the setdiff() function. Here is an example of how to extract the words used by Obama (in $2013-Obama) not used by Biden (in $2021-Biden) from your example: diff <- setdiff(toks[[1]], toks[[3]])

Related

Computing frequency of words for each document in a corpus/DFM for R

In R, how can I count specific words in a corpus?

How to count frequency of a multiword expression in Quanteda?

How to calculate proximity of words to a specific term in a document

Find the most frequently occuring words in a text in R

Categories

Resources