Find the most frequently occuring words in a text in R - r

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")

The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...

Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)

We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)

Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2

Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Related

Grepl for 2 words/phrases in proximity in R (dplyr)

I'm trying to create a filter for large dataframe. I'm trying to use grepl to search for a series of text within a specific column. I've done this for single words/combinations, but now I want to search for two words in close proximity (ie the word tumo(u)r within 3 words of the word colon).
I've checked my regular expression on https://www.regextester.com/109207 and my search works there, but it doesn't work within R.
The error I get is
Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"
Example below - trying to search for tumo(u)r within 3 words of cancer.
Can anyone help?
library(tibble)
example.df <- tibble(number = 1:4, AB = c('tumor of the colon is a very hard disease to cure', 'breast cancer is also known as a neoplasia of the breast', 'tumour of the colon is bad', 'colon cancer is also bad'))
filtered.df <- example.df %>%
filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, ignore.case=T)
R uses backslashes as escapes and the regex engine does,too. Need to double your backslashes. This is explained in multiple prior questions on StackOverflow as well as in the help page brought up at ?regex. You should try to use the escaped operators in a more simple set of tests before attempting complex operations. And you should pay better attention to the proper placement of parentheses and quotes in the pattern argument.
filtered.df <- example.df %>%
#filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB,
# errors here ....^.^..............^..^...^..^.............^.^
filter(grepl( "(\\btumor|tumour)\\W|\\w+(\\w+\\W+){0,3}colon\\b", AB,
ignore.case=T) )
> filtered.df
# A tibble: 2 × 2
number AB
<int> <chr>
1 1 tumor of the colon is a very hard disease to cure
2 3 tumour of the colon is bad

How to count frequency of a multiword expression in Quanteda?

I am trying to count the frequency of a multiword expression in Quanteda. I know several articles in the corpus contain this expression, as when I look for it using 're' in Python it can find them. However, with Quanteda it doesn't seem to be working. Can anybody tell me what I am doing wrong?
> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]
First off, apologies for not being able to use a fully Chinese text. But here's presidential address into which I've taken the liberty of inserting your Mandarin words:
data <- "I stand here today humbled by the task before us 抗美 援朝,
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation,
as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
What you can do, if you want to use quanteda, is you can compute 4-grams (I take it your words consist of four signs and will hence be treated as four words)
Step 1: split text into word tokens:
data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)
Step 2: compute 4-grams and make a frequency list of them
fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)
You can inspect the first ten:
fourgrams[1:10]
抗 美 援 朝 美 援 朝 have America has carried on Americans 抗 美 援
4 2 1 1
amidst gathering clouds and ancestors I thank President and cooperation he has and raging storms At
1 1 1 1
and the still waters and true to our
1 1
If you just want to know the frequency of your target compound:
fourgrams["抗 美 援 朝"]
抗 美 援 朝
4
Alternatively, and much more simply, especially if your interest is really just in a single compound, you could use str_extract_all from stringr. This will provide you the frequency count immediately:
library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4
Generally speaking, it is the best to make a dictionary to lookup or compound tokens in Chinese or Japanese languages, but dictionary values should be segmented in the same way as tokens does.
require(quanteda)
require(stringi)
txt <- "10月初,聯合國軍逆轉戰情,向北開進,越過38度線,終促使中华人民共和国決定出兵介入,中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")
## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)
## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs mwe1 mwe2
## text1 1 1
You're on the right track, but quanteda's default tokenizer seems to separate the tokens in your phrase into four characters:
> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"
For these reasons, you should consider an alternative tokenizer. Fortunately the excellent spaCy Python library offers a means to do this, and has Chinese language models. Using the spacyr package and quanteda, you can create tokens directly from the output of spacyr::spacy_tokenize() after loading the small Chinese language model.
To count just these expressions, you can use a combination of tokens_select() and then textstat_frequency() on the dfm.
library("quanteda")
## Package version: 2.1.0
txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
spacy_tokenize(txt) %>%
as.tokens() %>%
tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
tokens_select("抗美 援朝") %>%
dfm() %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 抗美 援朝 3 1 1 all

Update qdap Dictionary for Sentiment Analysis

I am using polarity function from qdap. There are few words that I want to add to dictionary as negative when said in combination. For instance.
"Pretty Bad"
The polarity score becomes neutral when this is sent into polarity function.
> polarity("Pretty Bad")
all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
1 all 1 2 0 NA NA
Because it considers pretty as good word and bad as bad one, hence the aggregate becomes neutral.
I want to get rid of this and want to add couple of custom words.
To add words in dictionary use sentiment_frame and make your own lexicon. You can add more words as per your requirement. By default polarised words in key.pol is used. check ?polarity
library(qdap)
polarity("pretty bad")
# customised lexicon
positives = c("good","great")
negatives = c("bad","badly")
new_lexicon <- sentiment_frame(positives,negatives, pos.weights = 1, neg.weights = -1)
counts(polarity("pretty bad",polarity.frame = new_lexicon))

R text mining: Create document term matrix from dataframe, convert to dataframe, retain columns from original dataframe

Thanks to lawyeR for recommending the tidytext package. Here is some code based on that package that seems to work pretty well on my sample data. It doesn't work quite so well though when the value of the text column is blank. (There are times when this will happen and it will make sense to keep the blank rather than filtering it.) I've set the first observation for TVAR to a blank to illustrate. The code drops this observation. How can I get R to keep the observation and to set the frequencies for each word to zero? I tried some ifelse statements using and not using the pipe. It's not working so well though. The troulbe seems to center around the unnest_tokens function from the tidytext package.
sampletxt$TVAR[1] <- ""
chunk_words <- sampletxt %>%
group_by(PTNO, DATE, TYPE) %>%
unnest_tokens(word, TVAR, to_lower = FALSE) %>%
count(word) %>%
spread(word, n, 0)
I have an R data frame. I want to use it to create a document term matrix. Presumably I would want to use the tm package to do that but there might be other ways. I then want to convert that matrix back to a data frame. I want the final data frame to contain identifying variables from the original data frame.
Question is, how do I do that? I found some answers to a similar question, but that was for a data frame with text and a single ID variable. My data are likely to have about half a dozen variables that identify a given text record. So far, attempts to scale up the solution for a single ID variable haven't proven all that successful.
Below are some sample data. I created these for another task which I did manage to solve.
How can I get a version of this data frame that has an additional frequency column for each word in the text entries and that retains variables like PTNO, DATE, and TYPE?
sampletxt <-
structure(
list(
PTNO = c(1, 2, 2, 3, 3),
DATE = structure(c(16801, 16436, 16436, 16832, 16845), class = "Date"),
TYPE = c(
"Progress note",
"Progress note",
"CAT scan",
"Progress note",
"Progress note"
),
TVAR = c(
"This sentence contains the word metastatic and the word Breast plus the phrase denies any symptoms referable to.",
"This sentence contains tnm code T-1, N-O, M-0. This sentence contains contains tnm code T-1, N-O, M-1. This sentence contains tnm code T1N0M0. This sentence contains contains tnm code T1NOM1. This sentence is a sentence!?!?",
"This sentence contains Dr. Seuss and no target words. This sentence contains Ms. Mary J. blige and no target words.",
"This sentence contains the term stageIV and the word Breast. This sentence contains no target words.",
"This sentence contains the word breast and the term metastatic. This sentence contains the word breast and the term stage IV."
)), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
The quanteda package is faster and more straightforward than tm, and works nicely with tidytext as well. Here's how to do it:
These operations create a corpus from your object, create a document-feature matrix, and then return a data.frame that combines the variables with the feature counts. (Additional options are available when creating the dfm, see ?dfm).
library("quanteda")
samplecorp <- corpus(sampletxt, text_field = "TVAR")
sampledfm <- dfm(samplecorp)
result <- cbind(docvars(sampledfm), as.data.frame(sampledfm))
You can then group by the variables to get your result. (Here I am showing just the first 6 columns.)
dplyr::group_by(result[, 1:6], PTNO, DATE, TYPE)
# # A tibble: 5 x 6
# # Groups: PTNO, DATE, TYPE [5]
# PTNO DATE TYPE this sentence contains
# * <dbl> <date> <chr> <dbl> <dbl> <dbl>
# 1 1 2016-01-01 Progress note 1 1 1
# 2 2 2015-01-01 Progress note 5 6 6
# 3 2 2015-01-01 CAT scan 2 2 2
# 4 3 2016-02-01 Progress note 2 2 2
# 5 3 2016-02-14 Progress note 2 2 2
packageVersion("quanteda")
# [1] ‘0.99.6’
this function from package "SentimentAnalysis" is the easiest way to do this, especially if you are trying to convert from column of a dataframe to DTM (though it also works on txt file!):
library("SentimentAnalysis")
corpus <- VCorpus(VectorSource(df$column_or_txt))
tdm <- TermDocumentMatrix(corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=FALSE,
ngmin=1, ngmax=2)))
it is simple and works like a charm each time, with both chinese and english, for those doing text mining in chinese.

How do I keep intra-word periods in unigrams? R quanteda

I would like to preserve two letter acronyms in my unigram frequency table that are separated by periods such as "t.v." and "u.s.". When I build my unigram frequency table with quanteda, the teminating period is getting truncated. Here is a small test corpus to illustrate. I have removed periods as sentence separators:
SOS This is the u.s. where our politics is crazy EOS
SOS In the US we watch a lot of t.v. aka TV EOS
SOS TV is an important part of life in the US EOS
SOS folks outside the u.s. probably don't watch so much t.v. EOS
SOS living in other countries is probably not any less crazy EOS
SOS i enjoy my sanity when it comes to visit EOS
which I load into R as character vector:
acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")
Here is the code I use to build my unigram frequency table:
library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ", toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable
This produces the following:
ngram frequency
1 SOS 6
2 EOS 6
3 the 4
4 is 3
5 . 3
6 u.s 2
7 crazy 2
8 US 2
9 watch 2
10 of 2
11 t.v 2
12 TV 2
13 in 2
14 probably 2
15 This 1
16 where 1
17 our 1
18 politics 1
19 In 1
20 we 1
21 a 1
22 lot 1
23 aka 1
etc...
I would like to keep the terminal periods on t.v. and u.s. as well as eliminate the entry in the table for . with a frequency of 3.
I also don't understand why the period (.) would have a count of 3 in this table while counting the u.s and t.v unigrams correctly (2 each).
The reason for this behaviour is that quanteda's default word tokeniser uses the ICU-based definition for word boundaries (from the stringi package). u.s. appears as the word u.s. followed by a period . token. This is great if your name is will.i.am but maybe not so great for your purposes. But you can easily switch to the white-space tokeniser, using the argument what = "fasterword" passed to tokens(), an option available in dfm() through the ... part of the function call.
tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS" "This" "is" "the" "u.s." "where" "our" "politics" "is" "crazy" "EOS"
You can see that here, u.s. is preserved. In response to your last question, the terminal . had a document frequency of 3 because it appeared in three documents as a separate token, which is the default word tokeniser behaviour when remove_punct = FALSE.
To pass this through to dfm() and then construct your data.frame of the document frequency of the words, the following code works (I've tidied it up a bit for efficiency). Note the comment about the difference between document and term frequency - I've noted that some users are a bit confused about docfreq().
# I removed the options that were the same as the default
# note also that stopwords = TRUE is not a valid argument - see remove parameter
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")
# sort in descending document frequency
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
# Note: this would sort the dfm in descending total term frequency
# not the same as docfreq
# dat.dfm <- sort(dat.dfm)
# this creates the data.frame in one more efficient step
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
row.names = NULL, stringsAsFactors = FALSE)
head(freqTable, 10)
## ngram frequency
## 1 SOS 6
## 2 EOS 6
## 3 the 4
## 4 is 3
## 5 u.s. 2
## 6 crazy 2
## 7 US 2
## 8 watch 2
## 9 of 2
## 10 t.v. 2
In my view the named vector produced by docfreq() on the dfm is a more efficient method for storing the results than your data.frame approach, but you may wish to add other variables.

Resources