Maximum occurrence of any set of words in text in R - r

Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)
say, I have a text like,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
I want output to be,
john beck - 3
chemical engineer - 2
Is there any function or package which does this?

Try this:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
library(tau)
library(tm)
tokens <- MC_tokenizer(string)
tokens <- tokens[tokens != ""]
string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ")
## if you want only bi-grams:
tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# john beck 3
# chemical engineer 2
## if you want uni-, bi- and tri-grams:
nmin <- 1; nmax <- 3
tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# beck 3
# john 3
# john beck 3
# chemical 2
# engineer 2
# is 2
# chemical engineer 2

Could also try this, using the quanteda package:
require(quanteda)
mydfm <- dfm(string, ngrams = 1:2, concatenator = "_", stem = TRUE, verbose = FALSE)
topfeatures(mydfm)
## beck john john_beck chemic chemical_engin engin is
## 3 3 3 2 2 2 2
## an an_chem are
## 1 1 1
You lose the stems, but this counts "john beck" three times instead of just two (since without stemming, "john beck's" will be a separate type).
It's simpler though!

Related

Compute word count with data.table in R by a value

I am new to data.table, I have a dataset with person names and countries, and I want to know the most frequent names by country.
The dataset looks like this:
DT <- data.table(person_id = c(1,2,3,4,5,6),
person_name = c('John Smith', 'Marty Mcfly', 'Amélie Poulain', 'John Wick', 'Clark Kent', 'Marcel Poulain'),
person_ctry = c('US', 'US', 'FR', 'US', 'US', 'FR')
)
I would like to obtain a data.table like this:
person_ctry word count
US John 2
US Smith 1
US Marty 1
FR Poulain 2
FR Amélie 1
....
I tried this:
all_names <- DT[, lapply(.(person_name), paste0, collapse=" "), by=person_ctry]
wordcount <- function(str) {
as.data.frame(table(unlist( strsplit(str, "\ ") )))
}
all_names[, c("word","count") := wordcount(V1), by=person_ctry]
But the last line gives an error saying RHS doesn't match LHS length exactly. However, I don't know how to correct it. Any ideas ?
Thanks.
Here's a slightly modified approach to count those words:
DT[, .(word = unlist(strsplit(person_name, '\\s+'), use.names = FALSE)),
by = .(person_ctry)][, .(count = .N), by = .(person_ctry, word)]
# person_ctry word count
# 1: US John 2
# 2: US Smith 1
# 3: US Marty 1
# 4: US Mcfly 1
# 5: US Wick 1
# 6: US Clark 1
# 7: US Kent 1
# 8: FR Amélie 1
# 9: FR Poulain 2
# 10: FR Marcel 1
The approach has two steps:
split the words (names) at spaces, do this by country to create an intermediate data.table
count the number of rows per unique word in each country using data.tables special .N function
I combined both steps using a chain of []

how to get a sentiment score (and keep the sentiment words) in quanteda?

Consider this simple example
library(tibble)
library(quanteda)
tibble(mytext = c('this is a good movie',
'oh man this is really bad',
'quanteda is great!'))
# A tibble: 3 x 1
mytext
<chr>
1 this is a good movie
2 oh man this is really bad
3 quanteda is great!
I would like to perform some basic sentiment analysis, but with a twist. Here is my dictionary, stored into a regular tibble
mydictionary <- tibble(sentiment = c('positive', 'positive','negative'),
word = c('good', 'great', 'bad'))
# A tibble: 3 x 2
sentiment word
<chr> <chr>
1 positive good
2 positive great
3 negative bad
Essentially, I would like to count how many positive and negative words are detected in each sentence, but also keep track of the matching words. In other words, the output should look like
mytext nb.pos nb.neg pos.words
1 this is a good and great movie 2 0 good, great
2 oh man this is really bad 0 1 bad
3 quanteda is great! 1 0 great
How can I do that in quanteda? Is this possible?
Thanks!
Stay tuned for quanteda v. 2.1 in which we will have greatly expanded, dedicated functions for sentiment analysis. In the meantime, see below. Note that I made some adjustments since there is a discrepancy in what you report as the text and your input text, also you have all sentiment words in pos.words, not just positive words. Below, I compute both positive and all sentiment matches.
# note the amended input text
mytext <- c(
"this is a good and great movie",
"oh man this is really bad",
"quanteda is great!"
)
mydictionary <- tibble::tibble(
sentiment = c("positive", "positive", "negative"),
word = c("good", "great", "bad")
)
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
# make the dictionary into a quanteda dictionary
qdict <- as.dictionary(mydictionary)
Now we can use the lookup functions to get to your final data.frame.
# get the sentiment scores
toks <- tokens(mytext)
df <- toks %>%
tokens_lookup(dictionary = qdict) %>%
dfm() %>%
convert(to = "data.frame")
names(df)[2:3] <- c("nb.neg", "nb.pos")
# get the matches for pos and all words
poswords <- tokens_keep(toks, qdict["positive"])
allwords <- tokens_keep(toks, qdict)
data.frame(
mytext = mytext,
df[, 2:3],
pos.words = sapply(poswords, paste, collapse = ", "),
all.words = sapply(allwords, paste, collapse = ", "),
row.names = NULL
)
## mytext nb.neg nb.pos pos.words all.words
## 1 this is a good and great movie 0 2 good, great good, great
## 2 oh man this is really bad 1 0 bad
## 3 quanteda is great! 0 1 great great

Select phrases found in dictionary and return dataframe of doc_id and phrase

I have a dictionary file of medical phrases and a corpus of raw texts. I'm trying to use the dictionary file to select the relevant phrases from the text. Phrases, in this case, are 1 to 5-word n-grams. In the end, I would like the selected phrases in a dataframe with two columns: doc_id, phrase
I've been trying to use the quanteda package to do this but haven't been successful. Below is some code to reproduce my latest attempt. I'd appreciate any advice you have...I've tried a variety of methods but keep getting back only single-word matches.
version R version 3.6.2 (2019-12-12)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
Packages:
dbplyr 1.4.2
quanteda 1.5.2
library(quanteda)
library(dplyr)
raw <- data.frame("doc_id" = c("1", "2", "3"),
"text" = c("diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."))
term = c("diffuse intrinsic pontine glioma", "brain tumors", "brain", "pontine glioma", "mri", "medical imaging", "radiology", "anatomy", "physiological processes", "radiation therapy", "radiotherapy", "cancer treatment", "malignant cells")
medTerms = list(term = term)
dict <- dictionary(medTerms)
corp <- raw %>% group_by(doc_id) %>% summarise(text = paste(text, collapse=" "))
corp <- corpus(corp, text_field = "text")
dfm <- dfm(corp,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
remove = stopwords("english"))
dfm <- dfm_select(dfm, pattern = phrase(dict))
What I'd eventually like to get back is something like the following:
doc_id term
1 diffuse intrinsice pontine glioma
1 pontine glioma
1 brain tumors
1 brain
2 mri
2 medical imaging
2 radiology
2 anatomy
2 physiological processes
3 radiation therapy
3 radiotherapy
3 cancer treatment
3 malignant cells
If you want to match-multi word patterns from a dictionary, you can do so by constructing your dfm using ngrams.
library(quanteda)
library(dplyr)
library(tidyr)
raw$text <- as.character(raw$text) # you forgot to use stringsAsFactors = FALSE while constructing the data.frame, so I convert your factor to character before continuing
corp <- corpus(raw, text_field = "text")
dfm <- tokens(corp) %>%
tokens_ngrams(1:5) %>% # This is the new way of creating ngram dfms. 1:5 means to construct all from unigram to 5-grams
dfm(tolower = TRUE,
stem = FALSE,
remove_punct = TRUE) %>% # I wouldn't remove stopwords for this matching task
dfm_select(pattern = dict)
Now we just have to convert the dfm to a data.frame and bring it into a long format:
convert(dfm, "data.frame") %>%
pivot_longer(-document, names_to = "term") %>%
filter(value > 0)
#> # A tibble: 13 x 3
#> document term value
#> <chr> <chr> <dbl>
#> 1 1 brain 2
#> 2 1 pontine_glioma 1
#> 3 1 brain_tumors 1
#> 4 1 diffuse_intrinsic_pontine_glioma 1
#> 5 2 mri 1
#> 6 2 radiology 1
#> 7 2 anatomy 1
#> 8 2 medical_imaging 1
#> 9 2 physiological_processes 1
#> 10 3 radiotherapy 1
#> 11 3 radiation_therapy 1
#> 12 3 cancer_treatment 1
#> 13 3 malignant_cells 1
You could remove the value column but it might be of interest later on.
You could form all ngrams from 1 to 5 in length, and then select all out. But for large texts, this would be very inefficient. Here's a more direct way. I've reproduced the entire problem here with a few modifications (such as stringsAsFactors = FALSE and skipping some unnecessary steps).
Granted, this does not double count the terms as in your expected example, but I submit that you probably did not want this. Why count "brain" if it occurred within "brain tumor"? You would be better counting "brain tumor" when it occurs as that phrase, and "brain" only when it occurs without "tumor". The code below does that.
library(quanteda)
## Package version: 2.0.1
raw <- data.frame(
"doc_id" = c("1", "2", "3"),
"text" = c(
"diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."
),
stringsAsFactors = FALSE
)
dict <- dictionary(list(
term = c(
"diffuse intrinsic pontine glioma",
"brain tumors", "brain", "pontine glioma", "mri", "medical imaging",
"radiology", "anatomy", "physiological processes", "radiation therapy",
"radiotherapy", "cancer treatment", "malignant cells"
)
))
Here's the key to the answer: using the dictionary first to select the tokens, then to concatenate them, then to reshape them one dictionary match per new "document". The last step creates the data.frame you want.
toks <- corpus(raw) %>%
tokens() %>%
tokens_select(dict) %>% # select just dictionary values
tokens_compound(dict, concatenator = " ") %>% # turn phrase into single "tokens"
tokens_segment(pattern = "*") # make one token per "document"
# make into data.frame
data.frame(
doc_id = docid(toks), term = as.character(toks),
stringsAsFactors = FALSE
)
## doc_id term
## 1 1 diffuse intrinsic pontine glioma
## 2 1 brain tumors
## 3 1 brain
## 4 2 mri
## 5 2 medical imaging
## 6 2 radiology
## 7 2 anatomy
## 8 2 physiological processes
## 9 3 radiation therapy
## 10 3 radiotherapy
## 11 3 cancer treatment
## 12 3 malignant cells

udpipe (keywords_rake) how to link keywords to the document they where extracted from

I am using the function keywords_rake from the udpipe package (for R) to extract keywords from a bunch of documents.
udmodel_en <- udpipe_load_model(file = dl$file_model)
x <- udpipe_annotate(udmodel_en, x = data$text)
x <- as.data.frame(x)
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id",
relevant = x$xpos %in% c("NN", "JJ"), ngram_max = 2)
where data looks like this
Text
"cats are nice but dogs are better..."
"I really like dogs..."
"red flowers are pretty, especially roses..."
"once I saw a blue whale ..."
....
(each row is a separate document)
However the output does not include the origin of the keywords, and provides a list of keywords for all the documents
how can I link these keywords to the corresponding documents they were taken from?
(I.e. have a list of keywords for each of the documents)
something like this:
keywords
doc1 dog, cat, blue whale
doc2 dog
doc3 red flower, tower, Donald Trump
You can use txt_recode_ngram together with the outcome of keywords_rake to do this. The advantage is that everything is back in the original data.frame and you can then select what you need. See example below using the dataset supplied with udpipe.
Disclaimer: Code copied from jwijffels' answer in issue 41 on the github page of udpipe.
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id",
relevant = x$xpos %in% c("NN", "JJ"), sep = "-")
head(keywords)
keyword ngram freq rake
1 openbaar-vervoer 2 19 2.391304
2 heel-fijn 2 2 2.236190
3 heel-vriendelijk 2 3 2.131092
4 herhaling-vatbaar 2 6 2.000000
5 heel-appartement 2 2 1.935450
6 steenworp-afstand 2 4 1.888889
x$term <- txt_recode_ngram(x$lemma, compound = keywords$keyword, ngram = keywords$ngram, sep = "-")
x$term <- ifelse(!x$term %in% keywords$keyword, NA, x$term)
head(x[!is.na(x$term), ])
doc_id language sentence_id token_id token lemma xpos term
67039 19991431 nl 4379 11 erg erg JJ erg-centraal
67048 19991431 nl 4379 20 leuk leuk JJ leuk-adres
67070 21054450 nl 4380 6 goede goed JJ goed-locatie
67077 21054450 nl 4380 13 Europese europees JJ europees-wijk
67272 23542577 nl 4393 84 uitstekende uitstekend JJ uitstekend-gastheer
67299 40676307 nl 4396 25 gezellige gezellig JJ gezellig-buurt

Extract n Words Around Defined Term (Multicase)

I have a vector of text strings, such as:
Sentences <- c("I would have gotten the promotion, but TEST my attendance wasn’t good enough.Let me help you with your baggage.",
"Everyone was busy, so I went to the movie alone. Two seats were vacant.",
"TEST Rock music approaches at high velocity.",
"I am happy to take your TEST donation; any amount will be greatly TEST appreciated.",
"A purple pig and a green donkey TEST flew a TEST kite in the middle of the night and ended up sunburnt.",
"Rock music approaches at high velocity TEST.")
I would like to extract n (for example: three) words (a word is characterized by a whitespace before and after character(s)) AROUND (i.e., before and after) a particular term (e.g., 'TEST').
Improtant: Several matches should be allowed (i.e., if a particular term occurs more than one times, the intended solution should capture those cases).
The result might look like this (the format can be improved):
S1 <- c(before = "the promotion, but", after = "my attendance wasn’t")
S2 <- c(before = "", after = "")
S3 <- c(before = "", after = "Rock music approaches")
S4a <- c(before = "to take your", after = "donation; any amount")
S4b <- c(before = "will be greatly", after = "appreciated.")
S5a <- c(before = "a green donkey", after = "flew a TEST")
S5b <- c(before = "TEST flew", after = "kite in the")
S6 <- c(before = "at high velocit", after = "")
How can I do this? I already figured out other psots, which are either only for one-case-matches or relate to fixed sentence structures.
The quanteda package has a great function for this: kwic() (keywords in context).
Out of the box, this works pretty well on your example:
library("quanteda")
names(Sentences) <- paste0("S", seq_along(Sentences))
(kw <- kwic(Sentences, "TEST", window = 3))
#
# [S1, 9] promotion, but | TEST | my attendance wasn't
# [S3, 1] | TEST | Rock music approaches
# [S4, 7] to take your | TEST | donation; any
# [S4, 15] will be greatly | TEST | appreciated.
# [S5, 8] a green donkey | TEST | flew a TEST
# [S5, 11] TEST flew a | TEST | kite in the
# [S6, 7] at high velocity | TEST | .
(kw2 <- as.data.frame(kw)[, c("docname", "pre", "post")])
# docname pre post
# 1 S1 promotion , but my attendance wasn't
# 2 S3 Rock music approaches
# 3 S4 to take your donation ; any
# 4 S4 will be greatly appreciated .
# 5 S5 a green donkey flew a TEST
# 6 S5 TEST flew a kite in the
# 7 S6 at high velocity .
That's probably a better format than the separate objects you ask for you in the question. But to get as close as possible to your target, you can further transform it as follows.
# this picks up the empty matching sentence S2
(kw3 <- merge(kw2,
data.frame(docname = names(Sentences), stringsAsFactors = FALSE),
all.y = TRUE))
# replaces the NA with the empty string
kw4 <- as.data.frame(lapply(kw3, function(x) { x[is.na(x)] <- ""; x} ),
stringsAsFactors = FALSE)
# renames pre/post to before/after
names(kw4)[2:3] <- c("before", "after")
# makes the docname unique
kw4$docname <- make.unique(kw4$docname)
kw4
# docname before after
# 1 S1 promotion , but my attendance wasn't
# 2 S2
# 3 S3 Rock music approaches
# 4 S4 to take your donation ; any
# 5 S4.1 will be greatly appreciated .
# 6 S5 a green donkey flew a TEST
# 7 S5.1 TEST flew a kite in the
# 8 S6 at high velocity .

Resources