Keep EXACT words from R corpus - r

From answer posted on: Keep document ID with R corpus by #MrFlick
I am trying to slightly modify what is a great example.
Question: How do I modify the content_transformer function to keep only exact words? You can see in the inspect output that wonderful is counted as wonder and ratio is counted as rationale. I do not have a strong understanding of gregexpr and regmatches.
Create data frame:
dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
"So that in many cases such a ",
"But there were still other and",
"Not even at the rationale")
, stringsAsFactors = F
)
Now, in order to read special attributes from a data.frame, we will use the readTabular function to make our own custom data.frame reader
library(tm)
myReader <- readTabular(mapping = list(content = "text", id = "id"))
specify the column to use for the contents and the id in the data.frame. Now we read it in with DataframeSource but use our custom reader.
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
Now if we want to only keep a certain set of words, we can create our own content_transformer function. One way to do this is
keepOnlyWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("\\b(", paste(words, collapse = "|"), "\\b)"), x)
, invert = T) <- " "
x
})
This will replace everything that's not in the word list with a space. Note that you probably want to run stripWhitespace after this. Thus our transformations would look like
keep <- c("wonder", "then", "that", "the")
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
Inspect dtm matrix:
> inspect(dtm)
<<DocumentTermMatrix (documents: 4, terms: 4)>>
Non-/sparse entries: 7/9
Sparsity : 56%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs ratio that the wonder
10 0 1 1 1
11 0 1 0 0
12 0 0 1 0
13 1 0 1 0

Switching grammars to tidytext, your current transformation would be
library(tidyverse)
library(tidytext)
library(stringr)
dd %>% unnest_tokens(word, text) %>%
mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 wonder
## 2 10 the
## 3 10 that
## 4 11 that
## 5 12 the
## 6 12 the
## 7 13 the
Keeping exact matches is easier, as you can use joins (which use ==) instead of regex:
dd %>% unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 then
## 2 10 that
## 3 11 that
## 4 13 the
To take it back to a document-term matrix,
library(tm)
dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM
unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep)) %>%
mutate(i = 1) %>%
cast_dtm(id, word, i) %>%
inspect()
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs then that the
## 10 1 1 0
## 11 0 1 0
## 12 0 0 0
## 13 0 0 1
Currently, your function is matching words with a boundary before or after. To change it to before and after, change the collapse parameter to include boundaries:
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)
, invert = T) <- " "
x
})
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
inspect(DocumentTermMatrix(tm))
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs that the then
## 10 1 0 1
## 11 1 0 0
## 12 0 0 0
## 13 0 1 0

I got same result as #alistaire with tm, with the following modified line in keepOnlyWords content transformer first defined by #BEMR:
gregexpr(paste0("\\b(", paste(words, collapse = "|"), ")\\b"), x)
There was a misplaced ")" in gregexpr first specified by #BEMR i.e. should be ")\\b" not "\\b)"
I think the above gregexpr is equivalent to that specified by #alistaire:
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)

Related

I am wondering how to make a term document matrix of keywords consisting of several words in R

Is there a way to make a data frame like this into a term document matrix? Each keyword consists of two or more words.
Example data
Data type is a data frame.
doc_id text
1 c('cat dog', 'cat rat')
2 c('cat dog')
3 c('cat rat')
Desired result
I want to get this result. The TermDocumentMatrix function already exists does not reflect a multiword keyword.
Docs
Terms 1 2 3
cat dog 1 1 0
cat rat 1 0 1
Using tidyr and tidytext you first unnest the list column before replacing the white space with _ (you can use something else but _ is usually used to represent n-grams). This way the words are not separated when producing the tdm:
library(dplyr)
library(tidyr)
library(tidytext)
library(stringr)
# bring toy data into useful form
df <- tibble::tribble(
~doc_id, ~text,
1, c('cat dog', 'cat rat'),
2, c('cat dog'),
3, c('cat rat')
)
tdm <- df %>%
unnest(text) %>%
mutate(text = str_replace(text, "\\s+", "_")) %>% # replace whitespace
unnest_tokens(word, text) %>%
count(word, doc_id) %>%
cast_tdm(word, doc_id, n)
tdm
#> <<TermDocumentMatrix (terms: 2, documents: 3)>>
#> Non-/sparse entries: 4/2
#> Sparsity : 33%
#> Maximal term length: 7
#> Weighting : term frequency (tf)
To display it as a regular matrix:
tdm %>%
as.matrix()
#> Docs
#> Terms 1 2 3
#> cat_dog 1 1 0
#> cat_rat 1 0 1

R- textminig with tm packages

I want to use the tm packakes, so I have created the next code:
x<-inspect(DocumentTermMatrix(docs,
list(dictionary = c("survive", "survival"))))
I need to find any word beginning with "surviv" in the text, such as to include words like "survival" "survivor" "survive" and others. Is there any way to write that condition - words begining with "surviv"- in the code?
What you could do is creating a general DocumentTermMatrix and then filtering it to keep only rows which start with surviv using startsWith.
corp <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm[startsWith(rownames(tdm), "surv"),])
You can stem the words with stemDocument. Then you need to look only for surviv and survivor as these are the stem words you are looking for. Using and expanding the list of words from #AshOfFire
my_corpus <- VCorpus(VectorSource(c("survival", "survivance", "survival",
"random", "yes", "survive", "survivors", "surviving")))
my_corpus <- tm_map(my_corpus, stemDocument)
my_dtm <- DocumentTermMatrix(my_corpus, control = list(dictionary = c("surviv", "survivor")))
inspect(my_dtm)
<<DocumentTermMatrix (documents: 8, terms: 2)>>
Non-/sparse entries: 6/10
Sparsity : 62%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Terms
Docs surviv survivor
1 1 0
2 1 0
3 1 0
4 0 0
5 0 0
6 1 0
7 0 1
8 1 0
p.s. only do x <- inspect(DocumentTermMatrix(docs, .....) if you want to get the first 10 rows and 10 columns in your x variable.

Text Mining - Count Frequencies of Phrases (more than one word)

I am familiar with using the tm library to create a tdm and count frequencies of terms.
But these terms are all single-word.
How can do count the # of times a multi-word phrase occurs in a document and/or corpus?
EDIT:
I am adding the code I have now to improve/clarify my post.
This is pretty standard code to build a term-document matrix:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
The problem is that this returns a matrix of single word terms, example:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
I want to be able to search for multi-word terms in the corpus instead. So for example "came from" instead of just "came" and "from" separately.
Thank you.
I created following function for obtaining word n-grams and their corresponding frequencies
library(tau)
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}
Given a string like
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
Here is how to call the function for a pair of words, for phrases of length 3 pass 3 as argument
res <- createNgram(text, 2)
printing res outputs
w1w2 freq length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7
Given the text:
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
For find frequency of words:
table(strsplit(text, ' '))
- (and and count example frequency I is little my
3 1 2 2 2 2 2 3 2 3
of of). patter. pattern R some text the This to
2 1 1 1 2 2 2 2 2 2
want
2
For frequency of a pattern:
attr(regexpr('is', text), "match.length")
[1] 3
Here is a nice example with code using Tidytext: https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf
The same technique can be extended to larger n values.
bigram_tf_idf <- bigrams %>%
count(year, bigram) %>%
filter(n > 2) %>%
bind_tf_idf(bigram, year, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf.plot <- bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
filter(tf_idf > 0) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
bigram_tf_idf.plot %>%
group_by(year) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = year)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~year, ncol = 3, scales = "free") +
theme(text = element_text(size = 10)) +
coord_flip()

"Bag of characters" n-grams in R

I would like to create a Term document matrix containing character n-grams. For example, take the following sentence:
"In this paper, we focus on a different but simple text representation."
Character 4-grams would be: |In_t|, |n_th|, |_thi|, |this|, |his__|, |is_p|, |s_pa|, |_pap|, |pape|, |aper|, etc.
I have used the R/Weka package to work with "bag of words" n-grams, but I'm having difficulty adapting tokenizers such as the one below to work with characters:
BigramTokenizer <- function(x){
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(corpus,
control = list(
tokenize = BigramTokenizer, wordLengths=c(2,Inf)))
Any thoughts on how to use R/Weka or an other package to create character n-grams?
I find quanteda quite useful:
library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
# Docs
# Terms text1 text2
# In_t 1 1
# n_th 1 1
# _thi 1 2
# this 1 2
# his_ 1 1
# is_p 1 0
# s_pa 1 0
# _pap 1 0
# pape 1 0
# aper 1 0
# per. 1 0
# is_l 0 1
# s_li 0 1
# _lin 0 1
# line 0 1
# ines 0 1
# nes_ 0 1
# es_t 0 1
# s_th 0 1
# his. 0 1
You need to use the CharacterNGramTokenizer instead.
The NGramTokenizer splits on characters like spaces.
##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)
CharacterNGramTokenizer <- structure(function (x, control = NULL)
{
tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
x <- Filter(nzchar, as.character(x))
if (!length(x))
return(character())
.jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
"weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)),
.jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer",
kind = "R_Weka_tokenizer_interface", class = "character",
init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################
BigramTokenizer <- function(x){
CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}
Sadly it is not included in RWeka by default.
However, if you want to use weka this seems to be a kind of holistic version

Extract and count common word-pairs from character vector

How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".
The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.
Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
Here is where the effort falters.
Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.
Thank you.
First, modify your initial text list from:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
to:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
Build a new Corpus:
v <- Corpus(VectorSource(text))
Create a bigram tokenizer function:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
Create your TermDocumentMatrix using the control parameter tokenize:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
Now that you have your new tdm, to get your desired output, you could do:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
Which gives:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...
One idea here , is to create a new corpus with bigrams.:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
A recursive function to extract bigram :
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.
edit after OP comments :
In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14

Resources