I am familiar with using the tm library to create a tdm and count frequencies of terms.
But these terms are all single-word.
How can do count the # of times a multi-word phrase occurs in a document and/or corpus?
EDIT:
I am adding the code I have now to improve/clarify my post.
This is pretty standard code to build a term-document matrix:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
The problem is that this returns a matrix of single word terms, example:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
I want to be able to search for multi-word terms in the corpus instead. So for example "came from" instead of just "came" and "from" separately.
Thank you.
I created following function for obtaining word n-grams and their corresponding frequencies
library(tau)
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}
Given a string like
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
Here is how to call the function for a pair of words, for phrases of length 3 pass 3 as argument
res <- createNgram(text, 2)
printing res outputs
w1w2 freq length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7
Given the text:
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
For find frequency of words:
table(strsplit(text, ' '))
- (and and count example frequency I is little my
3 1 2 2 2 2 2 3 2 3
of of). patter. pattern R some text the This to
2 1 1 1 2 2 2 2 2 2
want
2
For frequency of a pattern:
attr(regexpr('is', text), "match.length")
[1] 3
Here is a nice example with code using Tidytext: https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf
The same technique can be extended to larger n values.
bigram_tf_idf <- bigrams %>%
count(year, bigram) %>%
filter(n > 2) %>%
bind_tf_idf(bigram, year, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf.plot <- bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
filter(tf_idf > 0) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
bigram_tf_idf.plot %>%
group_by(year) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = year)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~year, ncol = 3, scales = "free") +
theme(text = element_text(size = 10)) +
coord_flip()
Related
I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.
For example:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
Thank you!
You could use quanteda framework for this:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.
An example using tidytext / dplyr looks like this:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.
I need to break a corpus into chunks of N words each. Say this is my corpus:
corpus <- "I need to break this corpus into chunks of ~3 words each"
One way around this problem is turning the corpus into a dataframe, tokenizing it
library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)
and then splitting the dataframe rowwise using the code below (taken from here).
chunk <- 3
n <- nrow(tokens)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)
This works, but there must be a more direct way. Any takes?
To split a string into into N words you can use tokenizers::chunk_text():
corpus <- "I need to break this corpus into chunks of ~3 words each"
library(tokenizers)
library(tidytext)
library(tibble)
corpus %>%
chunk_text(3)
[[1]]
[1] "i need to"
[[2]]
[1] "break this corpus"
[[3]]
[1] "into chunks of"
[[4]]
[1] "3 words each"
To return a data frame you can do:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text)
# A tibble: 12 x 2
group word
<int> <chr>
1 1 i
2 1 need
3 1 to
4 2 break
5 2 this
6 2 corpus
7 3 into
8 3 chunks
9 3 of
10 4 3
11 4 words
12 4 each
If you want these as a list of data frames of 3 separate words:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text) %>%
group_split(group)
I have a dataset with measured values (txt file, whitespace separated) and some numbers stick together like this:
Currently all columns are of class "character", since after conversion those pasted numbers got "NA"s. I created a routine for negative numbers, which was easy so far:
findandreplace <- function(file_name){
dat <- read_table2(file_name, col_names = FALSE)
for (n in 0:9) {
dat <- data.frame(lapply(dat, function(x) {gsub(paste0(n, "-"), paste0(n, " -"), x)}))
}
#save dat as txt and read it again
}
But now, I have no idea how to separate positive values. If you want you can use this MWE:
b = c("340.9","341","316.1","336.8316.39","378.8","315","386.57317.33",NA,NA)
a =c(1,2,3,4,5,6,7,8,9)
c = data.frame(a,b)
This is how it should be:
b = c("340.9","341","316.1","336.8","316.39","378.8","315","386.57", "317.33")
a =c(1,2,3,4,5,6,7,8,9)
c = data.frame(a,b)
x=unlist(strsplit(gsub("(.*)(3(?>\\d{2}\\.))","\\1 \\2",b,perl=T)," "))
grep("\\d",x,value = T)
[1] "340.9" "341" "316.1" "336.8" "316.39" "378.8" "315" "386.57" "317.33"
transform(c,b=grep("\\d",x,value = T))
a b
1 1 340.9
2 2 341
3 3 316.1
4 4 336.8
5 5 316.39
6 6 378.8
7 7 315
8 8 386.57
9 9 317.33
How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".
The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.
Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
Here is where the effort falters.
Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.
Thank you.
First, modify your initial text list from:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
to:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
Build a new Corpus:
v <- Corpus(VectorSource(text))
Create a bigram tokenizer function:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
Create your TermDocumentMatrix using the control parameter tokenize:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
Now that you have your new tdm, to get your desired output, you could do:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
Which gives:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...
One idea here , is to create a new corpus with bigrams.:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
A recursive function to extract bigram :
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.
edit after OP comments :
In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14
I have been using the tm package to run some text analysis.
My problem is with creating a list with words and their frequencies associated with the same
library(tm)
library(RWeka)
txt <- read.csv("HW.csv",header=T)
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"
myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#building the TDM
btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors.
Is there a simple solution for this?
Try this
data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
head(FreqMat, 10)
# ST Freq
# 1 "(it) 1
# 2 "demand 1
# 3 "expansion 1
# 4 "for 1
# 5 "growth 1
# 6 "if 1
# 7 "is 2
# 8 "may 1
# 9 "none 2
# 10 "opec 2
I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.
avisos<- scan("anuncio.txt", what="character", sep="\n")
avisos1 <- tolower(avisos)
avisos2 <- strsplit(avisos1, "\\W")
avisos3 <- unlist(avisos2)
freq<-table(avisos3)
freq1<-sort(freq, decreasing=TRUE)
temple.sorted.table<-paste(names(freq1), freq1, sep="\\t")
cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")
Looking at the source of findFreqTerms, it appears that the function slam::row_sums does the trick when called on a term-document matrix. Try, for instance:
data(crude)
slam::row_sums(TermDocumentMatrix(crude))
Depending on your needs, using some tidyverse functions might be a rough solution that offers some flexibility in terms of how you handle capitalization, punctuation, and stop words:
text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'
stop_words <- c('a', 'and', 'for', 'the') # just a sample list of words I don't care about
library(tidyverse)
data_frame(text = text_string) %>%
mutate(text = tolower(text)) %>%
mutate(text = str_remove_all(text, '[[:punct:]]')) %>%
mutate(tokens = str_split(text, "\\s+")) %>%
unnest() %>%
count(tokens) %>%
filter(!tokens %in% stop_words) %>%
mutate(freq = n / sum(n)) %>%
arrange(desc(n))
# A tibble: 64 x 3
tokens n freq
<chr> <int> <dbl>
1 i 5 0.0581
2 with 5 0.0581
3 is 4 0.0465
4 words 3 0.0349
5 into 2 0.0233
6 list 2 0.0233
7 of 2 0.0233
8 problem 2 0.0233
9 run 2 0.0233
10 that 2 0.0233
# ... with 54 more rows
a = scan(file='~/Desktop//test.txt',what="list")
a1 = data.frame(lst=a)
count(a1,vars="lst")
seems to work to get simple frequencies. I've used scan because I had a txt file, but it should work with read.csv too.
Does apply(myTdm, 1, sum) or rowSums(as.matrix(myTdm)) give the ngram counts you're after?