Extract and count common word-pairs from character vector - r

How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".
The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.
Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
Here is where the effort falters.
Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.
Thank you.

First, modify your initial text list from:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
to:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
Build a new Corpus:
v <- Corpus(VectorSource(text))
Create a bigram tokenizer function:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
Create your TermDocumentMatrix using the control parameter tokenize:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
Now that you have your new tdm, to get your desired output, you could do:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
Which gives:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...

One idea here , is to create a new corpus with bigrams.:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
A recursive function to extract bigram :
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.
edit after OP comments :
In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14

Related

Word length manipulation in data frame

I'm printing a data frame that should list word, length, and frequency in any simple text document. I have everything set but 1) length isn't counting the number of characters , I'm not sure what it's actually counting; and 2) I need to reorganize the word list from longest word to shortest for a final print of the list.
file <- c(scan("a.txt",character()))
file <- as.data.frame(table(file))
Freq <- file$Freq
Word <- file$file
Len <- sapply(c(Word),nchar)
A plane a.txt file with the following:
the the the bus ran over two two people and when
prints
Word Len Freq
1 and 1 1
2 bus 1 1
3 over 1 1
4 people 1 1
5 ran 1 1
6 the 1 3
7 two 1 2
8 when 1 1
Len should be the length of letters but here it always counts 1 - in longer tests it sometimes says 2 so I'm not sure what it counts. After this, it prints:
[1] and bus over people ran the two when
Levels: and bus over people ran the two when
I'm trying to get the full word print to be in order from longest to shortest. I should be able to use Len to sort the words but I can't seem to get sapply to work right.
You can try to convert file$file (which is a factor here) to a string using as.character() and count its characters with simple nchar() without sapply() as R is vectorized.
file <- c(scan("a.txt",character()))
file <- as.data.frame(table(file))
Freq <- file$Freq
Word <- as.character(file$file)
Len <- nchar(Word)
x <- data.frame(Word, Len, Freq)
print(x)
print(Word[order(Len, decreasing = T)])
Ordering is done with order().
Results:
print(x)
# Word Len Freq
# 1 and 3 1
# 2 bus 3 1
# 3 over 4 1
# 4 people 6 1
# 5 ran 3 1
# 6 the 3 3
# 7 two 3 2
# 8 when 4 1
print(Word[order(Len, decreasing = T)])
# [1] "people" "over" "when" "and" "bus" "ran" "the" "two"
With text produced by Lorem Ipsum, this sequence of instructions does what the question asks for.
Word <- scan(file = 'a.txt', what = character())
Word <- gsub('[[:punct:]]', '', Word) # remove punctuation characters
Word <- tolower(Word) # all characters lower case
tbl <- table(Word) # now get their frequencies
Len <- nchar(names(tbl)) # the words are the table's names
x <- as.data.frame(tbl) # to data.frame
x$Len <- Len # assign the lengths column
The data is now in lexicographic order. If the class of x$Word is "factor", use argument stringsAsFactors = FALSE in the call to as.data.frame.
Finally, order by Len and assign new row numbers.
x <- x[order(x$Len, decreasing = TRUE), ]
row.names(x) <- NULL
head(x)
# Word Freq Len
#1 sollicitudin 3 12
#2 pellentesque 4 12
#3 ullamcorper 5 11
#4 suspendisse 1 11
#5 scelerisque 2 11
#6 consectetur 2 11
I don't have your data, but you could probably do something like this. The $ operates to extract data by name, so file$Freq is getting the column Freq from the data.frame file.
file$Len <- nchar(file$file)
x <- file[,c('file', 'Len', 'Freq')]
names(x) <- c('Word', 'Len', 'Freq')
length() is calculating vector length. For example:
x <- c("apple", "pie", "math", "this is sentance")
x
[1] "apple" "pie" "math" "this is sentance"
length(x)
[1] 4
x is character vector of length 4 (it has 4 elements). If you want to calculate number of characters for each element in character vector use nchar():
nchar(x)
[1] 5 3 4 16
as you can see nchar() is vectorized - it calculates number of characters (not only letters) for each element in character vector.

Remove part of a string based on overlapping patterns

I have the following data:
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short"),
some_other_cols = c(1, 2, 2))
Further, I have the following vector of patterns:
my_patterns <- c("my example", "is my", "my other text")
What I want to achieve is to remove any text of my_patterns that occurs in dat$x.
I tried the solution below, but the problem is that as soon as I remove the first pattern from the text (here: "my example"), my solution is not able to detect the occurence of the second (here: "is my") or third pattern anymore.
Wrong solution:
library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")
dat_new <- dat %>%
mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))
I guess I could do sth. like looping through all patterns, collect the string positions in dat$x that match my patterns, then combine them into a range and delete that range from the text. E.g. I add columns to my dat data frame like start_pattern_1 and end_pattern_1 and so on. So for the first row 1 I get 9 (start) and 18 (end) for the first pattern, 6/10 for the second pattern. I then need to check if any end position overlaps with any start position (here start 9 and end 10) and combine them into a range 6-18 and remove this range from the text.
Problem is that I potentially have many new start/end columns then (could be a few hundred patterns in my case) and if I need to pairwise compare the overlapping ranges, my computer will probably crash.
So I'm wondering how I could get it work or how I should best approach this solution. Maybe (and I hope so) there's a better/more elegant/easy solution.
Desired Output of dat would be:
x some_other_cols short_x
this is my example text 1 this text
and here is my other text example 2 and here example
my other text is short 2 is short
Appreciate your help! Thanks.
New option with str_locate_all mentionned by Uwe in a comment under the question which greatly simplify the code:
library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
if (nrow(positions) == 0) return(text)
ret <- strsplit(text,"")[[1]]
lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}
If you have control over the pattern definition and can create it by hand then it can be achieved with a regex solution:
> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this text" "and here example" " is short"
The idea is to create the pattern with optional parts (the ? after the grouping parentheses.
So we have roughly:
(is )? <= optional "is" followed by space
my <= literal "my" followed by space
(other text|example)? <= Optional text after "my ", either "other text" or (the |) "example"
If you don't have control, things gets messy, I hope I've commented enough for it to be understandable, according to the number of loops included don't expect it to be quick:
# Given datas
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
some_other_cols = c(1, 2, 2, 4))
my_patterns <- c("my example", "is my", "my other text")
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
ret <- strsplit(text,"")[[1]]
lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
sapply(regexec(pattern,vector),
function(x) {
if(x>-1) {
c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
}
})
}
# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}
Result after run:
> dat
x some_other_cols result
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
There are two crucial points here:
The patterns to remove from a string may overlap
There may be multiple non-overlapping patterns to remove from the string
The solution below tries to address both issues using my favorite tools
library(data.table)
setDT(dat)[, rn := .I] # add row numbers to join on later
library(stringr)
library(magrittr) # piping used to improve readability
pos <-
# find start and end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn")) %>%
rbindlist() %>%
# collapse overlapping positions
setorder(rn, start, end) %>%
.[, grp := cumsum(cummax(shift(end, fill = 0)) < start), by = rn] %>%
.[, .(start = min(start), end = max(end)), by = .(rn, grp)]
Now, pos has become:
rn grp start end
1: 1 1 6 18
2: 2 1 10 25
3: 3 1 1 13
4: 5 1 6 10
5: 5 2 24 28
6: 6 1 1 13
7: 6 2 15 27
8: 7 1 3 7
9: 8 1 1 10
10: 8 2 12 16
11: 8 3 22 34
12: 9 1 1 10
13: 9 2 19 31
# remove patterns from strings from back to front
dat[, short_x := x]
for (g in rev(seq_len(max(pos$grp)))) {
# update join
dat[pos[grp == g], on = .(rn), short_x := `str_sub<-`(short_x, start, end, value = "")]
}
dat[, rn := NULL][ #remove row number
, short_x := str_squish(short_x)][] # remove whitespace
x some_other_cols short_x
1: this is my example text 1 this text
2: and here is my other text example 2 and here example
3: my other text is short 2 is short
4: yet another text 4 yet another text
5: this is my text where 'is my' appears twice 5 this text where '' appears twice
6: my other text is my example 6
7: This myself 7 Thself
8: my example is my not my other text 8 not
9: my example is not my other text 9 is not
The code to collapse overlapping positions is modified from this answer.
The intermediate result
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn"))
[[1]]
rn start end
1: 1 9 18
2: 6 18 27
3: 8 1 10
4: 9 1 10
[[2]]
rn start end
1: 1 6 10
2: 2 10 14
3: 5 6 10
4: 5 24 28
5: 6 15 19
6: 7 3 7
7: 8 12 16
[[3]]
rn start end
1: 2 13 25
2: 3 1 13
3: 6 1 13
4: 8 22 34
5: 9 19 31
shows that patterns 1 and 2 overlap in row 1 and patterns 2 and 3 overlap in row 2. Rows 5, 8, and 9 have non-overlapping patterns. Row 7 is to show that patterns are extracted regardless of word boundaries.
EDIT: dplyr version
The OP has mentioned that he/she has "successfully avoided data.table so far". So, I felt challenged to add a dplyr version:
library(dplyr)
library(stringr)
pos <-
# find start end end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as_tibble) %>%
bind_rows(.id = "rn")) %>%
bind_rows() %>%
# collapse overlapping positions
arrange(rn, start, end) %>%
group_by(rn) %>%
mutate(grp = cumsum(cummax(lag(end, default = 0)) < start)) %>%
group_by(rn, grp) %>%
summarize(start = min(start), end = max(end))
# remove patterns from strings from back to front
dat <- dat %>%
mutate(rn = row_number() %>% as.character(),
short_x = x %>% as.character())
for (g in rev(seq_len(max(pos$grp)))) {
dat <- dat %>%
left_join(pos %>% filter(grp == g), by = "rn") %>%
mutate(short_x = ifelse(is.na(grp), short_x, `str_sub<-`(short_x, start, end, value = ""))) %>%
select(-grp, -start, -end)
}
# remove row number
dat %>%
select(-rn) %>%
mutate(short_x = str_squish(short_x))
x some_other_cols short_x
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
5 this is my text where 'is my' appears twice 5 this text where '' appears twice
6 my other text is my example 6
7 This is myself 7 This self
8 my example is my not my other text 8 not
9 my example is not my other text 9 is not
The algorithm is essentially the same. However, there are two challenges here where dplyr differs from data.table:
dplyr requires explicit coersion from factor to character
there is no update join available in dplyr, so the for loop has become more verbose than the data.table counterpart (Perhaps, someone knows a fancy purrr function or a map-reduce trick to accomplish the same?)
EDIT 2
There are some bug fixes and improvements to above codes:
Collapsing positions has been corrected to work also for some edge case I have added to dat.
seq() has been replaced by seq_len().
str_squish() reduces repeated whitespace inside a string and removes whitespace from start and end of a string.
Data
I have added some use cases to test for non-overlapping patterns and complete removal, e.g.:
dat <- data.frame(
x = c(
"this is my example text",
"and here is my other text example",
"my other text is short",
"yet another text",
"this is my text where 'is my' appears twice",
"my other text is my example",
"This myself",
"my example is my not my other text",
"my example is not my other text"
),
some_other_cols = c(1, 2, 2, 4, 5, 6, 7, 8, 9)
)
my_patterns <- c("my example", "is my", "my other text")

How to get all possible 2 words combinations with their frequency without tm package

I have a text like this:
dat<-c("this is my farm this is my land")
I would like to get all possible 2 words combinations with their frequency.
I can't use tm package so any other solution will be appreciated.
The output should be something like this:
two words freq
this is 2
is my 2
my farm 1
my land 1
The combinations could be generated by splitting the dat and then extracting the consecutive two word combinations. Then, gregexpr could be used to count the appearances.
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(2:length(temp), function(i)
paste(temp[(i-1):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is is my my farm farm this my land
# 2 2 1 1 1
Or for three word combinations
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(3:length(temp), function(i)
paste(temp[(i-2):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is my is my farm my farm this farm this is is my land
# 2 1 1 1 1

Text Mining - Count Frequencies of Phrases (more than one word)

I am familiar with using the tm library to create a tdm and count frequencies of terms.
But these terms are all single-word.
How can do count the # of times a multi-word phrase occurs in a document and/or corpus?
EDIT:
I am adding the code I have now to improve/clarify my post.
This is pretty standard code to build a term-document matrix:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
The problem is that this returns a matrix of single word terms, example:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
I want to be able to search for multi-word terms in the corpus instead. So for example "came from" instead of just "came" and "from" separately.
Thank you.
I created following function for obtaining word n-grams and their corresponding frequencies
library(tau)
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}
Given a string like
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
Here is how to call the function for a pair of words, for phrases of length 3 pass 3 as argument
res <- createNgram(text, 2)
printing res outputs
w1w2 freq length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7
Given the text:
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
For find frequency of words:
table(strsplit(text, ' '))
- (and and count example frequency I is little my
3 1 2 2 2 2 2 3 2 3
of of). patter. pattern R some text the This to
2 1 1 1 2 2 2 2 2 2
want
2
For frequency of a pattern:
attr(regexpr('is', text), "match.length")
[1] 3
Here is a nice example with code using Tidytext: https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf
The same technique can be extended to larger n values.
bigram_tf_idf <- bigrams %>%
count(year, bigram) %>%
filter(n > 2) %>%
bind_tf_idf(bigram, year, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf.plot <- bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
filter(tf_idf > 0) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
bigram_tf_idf.plot %>%
group_by(year) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = year)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~year, ncol = 3, scales = "free") +
theme(text = element_text(size = 10)) +
coord_flip()

list of word frequencies using R

I have been using the tm package to run some text analysis.
My problem is with creating a list with words and their frequencies associated with the same
library(tm)
library(RWeka)
txt <- read.csv("HW.csv",header=T)
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"
myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#building the TDM
btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors.
Is there a simple solution for this?
Try this
data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
head(FreqMat, 10)
# ST Freq
# 1 "(it) 1
# 2 "demand 1
# 3 "expansion 1
# 4 "for 1
# 5 "growth 1
# 6 "if 1
# 7 "is 2
# 8 "may 1
# 9 "none 2
# 10 "opec 2
I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.
avisos<- scan("anuncio.txt", what="character", sep="\n")
avisos1 <- tolower(avisos)
avisos2 <- strsplit(avisos1, "\\W")
avisos3 <- unlist(avisos2)
freq<-table(avisos3)
freq1<-sort(freq, decreasing=TRUE)
temple.sorted.table<-paste(names(freq1), freq1, sep="\\t")
cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")
Looking at the source of findFreqTerms, it appears that the function slam::row_sums does the trick when called on a term-document matrix. Try, for instance:
data(crude)
slam::row_sums(TermDocumentMatrix(crude))
Depending on your needs, using some tidyverse functions might be a rough solution that offers some flexibility in terms of how you handle capitalization, punctuation, and stop words:
text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'
stop_words <- c('a', 'and', 'for', 'the') # just a sample list of words I don't care about
library(tidyverse)
data_frame(text = text_string) %>%
mutate(text = tolower(text)) %>%
mutate(text = str_remove_all(text, '[[:punct:]]')) %>%
mutate(tokens = str_split(text, "\\s+")) %>%
unnest() %>%
count(tokens) %>%
filter(!tokens %in% stop_words) %>%
mutate(freq = n / sum(n)) %>%
arrange(desc(n))
# A tibble: 64 x 3
tokens n freq
<chr> <int> <dbl>
1 i 5 0.0581
2 with 5 0.0581
3 is 4 0.0465
4 words 3 0.0349
5 into 2 0.0233
6 list 2 0.0233
7 of 2 0.0233
8 problem 2 0.0233
9 run 2 0.0233
10 that 2 0.0233
# ... with 54 more rows
a = scan(file='~/Desktop//test.txt',what="list")
a1 = data.frame(lst=a)
count(a1,vars="lst")
seems to work to get simple frequencies. I've used scan because I had a txt file, but it should work with read.csv too.
Does apply(myTdm, 1, sum) or rowSums(as.matrix(myTdm)) give the ngram counts you're after?

Resources