I have been using the tm package to run some text analysis.
My problem is with creating a list with words and their frequencies associated with the same
library(tm)
library(RWeka)
txt <- read.csv("HW.csv",header=T)
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"
myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#building the TDM
btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors.
Is there a simple solution for this?
Try this
data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
head(FreqMat, 10)
# ST Freq
# 1 "(it) 1
# 2 "demand 1
# 3 "expansion 1
# 4 "for 1
# 5 "growth 1
# 6 "if 1
# 7 "is 2
# 8 "may 1
# 9 "none 2
# 10 "opec 2
I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.
avisos<- scan("anuncio.txt", what="character", sep="\n")
avisos1 <- tolower(avisos)
avisos2 <- strsplit(avisos1, "\\W")
avisos3 <- unlist(avisos2)
freq<-table(avisos3)
freq1<-sort(freq, decreasing=TRUE)
temple.sorted.table<-paste(names(freq1), freq1, sep="\\t")
cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")
Looking at the source of findFreqTerms, it appears that the function slam::row_sums does the trick when called on a term-document matrix. Try, for instance:
data(crude)
slam::row_sums(TermDocumentMatrix(crude))
Depending on your needs, using some tidyverse functions might be a rough solution that offers some flexibility in terms of how you handle capitalization, punctuation, and stop words:
text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'
stop_words <- c('a', 'and', 'for', 'the') # just a sample list of words I don't care about
library(tidyverse)
data_frame(text = text_string) %>%
mutate(text = tolower(text)) %>%
mutate(text = str_remove_all(text, '[[:punct:]]')) %>%
mutate(tokens = str_split(text, "\\s+")) %>%
unnest() %>%
count(tokens) %>%
filter(!tokens %in% stop_words) %>%
mutate(freq = n / sum(n)) %>%
arrange(desc(n))
# A tibble: 64 x 3
tokens n freq
<chr> <int> <dbl>
1 i 5 0.0581
2 with 5 0.0581
3 is 4 0.0465
4 words 3 0.0349
5 into 2 0.0233
6 list 2 0.0233
7 of 2 0.0233
8 problem 2 0.0233
9 run 2 0.0233
10 that 2 0.0233
# ... with 54 more rows
a = scan(file='~/Desktop//test.txt',what="list")
a1 = data.frame(lst=a)
count(a1,vars="lst")
seems to work to get simple frequencies. I've used scan because I had a txt file, but it should work with read.csv too.
Does apply(myTdm, 1, sum) or rowSums(as.matrix(myTdm)) give the ngram counts you're after?
Related
I ran into an issue using the unnest_tokens function on a data_frame. I am working with pdf files I want to compare.
text_path <- "c:/.../text1.pdf"
text_raw <- pdf_text("c:/.../text1.pdf")
text1df<- data_frame(Zeile = 1:25,
text_raw)
So far so good. But here comes my problemo:
unnest_tokens(output = token, input = content) -> text1_long
Error: Must extract column with a single valid subscript.
x Subscript var has the wrong type function.
i It must be numeric or character.
I want to tokenize my pdf files so I can analyse the word frequencies and maybe compare multiple pdf files on wordclouds.
Here is a piece of simple code. I kept your German words so you can copy paste everything.
library(pdftools)
library(dplyr)
library(stringr)
library(tidytext)
file_location <- "d:/.../my_doc.pdf"
text_raw <- pdf_text(file_location)
# Zeile 12 because I only have 12 pages
text1df <- data_frame(Zeile = 1:12,
text_raw)
text1df_long <- unnest_tokens(text1df , output = wort, input = text_raw ) %>%
filter(str_detect(wort, "[a-z]"))
text1df_long
# A tibble: 4,134 x 2
Zeile wort
<int> <chr>
1 1 training
2 1 and
3 1 development
4 1 policy
5 1 contents
6 1 policy
7 1 statement
8 1 scope
9 1 induction
10 1 training
# ... with 4,124 more rows
I need to break a corpus into chunks of N words each. Say this is my corpus:
corpus <- "I need to break this corpus into chunks of ~3 words each"
One way around this problem is turning the corpus into a dataframe, tokenizing it
library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)
and then splitting the dataframe rowwise using the code below (taken from here).
chunk <- 3
n <- nrow(tokens)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)
This works, but there must be a more direct way. Any takes?
To split a string into into N words you can use tokenizers::chunk_text():
corpus <- "I need to break this corpus into chunks of ~3 words each"
library(tokenizers)
library(tidytext)
library(tibble)
corpus %>%
chunk_text(3)
[[1]]
[1] "i need to"
[[2]]
[1] "break this corpus"
[[3]]
[1] "into chunks of"
[[4]]
[1] "3 words each"
To return a data frame you can do:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text)
# A tibble: 12 x 2
group word
<int> <chr>
1 1 i
2 1 need
3 1 to
4 2 break
5 2 this
6 2 corpus
7 3 into
8 3 chunks
9 3 of
10 4 3
11 4 words
12 4 each
If you want these as a list of data frames of 3 separate words:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text) %>%
group_split(group)
Here is the repex of a tally I have:
word_tally <- data.frame(scarred = c(1,1,0,0,0,0,0,0,0,0,0,0,0),
happy = c(0,0,1,0,0,0,0,0,0,0,0,0,0),
cheerful = c(0,0,0,1,0,0,0,0,0,0,0,0,0),
mad = c(0,0,0,0,1,1,1,1,1,0,0,0,0),
curious = c(0,0,0,0,0,0,0,0,0,1,1,1,1))
To make a word cloud seems I need 1 column with all the words. How could I transform the above dataframe to make that type of structure for a word cloud?
Using rep and colSums:
words <- rep(names(word_tally), colSums(word_tally))
words
[1] "scarred" "scarred" "happy" "cheerful" "mad"
[6] "mad" "mad" "mad" "mad" "curious"
[11] "curious" "curious" "curious"
Or since the frequencies are the column sums, using just the data.
wordcloud(names(word_tally), freq=colSums(word_tally), min.freq = 1)
You could get the data in long format and remove rows where value = 0.
library(dplyr)
tidyr::pivot_longer(word_tally, cols = everything(), names_to = "word") %>%
filter(value != 0) %>%
select(word)
# A tibble: 13 x 1
# word
# <chr>
# 1 scarred
# 2 scarred
# 3 happy
# 4 cheerful
# 5 mad
# 6 mad
# 7 mad
# 8 mad
# 9 mad
#10 curious
#11 curious
#12 curious
#13 curious
This would give all the words in one column which can be used as input for wordcloud.
In base R, another way could be :
names(word_tally)[which(word_tally != 0, arr.ind = TRUE)[,2]]
I am familiar with using the tm library to create a tdm and count frequencies of terms.
But these terms are all single-word.
How can do count the # of times a multi-word phrase occurs in a document and/or corpus?
EDIT:
I am adding the code I have now to improve/clarify my post.
This is pretty standard code to build a term-document matrix:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
The problem is that this returns a matrix of single word terms, example:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
I want to be able to search for multi-word terms in the corpus instead. So for example "came from" instead of just "came" and "from" separately.
Thank you.
I created following function for obtaining word n-grams and their corresponding frequencies
library(tau)
library(data.table)
# given a string vector and size of ngrams this function returns word ngrams with corresponding frequencies
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}
Given a string like
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
Here is how to call the function for a pair of words, for phrases of length 3 pass 3 as argument
res <- createNgram(text, 2)
printing res outputs
w1w2 freq length
1: I want 2 6
2: R text 2 6
3: This is 2 7
4: and I 2 5
5: and is 1 6
6: count the 2 9
7: example and 2 11
8: frequency of 2 12
9: is my 3 5
10: little R 2 8
11: my little 2 9
12: my of 1 5
13: of This 1 7
14: of some 2 7
15: pattern and 1 11
16: some patter 1 11
17: some pattern 1 12
18: text example 2 12
19: the frequency 2 13
20: to count 2 8
21: want to 2 7
Given the text:
text <- "This is my little R text example and I want to count the frequency of some pattern (and - is - my - of). This is my little R text example and I want to count the frequency of some patter."
For find frequency of words:
table(strsplit(text, ' '))
- (and and count example frequency I is little my
3 1 2 2 2 2 2 3 2 3
of of). patter. pattern R some text the This to
2 1 1 1 2 2 2 2 2 2
want
2
For frequency of a pattern:
attr(regexpr('is', text), "match.length")
[1] 3
Here is a nice example with code using Tidytext: https://www.kaggle.com/therohk/news-headline-bigrams-frequency-vs-tf-idf
The same technique can be extended to larger n values.
bigram_tf_idf <- bigrams %>%
count(year, bigram) %>%
filter(n > 2) %>%
bind_tf_idf(bigram, year, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf.plot <- bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
filter(tf_idf > 0) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram))))
bigram_tf_idf.plot %>%
group_by(year) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(bigram, tf_idf, fill = year)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~year, ncol = 3, scales = "free") +
theme(text = element_text(size = 10)) +
coord_flip()
How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".
The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.
Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
Here is where the effort falters.
Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.
Thank you.
First, modify your initial text list from:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
to:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
Build a new Corpus:
v <- Corpus(VectorSource(text))
Create a bigram tokenizer function:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
Create your TermDocumentMatrix using the control parameter tokenize:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
Now that you have your new tdm, to get your desired output, you could do:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
Which gives:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...
One idea here , is to create a new corpus with bigrams.:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
A recursive function to extract bigram :
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.
edit after OP comments :
In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14