Find frequency of terms from Function - r

I need to find frequency of terms from the function that I have created that find terms with punctuation in them.
library("tm")
my.text.location <- "C:/Users/*/"
newpapers <- VCorpus(DirSource(my.text.location))
I read it then make the function:
library("stringr")
punctterms <- function(x){str_extract_all(x, "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}")}
terms <- lapply(newpapers, punctterms)
Now I'm lost as to how will I find the frequency for each term in each file. Do I turn it into a DTM or is there a better way without it?
Thank you!

This task is better suited for quanteda, not tm. Your function creates a list and removes everything out of the corpus. Using quanteda you can just use the quanteda commands to get everything you want.
Since you didn't provide any reproducible data, I will use a data set that comes with quanteda. Comments above the code explain what is going on. Most important function in this code is dfm_select. Here you can use a diverse set of selection patterns to find terms in the text.
library(quanteda)
# load corpus
my_corpus <- corpus(data_corpus_inaugural)
# create document features (like document term matrix)
my_dfm <- dfm(my_corpus)
# dfm_select can use regex selections to select terms
my_dfm_punct <- dfm_select(my_dfm,
pattern = "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}",
selection = "keep",
valuetype = "regex")
# show frequency of selected terms.
head(textstat_frequency(my_dfm_punct))
feature frequency rank docfreq group
1 fellow-citizens 39 1 19 all
2 america's 35 2 11 all
3 self-government 30 3 16 all
4 world's 24 4 15 all
5 nation's 22 5 13 all
6 god's 15 6 14 all

So I got it to work without using quanteda:
m <- as.data.frame(table(unlist(terms)))
names(m) <- c("Terms", "Frequency")

Related

Extract total frequency of words from vector in R

This is the vector I have:
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)"
I want a data frame as a result, that would contain words and the frequecy of times they occur.
So result should look something like:
word count
a 300
and 260
be 200
... ...
... ...
What I tried to do, was use tm
corpus <- VCorpus(VectorSource(posts))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, removePunctuation)
m <- DocumentTermMatrix(corpus)
Running findFreqTerms(m, lowfreq =0, highfreq =Inf ) just gives me the words, so I understand its a sparse matrix, how do I extract the words and their frequency?
Is there a easier way to do this, maybe by not using tm at all?
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts) # remove punctuations
posts <- gsub("[[:digit:]]", '', posts) # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") ))) # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] ) # remove empty characters
head(word_counts)
# Var1 Freq
# 2 a 8
# 3 about 3
# 4 allows 1
# 5 although 1
# 6 am 1
# 7 an 1
Plain R solution, assuming all words are separated by space:
words <- strsplit(posts, " ", fixed = T)
words <- unlist(words)
counts <- table(words)
The names(counts) holds words, and values are the counts.
You might want to use gsub to get rid of (),.?: and 's, 't or 're as in your example. As in:
posts <- gsub("'t|'s|'t|'re", "", posts)
posts <- gsub("[(),.?:]", " ", posts)
You've got two options. Depends if you want word count per document, or for all documents.
All Documents
library(dplyr)
count <- as.data.frame(t(inspect(m)))
sel_cols <- colnames(count)
count$word <- rownames(count)
rownames(count) <- seq(length = nrow(count))
count$count <- rowSums(count[,sel_cols])
count <- count %>% select(word,count)
count <- count[order(count$count, decreasing=TRUE), ]
### RESULT of head(count)
# word count
# 140 the 14
# 144 they 10
# 4 and 9
# 25 csm 7
# 43 for 5
# 55 had 4
This should capture occurrences across all documents (by use of rowSums).
Per Document
I would suggesting using the tidytext package, if you want word frequency per document.
library(tidytext)
m_td <- tidy(m)
The tidytext package allows fairly intuitive text mining, including tokenization. It is designed to work in a tidyverse pipeline, so it supplies a list of stop words ("a", "the", "to", etc.) to exclude with dplyr::anti_join. Here, you might do
library(dplyr) # or if you want it all, `library(tidyverse)`
library(tidytext)
data_frame(posts) %>%
unnest_tokens(word, posts) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## # A tibble: 101 × 2
## word n
## <chr> <int>
## 1 csm 7
## 2 0.0 3
## 3 nda 3
## 4 bit 2
## 5 ccp 2
## 6 dominion 2
## 7 forum 2
## 8 forums 2
## 9 hard 2
## 10 internal 2
## # ... with 91 more rows

Problems with Naive Bayes

I'm trying to run Naive Bayes in R for making predictions from textual data (by building a Document Term Matrix).
I read several posts warning about terms that could be missing in both the training and the testing set, so I decided to work with only one data frame and split it afterwards. The code I'm using is this:
data <- read.csv(file="path",header=TRUE)
########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)
# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])
# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)
# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)
# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)
# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
completecorpus <- tm_map(completecorpus,PlainTextDocument)
completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
completecorpus <- tm_map(completecorpus,removePunctuation)
completecorpus <- tm_map(completecorpus,removeNumbers)
completecorpus <- tm_map(completecorpus,stripWhitespace)
# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]
# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)
# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))
conf.matrix
The problem is that I'm getting weird results like this:
actual
predicted 1 2 3
1 60 833 107
2 0 0 0
3 0 0 0
Any idea of why is this happening?
The raw data looks like this:
head(complete)
Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer. easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer. I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well
InfoType
13000 2
13001 2
13002 2
13003 3
13004 2
13005 2
Seemingly the problem is that the TDM needs to get rid of so much sparsity. So I added:
completematrix<-removeSparseTerms(completematrix, 0.95)
And it started working!!
actual
predicted 1 2 3
1 60 511 6
2 0 86 2
3 0 236 99
Thank you all for your ideas (thank you Chelsey Hill!!)

How do I keep intra-word periods in unigrams? R quanteda

I would like to preserve two letter acronyms in my unigram frequency table that are separated by periods such as "t.v." and "u.s.". When I build my unigram frequency table with quanteda, the teminating period is getting truncated. Here is a small test corpus to illustrate. I have removed periods as sentence separators:
SOS This is the u.s. where our politics is crazy EOS
SOS In the US we watch a lot of t.v. aka TV EOS
SOS TV is an important part of life in the US EOS
SOS folks outside the u.s. probably don't watch so much t.v. EOS
SOS living in other countries is probably not any less crazy EOS
SOS i enjoy my sanity when it comes to visit EOS
which I load into R as character vector:
acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")
Here is the code I use to build my unigram frequency table:
library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ", toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable
This produces the following:
ngram frequency
1 SOS 6
2 EOS 6
3 the 4
4 is 3
5 . 3
6 u.s 2
7 crazy 2
8 US 2
9 watch 2
10 of 2
11 t.v 2
12 TV 2
13 in 2
14 probably 2
15 This 1
16 where 1
17 our 1
18 politics 1
19 In 1
20 we 1
21 a 1
22 lot 1
23 aka 1
etc...
I would like to keep the terminal periods on t.v. and u.s. as well as eliminate the entry in the table for . with a frequency of 3.
I also don't understand why the period (.) would have a count of 3 in this table while counting the u.s and t.v unigrams correctly (2 each).
The reason for this behaviour is that quanteda's default word tokeniser uses the ICU-based definition for word boundaries (from the stringi package). u.s. appears as the word u.s. followed by a period . token. This is great if your name is will.i.am but maybe not so great for your purposes. But you can easily switch to the white-space tokeniser, using the argument what = "fasterword" passed to tokens(), an option available in dfm() through the ... part of the function call.
tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS" "This" "is" "the" "u.s." "where" "our" "politics" "is" "crazy" "EOS"
You can see that here, u.s. is preserved. In response to your last question, the terminal . had a document frequency of 3 because it appeared in three documents as a separate token, which is the default word tokeniser behaviour when remove_punct = FALSE.
To pass this through to dfm() and then construct your data.frame of the document frequency of the words, the following code works (I've tidied it up a bit for efficiency). Note the comment about the difference between document and term frequency - I've noted that some users are a bit confused about docfreq().
# I removed the options that were the same as the default
# note also that stopwords = TRUE is not a valid argument - see remove parameter
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")
# sort in descending document frequency
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
# Note: this would sort the dfm in descending total term frequency
# not the same as docfreq
# dat.dfm <- sort(dat.dfm)
# this creates the data.frame in one more efficient step
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
row.names = NULL, stringsAsFactors = FALSE)
head(freqTable, 10)
## ngram frequency
## 1 SOS 6
## 2 EOS 6
## 3 the 4
## 4 is 3
## 5 u.s. 2
## 6 crazy 2
## 7 US 2
## 8 watch 2
## 9 of 2
## 10 t.v. 2
In my view the named vector produced by docfreq() on the dfm is a more efficient method for storing the results than your data.frame approach, but you may wish to add other variables.

Can I check the frequencies of predetermined words or phrases in document clustering using R?

I'm doing a text mining using "tm" packages in R, and I can get word frequencies after I generate a term document matrix:
freq <- colSums(as.matrix(dtm))
ord <- order(freq)
freq[head(ord)]
# abit acal access accord across acsess
# 1 1 1 1 1 1
freq[tail(ord)]
# direct save month will thank list
# 106 107 116 122 132 154
It can only present me a list of word frequencies by sequence, I was wondering if I can check a word's frequency individually? Can I also check a phrase's frequency? For example, how many times the word "thank" is in a text corpus or what is the frequency of the phrase "contact number" shown in this corpus?
Many thanks for any hints and suggestions.
I show this by the data from the tm package:
library(tm)
data(crude)
dtm <- as.matrix(DocumentTermMatrix(crude))
#find the column that contains the word "demand"
columnindices <- which(colnames(dtm)=="demand")
#how often dooes the word "demand" show up?
sum(dtm[,columnindices])
>6
If you want to do this with phrases your dtm must contain these phrases not just the bag of single words as it is used in most cases. if this data is available, the procedure is the same as for a single word.

How to get frequency of word in a sentence in R?

I have one input file which has one paragraph. I need to find the frequency of particular word in that paragraph.
cat file:
Text Index
train is good 1
let the train come 5
train is best 3
i m great 3
what is best 2
Code:
input<-read.table("file",sep="\t",header=TRUE)
paragraph1<-input[1][1]
word<-"train"
I need to find frequency of word "train" in paragraph1. How can i get it using R?
If you gave a little more info I could probably provide more info in return. Using qdap you could:
library(qdap)
dat <- readLines(n=5)
train is good 1
let the train come 5
train is best 3
i m great 3
what is best 2
dat <- do.call(rbind.data.frame, strsplit(dat, " +"))
colnames(dat) <- c("Text", "Index")
termco(dat$Text, , " train ")
## > termco(dat$Text, , " train ")
## all word.count train
## 1 all 16 3(18.75%)
You could probably do all the paragraphs at once with termco. For more on termco see this link.
Alot of this depends on what's separating paragraphs, how you're reading it in, how things are indented etc.
The poster found the following useful:
length(gregexpr("the", "the dog ate the word the", fixed = TRUE)[[1]])

Resources