Here is the code I use to create bi-grams with frequency list:
library(tm)
library(RWeka)
#data <- myData[,2]
tdm.generate <- function(string, ng){
# tutorial on rweka - http://tm.r-forge.r-project.org/faq.html
corpus <- Corpus(VectorSource(string)) # create corpus for TM processing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
# corpus <- tm_map(corpus, removeWords, stopwords("english"))
options(mc.cores=1) # http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/20251039#20251039
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng)) # create n-grams
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) # create tdm from n-grams
tdm
}
source("GenerateTDM.R") # generatetdm function in appendix
tdm <- tdm.generate("The book The book The greatest The book",2)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
topwords <- as.numeric(topwords)
hist(topwords, breaks = 10)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
head(sort(topwords, decreasing = TRUE))
The result for the above code is:
the book greatest
4 3 1
Instead, I'm looking for the result where bi-grams are shown like:
"the book" "book the"
3 2
What needs to be changed in the above code to get the output as above?
You need to use VCorpus instead of Corpus, I was having the same issue you could check more details here
Related
I have a code like this:
nf<- read.csv("test2.csv")#test 2 is containing 79 rows(name of documents) and one column of text as containing document.
corpus <- Corpus(VectorSource(nf$segment))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, function(x) removeWords(x,"shall"))
corpus <- tm_map(corpus, function(x) removeWords(x,"will"))
corpus <- tm_map(corpus, function(x) removeWords(x,"can"))
corpus <- tm_map(corpus, function(x) removeWords(x,"could"))
corpus <- tm_map(corpus, stemDocument, language = "english")
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
ft <- hclust(dist.mat, method="ward.D2")
plot(ft)
my dendrogram
I have cluster dendrogram from documents. if I cut I in height=50 .how I can have the terms in this level?
I know this has been asked multiple times. For example
Finding 2 & 3 word Phrases Using R TM Package
However, I don't know why none of these solutions work with my data. The result is always one-gram word no matter how many ngram I chose (2, 3 or 4) for the ngram.
Could anybody know the reason why? I suspect the encoding is the reason.
Edited: a small part of the data.
comments <- c("Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into problem_70918\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into tm-247\n",
"Merge branch 'php5.3-upgrade-sprint6-7' of git.internal.net:/git/pn-project/LegacyCodebase into release2012.08\n",
"Merge remote-tracking branch 'dmann1/p71148-s3-callplan_mapping' into lcst-operational-changes\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into TASK-360148\n",
"Merge remote-tracking branch 'grockett/rpr-pre' into rpr-lite\n"
)
cleanCorpus <- function(vector){
corpus <- Corpus(VectorSource(vector), readerControl = list(language = "en_US"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
#corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
#corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
return(corpus)
}
# this function is provided by a team member (in the link I posted above)
test <- function(keywords_doc){
BigramTokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
# creating of document matrix
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
# remove sparse terms
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.99)
# Frequency of the words appearing
keyword.freq <- rowSums(as.matrix(keywords_naremoval))
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq)
# Sorting of the words
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
# Printing of the words
# wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
return(frequentKeywordDF)
}
corpus <- cleanCorpus(comments)
t <- test(corpus)
> head(t)
term freq
added added 6
html html 6
tracking tracking 6
common common 4
emails emails 4
template template 4
Thanks,
I haven't found the reason either, but if you are only interested in the counts regardless in which documents the bigrams occured, you could get them alternatively via this pipeline:
library(tm)
lilbrary(dplyr)
library(quanteda)
# ..construct the corpus as in your post ...
corpus %>%
unlist() %>%
tokens() %>%
tokens_ngrams(2:2, concatenator = " ") %>%
unlist() %>%
as.data.frame() %>%
group_by_(".") %>%
summarize(cnt=n()) %>%
arrange(desc(cnt))
i have created a corpus and processed it using tm package, a snippet below
cleanCorpus<-function(corpus){
corpus.tmp <- tm_map(corpus, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
corpus.tmp <- tm_map(corpus.tmp, removeWords,stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, stemDocument)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
return(corpus.tmp)
}
myCorpus <-Corpus(VectorSource(Data$body),readerControl = list(reader=readPlain))
cln.corpus<-cleanCorpus(myCorpus)
Now i am using the mpqa lexicon to get the total number of positive words and negative words in each document of the corpus.
so i have the list with me as
pos.words <- lexicon$word[lexicon$Polarity=="positive"]
neg.words <- lexicon$word[lexicon$Polarity=="negative"]
How should i go about comparing the content of each document with the positive and negative list and get the counts of both per document?
i checked other posts on tm dictionaries but looks like the feature is withdrawn.
For example
library(tm)
data("crude")
myCorpus <- crude[1:2]
pos.words <- c("advantag", "easy", "cut")
neg.words <- c("problem", "weak", "uncertain")
weightSenti <- structure(function (m) {
m$v <- rep(1, length(m$v))
m$v[rownames(m) %in% neg.words] <- m$v[rownames(m) %in% neg.words] * -1
attr(m, "weighting") <- c("binarySenti", "binSenti")
m
}, class = c("WeightFunction", "function"), name = "binarySenti", acronym = "binSenti")
tdm <- TermDocumentMatrix(cln.corpus, control=list(weighting=weightSenti, dictionary=c(pos.words, neg.words)))
colSums(as.matrix(tdm))
# 127 144
# 2 -2
I have a large corpus of Russian text. When I build a wordcloud, I see some characters like 'ч' are not rendered. The code looks like this:
dat <- read.csv("news.csv",sep=";",header=TRUE,stringsAsFactors=FALSE)
corpus <- Corpus(VectorSource(dat$Article),
readerControl = list(reader=readPlain,language="ru"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords,
stopwords("russian")))
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud.png", width=640,height=640)
wordcloud(d$word,d$freq, scale=c(8,.2), min.freq=5, max.words=200,
random.order=FALSE, rot.per=0, colors=pal2)
dev.off()
EDIT
Oh, I did it myself. I just added one line of code to do the trick:
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
[from OP's own edit, but repeated here as so to complete the Question-Answer]
You need to add, along with the other tm_map() calls.
corpus <- tm_map(corpus, iconv, 'cp1251', 'UTF-8')
I need to create a similarity matrix from a document term matrix in order to perform maximum capturing clusturing on documents. Have only found a solution for distance matrix so far. Tryed the dist method but it gives me the wrong output. Is there a way to create similarity matrixes using R? I used the tm package for the following code but i am not constricted to it, if there is any other good package, let me know. The code so far:
install.packages("tm")
install.packages("rJava")
install.packages("Snowball")
install.packages("RWeka")
install.packages("RWekajars")
install.packages("XML")
install.packages("openNLP")
install.packages("openNLPmodels.en")
Sys.setenv(NOAWT=TRUE)
library(XML)
library(rJava)
library(Snowball)
library(RWeka)
library(tm)
library(openNLP)
library(openNLPmodels.en)
sample = c(
"cc ee aa",
"dd bb ee",
"bb cc ee dd",
"cc ee dd aa",
"bb ee",
"cc dd aa",
"bb cc aa",
"bb cc",
"cc ee dd"
)
print(sample)
corpus <- Corpus(VectorSource(sample))
inspect(corpus)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tmTagPOS)
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
# need to create similarity matrix here
dist(dtm, method = "manhattan", diag = FALSE, upper = FALSE)
The output for the given sample should look like this
The similarity matrix is defined as:
if (i < j)
a[i][j] = sim[i][j]
else
a[i][j] = 0