R Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE - r

I am creating a Document Term Matrix with the following code. I have no problem creating the matrix, but when I try to Remove Sparse Terms or find Frequent Terms, I get an error.
text<- c("Since I love to travel, this is what I rely on every time.",
"I got this card for the no international transaction fee",
"I got this card mainly for the flight perks",
"Very good card, easy application process",
"The customer service is outstanding!")
library(tm)
corpus<- Corpus(VectorSource(text))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
dtm<- as.matrix(DocumentTermMatrix(corpus))
Here is the result:
Docs application card customer easy every ... etc.
1 0 0 0 1 0
2 0 1 0 0 1
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
Here is where I get the error using either removeSparseTerms or findFreqTerms
sparse<- removeSparseTerms(dtm, 0.80)
freq<- findFreqTerms(dtm, 2)
Result
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE

removeSparseTerms and findFreqTerms are expecting a DocumentTermMatrix or a TermDocumentMatrix object not a matrix.
Create the DocumentTermMatrix without converting to a matrix and you won't get the error.
dtm <- DocumentTermMatrix(corpus)
sparse <- removeSparseTerms(dtm, 0.80)
freq <- findFreqTerms(dtm, 2)

Related

Stopwords Remaining in Corpus After Cleaning

I am attempting to remove the stopword "the" from my corpus, however not all instances are being removed.
library(RCurl)
library(tm)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
tm_map(shakespeare, content_transformer(tolower))
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
The first inspect call reveals:
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
And the second, after cleaning:
Terms
Docs the thee
1 1916 1298
2 1711 1140
3 760 740
What am I missing here about the logic of removeWords that it would ignore all these instances of "the"?
EDIT
I was able to get the instances of "the" down to below 1000 by a slight call change and making the removewords call the very first cleaning step:
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the","The"))
Which gets me down to:
Docs the thee
1 145 752
2 130 660
3 71 382
Still though, I'd like to know why I can't seem to eliminate them all.
Hereby reproducable code which leads to 0 instances of "the". I solved your typo and used your code from before the edit.
library(RCurl)
library(tm)
library(SnowballC)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
leads to:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 6/0
Sparsity : 0%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
and after cleaning and solving the typo:
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
shakespeare = tm_map(shakespeare, content_transformer(tolower)) ## FIXED TYPO
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
it leads to:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 3/3
Sparsity : 50%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 0 1298
2 0 1140
3 0 740

How to remove words in corpus that start with $ in R?

I am trying to do preprocessing in corpus in R, and i need to remove the words that start with $. Below code removes the $ but not the $words, I am puzzled.
inspect(data.corpus1[1:2])
# <<SimpleCorpus>>
# Metadata: corpus specific: 1, document level (indexed): 0
# Content: documents: 2
#
# [1] $rprx loading mid .60's, think potential. 12m vol fri already 11m today
# [2] members report success see track record $itek $rprx $nete $cnet $zn $cwbr $inpx
removePunctWords <- function(x) {
gsub(pattern = "\\$", "", x)
}
data.corpus1 <-
tm_map(data.corpus1,
content_transformer(removePunctWords))
inspect(data.corpus1[1:2])
# <<SimpleCorpus>>
# Metadata: corpus specific: 1, document level (indexed): 0
# Content: documents: 2
#
# [1] rprx loading mid .60's, think potential. 12m vol fri already 11m today
# [2] members report success see track record itek rprx nete cnet zn cwbr inpx
Your regular expression only specifies the $. You need to include the rest of the word.
removePunctWords <- function(x) {
gsub(pattern = "\\$\\w*", "", x)
}

Implementing N-grams in my corpus, Quanteda Error

I am trying to implement quanteda on my corpus in R, but I am getting:
Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, :
duplicate row.names: character(0)
I don't have much experience with this. Here is a download of the dataset: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0
Here is the code:
tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)
quanteda.corpus <- corpus(corpus)
The processing that you're doing with tm is preparing a object for tm and quanteda doesn't know what to do with it...quanteda does all of these steps itself, help("dfm"), as can be seen from the options.
If you try the following you can move ahead:
dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english", ignoredFeatures=stopwords("english"), stem=TRUE)
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing documents: 6,943 documents
... indexing features: 15,164 feature types
... removed 161 features, from 174 supplied (glob) feature types
... stemming features (English), trimmed 2175 feature variants
... created a 6943 x 12828 sparse dfm
... complete.
Elapsed time: 0.756 seconds.
HTH
No need to start with the tm package, or even to use read.csv() at all - this is what the quanteda companion package readtext is for.
So to read in the data, you can send the object created by readtext::readtext() straight to the corpus constructor:
myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
##
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1 19 21 1 2 0.7579
## text2 18 20 2 2 0.8775
## text3 23 24 1 -1 0.6805
## text5 17 19 2 0 1.0000
## text4 18 19 1 -1 0.8820
##
## Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:
From there, you can perform all of the pre-processing stems directly in the dfm() call, including the choice of ngrams:
# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete.
## Elapsed time: 0.662 seconds.
# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete.
## Elapsed time: 1.419 seconds.

Is it possible to remove parts or sections of the documents in the corpus of the R tm package?

I have built a corpus with the R tm package consisting of several papers and I would like to remove the Reference section of all al them. Is that possible?
Do you mean a section within the documents? Yes:
library(tm)
txt <- c("Reference Section 1: Foo", "Reference Section 2: Bar")
corp <- Corpus(VectorSource(txt))
removeRefSec <- content_transformer(function(x) sub("^Reference Section \\d+: ", "", x))
corp[[1]]
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 24
removeRefSec(corp[[1]])
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 3
corp <- tm_map(corp, removeRefSec)
corp[[2]]
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 3

Document term matrix in R

I have the following code:
rm(list=ls(all=TRUE)) #clear data
setwd("~/UCSB/14 Win 15/Issy/text.fwt") #set working directory
files <- list.files(); head(files) #load & check working directory
fw1 <- scan(what="c", sep="\n",file="fw_chp01.fwt")
library(tm)
corpus2<-Corpus(VectorSource(c(fw1)))
skipWords<-(function(x) removeWords(x, stopwords("english")))
#remove punc, numbers, stopwords, etc
funcs<-list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc<-tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(1,110))) #create document term matrix
I'm trying use some of the operations detailed in the tm reference manual (http://cran.r-project.org/web/packages/tm/tm.pdf) with little success. For example, when I try to use the findFreqTerms, I get the following error:
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE
Can anyone clue me in as to why this isn't working and what I can do to fix it?
Edited for #lawyeR:
head(fw1) produces the first six lines of the text (Episode 1 of Finnegans Wake by James Joyce):
[1] "003.01 riverrun, past Eve and Adam's, from swerve of shore to bend"
[2] "003.02 of bay, brings us by a commodius vicus of recirculation back to"
[3] "003.03 Howth Castle and Environs."
[4] "003.04 Sir Tristram, violer d'amores, fr'over the short sea, had passen-"
[5] "003.05 core rearrived from North Armorica on this side the scraggy"
[6] "003.06 isthmus of Europe Minor to wielderfight his penisolate war: nor"
inspect(corpus2) outputs each line of the text in the following format (this is the final line of the text):
[[960]]
<<PlainTextDocument (metadata: 7)>>
029.36 borough. #this part differs by line of course
inspect(corpus2a.dtm) returns a table of all the types (there are 4163 in total( in the text in the following format:
Docs youths yoxen yu yurap yutah zee zephiroth zine zingzang zmorde zoom
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
Here is a simplified form of what you provided and did, and tm does its job. It may be that one or more of your cleaning steps caused a problem.
> library(tm)
> fw1 <- c("riverrun, past Eve and Adam's, from swerve of shore to bend
+ of bay, brings us by a commodius vicus of recirculation back to
+ Howth Castle and Environs.
+ Sir Tristram, violer d'amores, fr'over the short sea, had passen-
+ core rearrived from North Armorica on this side the scraggy
+ isthmus of Europe Minor to wielderfight his penisolate war: nor")
>
> corpus<-Corpus(VectorSource(c(fw1)))
> inspect(corpus)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
riverrun, past Eve and Adam's, from swerve of shore to bend
of bay, brings us by a commodius vicus of recirculation back to
Howth Castle and Environs.
Sir Tristram, violer d'amores, fr'over the short sea, had passen-
core rearrived from North Armorica on this side the scraggy
isthmus of Europe Minor to wielderfight his penisolate war: nor
> dtm <- DocumentTermMatrix(corpus)
> findFreqTerms(dtm)
[1] "adam's," "and" "armorica" "back" "bay," "bend"
[7] "brings" "castle" "commodius" "core" "d'amores," "environs."
[13] "europe" "eve" "fr'over" "from" "had" "his"
[19] "howth" "isthmus" "minor" "nor" "north" "passen-"
[25] "past" "penisolate" "rearrived" "recirculation" "riverrun," "scraggy"
[31] "sea," "shore" "short" "side" "sir" "swerve"
[37] "the" "this" "tristram," "vicus" "violer" "war:"
[43] "wielderfight"
As another point, I find it useful at the start to load a few other complementary packages to tm.
library(SnowballC); library(RWeka); library(rJava); library(RWekajars)
For what its worth, as compared to your somewhat complicated cleaning steps, I usually trudge along like this (replace comments$comment with your text vector):
comments$comment <- tolower(comments$comment)
comments$comment <- removeNumbers(comments$comment)
comments$comment <- stripWhitespace(comments$comment)
comments$comment <- str_replace_all(comments$comment, " ", " ")
# replace all double spaces internally with single space
# better to remove punctuation with str_ because the tm function doesn't insert a space
library(stringr)
comments$comment <- str_replace_all(comments$comment, pattern = "[[:punct:]]", " ")
comments$comment <- removeWords(comments$comment, stopwords(kind = "english"))
From another ticket this should help tm 0.6.0 has a bug and it can be addressed with this statement.
corpus_clean <- tm_map( corp_stemmed, PlainTextDocument)
Hope this helps.

Resources