Stopwords Remaining in Corpus After Cleaning - r

I am attempting to remove the stopword "the" from my corpus, however not all instances are being removed.
library(RCurl)
library(tm)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
tm_map(shakespeare, content_transformer(tolower))
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
The first inspect call reveals:
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
And the second, after cleaning:
Terms
Docs the thee
1 1916 1298
2 1711 1140
3 760 740
What am I missing here about the logic of removeWords that it would ignore all these instances of "the"?
EDIT
I was able to get the instances of "the" down to below 1000 by a slight call change and making the removewords call the very first cleaning step:
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the","The"))
Which gets me down to:
Docs the thee
1 145 752
2 130 660
3 71 382
Still though, I'd like to know why I can't seem to eliminate them all.

Hereby reproducable code which leads to 0 instances of "the". I solved your typo and used your code from before the edit.
library(RCurl)
library(tm)
library(SnowballC)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
leads to:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 6/0
Sparsity : 0%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
and after cleaning and solving the typo:
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
shakespeare = tm_map(shakespeare, content_transformer(tolower)) ## FIXED TYPO
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
it leads to:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 3/3
Sparsity : 50%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 0 1298
2 0 1140
3 0 740

Related

tm_map and stopwords failed to remove unwanted words from the corpus created in R

I have a resulting data frame which has the following data:
word freq
credit credit 790
account account 451
xxxxxxxx xxxxxxxx 430
report report 405
information information 368
reporting reporting 345
consumer consumer 331
accounts accounts 300
debt debt 170
company company 152
xxxxxx xxxxxx 147
I want to do the following:
remove all the wods which has more than two x such as xx, xxx, xxx
and so forth, since these words can be in lower or upper case so have to
bring into lower case first then remove
I am using tm_map for removing the stopwords but it seems, it didn't work and I still got the unwanted words in the dataframe as above.
myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx",
"XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
"xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)
This code above didn't work for me for removing unwanted words from corpus.
is there any other alternative to deal with this issue?
One possibility involving dplyr and stringr could be:
df %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, fixed("x")) <= 1)
word freq
1 credit 790
2 account 451
3 report 405
4 information 368
5 reporting 345
6 consumer 331
7 accounts 300
8 debt 170
9 company 152
Or a base R possibility using a similar logic:
df[sapply(df[, 1],
function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1,
USE.NAMES = FALSE), ]

R Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE

I am creating a Document Term Matrix with the following code. I have no problem creating the matrix, but when I try to Remove Sparse Terms or find Frequent Terms, I get an error.
text<- c("Since I love to travel, this is what I rely on every time.",
"I got this card for the no international transaction fee",
"I got this card mainly for the flight perks",
"Very good card, easy application process",
"The customer service is outstanding!")
library(tm)
corpus<- Corpus(VectorSource(text))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
dtm<- as.matrix(DocumentTermMatrix(corpus))
Here is the result:
Docs application card customer easy every ... etc.
1 0 0 0 1 0
2 0 1 0 0 1
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
Here is where I get the error using either removeSparseTerms or findFreqTerms
sparse<- removeSparseTerms(dtm, 0.80)
freq<- findFreqTerms(dtm, 2)
Result
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE
removeSparseTerms and findFreqTerms are expecting a DocumentTermMatrix or a TermDocumentMatrix object not a matrix.
Create the DocumentTermMatrix without converting to a matrix and you won't get the error.
dtm <- DocumentTermMatrix(corpus)
sparse <- removeSparseTerms(dtm, 0.80)
freq <- findFreqTerms(dtm, 2)

Implementing N-grams in my corpus, Quanteda Error

I am trying to implement quanteda on my corpus in R, but I am getting:
Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, :
duplicate row.names: character(0)
I don't have much experience with this. Here is a download of the dataset: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0
Here is the code:
tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)
quanteda.corpus <- corpus(corpus)
The processing that you're doing with tm is preparing a object for tm and quanteda doesn't know what to do with it...quanteda does all of these steps itself, help("dfm"), as can be seen from the options.
If you try the following you can move ahead:
dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english", ignoredFeatures=stopwords("english"), stem=TRUE)
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing documents: 6,943 documents
... indexing features: 15,164 feature types
... removed 161 features, from 174 supplied (glob) feature types
... stemming features (English), trimmed 2175 feature variants
... created a 6943 x 12828 sparse dfm
... complete.
Elapsed time: 0.756 seconds.
HTH
No need to start with the tm package, or even to use read.csv() at all - this is what the quanteda companion package readtext is for.
So to read in the data, you can send the object created by readtext::readtext() straight to the corpus constructor:
myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
##
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1 19 21 1 2 0.7579
## text2 18 20 2 2 0.8775
## text3 23 24 1 -1 0.6805
## text5 17 19 2 0 1.0000
## text4 18 19 1 -1 0.8820
##
## Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:
From there, you can perform all of the pre-processing stems directly in the dfm() call, including the choice of ngrams:
# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete.
## Elapsed time: 0.662 seconds.
# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete.
## Elapsed time: 1.419 seconds.

Is it possible to remove parts or sections of the documents in the corpus of the R tm package?

I have built a corpus with the R tm package consisting of several papers and I would like to remove the Reference section of all al them. Is that possible?
Do you mean a section within the documents? Yes:
library(tm)
txt <- c("Reference Section 1: Foo", "Reference Section 2: Bar")
corp <- Corpus(VectorSource(txt))
removeRefSec <- content_transformer(function(x) sub("^Reference Section \\d+: ", "", x))
corp[[1]]
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 24
removeRefSec(corp[[1]])
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 3
corp <- tm_map(corp, removeRefSec)
corp[[2]]
# <<PlainTextDocument>>
# Metadata: 7
# Content: chars: 3

using findAssocs to build a correlation matrix of all word combinations in R

I'm trying to write code that builds a table that shows all the correlations between all the words from a corpus.
I know that I can use findAssocs in the tm package to find all word correlations for a single word i.e. findAssocs(dtm, "quick", 0.5) - would give me all the words that have a correlation with the word "quick" above 0.5, but I do not want to do this manually for each word in the text I have.
#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
From here I can find the word correlations for individual words:
findAssocs(dtm, "quick", 0.4)
But I want to find all the correlations like this:
quick easy the and
quick 1.00 0.54 0.72 0.92
easy 0.54 1.00 0.98 0.54
the 0.72 0.98 1.00 0.05
and 0.92 0.54 0.05 1.00
Any suggestions?
Example of the "TESTER.csv" data file (starting from cell A1)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
You can probably use as.matrix and cor. findAssocs has a lower limit of 0:
(cor_1 <- findAssocs(dtm, colnames(dtm)[1:2], 0))
# all along
# there 1.00 1.00
# information 0.65 0.65
# needed 0.65 0.65
# the 0.47 0.47
# was 0.47 0.47
cor gets you all pearson correlations, for what it's worth:
cor_2 <- cor(as.matrix(dtm))
cor_2[c("there", "information", "needed", "the", "was"), c("all", "along")]
# all along
# there 1.0000000 1.0000000
# information 0.6454972 0.6454972
# needed 0.6454972 0.6454972
# the 0.4714045 0.4714045
# was 0.4714045 0.4714045
The preceding code:
x <- readLines(n = 7)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
library(tm)
corp <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corp)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)

Resources