Text Mining in R: Counting 2-3 word phrases - r

I found a very useful piece of code within Stackoverflow - Finding 2 & 3 word Phrases Using R TM Package
(credit #patrick perry) to show the frequency of 2 and 3 word phrases within a corpus:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
How do you ensure that frequency counts of phrases like "the tin" are not also included in the frequency count of "the tin woodman" or the "tin woodman"?
Thanks

Removing stopwords can remove noise from the data, causing issues such as those you are having a above:
library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>%
arrange(desc(count)) %>%
group_by(grp = str_extract(as.character(term), "\\w+\\s+\\w+")) %>%
mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>%
ungroup() %>%
select(-grp)

Related

How to count collocations in quanteda based on grouping variables?

I have been working on identifying and classfying collocations over Quenteda package in R.
For instance;
I create token object from a list of documents, and apply collocation analysis.
toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)
however, as far as I can see, there is not a clear method to see which collocation(s) is frequent/exist in which document. Even if I apply kwic(toks, pattern = phrase(collocations), selection = 'keep') result will only include rowid as text1, text2 etc.
I would like to group collocation analysis results based on docvars. is it possible with Quanteda ?
It sounds like you wish to tally collocations by document. The output from textstat_collocations() already provides counts for each collocation, but these are for the entire corpus.
So the solution to group by document (or any other variable) is to
Get the collocations using textstat_collocations(). Below, I've done that after removing stopwords and punctuation.
Compound the tokens from which the stopwords were formed, using tokens_compound(). This converts each collocation sequence into a single token.
Form a dfm from the compounded tokens, and use textstat_frequency() to count the compounds by document.
This is a bit trickier
Implementation using the built-in inaugural corpus:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
toks <- data_corpus_inaugural %>%
tail(10) %>%
tokens(remove_punct = TRUE, padding = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 let us 34 0 2 6.257000 17.80637
## 2 fellow citizens 14 0 2 6.451738 16.18314
## 3 fellow americans 15 0 2 6.221678 16.16410
## 4 one another 14 0 2 6.592755 14.56082
## 5 god bless 15 0 2 8.628894 13.57027
## 6 united states 12 0 2 9.192044 13.22077
Now we compound them and keep only the collocations, then get the frequencies by document:
dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
dfm() %>%
dfm_keep("* *")
That dfm already contains the counts by document of each collocation, but if you want counts in a data.frame format, with a grouping option, use textstat_frequency(). Here I've only output the top two by document, but if you remove the n = 2 then it will give you the frequencies of all collocations by document.
textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
head(10)
## feature frequency rank docfreq group
## 1 nuclear weapons 4 1 1 1985-Reagan
## 2 human freedom 3 2 1 1985-Reagan
## 3 new breeze 4 1 1 1989-Bush
## 4 new engagement 3 2 1 1989-Bush
## 5 let us 7 1 1 1993-Clinton
## 6 fellow americans 4 2 1 1993-Clinton
## 7 let us 6 1 1 1997-Clinton
## 8 new century 6 1 1 1997-Clinton
## 9 nation's promise 2 1 1 2001-Bush
## 10 common good 2 1 1 2001-Bush

Count number of words in a Dictionary file in R

I am reading in a dictionary into R via the quanteda package. This package is preloaded with some great dictionaries, one of which is the Moral Foundations Dictionary that I am interested in. This dictionary has several categories (Farm, Fairness, Ingroup etc.) which are broken down to virtue and vice subcategories.
I want to count the number of words that are in each subcategory for each foundation in R. How can I go about doing that?
For a reproducible example, I can access the Moral Foundations Dictionary (labeled as data_dictionary_MFD) by running library(quanteda.dictionaries)
Thank you!
It's not entirely clear what you are looking for, but this probably comes down to terminology. quanteda dictionaries use the terminology of "keys" for the canonical categories (in R, the names of the list elements), and "values" for the patterns used to match words for counting occurrences of each key.
The MFD has two sets of "keys": moral "foundations" such as care, fairness, etc, and "valences" represented by "vice" and "virtue" for each foundation category. As we have recorded it in quanteda.dictionaries::data_dictionary_MFD, however -- in v0.22 of quanteda.dictionaries at least -- the dictionary is flattened to just one level.
We can see this, and count the values in each dictionary "key" that combines here the foundation and the valence, as follows:
library("quanteda")
## Package version: 1.5.2
data(data_dictionary_MFD, package = "quanteda.dictionaries")
# number of "words" in each MFD dictionary key
lengths(data_dictionary_MFD)
## care.virtue care.vice fairness.virtue fairness.vice
## 182 288 115 236
## loyalty.virtue loyalty.vice authority.virtue authority.vice
## 142 49 301 130
## sanctity.virtue sanctity.vice
## 272 388
# first 5 values in each dictionary key
lapply(data_dictionary_MFD, head, 5)
## $care.virtue
## [1] "alleviate" "alleviated" "alleviates" "alleviating" "alleviation"
##
## $care.vice
## [1] "abused" "abuser" "abusers" "abuses" "abusing"
##
## $fairness.virtue
## [1] "avenge" "avenged" "avenger" "avengers" "avenges"
##
## $fairness.vice
## [1] "am partial" "bamboozle" "bamboozled" "bamboozles" "bamboozling"
##
## $loyalty.virtue
## [1] "all for one" "allegiance" "allegiances" "allegiant" "allied"
##
## $loyalty.vice
## [1] "against us" "apostate" "apostates" "backstab" "backstabbed"
##
## $authority.virtue
## [1] "acquiesce" "acquiesced" "acquiescent" "acquiesces" "acquiescing"
##
## $authority.vice
## [1] "anarchist" "anarchistic" "anarchists" "anarchy" "apostate"
##
## $sanctity.virtue
## [1] "abstinance" "abstinence" "allah" "almighty" "angel"
##
## $sanctity.vice
## [1] "abhor" "abhored" "abhors" "addict" "addicted"
To apply this to count the words matching a "key" (the combination of foundation and valence), we can create a dfm and then use dfm_lookup():
# number of words in a text matching the MFD dictionary
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFD) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
However there is a better way that makes use of the nested structure of the MFD, but we will need to modify the dictionary object first to make it nested. As supplied, the MFD is already "flattened". We want to unflatten it so that the foundations form the 1st level keys, and the valences form the second level keys. Then, using the levels argument in tokens_lookup() and dfm_lookup(), we will be able to choose the level at which we count matches in our text.
First, recreate the dictionary to make it nested.
# remake the dictionary into nested catetgory of foundation and valence
data_dictionary_MFDnested <-
dictionary(list(
care = list(
virtue = data_dictionary_MFD[["care.virtue"]],
vice = data_dictionary_MFD[["care.vice"]]
),
fairness = list(
virtue = data_dictionary_MFD[["fairness.virtue"]],
vice = data_dictionary_MFD[["fairness.vice"]]
),
loyalty = list(
virtue = data_dictionary_MFD[["loyalty.virtue"]],
vice = data_dictionary_MFD[["loyalty.vice"]]
),
authority = list(
virtue = data_dictionary_MFD[["authority.virtue"]],
vice = data_dictionary_MFD[["authority.vice"]]
),
sanctity = list(
virtue = data_dictionary_MFD[["sanctity.virtue"]],
vice = data_dictionary_MFD[["sanctity.vice"]]
)
))
Inspecting this we can see details on the dictionary:
lengths(data_dictionary_MFDnested)
## care fairness loyalty authority sanctity
## 2 2 2 2 2
lapply(data_dictionary_MFDnested, lengths)
## $care
## virtue vice
## 182 288
##
## $fairness
## virtue vice
## 115 236
##
## $loyalty
## virtue vice
## 142 49
##
## $authority
## virtue vice
## 301 130
##
## $sanctity
## virtue vice
## 272 388
And now we can apply it to our texts:
# now apply it to texts
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1) %>%
tail()
## Document-feature matrix of: 6 documents, 5 features (0.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs care fairness loyalty authority sanctity
## 1997-Clinton 12 8 37 3 22
## 2001-Bush 29 12 37 20 22
## 2005-Bush 26 20 41 37 16
## 2009-Obama 24 9 34 20 21
## 2013-Obama 20 17 41 12 14
## 2017-Trump 23 6 44 21 16
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 2) %>%
tail()
## Document-feature matrix of: 6 documents, 2 features (0.0% sparse).
## 6 x 2 sparse Matrix of class "dfm"
## features
## docs virtue vice
## 1997-Clinton 68 14
## 2001-Bush 107 13
## 2005-Bush 117 23
## 2009-Obama 95 13
## 2013-Obama 94 10
## 2017-Trump 95 15
Specifying both levels (or the default of levels = 1:5) matches what we had originally with the flattened dictionary:
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1:2) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
Not sure what your MFD corpus looks like; if it is the one hosted on osf.io/whjt2 then the first six lines will look like this (with mfdas the name for the data set and Wordtokenand MFDcategoryas my column headers):
head(mfd)
Wordtoken MFDcategory
1 compassion 1
2 empathy 1
3 kindness 1
4 caring 1
5 generosity 1
6 benevolence 1
If your aim is just to find out how many words are listed under each of the ten levels of MFDcategory, then all you have to do is use tablefor that column:
table(mfd$MFDcategory)
1 2 3 4 5 6 7 8 9 10
182 288 115 236 143 49 301 130 272 388
That is, there are 182 word tokens for category 1, namely care.virtue, as opposed to 288 tokens for category 2, namely care.vice, and so on. Does this help?

searching for deleted documents from corpus in R

I want preprocess my texts before its analysis
mydat
Production of banners 1,2x2, Cutting
Production of a plate with the size 2330 * 600mm
Delivery
Placement of advertising information on posters 0.85 * 0.65 at Ordzhonikidze Street (TSUM) -Gerzen, side A2 April 2014
Manufacturing of a banner 3,7х2,7
Placement of advertising information on the prismatron 3 * 4 at 60, Ordzhonikidze, Aldjonikidze Street, A (01.12.2011-14.12.2011)
Placement of advertising information on the multipanel 3 * 12 at Malygina-M.Torez street, side A, (01.12.2011-14.12.2011)
Designer services
41526326
12
Mounting and rolling of the RIM on the prismatron 3 * 6
the code
mydat=read.csv("C:/kr_csv.csv", sep=";",dec=",")
tw.corpus <- Corpus(VectorSource(mydat$descr))
tw.corpus <- tm_map(tw.corpus, removePunctuation)
tw.corpus <- tm_map(tw.corpus, removeNumbers)
tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
tw.corpus = tm_map(tw.corpus, stemDocument)
#deleting emptu documents
doc.m <- DocumentTermMatrix(tw.corpus)
rowTotals <- apply(doc.m , 1, sum) #Find the sum of words in each Document
doc.m.new <- doc.m[rowTotals> 0, ]
1. How do I know the numbers of observations that were deleted during preprocessing (for example first, second texts were deleted)?
2.How this numbers of observation delete from original dataset (mydat)?
After pre-processing and stemming your corpus, you are counting the number of words that are left in each document. Surely, the "documents" with no words in them, have a count of zero. Also, the documents with only letters and punctuation are also empty, because you removed those strings.
In your data, you have many "documents" that are empty lines. In total, you have 28 "documents" in your corpus, but more than half of them are empty lines (i.e. they contain zero words).
You calculate the word-count for each document in rowTotals. If you check which of the entries in rowTotals are equal to zero, you would get the document numbers that are subsequently removed from doc.m:
rowTotals
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
# 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 2 8 8 2 0 0 0 7
You can see that documents 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, etc. all contain zero words, and are therefore not present in doc.m. You can automatically get these number with which():
which( rowTotals == 0)
# [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 25 26 27

Word frequency over time by user in R

I'm aiming to make a bump chart of word frequency over time. I have about 36000 individual entries of a user's comment and an associated date. I have a 25 user sample available here: http://pastebin.com/kKfby5kf
I'm trying to get the most frequent words (maybe top 10?) on a given date. I feel like my methodology is close, but not quite right:
library("tm")
frequencylist <- list(0)
for(i in unique(sampledf[,2])){
subset <- subset(sampledf, sampledf[,2]==i)
comments <- as.vector(subset[,1])
verbatims <- Corpus(VectorSource(comments))
verbatims <- tm_map(verbatims, stripWhitespace)
verbatims <- tm_map(verbatims, content_transformer(tolower))
verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
verbatims <- tm_map(verbatims, removePunctuation)
stopwords2 <- c("game")
verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
dtm <- DocumentTermMatrix(verbatims2)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
frequencydf <- data.frame(frequency)
frequencydf$comments <- row.names(frequencydf)
frequencydf$date <- i
frequencylist[[i]] <- frequencydf
}
An explanation of my madness: the pastebin example goes into sampledf. For each unique date in the sample, I'm trying to get a word frequency. I'm then attempting to store that tabulated word frequency in a list (might not be the best approach, though). First, I subset by date, then strip whitespace, common English words, punctuation, and lowercase it all. I then do another pass of word removal for "game" since it's not too interesting but very common. To get the word frequency, I then pass it into a document term matrix and do a simple colSums(). Then I append the date for that table and try to store it in a list.
I'm not sure if my strategy is valid to begin with. Is there a simpler, better approach to this problem?
The commenters are correct in that there are better ways to set up a reproducible example. In addition, your answer could be more specific in what you are trying to accomplish as an output. (I could not get your code to execute without error.)
However: You asked for a simpler, better approach. Here is what I think is both. It uses the quanteda text package and exploits the groups feature when creating the document-feature matrix. Then it performs some rankings on the "dfm" to get what you need in terms of daily term rankings.
Note that this is based on my having loaded your linked data using read.delim("sampledf.tsv", stringsAsFactors = FALSE).
require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip,
docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))
# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete.
## Elapsed time: 0.009 seconds.
myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types
# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
## â great nice play go will can get ever first
## 2013-10-02 1 18 19 20 21 22 23 24 25 26
## 2013-10-04 3 1 2 4 5 6 7 8 9 10
## 2013-10-05 3 9 28 29 1 2 4 5 6 7
## 2013-10-06 7 4 8 10 11 30 31 32 33 34
## 2013-10-07 5 1 2 3 4 6 7 8 9 10
## 2013-10-09 12 42 43 1 2 3 4 5 6 7
## 2013-10-13 1 14 6 9 10 13 44 45 46 47
## 2013-10-16 2 3 84 85 1 4 5 6 7 8
## 2013-10-18 15 1 2 3 4 5 6 7 8 9
## 2013-10-19 3 86 1 2 4 5 6 7 8 9
## 2013-10-22 2 87 88 89 90 91 92 93 94 95
## 2013-10-23 13 98 99 100 101 102 103 104 105 106
## 2013-10-25 4 6 5 12 16 109 110 111 112 113
## 2013-10-27 8 4 6 15 17 124 125 126 127 128
## 2013-10-30 11 1 2 3 4 5 6 7 8 9
## 2014-10-01 7 16 139 1 2 3 4 5 6 8
## 2014-10-02 140 1 2 3 4 5 6 7 8 9
## 2014-10-03 141 142 143 1 2 3 4 5 6 7
## 2014-10-05 144 145 146 147 148 1 2 3 4 5
## 2014-10-06 17 149 150 1 2 3 4 5 6 7
# top n features by day
n <- 10
as.data.frame(apply(featureRanksByDate, 1, function(x) {
todaysTopFeatures <- names(featureRanksByDate)
names(todaysTopFeatures) <- x
todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
## 2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1 â great go triple great play â go great nice year year
## 2 win nice will niple nice go created â nice play â give
## 3 year â â backflip play will wasnt great play â give good
## 4 give play can great go can money will go go good hard
## 5 good go get scope â get prizes can will will hard time
## 6 hard will ever ball will ever nice get can can time triple
## 7 time can first â can first piece ever get get triple niple
## 8 triple get fun nice get fun dead first ever ever niple backflip
## 9 niple ever great testical ever win play fun first first backflip scope
## 10 backflip first win play first year go win fun fun scope ball
## 2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1 scope scope great play great play will play
## 3 testical testical play will play will get will
## 2 ball ball nice go nice go can go
## 4 â great go can go can ever can
## 5 nice shot will get will get first get
## 6 great nice can ever can ever fun ever
## 7 shot head get â get first win first
## 8 head â ever first ever fun year fun
## 9 dancing dancing first fun first win give win
## 10 cow cow fun win fun year good year
BTW interesting spellings of niple and testical.

R dynamic stop word list with terms of frequency one

I am working on a text mining assignment and am stuck at the moment. The following is based on Zhaos Text Mining with Twitter. I cannot get it to work, maybe one of you has a good idea?
Goal: I would like to remove all terms from the corpus with a word count of one instead of using a stopword list.
What I did so far: I have downloaded the tweets and converted them into a data frame.
tf1 <- Corpus(VectorSource(tweets.df$text))
tf1 <- tm_map(tf1, content_transformer(tolower))
removeUser <- function(x) gsub("#[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeUser))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeNumPunct))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeURL))
tf1 <- tm_map(tf1, stripWhitespace)
#Using TermDocMatrix in order to find terms with count 1, dont know any other way
tdmtf1 <- TermDocumentMatrix(tf1, control = list(wordLengths = c(1, Inf)))
ones <- findFreqTerms(tdmtf1, lowfreq = 1, highfreq = 1)
tf1Copy <- tf1
tf1List <- setdiff(tf1Copy, ones)
tf1CList <- paste(unlist(tf1List),sep="", collapse=" ")
tf1Copy <- tm_map(tf1Copy, removeWords, tf1CList)
tdmtf1Test <- TermDocumentMatrix(tf1Copy, control = list(wordLengths = c(1, Inf)))
#Just to test success...
ones2 <- findFreqTerms(tdmtf1Test, lowfreq = 1, highfreq = 1)
(ones2)
The Error:
Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : invalid regular expression '(*UCP)\b(senior data scientist global strategy firm
25.0010230541229 48 17 6 6 115 1 186 0 1 en kdnuggets poll primary programming language for analytics data mining data scienc
25.0020229816437 48 17 6 6 115 1 186 0 2 en iapa canberra seminar mining the internet of everything official statistics in the information age anu june 25.0020229816437 48 17 6 6 115 1 186 0 3 en handling and processing strings in r an ebook in pdf format pages
25.0020229816437 48 17 6 6 115 1 186 0 4 en webinar getting your data into r by hadley wickham am edt june th
25.0020229816437 48 17 6 6 115 1 186 0 5 en before loading the rdmtweets dataset please run librarytwitter to load required package
25.0020229816437 48 17 6 6 115 1 186 0 6 en an infographic on sas vs r vs python datascience via
25.0020229816437 48 17 6 6 115 1 186 0 7 en r is again the kdnuggets poll on top analytics data mining science software
25.0020229816437 48 17 6 6 115 1 186 0 8 en i will run
In Addition:
Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error
'regular expression is too large'
at ''
PS sorry for the bad format at the end could not get it fixed.
Here's a way how to remove all terms from the corpus with a word count of one:
library(tm)
mytweets <- c("This is a doc", "This is another doc")
corp <- Corpus(VectorSource(mytweets))
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
#
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is another doc
## ^^^
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
# Terms
# Docs another doc this
# 1 0 1 1
# 2 1 1 1
(stopwords <- findFreqTerms(dtm, 1, 1))
# [1] "another"
corp <- tm_map(corp, removeWords, stopwords)
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
#
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is doc
## ^ 'another' is gone
(As a side note: The token 'a' from 'This is a...' is gone, too, because DocumentTermMatrix cuts out tokens with a length < 3 by default.)
Here's a simpler method using the dfm() and trim() functions from the quanteda package:
require(quanteda)
mydfm <- dfm(c("This is a doc", "This is another doc"), verbose = FALSE)
mydfm
## Document-feature matrix of: 2 documents, 5 features.
## 2 x 5 sparse Matrix of class "dfmSparse"
## features
## docs a another doc is this
## text1 1 0 1 1 1
## text2 0 1 1 1 1
trim(mydfm, minCount = 2)
## Features occurring less than 2 times: 2
## Document-feature matrix of: 2 documents, 3 features.
## 2 x 3 sparse Matrix of class "dfmSparse"
## features
## docs doc is this
## text1 1 1 1
## text2 1 1 1

Resources