Sentiment Analysis in R

Sentiment Analysis in R - r

I am new in sentiment analysis, and totally have no idea on how to go about it using R. Hence, I would like to seek help and guidance in this.
I have a set of data consisting of opinions, and would like to analyse the the opinions.
Title Date Content
Boy May 13 2015 "She is pretty", Tom said.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart..
Monster Jan 22 2015 Ms May, a student, said that John has $10.80.
Thank you.

Sentiment analysis encompasses a broad category of methods designed to measure positive versus negative sentiment from text, so that makes this a fairly difficult question to answer simply. But here is a simple answer: You can apply a dictionary to your document-term matrix and then combine the positive versus negative key categories of your dictionary to create a sentiment measure.
I suggest trying this in the text analysis package quanteda, which handles a variety of existing dictionary formats and allows you to create very flexible custom dictionaries.
For example:
require(quanteda)
mycorpus <- subset(inaugCorpus, Year>1980)
mydict <- dictionary(list(negative = c("detriment*", "bad*", "awful*", "terrib*", "horribl*"),
postive = c("good", "great", "super*", "excellent")))
myDfm <- dfm(mycorpus, dictionary = mydict)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 9 documents
## ... indexing features: 3,113 feature types
## ... applying a dictionary consisting of 2 keys
## ... created a 9 x 2 sparse dfm
## ... complete.
## Elapsed time: 0.057 seconds.
myDfm
## Document-feature matrix of: 9 documents, 2 features.
## 9 x 2 sparse Matrix of class "dfmSparse"
## features
## docs negative postive
## 1981-Reagan 0 6
## 1985-Reagan 0 6
## 1989-Bush 0 18
## 1993-Clinton 1 2
## 1997-Clinton 2 8
## 2001-Bush 1 6
## 2005-Bush 0 8
## 2009-Obama 2 3
## 2013-Obama 1 3
# use a LIWC dictionary - obviously you need this file
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
myDfmLIWC <- dfm(mycorpus, dictionary = liwcdict)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 9 documents
## ... indexing features: 3,113 feature types
## ... applying a dictionary consisting of 68 keys
## ... created a 9 x 68 sparse dfm
## ... complete.
## Elapsed time: 1.844 seconds.
myDfmLIWC[, grep("^Pos|^Neg", features(myDfmLIWC))]
## Document-feature matrix of: 9 documents, 4 features.
## 9 x 4 sparse Matrix of class "dfmSparse"
## features
## docs Negate Posemo Posfeel Negemo
## 1981-Reagan 46 89 5 24
## 1985-Reagan 28 104 7 33
## 1989-Bush 40 102 10 8
## 1993-Clinton 25 51 3 23
## 1997-Clinton 27 64 5 22
## 2001-Bush 40 80 6 27
## 2005-Bush 25 117 5 31
## 2009-Obama 40 83 5 46
## 2013-Obama 42 80 13 22
For your corpus, assuming that you get it into a data.frame called data, you can create a quanteda corpus using:
mycorpus <- corpus(data$Content, docvars = data[, 1:2])
See also ?textfile for loading in content from files in one easy command. This works with .csv files for instance, although you would have problems with that file because the Content field contains text containing commas.
There are many other ways to measure sentiment of course, but if you are new to sentiment mining and R, that should get you started. You can read more on sentiment mining methods (and apologies if you already have encountered them) from:
Liu, Bing. 2010. "Sentiment Analysis and Subjectivity." Handbook of
natural language processing 2:
627–66.
Liu, Bing. 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press.

Related

How to count collocations in quanteda based on grouping variables?

I have been working on identifying and classfying collocations over Quenteda package in R.
For instance;
I create token object from a list of documents, and apply collocation analysis.
toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)
however, as far as I can see, there is not a clear method to see which collocation(s) is frequent/exist in which document. Even if I apply kwic(toks, pattern = phrase(collocations), selection = 'keep') result will only include rowid as text1, text2 etc.
I would like to group collocation analysis results based on docvars. is it possible with Quanteda ?

It sounds like you wish to tally collocations by document. The output from textstat_collocations() already provides counts for each collocation, but these are for the entire corpus.
So the solution to group by document (or any other variable) is to
Get the collocations using textstat_collocations(). Below, I've done that after removing stopwords and punctuation.
Compound the tokens from which the stopwords were formed, using tokens_compound(). This converts each collocation sequence into a single token.
Form a dfm from the compounded tokens, and use textstat_frequency() to count the compounds by document.
This is a bit trickier
Implementation using the built-in inaugural corpus:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
toks <- data_corpus_inaugural %>%
tail(10) %>%
tokens(remove_punct = TRUE, padding = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 let us 34 0 2 6.257000 17.80637
## 2 fellow citizens 14 0 2 6.451738 16.18314
## 3 fellow americans 15 0 2 6.221678 16.16410
## 4 one another 14 0 2 6.592755 14.56082
## 5 god bless 15 0 2 8.628894 13.57027
## 6 united states 12 0 2 9.192044 13.22077
Now we compound them and keep only the collocations, then get the frequencies by document:
dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
dfm() %>%
dfm_keep("* *")
That dfm already contains the counts by document of each collocation, but if you want counts in a data.frame format, with a grouping option, use textstat_frequency(). Here I've only output the top two by document, but if you remove the n = 2 then it will give you the frequencies of all collocations by document.
textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
head(10)
## feature frequency rank docfreq group
## 1 nuclear weapons 4 1 1 1985-Reagan
## 2 human freedom 3 2 1 1985-Reagan
## 3 new breeze 4 1 1 1989-Bush
## 4 new engagement 3 2 1 1989-Bush
## 5 let us 7 1 1 1993-Clinton
## 6 fellow americans 4 2 1 1993-Clinton
## 7 let us 6 1 1 1997-Clinton
## 8 new century 6 1 1 1997-Clinton
## 9 nation's promise 2 1 1 2001-Bush
## 10 common good 2 1 1 2001-Bush

Count number of words in a Dictionary file in R

I am reading in a dictionary into R via the quanteda package. This package is preloaded with some great dictionaries, one of which is the Moral Foundations Dictionary that I am interested in. This dictionary has several categories (Farm, Fairness, Ingroup etc.) which are broken down to virtue and vice subcategories.
I want to count the number of words that are in each subcategory for each foundation in R. How can I go about doing that?
For a reproducible example, I can access the Moral Foundations Dictionary (labeled as data_dictionary_MFD) by running library(quanteda.dictionaries)
Thank you!

It's not entirely clear what you are looking for, but this probably comes down to terminology. quanteda dictionaries use the terminology of "keys" for the canonical categories (in R, the names of the list elements), and "values" for the patterns used to match words for counting occurrences of each key.
The MFD has two sets of "keys": moral "foundations" such as care, fairness, etc, and "valences" represented by "vice" and "virtue" for each foundation category. As we have recorded it in quanteda.dictionaries::data_dictionary_MFD, however -- in v0.22 of quanteda.dictionaries at least -- the dictionary is flattened to just one level.
We can see this, and count the values in each dictionary "key" that combines here the foundation and the valence, as follows:
library("quanteda")
## Package version: 1.5.2
data(data_dictionary_MFD, package = "quanteda.dictionaries")
# number of "words" in each MFD dictionary key
lengths(data_dictionary_MFD)
## care.virtue care.vice fairness.virtue fairness.vice
## 182 288 115 236
## loyalty.virtue loyalty.vice authority.virtue authority.vice
## 142 49 301 130
## sanctity.virtue sanctity.vice
## 272 388
# first 5 values in each dictionary key
lapply(data_dictionary_MFD, head, 5)
## $care.virtue
## [1] "alleviate" "alleviated" "alleviates" "alleviating" "alleviation"
##
## $care.vice
## [1] "abused" "abuser" "abusers" "abuses" "abusing"
##
## $fairness.virtue
## [1] "avenge" "avenged" "avenger" "avengers" "avenges"
##
## $fairness.vice
## [1] "am partial" "bamboozle" "bamboozled" "bamboozles" "bamboozling"
##
## $loyalty.virtue
## [1] "all for one" "allegiance" "allegiances" "allegiant" "allied"
##
## $loyalty.vice
## [1] "against us" "apostate" "apostates" "backstab" "backstabbed"
##
## $authority.virtue
## [1] "acquiesce" "acquiesced" "acquiescent" "acquiesces" "acquiescing"
##
## $authority.vice
## [1] "anarchist" "anarchistic" "anarchists" "anarchy" "apostate"
##
## $sanctity.virtue
## [1] "abstinance" "abstinence" "allah" "almighty" "angel"
##
## $sanctity.vice
## [1] "abhor" "abhored" "abhors" "addict" "addicted"
To apply this to count the words matching a "key" (the combination of foundation and valence), we can create a dfm and then use dfm_lookup():
# number of words in a text matching the MFD dictionary
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFD) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
However there is a better way that makes use of the nested structure of the MFD, but we will need to modify the dictionary object first to make it nested. As supplied, the MFD is already "flattened". We want to unflatten it so that the foundations form the 1st level keys, and the valences form the second level keys. Then, using the levels argument in tokens_lookup() and dfm_lookup(), we will be able to choose the level at which we count matches in our text.
First, recreate the dictionary to make it nested.
# remake the dictionary into nested catetgory of foundation and valence
data_dictionary_MFDnested <-
dictionary(list(
care = list(
virtue = data_dictionary_MFD[["care.virtue"]],
vice = data_dictionary_MFD[["care.vice"]]
),
fairness = list(
virtue = data_dictionary_MFD[["fairness.virtue"]],
vice = data_dictionary_MFD[["fairness.vice"]]
),
loyalty = list(
virtue = data_dictionary_MFD[["loyalty.virtue"]],
vice = data_dictionary_MFD[["loyalty.vice"]]
),
authority = list(
virtue = data_dictionary_MFD[["authority.virtue"]],
vice = data_dictionary_MFD[["authority.vice"]]
),
sanctity = list(
virtue = data_dictionary_MFD[["sanctity.virtue"]],
vice = data_dictionary_MFD[["sanctity.vice"]]
)
))
Inspecting this we can see details on the dictionary:
lengths(data_dictionary_MFDnested)
## care fairness loyalty authority sanctity
## 2 2 2 2 2
lapply(data_dictionary_MFDnested, lengths)
## $care
## virtue vice
## 182 288
##
## $fairness
## virtue vice
## 115 236
##
## $loyalty
## virtue vice
## 142 49
##
## $authority
## virtue vice
## 301 130
##
## $sanctity
## virtue vice
## 272 388
And now we can apply it to our texts:
# now apply it to texts
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1) %>%
tail()
## Document-feature matrix of: 6 documents, 5 features (0.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs care fairness loyalty authority sanctity
## 1997-Clinton 12 8 37 3 22
## 2001-Bush 29 12 37 20 22
## 2005-Bush 26 20 41 37 16
## 2009-Obama 24 9 34 20 21
## 2013-Obama 20 17 41 12 14
## 2017-Trump 23 6 44 21 16
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 2) %>%
tail()
## Document-feature matrix of: 6 documents, 2 features (0.0% sparse).
## 6 x 2 sparse Matrix of class "dfm"
## features
## docs virtue vice
## 1997-Clinton 68 14
## 2001-Bush 107 13
## 2005-Bush 117 23
## 2009-Obama 95 13
## 2013-Obama 94 10
## 2017-Trump 95 15
Specifying both levels (or the default of levels = 1:5) matches what we had originally with the flattened dictionary:
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1:2) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3

Not sure what your MFD corpus looks like; if it is the one hosted on osf.io/whjt2 then the first six lines will look like this (with mfdas the name for the data set and Wordtokenand MFDcategoryas my column headers):
head(mfd)
Wordtoken MFDcategory
1 compassion 1
2 empathy 1
3 kindness 1
4 caring 1
5 generosity 1
6 benevolence 1
If your aim is just to find out how many words are listed under each of the ten levels of MFDcategory, then all you have to do is use tablefor that column:
table(mfd$MFDcategory)
1 2 3 4 5 6 7 8 9 10
182 288 115 236 143 49 301 130 272 388
That is, there are 182 word tokens for category 1, namely care.virtue, as opposed to 288 tokens for category 2, namely care.vice, and so on. Does this help?

Quanteda - Apply Function to DFM Over Document Variables

I am using R's quanteda package and the latest versions for both R and the package. I have a corpus of documents which number in the millions.
Let's suppose I have a DFM generated from quanteda with each document having a docvar of the date. There are thousands of documents generated in a given day, but I want to obtain the DFMs applied to the documents by day (so that I have total word counts for each term by day). I know that quanteda is built using data.table, so it should be possible to do this, but I have found little in the "Getting Started with Quanteda" or on StackOverflow that gives a clean way of doing this.
Any suggestions?

You want the 'groups' argument to dfm:
> # Add some random dates to an existing corpus
> docvars(data_corpus_inaugural)$date <- rep(as.Date(runif(19, 1, 18000), origin='1970-01-01'), 3)
> dfm_inaugural <- dfm(data_corpus_inaugural, groups='date')
> head(dfm_inaugural)
Document-feature matrix of: 19 documents, 9,215 features (80.8% sparse).
(showing first 6 documents and first 6 features)
features
docs fellow citizens i appear before you
1970-12-27 4 7 39 2 10 17
1972-04-25 8 13 29 1 8 8
1973-08-22 1 3 48 1 6 1
1973-10-11 2 4 25 0 3 5
1974-01-05 3 9 57 0 7 2
1975-04-12 7 21 63 4 6 16

Feature selection in document-feature matrix by using chi-squared test

I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test.
I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r)
I learned that I could use chi.squared in FSelector package but I don't know how to apply this function to a dfm class object (trainingtfidf below). (Shows in the manual, it applies to the predictor variable)
Could anyone give me a hint? I appreciate it!
Example code:
description <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
code <- c(4,3,6)
example <- data.frame(description, code)
library(quanteda)
trainingcorpus <- corpus(example$description)
trainingdfm <- dfm(trainingcorpus, verbose = TRUE, stem=TRUE, toLower=TRUE, removePunct= TRUE, removeSeparators=TRUE, language="english", ignoredFeatures = stopwords("english"), removeNumbers=TRUE, ngrams = 2)
# tf-idf
trainingtfidf <- tfidf(trainingdfm, normalize=TRUE)
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

Here's a general method for computing Chi-squared values for features. It requires that you have some variable against which to form the associations, which here could be some classification variable you are using for training your classifier.
Note that I am showing how to do this in the quanteda package, but the results should be general enough to work for other text package matrix objects. Here, I am using the data from the auxiliary quantedaData package that has all of the State of the Union addresses of US presidents.
data(data_corpus_sotu, package = "quanteda.corpora")
table(docvars(data_corpus_sotu, "party"))
## Democratic Democratic-Republican Federalist Independent
## 90 28 4 8
## Republican Whig
## 9 8
sotuDemRep <- corpus_subset(data_corpus_sotu, party %in% c("Democratic", "Republican"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- dfm(sotuDemRep, remove = stopwords("english"))
# compute chi-squared values for each feature
chi2vals <- apply(sotuDfm, 2, function(x) {
chisq.test(as.numeric(x), docvars(sotuDemRep, "party"))$statistic
})
head(sort(chi2vals, decreasing = TRUE), 10)
## government will united states year public congress upon
## 85.19783 74.55845 68.62642 66.57434 64.30859 63.19322 59.49949 57.83603
## war people
## 57.43142 57.38697
These can now be selected using the dfm_select() command. (Note that column indexing by name would also work.)
# select just 100 top Chi^2 vals from dfm
dfmTop100cs <- dfm_select(sotuDfm, names(head(sort(chi2vals, decreasing = TRUE), 100)))
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs)
## Document-feature matrix of: 182 documents, 100 features.
## (showing first 6 documents and first 6 features)
## features
## docs citizens government upon duties constitution present
## Jackson-1830 14 68 67 12 17 23
## Jackson-1831 21 26 13 7 5 22
## Jackson-1832 17 36 23 11 11 18
## Jackson-1829 17 58 37 16 7 17
## Jackson-1833 14 43 27 18 1 17
## Jackson-1834 24 74 67 11 11 29
Added: With >= v0.9.9 this can be done using the textstat_keyness() function.
# to avoid empty factors
docvars(data_corpus_sotu, "party") <- as.character(docvars(data_corpus_sotu, "party"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- data_corpus_sotu %>%
corpus_subset(party %in% c("Democratic", "Republican")) %>%
dfm(remove = stopwords("english"))
chi2vals <- dfm_group(sotuDfm, "party") %>%
textstat_keyness(measure = "chi2")
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 5 million 132.3267 0 366 131
# 6 texas 101.1991 0 174 37
This information can then be used to select the most discriminating features, after the sign of the chi^2 score is removed.
# remove sign
chi2vals$chi2 <- abs(chi2vals$chi2)
# sort
chi2vals <- chi2vals[order(chi2vals$chi2, decreasing = TRUE), ]
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 29044 commission 190.3010 0 175 588
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 29043 law 137.8330 0 607 1178
dfmTop100cs <- dfm_select(sotuDfm, chi2vals$feature)
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs, nf = 6)
Document-feature matrix of: 6 documents, 6 features (0% sparse).
6 x 6 sparse Matrix of class "dfm"
features
docs fellow citizens senate house representatives :
Jackson-1829 5 17 2 3 5 1
Jackson-1830 6 14 4 6 9 3
Jackson-1831 9 21 3 1 4 1
Jackson-1832 6 17 4 1 2 1
Jackson-1833 2 14 7 4 6 1
Jackson-1834 3 24 5 1 3 5

Word frequency over time by user in R

I'm aiming to make a bump chart of word frequency over time. I have about 36000 individual entries of a user's comment and an associated date. I have a 25 user sample available here: http://pastebin.com/kKfby5kf
I'm trying to get the most frequent words (maybe top 10?) on a given date. I feel like my methodology is close, but not quite right:
library("tm")
frequencylist <- list(0)
for(i in unique(sampledf[,2])){
subset <- subset(sampledf, sampledf[,2]==i)
comments <- as.vector(subset[,1])
verbatims <- Corpus(VectorSource(comments))
verbatims <- tm_map(verbatims, stripWhitespace)
verbatims <- tm_map(verbatims, content_transformer(tolower))
verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
verbatims <- tm_map(verbatims, removePunctuation)
stopwords2 <- c("game")
verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
dtm <- DocumentTermMatrix(verbatims2)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
frequencydf <- data.frame(frequency)
frequencydf$comments <- row.names(frequencydf)
frequencydf$date <- i
frequencylist[[i]] <- frequencydf
}
An explanation of my madness: the pastebin example goes into sampledf. For each unique date in the sample, I'm trying to get a word frequency. I'm then attempting to store that tabulated word frequency in a list (might not be the best approach, though). First, I subset by date, then strip whitespace, common English words, punctuation, and lowercase it all. I then do another pass of word removal for "game" since it's not too interesting but very common. To get the word frequency, I then pass it into a document term matrix and do a simple colSums(). Then I append the date for that table and try to store it in a list.
I'm not sure if my strategy is valid to begin with. Is there a simpler, better approach to this problem?

The commenters are correct in that there are better ways to set up a reproducible example. In addition, your answer could be more specific in what you are trying to accomplish as an output. (I could not get your code to execute without error.)
However: You asked for a simpler, better approach. Here is what I think is both. It uses the quanteda text package and exploits the groups feature when creating the document-feature matrix. Then it performs some rankings on the "dfm" to get what you need in terms of daily term rankings.
Note that this is based on my having loaded your linked data using read.delim("sampledf.tsv", stringsAsFactors = FALSE).
require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip,
docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))
# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete.
## Elapsed time: 0.009 seconds.
myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types
# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
## â great nice play go will can get ever first
## 2013-10-02 1 18 19 20 21 22 23 24 25 26
## 2013-10-04 3 1 2 4 5 6 7 8 9 10
## 2013-10-05 3 9 28 29 1 2 4 5 6 7
## 2013-10-06 7 4 8 10 11 30 31 32 33 34
## 2013-10-07 5 1 2 3 4 6 7 8 9 10
## 2013-10-09 12 42 43 1 2 3 4 5 6 7
## 2013-10-13 1 14 6 9 10 13 44 45 46 47
## 2013-10-16 2 3 84 85 1 4 5 6 7 8
## 2013-10-18 15 1 2 3 4 5 6 7 8 9
## 2013-10-19 3 86 1 2 4 5 6 7 8 9
## 2013-10-22 2 87 88 89 90 91 92 93 94 95
## 2013-10-23 13 98 99 100 101 102 103 104 105 106
## 2013-10-25 4 6 5 12 16 109 110 111 112 113
## 2013-10-27 8 4 6 15 17 124 125 126 127 128
## 2013-10-30 11 1 2 3 4 5 6 7 8 9
## 2014-10-01 7 16 139 1 2 3 4 5 6 8
## 2014-10-02 140 1 2 3 4 5 6 7 8 9
## 2014-10-03 141 142 143 1 2 3 4 5 6 7
## 2014-10-05 144 145 146 147 148 1 2 3 4 5
## 2014-10-06 17 149 150 1 2 3 4 5 6 7
# top n features by day
n <- 10
as.data.frame(apply(featureRanksByDate, 1, function(x) {
todaysTopFeatures <- names(featureRanksByDate)
names(todaysTopFeatures) <- x
todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
## 2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1 â great go triple great play â go great nice year year
## 2 win nice will niple nice go created â nice play â give
## 3 year â â backflip play will wasnt great play â give good
## 4 give play can great go can money will go go good hard
## 5 good go get scope â get prizes can will will hard time
## 6 hard will ever ball will ever nice get can can time triple
## 7 time can first â can first piece ever get get triple niple
## 8 triple get fun nice get fun dead first ever ever niple backflip
## 9 niple ever great testical ever win play fun first first backflip scope
## 10 backflip first win play first year go win fun fun scope ball
## 2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1 scope scope great play great play will play
## 3 testical testical play will play will get will
## 2 ball ball nice go nice go can go
## 4 â great go can go can ever can
## 5 nice shot will get will get first get
## 6 great nice can ever can ever fun ever
## 7 shot head get â get first win first
## 8 head â ever first ever fun year fun
## 9 dancing dancing first fun first win give win
## 10 cow cow fun win fun year good year
BTW interesting spellings of niple and testical.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sentiment Analysis in R - r

Related

How to count collocations in quanteda based on grouping variables?

Count number of words in a Dictionary file in R

Quanteda - Apply Function to DFM Over Document Variables

Feature selection in document-feature matrix by using chi-squared test

Word frequency over time by user in R

Categories

Resources