Feature selection in document-feature matrix by using chi-squared test - r

I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test.
I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r)
I learned that I could use chi.squared in FSelector package but I don't know how to apply this function to a dfm class object (trainingtfidf below). (Shows in the manual, it applies to the predictor variable)
Could anyone give me a hint? I appreciate it!
Example code:
description <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
code <- c(4,3,6)
example <- data.frame(description, code)
library(quanteda)
trainingcorpus <- corpus(example$description)
trainingdfm <- dfm(trainingcorpus, verbose = TRUE, stem=TRUE, toLower=TRUE, removePunct= TRUE, removeSeparators=TRUE, language="english", ignoredFeatures = stopwords("english"), removeNumbers=TRUE, ngrams = 2)
# tf-idf
trainingtfidf <- tfidf(trainingdfm, normalize=TRUE)
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

Here's a general method for computing Chi-squared values for features. It requires that you have some variable against which to form the associations, which here could be some classification variable you are using for training your classifier.
Note that I am showing how to do this in the quanteda package, but the results should be general enough to work for other text package matrix objects. Here, I am using the data from the auxiliary quantedaData package that has all of the State of the Union addresses of US presidents.
data(data_corpus_sotu, package = "quanteda.corpora")
table(docvars(data_corpus_sotu, "party"))
## Democratic Democratic-Republican Federalist Independent
## 90 28 4 8
## Republican Whig
## 9 8
sotuDemRep <- corpus_subset(data_corpus_sotu, party %in% c("Democratic", "Republican"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- dfm(sotuDemRep, remove = stopwords("english"))
# compute chi-squared values for each feature
chi2vals <- apply(sotuDfm, 2, function(x) {
chisq.test(as.numeric(x), docvars(sotuDemRep, "party"))$statistic
})
head(sort(chi2vals, decreasing = TRUE), 10)
## government will united states year public congress upon
## 85.19783 74.55845 68.62642 66.57434 64.30859 63.19322 59.49949 57.83603
## war people
## 57.43142 57.38697
These can now be selected using the dfm_select() command. (Note that column indexing by name would also work.)
# select just 100 top Chi^2 vals from dfm
dfmTop100cs <- dfm_select(sotuDfm, names(head(sort(chi2vals, decreasing = TRUE), 100)))
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs)
## Document-feature matrix of: 182 documents, 100 features.
## (showing first 6 documents and first 6 features)
## features
## docs citizens government upon duties constitution present
## Jackson-1830 14 68 67 12 17 23
## Jackson-1831 21 26 13 7 5 22
## Jackson-1832 17 36 23 11 11 18
## Jackson-1829 17 58 37 16 7 17
## Jackson-1833 14 43 27 18 1 17
## Jackson-1834 24 74 67 11 11 29
Added: With >= v0.9.9 this can be done using the textstat_keyness() function.
# to avoid empty factors
docvars(data_corpus_sotu, "party") <- as.character(docvars(data_corpus_sotu, "party"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- data_corpus_sotu %>%
corpus_subset(party %in% c("Democratic", "Republican")) %>%
dfm(remove = stopwords("english"))
chi2vals <- dfm_group(sotuDfm, "party") %>%
textstat_keyness(measure = "chi2")
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 5 million 132.3267 0 366 131
# 6 texas 101.1991 0 174 37
This information can then be used to select the most discriminating features, after the sign of the chi^2 score is removed.
# remove sign
chi2vals$chi2 <- abs(chi2vals$chi2)
# sort
chi2vals <- chi2vals[order(chi2vals$chi2, decreasing = TRUE), ]
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 29044 commission 190.3010 0 175 588
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 29043 law 137.8330 0 607 1178
dfmTop100cs <- dfm_select(sotuDfm, chi2vals$feature)
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs, nf = 6)
Document-feature matrix of: 6 documents, 6 features (0% sparse).
6 x 6 sparse Matrix of class "dfm"
features
docs fellow citizens senate house representatives :
Jackson-1829 5 17 2 3 5 1
Jackson-1830 6 14 4 6 9 3
Jackson-1831 9 21 3 1 4 1
Jackson-1832 6 17 4 1 2 1
Jackson-1833 2 14 7 4 6 1
Jackson-1834 3 24 5 1 3 5

Related

how to set Target in quanteda's "textstat_keyness" function associated with "date"

It's working as target = year(dfmat_news$datee) >= 2016
tstat_key <- textstat_keyness(hr_dfm,
measure = "chi2",sort = TRUE, correction = c("default"),
target = year(dfmat_news$datee) >= 2016)
AS I set target= date(dfmat_news$datee) >= 2016-02-01
It's not working.
datee=(YYYY-MM-DD) in date format
How to set "target" with date in function "textstat_keyness" in R Package quanteda?
THANKS A LOT!
The problem is that the comparison operator (>=) is trying to compare a numeric value to a "date" formatted field dfmat_news$datee, and this is not producing what you expect. 2016-02-01 evaluates to 2013 which compares to Date fields that are very different when used as integers. For instance:
> as.numeric(as.Date("2016-01-01"))
[1] 16801
So you should brush up on date operations in R, starting with ?Ops.Date.
Here's a reproducible example solving this in the way that you want, using textstat_keyness(). You can see that this works on an expression that evaluates to a logical.
library("quanteda")
## Package version: 2.1.1
# this has a date field, but the package must be installed
# from https://github.com/quanteda/quanteda.corpora
data(data_corpus_sotu, package = "quanteda.corpora")
hr_dfm <- dfm(tail(data_corpus_sotu, 10)) %>%
dfm_remove(stopwords("en"))
hr_dfm$Date
## [1] "2011-01-25" "2012-01-24" "2013-02-12" "2014-01-28" "2015-01-20"
## [6] "2016-01-12" "2017-02-28" "2018-01-30" "2019-02-05" "2020-02-04"
textstat_keyness(hr_dfm, target = hr_dfm$Date >= "2016-01-01") %>%
head()
## feature chi2 p n_target n_reference
## 1 thank 65.17899 6.661338e-16 85 12
## 2 much 33.17024 8.443305e-09 49 9
## 3 great 28.21748 1.084209e-07 66 22
## 4 , 21.21601 4.103217e-06 1822 1791
## 5 drug 20.95085 4.712182e-06 21 1
## 6 border 20.25901 6.763404e-06 27 4
textstat_keyness(hr_dfm, target = rep(c(FALSE, TRUE), each = 5)) %>%
head()
## feature chi2 p n_target n_reference
## 1 thank 65.17899 6.661338e-16 85 12
## 2 much 33.17024 8.443305e-09 49 9
## 3 great 28.21748 1.084209e-07 66 22
## 4 , 21.21601 4.103217e-06 1822 1791
## 5 drug 20.95085 4.712182e-06 21 1
## 6 border 20.25901 6.763404e-06 27 4

Count number of words in a Dictionary file in R

I am reading in a dictionary into R via the quanteda package. This package is preloaded with some great dictionaries, one of which is the Moral Foundations Dictionary that I am interested in. This dictionary has several categories (Farm, Fairness, Ingroup etc.) which are broken down to virtue and vice subcategories.
I want to count the number of words that are in each subcategory for each foundation in R. How can I go about doing that?
For a reproducible example, I can access the Moral Foundations Dictionary (labeled as data_dictionary_MFD) by running library(quanteda.dictionaries)
Thank you!
It's not entirely clear what you are looking for, but this probably comes down to terminology. quanteda dictionaries use the terminology of "keys" for the canonical categories (in R, the names of the list elements), and "values" for the patterns used to match words for counting occurrences of each key.
The MFD has two sets of "keys": moral "foundations" such as care, fairness, etc, and "valences" represented by "vice" and "virtue" for each foundation category. As we have recorded it in quanteda.dictionaries::data_dictionary_MFD, however -- in v0.22 of quanteda.dictionaries at least -- the dictionary is flattened to just one level.
We can see this, and count the values in each dictionary "key" that combines here the foundation and the valence, as follows:
library("quanteda")
## Package version: 1.5.2
data(data_dictionary_MFD, package = "quanteda.dictionaries")
# number of "words" in each MFD dictionary key
lengths(data_dictionary_MFD)
## care.virtue care.vice fairness.virtue fairness.vice
## 182 288 115 236
## loyalty.virtue loyalty.vice authority.virtue authority.vice
## 142 49 301 130
## sanctity.virtue sanctity.vice
## 272 388
# first 5 values in each dictionary key
lapply(data_dictionary_MFD, head, 5)
## $care.virtue
## [1] "alleviate" "alleviated" "alleviates" "alleviating" "alleviation"
##
## $care.vice
## [1] "abused" "abuser" "abusers" "abuses" "abusing"
##
## $fairness.virtue
## [1] "avenge" "avenged" "avenger" "avengers" "avenges"
##
## $fairness.vice
## [1] "am partial" "bamboozle" "bamboozled" "bamboozles" "bamboozling"
##
## $loyalty.virtue
## [1] "all for one" "allegiance" "allegiances" "allegiant" "allied"
##
## $loyalty.vice
## [1] "against us" "apostate" "apostates" "backstab" "backstabbed"
##
## $authority.virtue
## [1] "acquiesce" "acquiesced" "acquiescent" "acquiesces" "acquiescing"
##
## $authority.vice
## [1] "anarchist" "anarchistic" "anarchists" "anarchy" "apostate"
##
## $sanctity.virtue
## [1] "abstinance" "abstinence" "allah" "almighty" "angel"
##
## $sanctity.vice
## [1] "abhor" "abhored" "abhors" "addict" "addicted"
To apply this to count the words matching a "key" (the combination of foundation and valence), we can create a dfm and then use dfm_lookup():
# number of words in a text matching the MFD dictionary
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFD) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
However there is a better way that makes use of the nested structure of the MFD, but we will need to modify the dictionary object first to make it nested. As supplied, the MFD is already "flattened". We want to unflatten it so that the foundations form the 1st level keys, and the valences form the second level keys. Then, using the levels argument in tokens_lookup() and dfm_lookup(), we will be able to choose the level at which we count matches in our text.
First, recreate the dictionary to make it nested.
# remake the dictionary into nested catetgory of foundation and valence
data_dictionary_MFDnested <-
dictionary(list(
care = list(
virtue = data_dictionary_MFD[["care.virtue"]],
vice = data_dictionary_MFD[["care.vice"]]
),
fairness = list(
virtue = data_dictionary_MFD[["fairness.virtue"]],
vice = data_dictionary_MFD[["fairness.vice"]]
),
loyalty = list(
virtue = data_dictionary_MFD[["loyalty.virtue"]],
vice = data_dictionary_MFD[["loyalty.vice"]]
),
authority = list(
virtue = data_dictionary_MFD[["authority.virtue"]],
vice = data_dictionary_MFD[["authority.vice"]]
),
sanctity = list(
virtue = data_dictionary_MFD[["sanctity.virtue"]],
vice = data_dictionary_MFD[["sanctity.vice"]]
)
))
Inspecting this we can see details on the dictionary:
lengths(data_dictionary_MFDnested)
## care fairness loyalty authority sanctity
## 2 2 2 2 2
lapply(data_dictionary_MFDnested, lengths)
## $care
## virtue vice
## 182 288
##
## $fairness
## virtue vice
## 115 236
##
## $loyalty
## virtue vice
## 142 49
##
## $authority
## virtue vice
## 301 130
##
## $sanctity
## virtue vice
## 272 388
And now we can apply it to our texts:
# now apply it to texts
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1) %>%
tail()
## Document-feature matrix of: 6 documents, 5 features (0.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs care fairness loyalty authority sanctity
## 1997-Clinton 12 8 37 3 22
## 2001-Bush 29 12 37 20 22
## 2005-Bush 26 20 41 37 16
## 2009-Obama 24 9 34 20 21
## 2013-Obama 20 17 41 12 14
## 2017-Trump 23 6 44 21 16
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 2) %>%
tail()
## Document-feature matrix of: 6 documents, 2 features (0.0% sparse).
## 6 x 2 sparse Matrix of class "dfm"
## features
## docs virtue vice
## 1997-Clinton 68 14
## 2001-Bush 107 13
## 2005-Bush 117 23
## 2009-Obama 95 13
## 2013-Obama 94 10
## 2017-Trump 95 15
Specifying both levels (or the default of levels = 1:5) matches what we had originally with the flattened dictionary:
dfm(data_corpus_inaugural) %>%
dfm_lookup(dictionary = data_dictionary_MFDnested, levels = 1:2) %>%
tail()
## Document-feature matrix of: 6 documents, 10 features (10.0% sparse).
## 6 x 10 sparse Matrix of class "dfm"
## features
## docs care.virtue care.vice fairness.virtue fairness.vice
## 1997-Clinton 8 4 6 2
## 2001-Bush 21 8 11 1
## 2005-Bush 14 12 16 4
## 2009-Obama 18 6 8 1
## 2013-Obama 14 6 15 2
## 2017-Trump 16 7 2 4
## features
## docs loyalty.virtue loyalty.vice authority.virtue authority.vice
## 1997-Clinton 37 0 3 0
## 2001-Bush 36 1 18 2
## 2005-Bush 38 3 33 4
## 2009-Obama 33 1 18 2
## 2013-Obama 39 2 12 0
## 2017-Trump 44 0 20 1
## features
## docs sanctity.virtue sanctity.vice
## 1997-Clinton 14 8
## 2001-Bush 21 1
## 2005-Bush 16 0
## 2009-Obama 18 3
## 2013-Obama 14 0
## 2017-Trump 13 3
Not sure what your MFD corpus looks like; if it is the one hosted on osf.io/whjt2 then the first six lines will look like this (with mfdas the name for the data set and Wordtokenand MFDcategoryas my column headers):
head(mfd)
Wordtoken MFDcategory
1 compassion 1
2 empathy 1
3 kindness 1
4 caring 1
5 generosity 1
6 benevolence 1
If your aim is just to find out how many words are listed under each of the ten levels of MFDcategory, then all you have to do is use tablefor that column:
table(mfd$MFDcategory)
1 2 3 4 5 6 7 8 9 10
182 288 115 236 143 49 301 130 272 388
That is, there are 182 word tokens for category 1, namely care.virtue, as opposed to 288 tokens for category 2, namely care.vice, and so on. Does this help?

Avoid memory increase in foreach loop in R

I try to create summary statistics combining two different spatial data-sets: a big raster file and a polygon file. The idea is to get summary statistics of the raster values within each polygon.
Since the raster is too big to process it at once, I try to create subtasks and process them in parallel i.e. process each polygon from the SpatialPolgyonsDataframe at once.
The code works fine, however after around 100 interations I run into memory problems. Here is my code and what I intent to do:
# session setup
library("raster")
library("rgdal")
# multicore processing.
library("foreach")
library("doSNOW")
# assign three clusters to be used for current R session
cluster = makeCluster(3, type = "SOCK",outfile="")
registerDoSNOW(cluster)
getDoParWorkers()# check if it worked
# load base data
r.terra.2008<-raster("~/terra.tif")
spodf.malha.2007<-readOGR("~/,"composed")
# bring both data-sets to a common CRS
proj4string(r.terra.2008)
proj4string(spodf.malha.2007)
spodf.malha.2007<-spTransform(spodf.malha.2007,CRSobj = CRS(projargs = proj4string(r.terra.2008)))
proj4string(r.terra.2008)==proj4string(spodf.malha.2007) # should be TRUE
# create a function to extract areas
function.landcover.sum<-function(r.landuse,spodf.pol){
return(table(extract(r.landuse,spodf.pol)))}
# apply it one one subset to see if it is working
function.landcover.sum(r.terra.2008,spodf.malha.2007[1,])
## parallel loop
# define package(s) to be use in the parallel loop
l.packages<-c("raster","sp")
# try a parallel loop for the first 6 polygons
l.results<-foreach(i=1:6,
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
}
here the output is a list that looks like this.
l.results
[[1]]
9 10
193159 2567
[[2]]
7 9 10 12 14 16
17 256 1084 494 67 15
[[3]]
3 5 6 7 9 10 11 12
2199 1327 8840 8579 194437 1061 1073 1834
14 16
222 1395
[[4]]
3 6 7 9 10 12 16
287 102 728 329057 1004 1057 31
[[5]]
3 5 6 7 9 12 16
21 6 20 495 184261 4765 28
[[6]]
6 7 9 10 12 14
161 161 386 943 205 1515
So the result is rather small and should not be the source of the memory allocation problem. So than the following loop upon the whole polygon dataset which has >32.000 rows creates the memory allocation which exceeds 8GB after around 100 iteratins.
# apply the parallel loop on the whole dataset
l.results<-foreach(i=1:nrow(spodf.malha.2007),
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
# gc(reset=TRUE) # does not resolve the problem
# closeAllConnections() # does not resolve the problem
}
What am I doing wrong?
edit:
I tried (as suggested in the comments) to remove the object after each iteration in the internal loop, but it did not resolve the problem. I furthermore tried to resolve eventual problems of multiple data-imports by passing the objects to the environment in the first place:
clusterExport(cl = cluster,
varlist = c("r.terra.2008","function.landcover.sum","spodf.malha.2007"))
without major changes. My R version is 3.4 on a linux platform so supposedly also the patch of the link from the fist comment should already be included in this version. I also tried the parallel package as suggested in the first comment but no differences appeared.
You can try exact_extract in the exactextractr package. Is the fastest and memory safer function to extract values from raster. The main function is implemented in C++ and usually it doesn't need parallelization. Since you do not provide any example data I post an example with real data:
library(raster)
library(sf)
library(exactextractr)
# Pull municipal boundaries for Brazil
brazil <- st_as_sf(getData('GADM', country='BRA', level=2))
# Pull gridded precipitation data
prec <- getData('worldclim', var='prec', res=10)
#transform precipitation data in a dummy land use map
lu <- prec[[1]]
values(lu) <- sample(1:10,ncell(lu),replace = T)
plot(lu)
#extract land uses class for each pixel inside each polygon
ex <- exact_extract(lu, brazil)
#apply table to the resulting list. Here I use just the first 5 elements to avoid long output
lapply(ex[1:5],function(x){
table(x[,1])#note that I use x[,1] because by default exact_extract provide in the second column the coverage fraction of each pixel by each polygon
})
here the example output:
[[1]]
1 2 4 6 7 9 10
1 1 1 2 3 1 1
[[2]]
2 3 4 5 6 7 8 10
2 4 3 2 1 2 2 2
[[3]]
1 2 4 6 7 8 9 10
4 5 1 1 4 2 5 5
[[4]]
1 2 3 4 5 6 7 8 9 10
2 2 4 2 2 4 1 4 1 2
[[5]]
3 4 5 6 8 10
2 3 1 1 2 3

Word frequency over time by user in R

I'm aiming to make a bump chart of word frequency over time. I have about 36000 individual entries of a user's comment and an associated date. I have a 25 user sample available here: http://pastebin.com/kKfby5kf
I'm trying to get the most frequent words (maybe top 10?) on a given date. I feel like my methodology is close, but not quite right:
library("tm")
frequencylist <- list(0)
for(i in unique(sampledf[,2])){
subset <- subset(sampledf, sampledf[,2]==i)
comments <- as.vector(subset[,1])
verbatims <- Corpus(VectorSource(comments))
verbatims <- tm_map(verbatims, stripWhitespace)
verbatims <- tm_map(verbatims, content_transformer(tolower))
verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
verbatims <- tm_map(verbatims, removePunctuation)
stopwords2 <- c("game")
verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
dtm <- DocumentTermMatrix(verbatims2)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
frequencydf <- data.frame(frequency)
frequencydf$comments <- row.names(frequencydf)
frequencydf$date <- i
frequencylist[[i]] <- frequencydf
}
An explanation of my madness: the pastebin example goes into sampledf. For each unique date in the sample, I'm trying to get a word frequency. I'm then attempting to store that tabulated word frequency in a list (might not be the best approach, though). First, I subset by date, then strip whitespace, common English words, punctuation, and lowercase it all. I then do another pass of word removal for "game" since it's not too interesting but very common. To get the word frequency, I then pass it into a document term matrix and do a simple colSums(). Then I append the date for that table and try to store it in a list.
I'm not sure if my strategy is valid to begin with. Is there a simpler, better approach to this problem?
The commenters are correct in that there are better ways to set up a reproducible example. In addition, your answer could be more specific in what you are trying to accomplish as an output. (I could not get your code to execute without error.)
However: You asked for a simpler, better approach. Here is what I think is both. It uses the quanteda text package and exploits the groups feature when creating the document-feature matrix. Then it performs some rankings on the "dfm" to get what you need in terms of daily term rankings.
Note that this is based on my having loaded your linked data using read.delim("sampledf.tsv", stringsAsFactors = FALSE).
require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip,
docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))
# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete.
## Elapsed time: 0.009 seconds.
myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types
# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
## â great nice play go will can get ever first
## 2013-10-02 1 18 19 20 21 22 23 24 25 26
## 2013-10-04 3 1 2 4 5 6 7 8 9 10
## 2013-10-05 3 9 28 29 1 2 4 5 6 7
## 2013-10-06 7 4 8 10 11 30 31 32 33 34
## 2013-10-07 5 1 2 3 4 6 7 8 9 10
## 2013-10-09 12 42 43 1 2 3 4 5 6 7
## 2013-10-13 1 14 6 9 10 13 44 45 46 47
## 2013-10-16 2 3 84 85 1 4 5 6 7 8
## 2013-10-18 15 1 2 3 4 5 6 7 8 9
## 2013-10-19 3 86 1 2 4 5 6 7 8 9
## 2013-10-22 2 87 88 89 90 91 92 93 94 95
## 2013-10-23 13 98 99 100 101 102 103 104 105 106
## 2013-10-25 4 6 5 12 16 109 110 111 112 113
## 2013-10-27 8 4 6 15 17 124 125 126 127 128
## 2013-10-30 11 1 2 3 4 5 6 7 8 9
## 2014-10-01 7 16 139 1 2 3 4 5 6 8
## 2014-10-02 140 1 2 3 4 5 6 7 8 9
## 2014-10-03 141 142 143 1 2 3 4 5 6 7
## 2014-10-05 144 145 146 147 148 1 2 3 4 5
## 2014-10-06 17 149 150 1 2 3 4 5 6 7
# top n features by day
n <- 10
as.data.frame(apply(featureRanksByDate, 1, function(x) {
todaysTopFeatures <- names(featureRanksByDate)
names(todaysTopFeatures) <- x
todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
## 2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1 â great go triple great play â go great nice year year
## 2 win nice will niple nice go created â nice play â give
## 3 year â â backflip play will wasnt great play â give good
## 4 give play can great go can money will go go good hard
## 5 good go get scope â get prizes can will will hard time
## 6 hard will ever ball will ever nice get can can time triple
## 7 time can first â can first piece ever get get triple niple
## 8 triple get fun nice get fun dead first ever ever niple backflip
## 9 niple ever great testical ever win play fun first first backflip scope
## 10 backflip first win play first year go win fun fun scope ball
## 2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1 scope scope great play great play will play
## 3 testical testical play will play will get will
## 2 ball ball nice go nice go can go
## 4 â great go can go can ever can
## 5 nice shot will get will get first get
## 6 great nice can ever can ever fun ever
## 7 shot head get â get first win first
## 8 head â ever first ever fun year fun
## 9 dancing dancing first fun first win give win
## 10 cow cow fun win fun year good year
BTW interesting spellings of niple and testical.

Sentiment Analysis in R

I am new in sentiment analysis, and totally have no idea on how to go about it using R. Hence, I would like to seek help and guidance in this.
I have a set of data consisting of opinions, and would like to analyse the the opinions.
Title Date Content
Boy May 13 2015 "She is pretty", Tom said.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart..
Monster Jan 22 2015 Ms May, a student, said that John has $10.80.
Thank you.
Sentiment analysis encompasses a broad category of methods designed to measure positive versus negative sentiment from text, so that makes this a fairly difficult question to answer simply. But here is a simple answer: You can apply a dictionary to your document-term matrix and then combine the positive versus negative key categories of your dictionary to create a sentiment measure.
I suggest trying this in the text analysis package quanteda, which handles a variety of existing dictionary formats and allows you to create very flexible custom dictionaries.
For example:
require(quanteda)
mycorpus <- subset(inaugCorpus, Year>1980)
mydict <- dictionary(list(negative = c("detriment*", "bad*", "awful*", "terrib*", "horribl*"),
postive = c("good", "great", "super*", "excellent")))
myDfm <- dfm(mycorpus, dictionary = mydict)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 9 documents
## ... indexing features: 3,113 feature types
## ... applying a dictionary consisting of 2 keys
## ... created a 9 x 2 sparse dfm
## ... complete.
## Elapsed time: 0.057 seconds.
myDfm
## Document-feature matrix of: 9 documents, 2 features.
## 9 x 2 sparse Matrix of class "dfmSparse"
## features
## docs negative postive
## 1981-Reagan 0 6
## 1985-Reagan 0 6
## 1989-Bush 0 18
## 1993-Clinton 1 2
## 1997-Clinton 2 8
## 2001-Bush 1 6
## 2005-Bush 0 8
## 2009-Obama 2 3
## 2013-Obama 1 3
# use a LIWC dictionary - obviously you need this file
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
myDfmLIWC <- dfm(mycorpus, dictionary = liwcdict)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 9 documents
## ... indexing features: 3,113 feature types
## ... applying a dictionary consisting of 68 keys
## ... created a 9 x 68 sparse dfm
## ... complete.
## Elapsed time: 1.844 seconds.
myDfmLIWC[, grep("^Pos|^Neg", features(myDfmLIWC))]
## Document-feature matrix of: 9 documents, 4 features.
## 9 x 4 sparse Matrix of class "dfmSparse"
## features
## docs Negate Posemo Posfeel Negemo
## 1981-Reagan 46 89 5 24
## 1985-Reagan 28 104 7 33
## 1989-Bush 40 102 10 8
## 1993-Clinton 25 51 3 23
## 1997-Clinton 27 64 5 22
## 2001-Bush 40 80 6 27
## 2005-Bush 25 117 5 31
## 2009-Obama 40 83 5 46
## 2013-Obama 42 80 13 22
For your corpus, assuming that you get it into a data.frame called data, you can create a quanteda corpus using:
mycorpus <- corpus(data$Content, docvars = data[, 1:2])
See also ?textfile for loading in content from files in one easy command. This works with .csv files for instance, although you would have problems with that file because the Content field contains text containing commas.
There are many other ways to measure sentiment of course, but if you are new to sentiment mining and R, that should get you started. You can read more on sentiment mining methods (and apologies if you already have encountered them) from:
Liu, Bing. 2010. "Sentiment Analysis and Subjectivity." Handbook of
natural language processing 2:
627–66.
Liu, Bing. 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press.

Resources