R Text Mining with quanteda - r

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
Everything works until:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
How would you interpret the error message? Are there any suggestions to solve the problem?

There was a bug in quanteda version 0.7.2 that caused dfm() to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.
This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in dfm(), resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)
devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
## ... indexing 57 documents
## ... lowercasing
## ... tokenizing
## ... shaping tokens into data.table, found 134,024 total tokens
## ... applying a dictionary consisting of 68 key entries
## ... summing dictionary-matched features by document
## ... indexing 68 feature types
## ... building sparse matrix
## ... created a 57 x 68 sparse dfm
## ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers Nonfl Swear TV Eating Sleep Groom Death Sports Sexual
## 0 0 0 42 47 49 53 76 81 100
It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older dfm you are using with your Facebook texts.
mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams
## 3

Related

Extract top positive and negative features when applying dictionary in quanteda

I have a data frame with around 100k rows that contain textual data. Using the quanteda package, I apply sentiment analysis (Lexicoder dictionary) to eventually calculate a sentiment score.
For an additional - more qualitative - step of analysis I would like extract the top features (i.e. negative/positive words from the dictionary that occur most frequent in my data) to examine whether the discourse is driven by particular words.
my_corpus <- corpus(my_df, docid_field = "ID", text_field = "my_text", metacorpus = NULL, compress = FALSE)
sentiment_corp <- dfm(my_corpus, dictionary = data_dictionary_LSD2015)
However, going through the quanteda documentation, I couldn't figure out how to achieve this - is there a way?
I'm aware of topfeatures and I did read this question, but it didn't help.
In all of the quanteda functions that take a pattern argument, the valid types of patterns are character vectors, lists, and dictionaries. So the best way to assess each the top features in each dictionary category (what we also call a dictionary key) is to select on that dictionary and then use topfeatures().
Here is how to do this using the built-in data_corpus_irishbudget2010 object, as an example, with the Lexicoder Sentiment Dictionary.
library("quanteda")
## Package version: 1.4.3
# tokenize and select just the dictionary value matches
toks <- tokens(data_corpus_irishbudget2010) %>%
tokens_select(pattern = data_dictionary_LSD2015)
lapply(toks[1:5], head)
## $`Lenihan, Brian (FF)`
## [1] "severe" "distress" "difficulties" "recovery"
## [5] "benefit" "understanding"
##
## $`Bruton, Richard (FG)`
## [1] "failed" "warnings" "sucking" "losses" "debt" "hurt"
##
## $`Burton, Joan (LAB)`
## [1] "remarkable" "consensus" "Ireland" "opposition" "knife"
## [6] "dispute"
##
## $`Morgan, Arthur (SF)`
## [1] "worst" "worst" "well" "corrupt" "golden" "protected"
##
## $`Cowen, Brian (FF)`
## [1] "challenge" "succeeding" "challenge" "oppose"
## [5] "responsibility" "support"
To explore the top matches for the positive entry, we can select them further by subsetting the dictionary for the Positive key.
# top positive matches
tokens_select(toks, pattern = data_dictionary_LSD2015["positive"]) %>%
dfm() %>%
topfeatures()
## benefit support recovery fair create confidence
## 68 52 44 41 39 37
## provide well credit help
## 36 33 31 29
And for Negative:
# top negative matches
tokens_select(toks, pattern = data_dictionary_LSD2015[["negative"]]) %>%
dfm() %>%
topfeatures()
## ireland benefit not support crisis recovery
## 79 68 52 52 47 44
## fair create deficit confidence
## 41 39 38 37
Why is "Ireland" a negative match? Because the LSD2015 includes ir* as a negative word that is intended to match ire and ireful but with the default case insensitive matching, also matches Ireland (a term frequently used in this example corpus). This is an example of a "false positive" match, always a risk in dictionaries when using wildcarding or when using a language such as English that has a very high rate of polysemes and homographs.

Find frequency of terms from Function

I need to find frequency of terms from the function that I have created that find terms with punctuation in them.
library("tm")
my.text.location <- "C:/Users/*/"
newpapers <- VCorpus(DirSource(my.text.location))
I read it then make the function:
library("stringr")
punctterms <- function(x){str_extract_all(x, "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}")}
terms <- lapply(newpapers, punctterms)
Now I'm lost as to how will I find the frequency for each term in each file. Do I turn it into a DTM or is there a better way without it?
Thank you!
This task is better suited for quanteda, not tm. Your function creates a list and removes everything out of the corpus. Using quanteda you can just use the quanteda commands to get everything you want.
Since you didn't provide any reproducible data, I will use a data set that comes with quanteda. Comments above the code explain what is going on. Most important function in this code is dfm_select. Here you can use a diverse set of selection patterns to find terms in the text.
library(quanteda)
# load corpus
my_corpus <- corpus(data_corpus_inaugural)
# create document features (like document term matrix)
my_dfm <- dfm(my_corpus)
# dfm_select can use regex selections to select terms
my_dfm_punct <- dfm_select(my_dfm,
pattern = "[[:alnum:]]{1,}[[:punct:]]{1,}?[[:alnum:]]{1,}",
selection = "keep",
valuetype = "regex")
# show frequency of selected terms.
head(textstat_frequency(my_dfm_punct))
feature frequency rank docfreq group
1 fellow-citizens 39 1 19 all
2 america's 35 2 11 all
3 self-government 30 3 16 all
4 world's 24 4 15 all
5 nation's 22 5 13 all
6 god's 15 6 14 all
So I got it to work without using quanteda:
m <- as.data.frame(table(unlist(terms)))
names(m) <- c("Terms", "Frequency")

R: partial match dictionary terms using grep and tm package

Hi: I have a dictionary of negative terms that has been prepared by others. I am not sure how they have gone about doing the stemming, but it looks like they have used something other than the Porter Stemer. The dictionary has a wildcard character (*) that I think is supposed to enable a stemming to happen. But I don't know how to make use of that with grep() or the tm package in the R context, so I stripped it out hoping to find a way to grep the partial match.
So the original dictionary looks like this
#load libraries
library(tm)
#sample dictionary terms for polarize and outlaw
negative<-c('polariz*', 'outlaw*')
#strip out wildcard
negative<-gsub('*', '', negative)
#test corpus
test<-c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')
#Here is how R's porter stemmer stems the text
stemDocument(test)
So, if I stemmed my corpus with R's stemmer, terms like 'outlaw' would be found in the dictionary, but it wouldn't match terms like 'polarized' and such because they would be stemmed differently than what is found in the dictionary.
So, what I would like to have is some way to have the tm package match only exact parts of each word. So, without stemming my documents, I would like it to be able to pick out 'outlaw' in the term 'outlawing' and 'outlaws' and to pick out 'polariz' in 'polarized', 'polarizing and 'polarizes'. Is this possible?
#Define corpus
test.corp<-Corpus(VectorSource(test))
#make Document Term Matrix
dtm<-documentTermMatrix(test.corp, control=list(dictionary=negative))
#inspect
inspect(dtm)
I haven't seen any tm answers, so here's one using the quanteda package as an alternative. It allows you to use "glob" wildcard values in your dictionary entries, which is the default valuetype for quanteda's dictionary functions. (See ?dictionary.) With this approach, you do not need to stem your text.
library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.6.2’
# create a quanteda dictionary, essentially a named list
negative <- dictionary(list(polariz = 'polariz*', outlaw = 'outlaw*'))
negative
## Dictionary object with 2 key entries.
## - polariz: polariz*
## - outlaw: outlaw*
test <- c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')
dfm(test, dictionary = negative, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 7 documents, 2 features.
## 7 x 2 sparse Matrix of class "dfmSparse"
## features
## docs polariz outlaw
## text1 1 0
## text3 1 0
## text2 1 0
## text4 1 0
## text5 0 1
## text6 0 1
## text7 0 1

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Can I check the frequencies of predetermined words or phrases in document clustering using R?

I'm doing a text mining using "tm" packages in R, and I can get word frequencies after I generate a term document matrix:
freq <- colSums(as.matrix(dtm))
ord <- order(freq)
freq[head(ord)]
# abit acal access accord across acsess
# 1 1 1 1 1 1
freq[tail(ord)]
# direct save month will thank list
# 106 107 116 122 132 154
It can only present me a list of word frequencies by sequence, I was wondering if I can check a word's frequency individually? Can I also check a phrase's frequency? For example, how many times the word "thank" is in a text corpus or what is the frequency of the phrase "contact number" shown in this corpus?
Many thanks for any hints and suggestions.
I show this by the data from the tm package:
library(tm)
data(crude)
dtm <- as.matrix(DocumentTermMatrix(crude))
#find the column that contains the word "demand"
columnindices <- which(colnames(dtm)=="demand")
#how often dooes the word "demand" show up?
sum(dtm[,columnindices])
>6
If you want to do this with phrases your dtm must contain these phrases not just the bag of single words as it is used in most cases. if this data is available, the procedure is the same as for a single word.

Resources