R sentiment analysis; 'lexicon' not found; 'sentiments' corrupted? - r

I am trying to follow this on-line tutorial on sentiment analysis. The code:
new_sentiments <- sentiments %>% #From the tidytext package
filter(lexicon != "loughran") %>% #Remove the finance lexicon
mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
ifelse(lexicon == "AFINN" & score < 0,
"negative", sentiment))) %>%
group_by(lexicon) %>%
mutate(words_in_lexicon = n_distinct(word)) %>%
ungroup()
Generates the error:
>Error in filter_impl(.data, quo) :
>Evaluation error: object 'lexicon' not found.
Related, perhaps is that to me it appears the "sentiments" tables are acting strangely (corrupted?). Here is a head of 'sentiments':
> head(sentiments,3)
> element_id sentence_id word_count sentiment
> chapter
> 1 1 1 7 0 The First Book of Moses:
> Called Genesis
> 2 2 1 NA 0 The First Book of Moses:
> Called Genesis
> 3 3 1 NA 0 The First Book of Moses: >
> Called Genesis
> category
> 1 The First Book of Moses: Called Genesis
> 2 The First Book of Moses: Called Genesis
> 3 The First Book of Moses: Called Genesis
If I use Get_Sentiments for bing, AFINN or NRC, though, I get what looks like an appropriate reponse:
> get_sentiments("bing")
> # A tibble: 6,788 x 2
> word sentiment
> <chr> <chr> > 1 2-faced negative
> 2 2-faces negative
> 3 a+ positive
> 4 abnormal negative
I tried removing (remove.packages) and re-installing tidytext; no change in behavior. I am running R 3.5
Even if I am completely misunderstanding the problem, I would appreciate any insights anyone can give me.

The following instructions will fix the new_sentiments dataset as shown in the Data Camp tutorial.
bing <- get_sentiments("bing") %>%
mutate(lexicon = "bing",
words_in_lexicon = n_distinct(word))
nrc <- get_sentiments("nrc") %>%
mutate(lexicon = "nrc",
words_in_lexicon = n_distinct(word))
afinn <- get_sentiments("afinn") %>%
mutate(lexicon = "afinn",
words_in_lexicon = n_distinct(word))
new_sentiments <- bind_rows(bing, nrc, afinn)
names(new_sentiments)[names(new_sentiments) == 'value'] <- 'score'
new_sentiments %>%
group_by(lexicon, sentiment, words_in_lexicon) %>%
summarise(distinct_words = n_distinct(word)) %>%
ungroup() %>%
spread(sentiment, distinct_words) %>%
mutate(lexicon = color_tile("lightblue", "lightblue")(lexicon),
words_in_lexicon = color_bar("lightpink")(words_in_lexicon)) %>%
my_kable_styling(caption = "Word Counts per Lexicon")
The subsequent graphs will work too!

It appears tidytext had to be changed, which broke some of the code in the tutorial.
To make the code run, replace
new_sentiments <- sentiments %>% #From the tidytext package
filter(lexicon != "loughran") %>% #Remove the finance lexicon
mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
ifelse(lexicon == "AFINN" & score < 0,
"negative", sentiment))) %>%
group_by(lexicon) %>%
mutate(words_in_lexicon = n_distinct(word)) %>%
ungroup()
with
new_sentiments <- get_sentiments("afinn")
names(new_sentiments)[names(new_sentiments) == 'value'] <- 'score'
new_sentiments <- new_sentiments %>% mutate(lexicon = "afinn", sentiment = ifelse(score >= 0, "positive", "negative"),
words_in_lexicon = n_distinct((word)))
The next few graphs won't make as much sense (since we now only use one lexicon), but the rest of the tutorial will work
UPDATE here's an excellent explanation from the tidytext package author as to what happened.

I found a similar problem, I try this code below,
I hope it would help
library(tm)
library(tidyr)
library(ggthemes)
library(ggplot2)
library(dplyr)
library(tidytext)
library(textdata)
# Choose the bing lexicon
get_sentiments("bing")
get_sentiments("afinn")
get_sentiments("nrc")
#define new
afinn=get_sentiments("afinn")
bing=get_sentiments("bing")
nrc=get_sentiments("nrc")
#check
head(afinn)
head(bing)
head(nrc)
head(sentiments) #from tidytext packages
#merging dataframe
merge_sentiments=rbind(sentiments,get_sentiments('bing'),get_sentiments('nrc'))
head(merge_sentiments) #check
merge2_sentiments=merge(merge_sentiments,afinn,by=1,all=T)
head(merge2_sentiments) #check
#make new data frame with column lexicon added
new_sentiments <- merge2_sentiments
new_sentiments <- new_sentiments %>%
mutate(lexicon=ifelse(sentiment=='positive','bing',ifelse(sentiment=='negative','bing',ifelse(sentiment=='NA','afinn','nrc'))))
colnames(new_sentiments)[colnames(new_sentiments)=='value']='score'
#check
head(new_sentiments)

Related

Errors in counting + combining bing sentiment score variables in Tidytext?

I'm doing sentiment analysis on a large corpus of text. I'm using the bing lexicon in tidytext to get simple binary pos/neg classifications, but want to calculate the ratios of positive to total (positive & negative) words within a document. I'm rusty with dplyr workflows, but I want to count the number of words coded as "positive" and divide it by the total count of words classified with a sentiment.
I tried this approach, using sample code and stand-in data . . .
library(tidyverse)
library(tidytext)
#Creating a fake tidytext corpus
df_tidytext <- data.frame(
doc_id = c("Iraq_Report_2001", "Iraq_Report_2002"),
text = c("xxxx", "xxxx") #Placeholder for text
)
#Creating a fake set of scored words with bing sentiments
#for each doc in corpus
df_sentiment_bing <- data.frame(
doc_id = c((rep("Iraq_Report_2001", each = 3)),
rep("Iraq_Report_2002", each = 3)),
word = c("improve", "democratic", "violence",
"sectarian", "conflict", "insurgency"),
bing_sentiment = c("positive", "positive", "negative",
"negative", "negative", "negative") #Stand-ins for sentiment classification
)
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_sentiment_scored <- df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
summarise(bing_score = count(positive)/(count(negative) + count(positive)))
But I get the following error:
"Error: Problem with `summarise()` input `bing_score`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
ℹ Input `bing_score` is `count(positive)/(count(negative) + count(positive))`.
ℹ The error occurred in group 1: doc_id = "Iraq_Report_2001".
Would love some insight into what I'm doing wrong with my summarizing workflow here.
I don't understand what is the point of counting there if the columns are numeric. By the way, that is also why you are having the error.
One solution could be:
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
dplyr::count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
replace(is.na(.), 0) %>%
summarise(bing_score = sum(positive)/(sum(negative) + sum(positive)))
The result you should get its:
Joining, by = "doc_id"
# A tibble: 2 × 2
doc_id bing_score
<fct> <dbl>
1 Iraq_Report_2001 0.667
2 Iraq_Report_2002 0

step_mutate with textrecipes tokenlists

I'm doing NLP with the tidymodels framework, taking advantage of the textrecipes package, which has recipe steps for text preprocessing. Here, step_tokenize takes a character vector as input and returns a tokenlist object. Now, I want to perform spell checking on the new tokenized variable with a custom function for correct spelling, using functions from the hunspell package, but I get the following error (link to the spell check blog post):
Error: Problem with `mutate()` column `desc`.
i `desc = correct_spelling(desc)`.
x is.character(words) is not TRUE
Apparently, tokenlists don't parse easily to character vectors. I've noticed the existence of step_untokenize, but simply disolves the tokenlist by pasting and collapsing and that's not what I need.
REPREX
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_mutate(desc = correct_spelling(desc))
product_recipe %>% prep()
WHAT I WANT, BUT WITHOUT RECIPES
product_descriptions %>%
unnest_tokens(word, desc) %>%
mutate(word = correct_spelling(word))
There isn't a canonical way to do this using {textrecipes} yet. We need 2 things, a function that takes a vector of tokens and returns spell-checked tokens (you provided that) and a way to apply that function to each element of the tokenlist. For now, there isn't a general step that lets you do that, but you can cheat it by passing the function to custom_stemmer in step_stem(). Giving you the results you want
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_stem(desc, custom_stemmer = correct_spelling) %>%
step_tf(desc)
product_recipe %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 3 × 6
#> price tf_desc_cad tf_desc_good tf_desc_not tf_desc_product tf_desc_sou
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1000 0 1 0 1 0
#> 2 700 0 1 1 0 1
#> 3 250 1 0 0 1 0
Not nearly as short, but this should work:
library(tidyverse)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
my_stopwords <- c("sou")
product_descriptions %>%
#create a row identifier
mutate(id = row_number()) %>%
#separate all `desc` into separate words (by space) into separate rows
separate_rows(desc, sep = " ") %>%
#helper for naming later on
mutate(word_id = "word") %>%
#word identifier
group_by(id) %>%
mutate(word = row_number()) %>%
ungroup() %>%
#exclude stopwords as defined above
filter(!desc %in% my_stopwords) %>%
#add spellchecker
mutate(desc = correct_spelling(desc)) %>%
#make tibble wide again
pivot_wider(names_from = c(word_id, word), values_from = desc) %>%
#unite all strings that were put into separate columns
unite(desc, starts_with("word_"), remove = FALSE, sep = " ", na.rm = TRUE) %>%
#omit all helper columns
select(-c(id, starts_with("word_"))) %>%
#clean up column ordering
relocate(desc, price)
In this case "sou" is deleted as a stopword and "produkt" is getting corrected to "product". The spellcheck function changes "cad" to "vad" instead of "bad", though.

R tidytext sentiment analysis- how to use the drop parameter

I recently asked a question about entries that are omitted after a sentiment analysis. The tweets that I analyse don't always contain words that are in the lexicon. I would like to know which ones can't be translated. So I would like to keep these even if zero words were scored. In my previous question, the drop parameter was given as a solution. However, I think I might be doing it wrong or missing something. This is my first time working with these techniques.
The following function takes a data frame and gives a new one in return, containing the amount of positive and negative words along with the sentiment.
The input (with one text in Dutch on purpose so it can't be scored)
id <- c(1, 2, 3)
date <- c("12-05-2021", "12-06-2021", "12-07-2021")
text <- c("Dit is tekst in het Nederlands", "I,m so happy that websites like this exsist", "This icecream tastes terrible. It made me upset")
df <- data.frame(id, date, text)
What i want as output is:
sentiment positive negative
0 0 0
2 2 0
-2 0 2
But my function gives me something else:
sentimentAnalysis <- function(tweetData){
sentimentDataframe <- data.frame()
for(row in 1:nrow(tweetData)){
tekst <- as.character(tweetData[row, "text"])
positive <- 0
negative <- 0
tokens <- tibble(text = tekst) %>% unnest_tokens(word, text, drop = FALSE)
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
sentimentDataframe <- bind_rows(sentimentDataframe, sentiment)
}
sentimentDataframe[is.na(sentimentDataframe)] <- 0
return(sentimentDataframe)
}
This still returns a data frame with the unscored texts missing. As you can see, the first text is omitted:
sentiment positive negative
2 2 0
-2 0 2
If there are no rows returned after the join you can return a tibble with all 0 values. We can use an if condition to check this.
In cases when there is only positive or negative sentiment in a sentence, complete would create another row with opposite sentiment and assign it the value 0. Also replaced spread with pivot_wider since spread is now superseded.
library(tidyverse)
library(tidytext)
map_df(df$text, ~{
tibble(text = .x) %>%
unnest_tokens(word, text, drop = FALSE) %>%
inner_join(get_sentiments("bing")) -> tmp
if(nrow(tmp) == 0) tibble(sentiment = 0, positive = 0, negative = 0)
else {
tmp %>%
count(sentiment) %>%
complete(sentiment = c('positive', 'negative'), fill = list(n = 0)) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
}
}) -> res
res
# sentiment positive negative
# <dbl> <dbl> <dbl>
#1 0 0 0
#2 2 2 0
#3 -2 0 2

Deal with phrasal verb in text mining

Phrasal verb is really important in day-to-day English usage. Is there any library in R that allows us to deal with it?
I have tried 2 ways but it seems unable to deal with it
For example
library(sentimentr)
library(tidytext)
library(tidyverse)
x <- 'i vomit when i see her'
y <- 'i throw up when i see her'
# sentimentR
sentiment(x) #give sentiment of -0.4
sentiment(y) #give sentiment of 0
# Similarly, using tidytext
y %>% as_tibble() %>%
unnest_tokens(word, value) %>%
left_join(get_sentiments('bing')) # give all words the sentiments of 0
I came up with a (clumsy) strategy to deal with phrasal verbs:
# create a dummy phrasal verb sentiment score
phrasel_verb <- data.frame(bigram = c("throw up"),
bigram_score = -1)
# use tidy text to make bigram--> join
y %>% as_tibble() %>%
unnest_tokens(bigram, value, 'ngrams', n = 2) %>%
separate(bigram, c('word','word2'), remove = F) %>%
left_join(phrasel_verb) %>%
left_join(get_sentiments('bing')) %>%
mutate(sentiment_all = coalesce(bigram_score, as.numeric(sentiment))) %>%
summarise(sentiment_sum = sum(na.fill(sentiment_all, 0)))
The result is -1 which suggest a negative sentiments.
Any ideas to improve it? Are there any data that have sentiment score of phrasal verb?

Can I combine pairwise_cor and pairwise_count to get the phi coefficient AND number of occurrences for each pair of words?

I'm new to R, and I'm using widyr to do text mining. I successfully used the methods found here to get a list of co-occurring words within each section of text and their phi coefficient.
Code as follows:
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
I understand that I can also generate a data frame with co-occurring words and the number of times they appear, using code like:
word_pairs <- review_words %>%
pairwise_count(word, title, sort = TRUE)
What I need is a table that has both the phi coefficient and the number of occurrences for each pair of words. I've been digging into pairwise_cor and pairwise_count but still can't figure out how to combine them. If I understand correctly, joins only take one column into account for matching, so I couldn't use a regular join reliably since there may be multiple pairs that have the same word in the item1 column.
Is this possible using widyr? If not, is there another package that will allow me to do this?
Here is the full code:
#Load packages
pacman::p_load(XML, dplyr, stringr, rvest, httr, xml2, tidytext, tidyverse, widyr)
#Load source material
prod_reviews_df <- read_csv("SOURCE SPREADSHEET.csv")
#Split into one word per row
review_words <- prod_reviews_df %>%
unnest_tokens(word, comments, token = "words", format = "text", drop = FALSE) %>%
anti_join(stop_words, by = c("word" = "word"))
#Find phi coefficient
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
#Write data to CSV
write.csv(word_cors, "WORD CORRELATIONS.csv")
I want to add in pairwise_count, but I need it alongside the phi coefficient.
Thank you!
If you are getting into using tidy data principles and tidyverse tools, I would suggest GOING ALL THE WAY :) and using dplyr to do the joins you are interested in. You can use left_join to connect the calculations from pairwise_cor() and pairwise_count(), and you can just pipe from one to the other, if you like.
library(dplyr)
library(tidytext)
library(janeaustenr)
library(widyr)
austen_section_words <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
austen_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE) %>%
left_join(austen_section_words %>%
pairwise_count(word, section, sort = TRUE),
by = c("item1", "item2"))
#> # A tibble: 154,842 x 4
#> item1 item2 correlation n
#> <chr> <chr> <dbl> <dbl>
#> 1 bourgh de 0.9508501 29
#> 2 de bourgh 0.9508501 29
#> 3 pounds thousand 0.7005808 17
#> 4 thousand pounds 0.7005808 17
#> 5 william sir 0.6644719 31
#> 6 sir william 0.6644719 31
#> 7 catherine lady 0.6633048 82
#> 8 lady catherine 0.6633048 82
#> 9 forster colonel 0.6220950 27
#> 10 colonel forster 0.6220950 27
#> # ... with 154,832 more rows
I discovered and used merge today, and it appears to have used both relevant columns to merge the data. I'm not sure how to check for accuracy, but I think it worked.

Resources