Deal with phrasal verb in text mining - r

Phrasal verb is really important in day-to-day English usage. Is there any library in R that allows us to deal with it?
I have tried 2 ways but it seems unable to deal with it
For example
library(sentimentr)
library(tidytext)
library(tidyverse)
x <- 'i vomit when i see her'
y <- 'i throw up when i see her'
# sentimentR
sentiment(x) #give sentiment of -0.4
sentiment(y) #give sentiment of 0
# Similarly, using tidytext
y %>% as_tibble() %>%
unnest_tokens(word, value) %>%
left_join(get_sentiments('bing')) # give all words the sentiments of 0
I came up with a (clumsy) strategy to deal with phrasal verbs:
# create a dummy phrasal verb sentiment score
phrasel_verb <- data.frame(bigram = c("throw up"),
bigram_score = -1)
# use tidy text to make bigram--> join
y %>% as_tibble() %>%
unnest_tokens(bigram, value, 'ngrams', n = 2) %>%
separate(bigram, c('word','word2'), remove = F) %>%
left_join(phrasel_verb) %>%
left_join(get_sentiments('bing')) %>%
mutate(sentiment_all = coalesce(bigram_score, as.numeric(sentiment))) %>%
summarise(sentiment_sum = sum(na.fill(sentiment_all, 0)))
The result is -1 which suggest a negative sentiments.
Any ideas to improve it? Are there any data that have sentiment score of phrasal verb?

Related

step_mutate with textrecipes tokenlists

I'm doing NLP with the tidymodels framework, taking advantage of the textrecipes package, which has recipe steps for text preprocessing. Here, step_tokenize takes a character vector as input and returns a tokenlist object. Now, I want to perform spell checking on the new tokenized variable with a custom function for correct spelling, using functions from the hunspell package, but I get the following error (link to the spell check blog post):
Error: Problem with `mutate()` column `desc`.
i `desc = correct_spelling(desc)`.
x is.character(words) is not TRUE
Apparently, tokenlists don't parse easily to character vectors. I've noticed the existence of step_untokenize, but simply disolves the tokenlist by pasting and collapsing and that's not what I need.
REPREX
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_mutate(desc = correct_spelling(desc))
product_recipe %>% prep()
WHAT I WANT, BUT WITHOUT RECIPES
product_descriptions %>%
unnest_tokens(word, desc) %>%
mutate(word = correct_spelling(word))
There isn't a canonical way to do this using {textrecipes} yet. We need 2 things, a function that takes a vector of tokens and returns spell-checked tokens (you provided that) and a way to apply that function to each element of the tokenlist. For now, there isn't a general step that lets you do that, but you can cheat it by passing the function to custom_stemmer in step_stem(). Giving you the results you want
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_stem(desc, custom_stemmer = correct_spelling) %>%
step_tf(desc)
product_recipe %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 3 × 6
#> price tf_desc_cad tf_desc_good tf_desc_not tf_desc_product tf_desc_sou
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1000 0 1 0 1 0
#> 2 700 0 1 1 0 1
#> 3 250 1 0 0 1 0
Not nearly as short, but this should work:
library(tidyverse)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
my_stopwords <- c("sou")
product_descriptions %>%
#create a row identifier
mutate(id = row_number()) %>%
#separate all `desc` into separate words (by space) into separate rows
separate_rows(desc, sep = " ") %>%
#helper for naming later on
mutate(word_id = "word") %>%
#word identifier
group_by(id) %>%
mutate(word = row_number()) %>%
ungroup() %>%
#exclude stopwords as defined above
filter(!desc %in% my_stopwords) %>%
#add spellchecker
mutate(desc = correct_spelling(desc)) %>%
#make tibble wide again
pivot_wider(names_from = c(word_id, word), values_from = desc) %>%
#unite all strings that were put into separate columns
unite(desc, starts_with("word_"), remove = FALSE, sep = " ", na.rm = TRUE) %>%
#omit all helper columns
select(-c(id, starts_with("word_"))) %>%
#clean up column ordering
relocate(desc, price)
In this case "sou" is deleted as a stopword and "produkt" is getting corrected to "product". The spellcheck function changes "cad" to "vad" instead of "bad", though.

R - Count exact matches in string from list of words, then calculate overall sentiment using score per word

I have a dataset containing a column of strings from which I wish to calculate an overall sentiment score, and a data frame containing all the unique words that appear in all the strings , each of which is assigned a score:
library(stringr)
df <- data.frame(text = c('recommend good value no problem','terrible quality no good','good service excellent quality commend'), score = 0)
words <- c('recommend','good','value','problem','terrible','quality','service','excellent','commend')
scores <- c(1,2,1,-2,-3,1,0,3,1)
wordsdf <- data.frame(words,scores)
The only way I have been able to get close to this is by using a nested for loop and the str_count function from the stringr package:
for (i in 1:3){
count = 0
for (j in 1:9){
count <- count + (str_count(df$text[i],as.character(wordsdf$words[j])) * wordsdf$scores[j])
}
df$score[i] <- count
}
This almost achieves what I want:
text score
1 recommend good value no problem 3
2 terrible quality no good 0
3 good service excellent quality commend 7
However, since the word 'commend' is also contained in the word 'recommend', my code calculates the scores as if both words are contained in the string.
So I have two queries:
1 - Is there a way to get it to match only to exact words?
2 - Is there a way to achieve this without using the nested loop?
One tidyverse possibility could be:
df %>%
rowid_to_column() %>%
mutate(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
full_join(wordsdf, by = c("text" = "words")) %>%
group_by(rowid) %>%
summarise(text = paste(text, collapse = " "),
scores = sum(scores, na.rm = TRUE)) %>%
ungroup() %>%
select(-rowid)
text scores
<chr> <dbl>
1 recommend good value no problem 2
2 terrible quality no good 0
3 good service excellent quality commend 7
It, first, splits the "text" column into separate words. Second, it performs a full join on these words. Finally, it combines the words from "text" column again and performs the summation.

R sentiment analysis; 'lexicon' not found; 'sentiments' corrupted?

I am trying to follow this on-line tutorial on sentiment analysis. The code:
new_sentiments <- sentiments %>% #From the tidytext package
filter(lexicon != "loughran") %>% #Remove the finance lexicon
mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
ifelse(lexicon == "AFINN" & score < 0,
"negative", sentiment))) %>%
group_by(lexicon) %>%
mutate(words_in_lexicon = n_distinct(word)) %>%
ungroup()
Generates the error:
>Error in filter_impl(.data, quo) :
>Evaluation error: object 'lexicon' not found.
Related, perhaps is that to me it appears the "sentiments" tables are acting strangely (corrupted?). Here is a head of 'sentiments':
> head(sentiments,3)
> element_id sentence_id word_count sentiment
> chapter
> 1 1 1 7 0 The First Book of Moses:
> Called Genesis
> 2 2 1 NA 0 The First Book of Moses:
> Called Genesis
> 3 3 1 NA 0 The First Book of Moses: >
> Called Genesis
> category
> 1 The First Book of Moses: Called Genesis
> 2 The First Book of Moses: Called Genesis
> 3 The First Book of Moses: Called Genesis
If I use Get_Sentiments for bing, AFINN or NRC, though, I get what looks like an appropriate reponse:
> get_sentiments("bing")
> # A tibble: 6,788 x 2
> word sentiment
> <chr> <chr> > 1 2-faced negative
> 2 2-faces negative
> 3 a+ positive
> 4 abnormal negative
I tried removing (remove.packages) and re-installing tidytext; no change in behavior. I am running R 3.5
Even if I am completely misunderstanding the problem, I would appreciate any insights anyone can give me.
The following instructions will fix the new_sentiments dataset as shown in the Data Camp tutorial.
bing <- get_sentiments("bing") %>%
mutate(lexicon = "bing",
words_in_lexicon = n_distinct(word))
nrc <- get_sentiments("nrc") %>%
mutate(lexicon = "nrc",
words_in_lexicon = n_distinct(word))
afinn <- get_sentiments("afinn") %>%
mutate(lexicon = "afinn",
words_in_lexicon = n_distinct(word))
new_sentiments <- bind_rows(bing, nrc, afinn)
names(new_sentiments)[names(new_sentiments) == 'value'] <- 'score'
new_sentiments %>%
group_by(lexicon, sentiment, words_in_lexicon) %>%
summarise(distinct_words = n_distinct(word)) %>%
ungroup() %>%
spread(sentiment, distinct_words) %>%
mutate(lexicon = color_tile("lightblue", "lightblue")(lexicon),
words_in_lexicon = color_bar("lightpink")(words_in_lexicon)) %>%
my_kable_styling(caption = "Word Counts per Lexicon")
The subsequent graphs will work too!
It appears tidytext had to be changed, which broke some of the code in the tutorial.
To make the code run, replace
new_sentiments <- sentiments %>% #From the tidytext package
filter(lexicon != "loughran") %>% #Remove the finance lexicon
mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
ifelse(lexicon == "AFINN" & score < 0,
"negative", sentiment))) %>%
group_by(lexicon) %>%
mutate(words_in_lexicon = n_distinct(word)) %>%
ungroup()
with
new_sentiments <- get_sentiments("afinn")
names(new_sentiments)[names(new_sentiments) == 'value'] <- 'score'
new_sentiments <- new_sentiments %>% mutate(lexicon = "afinn", sentiment = ifelse(score >= 0, "positive", "negative"),
words_in_lexicon = n_distinct((word)))
The next few graphs won't make as much sense (since we now only use one lexicon), but the rest of the tutorial will work
UPDATE here's an excellent explanation from the tidytext package author as to what happened.
I found a similar problem, I try this code below,
I hope it would help
library(tm)
library(tidyr)
library(ggthemes)
library(ggplot2)
library(dplyr)
library(tidytext)
library(textdata)
# Choose the bing lexicon
get_sentiments("bing")
get_sentiments("afinn")
get_sentiments("nrc")
#define new
afinn=get_sentiments("afinn")
bing=get_sentiments("bing")
nrc=get_sentiments("nrc")
#check
head(afinn)
head(bing)
head(nrc)
head(sentiments) #from tidytext packages
#merging dataframe
merge_sentiments=rbind(sentiments,get_sentiments('bing'),get_sentiments('nrc'))
head(merge_sentiments) #check
merge2_sentiments=merge(merge_sentiments,afinn,by=1,all=T)
head(merge2_sentiments) #check
#make new data frame with column lexicon added
new_sentiments <- merge2_sentiments
new_sentiments <- new_sentiments %>%
mutate(lexicon=ifelse(sentiment=='positive','bing',ifelse(sentiment=='negative','bing',ifelse(sentiment=='NA','afinn','nrc'))))
colnames(new_sentiments)[colnames(new_sentiments)=='value']='score'
#check
head(new_sentiments)

Chronological Sentiment Analysis -- Cannot group by lines

My data text is a novel in plain text. I used packages tm and tidytext. Data processing went well and I created my DocumentTermMatrix without trouble.
text <- read_lines("GoneWithTheWind2.txt")
set.seed(314)
text <- iconv(text,'UTF-8',sub="")
myCorpus <- tm_map(myCorpus, removeWords, c(stopwords("english"),
stopwords("SMART"), mystopwords, Top200Words))
myDtm <- TermDocumentMatrix(myCorpus, control=list(minWordLength= 1))`
However, I could not run the coding using inner_join between bing lexicon and the DocumentTermMatrix to do chronological sentiment analysis of this novel over time. I wrote the function below based on an online example but did not know what to group by in count(sentiment) (I place ???? in hold), because the plain text and the DocumentTermMatrix has no "lines" columns.
bing <- get_sentiments("bing")
m <- as.matrix(myDtm)
v <- sort(rowSums(m),decreasing=TRUE)
myNames <- names(v)
d <- data.frame(term=myNames, freq = v)
wind_polarity <- d %>%
# Inner join to the lexicon
inner_join(bing, by=c("term"="word")) %>%
# Count by sentiment, **????**
count(sentiment, **????**) %>%
# Spread sentiments
spread(sentiment, n, fill=0) %>%
mutate(
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number())
Then plot by ggplot.
I tried adding a column "Index" indicating the line number for each document (line) in text but this column disappears somewhere in the process. Any suggestions would be highly appreciated.
Below an approach that calculates the polarity per line (based on a minimum example of three lines). You might join your dtm with the lexicon directly to maintain information on the counts. Then turn polarity information into numeric representation and do your calculations per line. You might certainly rewrite the code and make it more elegant (I am not very familiar with dplyr vocabulary, sorry). I hope that helps anyway.
library(tm)
library(tidytext)
text <- c("I like coffe."
,"I rather like tea."
,"I hate coffee and tea, but I love orange juice.")
myDtm <- TermDocumentMatrix(VCorpus(VectorSource(text)),
control = list(removePunctuation = TRUE,
stopwords = TRUE))
bing <- tidytext::get_sentiments("bing")
wind_polarity <- as.matrix(myDtm) %>%
data.frame(terms = rownames(myDtm), ., stringsAsFactors = FALSE) %>%
inner_join(bing, by= c("terms"="word")) %>%
mutate(terms = NULL,
polarity = ifelse( (.[,"sentiment"] == "positive"), 1,-1),
sentiment = NULL) %>%
{ . * .$polarity } %>%
mutate(polarity = NULL) %>%
colSums
#the polarity per line which you may plot, e.g., with base or ggplot
# X1 X2 X3
# 1 1 0

Can I combine pairwise_cor and pairwise_count to get the phi coefficient AND number of occurrences for each pair of words?

I'm new to R, and I'm using widyr to do text mining. I successfully used the methods found here to get a list of co-occurring words within each section of text and their phi coefficient.
Code as follows:
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
I understand that I can also generate a data frame with co-occurring words and the number of times they appear, using code like:
word_pairs <- review_words %>%
pairwise_count(word, title, sort = TRUE)
What I need is a table that has both the phi coefficient and the number of occurrences for each pair of words. I've been digging into pairwise_cor and pairwise_count but still can't figure out how to combine them. If I understand correctly, joins only take one column into account for matching, so I couldn't use a regular join reliably since there may be multiple pairs that have the same word in the item1 column.
Is this possible using widyr? If not, is there another package that will allow me to do this?
Here is the full code:
#Load packages
pacman::p_load(XML, dplyr, stringr, rvest, httr, xml2, tidytext, tidyverse, widyr)
#Load source material
prod_reviews_df <- read_csv("SOURCE SPREADSHEET.csv")
#Split into one word per row
review_words <- prod_reviews_df %>%
unnest_tokens(word, comments, token = "words", format = "text", drop = FALSE) %>%
anti_join(stop_words, by = c("word" = "word"))
#Find phi coefficient
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
#Write data to CSV
write.csv(word_cors, "WORD CORRELATIONS.csv")
I want to add in pairwise_count, but I need it alongside the phi coefficient.
Thank you!
If you are getting into using tidy data principles and tidyverse tools, I would suggest GOING ALL THE WAY :) and using dplyr to do the joins you are interested in. You can use left_join to connect the calculations from pairwise_cor() and pairwise_count(), and you can just pipe from one to the other, if you like.
library(dplyr)
library(tidytext)
library(janeaustenr)
library(widyr)
austen_section_words <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
austen_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE) %>%
left_join(austen_section_words %>%
pairwise_count(word, section, sort = TRUE),
by = c("item1", "item2"))
#> # A tibble: 154,842 x 4
#> item1 item2 correlation n
#> <chr> <chr> <dbl> <dbl>
#> 1 bourgh de 0.9508501 29
#> 2 de bourgh 0.9508501 29
#> 3 pounds thousand 0.7005808 17
#> 4 thousand pounds 0.7005808 17
#> 5 william sir 0.6644719 31
#> 6 sir william 0.6644719 31
#> 7 catherine lady 0.6633048 82
#> 8 lady catherine 0.6633048 82
#> 9 forster colonel 0.6220950 27
#> 10 colonel forster 0.6220950 27
#> # ... with 154,832 more rows
I discovered and used merge today, and it appears to have used both relevant columns to merge the data. I'm not sure how to check for accuracy, but I think it worked.

Resources