I have done some basic sentiment analysis in r and wanted to know if there was a way to have the sentiment of a sentence or row analyzed, and then have a column appended with the sentiment of the sentence. All analysis I have done up until now gives me an overview of the sentiment or pulls specific words, but doesn't link back to the original row of data
The input of my data would be fed in through a BI software and would look something like below with a case number and some text:
"12345","I am extremely angry with my service"
"23456","I was happy with how everything turned out"
"34567","The rep did a great job helping me"
I would like it to be returned as an output below
"12345","I am extremely angry with my service","Anger"
"23456","I was happy with how everything turned out","Positive"
"34567","The rep did a great job helping me","Positive"
Any point in the right direction of a package or resource would be greatly appreciated!
The problem you run into with sentences is that sentiment lexicons are based on words. If you look at the nrc lexicon, the word "angry" has three sentiment values: anger, disgust and negative. Which one do you choose? Or you have the sentence returning multiple words that are in a lexicon. Try testing different lexicons with your text to see what happens for example with tidytext.
If want a a package that can analyse sentiment on sentence level, you can look into sentimentr. You will not get sentiment values like anger back, but a sentiment/polarity score. More about sentimentr can be found in the package documentation and on sentimentr github page.
A small example code:
library(sentimentr)
text <- data.frame(id = c("12345","23456","34567"),
sentence = c("I am extremely angry with my service", "I was happy with how everything turned out", "The rep did a great job helping me"),
stringsAsFactors = FALSE)
sentiment(text$sentence)
element_id sentence_id word_count sentiment
1: 1 1 7 -0.5102520
2: 2 1 8 0.2651650
3: 3 1 8 0.3535534
# add sentiment score to data.frame
text$sentiment <- sentiment(text$sentence)$sentiment
text
id sentence sentiment
1 12345 I am extremely angry with my service -0.5102520
2 23456 I was happy with how everything turned out 0.2651650
3 34567 The rep did a great job helping me 0.3535534
Related
For example, I have a line of text "i appreciate the help"
I want to remove the word "appreciate" from the sentimentr dictionary, so that it will not factor to any sentiment score moving forward.
You can create your own sentiment table. Either from scratch or from using the default one.
Example:
library(sentimentr)
txt <- "i appreciate the help"
sentiment(txt)
element_id sentence_id word_count sentiment
1: 1 1 4 0.25
Adjust the sentiment table. Since the sentiment tables are stored as data.tables first load data.table.
library(data.table)
# remove word we do not want from default sentiment table coming from lexicon package
my_sent_table <- lexicon::hash_sentiment_jockers_rinker[x != "appreciate"]
sentiment(txt, polarity_dt = my_sent_table)
element_id sentence_id word_count sentiment
1: 1 1 4 0
I have a large tidy data set with columns containing text responses(i.e., in a grant application) and rows as the individual organization applying for the grant. I'm trying to find the topics and phrases grouped with a specific word (e.g., "funder"/"funding"). More specifically, what adjectives and verbs are being grouped with these tokens?
So for example
text <- "This funding would help us create a new website and hire talented people."
So "funding" can be grouped with verbs like "create", "hire", and adjective phrases like "new website", "talented people".
I'm doing this in R. Does anyone have a package or program in which they'd recommend doing this? I've found cleanNLP, but not sure if this is the most convenient package. Would I need to tokenize all the words? If so, wouldn't I have problems grouping phrases?
I'm fairly new to NLP/text mining, so I apologize for the introductory question.
Thank you!
This is a huge area to start exploring.
I would strongly recommend taking a look at the tidytextmining book and package, as well as the authors personal blogs (https://juliasilge.com, http://varianceexplained.org) there is a huge amount of great work there to get you started, and its really well written for people new to NLP.
Also really helpful for what you are looking for are the widyr and udpipe libraries.
Here's a couple of examples:
Using widyr we can look at the pairwise pmi between a word, say funding, and all other words that it has some relationship with. For info on PMI check out: https://stackoverflow.com/a/13492808/2862791
library(tidytext)
library(tidyverse)
texts <- tibble(text = c('This funding would help us create a new website and hire talented people',
'this random funding function talented people',
'hire hire hire new website funding',
'fun fun fun for all'))
tidy_texts %>%
pairwise_pmi(word, id) %>%
filter(item1 == 'funding') %>%
top_n(5, wt = pmi) %>%
arrange(desc(pmi))
item1 item2 pmi
<chr> <chr> <dbl>
1 funding this -0.0205
2 funding would -0.0205
3 funding help -0.0205
4 funding us -0.0205
So to introduce adjectives and phrases you could look at udpipe as boski suggested.
I'm going to reproduce the above to calculate the PMI too, as it's a really intuitive and quick to compute metric
library(udpipe)
english <- udpipe_download_model(language = "english")
ud_english <- udpipe_load_model(english$file_model)
tagged <- udpipe_annotate(ud_english, x = texts$text)
tagged_df <- as.data.frame(tagged)
tagged_df %>%
filter(upos == 'ADJ' |
token == 'funding') %>%
pairwise_pmi(token, doc_id) %>%
filter(item1 == 'funding')
item1 item2 pmi
<chr> <chr> <dbl>
1 funding new 0.170
2 funding talented 0.170
You've mentioned cleanNLP, which is a great library for this kind of work. It makes it easy to access udpipe and spacyr and a few other methods which do the kind of tokenisation and tagging needed for that adjective finding.
If you can get past the setup details spacyr is my preferred option just because its the fastest, but if speed isn't an issue I would just go with udpipe as its very easy to use.
Would I need to tokenize all the words? If so, wouldn't I have problems grouping phrases?
So udpipe, and other text annotators, have a solution for this.
In udpipe you can use 'keywords_collocation()' which identifies words which occur together more frequently than expected through random chance.
We would need to have a text dataset bigger than the three junk sentences I've written above to get a reproducible example.
But you can find out alot through this blog:
https://bnosac.github.io/udpipe/docs/doc7.html
Sorry this reply is kind of a collection of links ... but as I said it's a huge area of study.
This question already has answers here:
Removing hashtags , hyperlinks and twitter handles from dataset in R using gsub
(2 answers)
Closed 4 years ago.
I am trying to clean a bunch of tweets using gsub.
V3
1 Well: Getting Insurance to Pay for Midwives http://xxxxxxxxx
2 Lightning may be giving you a headache http://xxxxxxxx
3 New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot? http://xxxxxxxx
4 VIDEO: Can we erase memories entirely? http://xxxxxxxx
5 Artificial sweeteners are a $1.5-billion-a-year market #kchangnyt reported last year. http://xxxxxxxx
I tried to use the following code to remove all the links (taken from a previous question at SO):
newdf1$V3 <- gsub("http\\w+", "", newdf1$V3)
However, there was no change in the tweets.
Further, when I use the code newdf1$V3 <- gsub("http.*", "", newdf1$V3), I am able to remove the links:
V3
1 Well: Getting Insurance to Pay for Midwives
2 Lightning may be giving you a headache
3 New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot?
4 VIDEO: Can we erase memories entirely?
5 Artificial sweeteners are a $1.5-billion-a-year market #kchangnyt reported last year.
Can someone explain why the code in the first case does not yield the desired results?
That's because \w only picks up on alphanumeric characters. since http is always followed by "://", the \w doesn't recognize it as a legal expression.
In contrast, .* just picks up anything that follows the "http", so that works.
I have the following text data:
I always prefer old-school guy. I have a PhD degree in science. I am
really not interested in finding someone with the same background,
otherwise life is gonna be boring.
And I am trying to extract out the sentiment scores of the above text, but what i get is all NAs.
dating3 = annotateString(bio)
bio.emo = getSentiment(dating3)
id sentimentValue sentiment
1 1 NA NA
2 2 NA NA
3 3 NA NA
I do not know why is occuring and googled around but did not find any relevant answers. In the meantime, when i tried the sample data provided within coreNLP package
getSentiment(annoHp)
id sentimentValue sentiment
1 1 4 Verypositive
It gives me an answer, so I don't know why this is happening. Would greatly appreciate if anyone can offer some insight.
Hopefully by now you have already found this but for you and anyone else, this is a known bug which is fixed on the GitHub version, see here: https://github.com/statsmaths/coreNLP/issues/9
I have many large text files with the following basic composition:
text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
As you can see, it is composed of: 1) Random text, 2) Person in uppercase, 3) Speech.
I've managed to separate in a list all the words using:
textw<-unlist(strsplit(text," "))
I then find all the position of the words which are uppercase:
grep(pattern = "^[[:upper:]]*$",x = textw)
And I have separated the names of persons into a vector;
upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]
The desired outcome would be a data frame or table like this:
Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))
Result
person message
1 this is a speech test.
2 FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON thank you for inviting us
I'm having trouble "linking" each message to it's author.
Also be noted: there are uppercase words which are NOT an author, for example "I". How could I specify a separation only where 2 or more uppercase words are next to each other?
In other words, if position 2 and 3 are upper case, then place as message everything from position 4 until next occurrence of double uppercases.
Any help appreciated.
Here's one approach using the stringi package:
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))
data.frame(
person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
Basic Approach
1) to get the text I will follow Tyler Rinkers approach of splitting the text by a sequence of one and more (+) ONLY UPPER CASE LETTERS ([[:upper:]]) that might also entail spaces and colons ([ [:upper:]:]): "[[:upper:]]+[ [:upper:]:]+"
2) to extract the persons speaking the nearly the same regex is used (not allowing colons anymore): "[[:upper:]]+[ [:upper:]]+" (again, the basic idea is stolen from Tyler Rinker)
stringr
require(stringr)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(str_extract_all(text, "[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(str_split(text, "[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
stringi
require(stringi)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(stri_extract_all(text, regex="[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(stri_split(text, regex="[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
Hints (that reflect my preferences rather than rules)
1) I would prefer "[A-Z]+" over "[A-Z]{1,1000}" because in the first case on does not have to decide what might actually be a reasonable number to put in.
2) I would prefer "[[:upper:]]" over "[A-Z]" because the former works like this ...
str_extract("Á", "[[:upper:]]")
## [1] "Á"
... while the latter works like this ...
str_extract("Á", "[A-Z]")
## [1] NA
... in case of special character.