I am trying to the sentiment of a dataset of Tweets using the AFINN dictionary (get_sentiments("afinn"). A sample of the dataset is provided below:
A tibble: 10 x 2
Date TweetText
<dttm> <chr>
1 2018-02-10 21:58:19 "RT #RealSirTomJones: Still got the moves! That was a lo~
2 2018-02-10 21:58:19 "Yass Tom \U0001f600 #snakehips still got it #TheVoiceUK"
3 2018-02-10 21:58:19 Yasss tom he’s some chanter #TheVoiceUK #ItsNotUnusual
4 2018-02-10 21:58:20 #TheVoiceUK SIR TOM JONES...HE'S STILL HOT... AMAZING VO~
5 2018-02-10 21:58:21 I wonder how many hips Tom Jones has been through? #TheV~
6 2018-02-10 21:58:21 Tom Jones has still got it!!! #TheVoiceUK
7 2018-02-10 21:58:21 Good grief Tom Jones is amazing #TheVoiceuk
8 2018-02-10 21:58:21 RT #tonysheps: Sir Thomas Jones you’re a bloody legend #~
9 2018-02-10 21:58:22 #ITV Tom Jones what a legend!!! ❤️ #StillGotIt #TheVoice~
10 2018-02-10 21:58:22 "RT #RealSirTomJones: Still got the moves! That was a lo~
What I want to do is:
1. Split up the Tweets into individual words.
2. Score those words using the AFINN lexicon.
3. Sum the score of all the words of each Tweet
4. Return this sum into a new third column, so I can see the score per Tweet.
For a similar lexicon I found the following code:
# Initiate the scoreTopic
scoreTopic <- 0
# Start a loop over the documents
for (i in 1:length (myCorpus)) {
# Store separate words in character vector
terms <- unlist(strsplit(myCorpus[[i]]$content, " "))
# Determine the number of positive matches
pos_matches <- sum(terms %in% positive_words)
# Determine the number of negative matches
neg_matches <- sum(terms %in% negative_words)
# Store the difference in the results vector
scoreTopic [i] <- pos_matches - neg_matches
} # End of the for loop
dsMyTweets$score <- scoreTopic
I am however not able to adjust this code to get it working with the afinn dictionary.
This would be a great use case for tidy data principles. Let's set up some example data (these are real tweets of mine).
library(tidytext)
library(tidyverse)
tweets <- tribble(
~tweetID, ~TweetText,
1, "Was Julie helping me because I don't know anything about Python package management? Yes, yes, she was.",
2, "#darinself OMG, this is my favorite.",
3, "#treycausey #ftrain THIS IS AMAZING.",
4, "#nest No, no, not in error. Just the turkey!",
5, "The #nest people should write a blog post about how many smoke alarms went off yesterday. (I know ours did.)")
Now we have some example data. In the code below, unnest_tokens() tokenizes the text, i.e. breaks it up into individual words (the tidytext package allows you to use a special tokenizer for tweets) and the inner_join() implements the sentiment analysis.
tweet_sentiment <- tweets %>%
unnest_tokens(word, TweetText, token = "tweets") %>%
inner_join(get_sentiments("afinn"))
#> Joining, by = "word"
Now we can find the scores for each tweet. Take the original data set of tweets and left_join() on to it the sum() of the scores for each tweet. The handy function replace_na() from tidyr lets you replace the resulting NA values with zero.
tweets %>%
left_join(tweet_sentiment %>%
group_by(tweetID) %>%
summarise(score = sum(score))) %>%
replace_na(list(score = 0))
#> Joining, by = "tweetID"
#> # A tibble: 5 x 3
#> tweetID TweetText score
#> <dbl> <chr> <dbl>
#> 1 1. Was Julie helping me because I don't know anything about … 4.
#> 2 2. #darinself OMG, this is my favorite. 2.
#> 3 3. #treycausey #ftrain THIS IS AMAZING. 4.
#> 4 4. #nest No, no, not in error. Just the turkey! -4.
#> 5 5. The #nest people should write a blog post about how many … 0.
Created on 2018-05-09 by the reprex package (v0.2.0).
If you are interested in sentiment analysis and text mining, I invite you to check out the extensive documentation and tutorials we have for tidytext.
For future reference:
Score_word <- function(x) {
word_bool_vec <- get_sentiments("afinn")$word==x
score <- get_sentiments("afinn")$score[word_bool_vec]
return (score) }
Score_tweet <- function(sentence) {
words <- unlist(strsplit(sentence, " "))
words <- as.vector(words)
scores <- sapply(words, Score_word)
scores <- unlist(scores)
Score_tweet <- sum(scores)
return (Score_tweet)
}
dsMyTweets$score<-apply(df, 1, Score_tweet)
This executes what I initially wanted! :)
Related
One question that I can't find the answer for in R is that how I can find the dominant topic in NLP model for each sentence?
Imagine I have data frame like this:
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
id <- c(1,2,3,4,5,6,7)
data <- data.frame(id, comment)
I do preprocess as shown below:
text_cleaning_tokens <- data %>%
tidytext::unnest_tokens(word, comment)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>%
anti_join(stop_words)
stemmed_token <- text_cleaning_tokens %>% mutate(word=wordStem(word))
tokens <- stemmed_token %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, clean_remark,-id,sep =" " )
tokens$clean_remark <- trimws(tokens$clean_remark)
The I ran FitLdaModel function on this data and finally, found the best topics based on 2 groups:
t_1 t_2
1 beauti built
2 block home
3 renov legal
4 bathroom locat
5 bdm bachelor
6 bdm_heart bachelor_suit
7 beauti_street block_norquai
8 beauti_upgr blueridg
9 bedroom blueridg_legal
10 bedroom_unit built_design
now based on the result I have, I want to find the most dominant topic in each sentence in topic modelling. For example, I want to know that for comment 1 ("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior"), which topic (topic 1 or topic 2) is the most dominant?
Can anyone help me with this question? do we have any package that can do this?
It is pretty easy to work with quanteda and topicmodels. The former is for data management and quantitative analysis of textual data, the latter is for topic modeling inference.
Here I take your comment object and transform it to a corpus and then to a dfm. I then convert it to be understandable by topicmodels.
The function LDA() gives you all you need to easily extract information. In particular, with get_topics() you get the most probable topic for each document. If you instead want to see the document-topic-weights you can do so with ldamodel#gamma. You will see that get_topics() does exactly what you asked.
Please, see if this works for you.
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(topicmodels)
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
mycorp <- corpus(comment)
docvars(mycorp, "id") <- 1L:7L
mydfm <- dfm(mycorp)
# convert the DFM to a Document Matrix for topicmodels
forTM <- convert(mydfm, to = "topicmodels")
myLDA <- LDA(forTM, k = 2)
dominant_topics <- get_topics(myLDA)
dominant_topics
#> text1 text2 text3 text4 text5 text6 text7
#> 2 2 2 2 1 1 1
dtw <- myLDA#gamma
dtw
#> [,1] [,2]
#> [1,] 0.4870600 0.5129400
#> [2,] 0.4994974 0.5005026
#> [3,] 0.4980144 0.5019856
#> [4,] 0.4938985 0.5061015
#> [5,] 0.5037667 0.4962333
#> [6,] 0.5000727 0.4999273
#> [7,] 0.5176960 0.4823040
Created on 2021-03-18 by the reprex package (v1.0.0)
I agree with the other answer that quanteda and topicmodels are a better choice. Maybe also look into seededlda which is an LDA implementation from one of the quanteda authors (with extra features you don't have to use).
However, if you want to stick with your choice of tidytext and textmineR, this is how you would do it.
First, I simplified your preprocessing a bit, since you did some steps that seemed unnecessary to me:
library(tidyverse)
library(tidytext)
text_cleaning_tokens <- data %>%
unnest_tokens(word, comment) %>%
mutate(word = str_remove(word, "[[:digit:]]|[[:punct:]]")) %>%
filter(!(nchar(word) <= 1))%>%
anti_join(stop_words, by = "word") %>%
mutate(word = SnowballC::wordStem(word))
Then I run LDA according to the textmineR example:
lda <- text_cleaning_tokens %>%
cast_sparse(id, word) %>%
textmineR::FitLdaModel(k = 2,
iterations = 200,
burnin = 175,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_r2 = TRUE)
Now all implementations of LDA deliver two important results:
phi (φ) which shows for each word in the corpus how it scored on each topic. The higher the phi-value, the more prevalent the word in this specific topic.
theta (θ) which shows for each document in the corpus how it scored on each topic. The higher the theta-value, the more prevalent the topic is in the document. (topicmodels calls it gamma for some reason.)
In other words, all you have to do to find the most dominant topic in a text is:
lda$theta %>%
as_tibble() %>%
rowwise() %>%
mutate(top = which.max(c_across(everything()))) %>% # find highest value per row dplyr style
bind_cols(data, .) %>% # bind to original data
as_tibble() # just for nicer printing
#> # A tibble: 7 x 5
#> id comment t_1 t_2 top
#> <int> <chr> <dbl> <dbl> <int>
#> 1 1 1 . outstanding renovation all improvements are t… 0.892 0.108 1
#> 2 2 solidly constructed lovingly maintained sf crest … 0.0161 0.984 2
#> 3 3 one year since built new this well designed store… 0.0238 0.976 2
#> 4 4 beautiful street large bdm in the heart of lynn v… 0.986 0.0139 1
#> 5 5 rare to find legal beautiful upgr in port moody c… 0.992 0.00820 1
#> 6 6 fantastic opportunity to get value for the money … 0.266 0.734 2
#> 7 7 original owner tired but rock solid perfect locat… 0.00549 0.995 2
Created on 2021-03-18 by the reprex package (v1.0.0)
I also recommend you read Julia Silge's stuff on the matter. For example, this and this.
I'm doing some very basic sentiment analysis on a pretty large set of data that continues to grow every day. I need to feed this data into a shiny app where I can adjust the date range. Rather than running the analysis over and over again, what I'd like to do is create a new CSV with the sum of each sentiment score by date. I'm having trouble iterating over the date though. Here's some sample data and the lapply() statement I tried that is not working.
library(tidyverse)
library(syuzhet)
library(data.table)
df <- data.frame(date = c("2021-01-18", "2021-01-18", "2021-01-18", "2021-01-17","2021-01-17", "2021-01-16", "2021-01-15", "2021-01-15", "2021-01-15"),
text = c("Some text here", "More text", "Some other words", "Just making this up", "as I go along", "hope the example helps", "thank you in advance", "I appreciate the help", "the end"))
> df
date text
1 2021-01-18 Some text here
2 2021-01-18 More text
3 2021-01-18 Some other words
4 2021-01-17 Just making this up
5 2021-01-17 as I go along
6 2021-01-16 hope the example helps
7 2021-01-15 thank you in advance
8 2021-01-15 I appreciate the help
9 2021-01-15 the end
dates_scores_df <- lapply(df, function(i){
data <- df %>%
# Filter to the unique date
filter(date == unique(df$date[i]))
# Sentiment Analysis for each date
sentiment_data <- get_nrc_sentiment(df$text)
# Convert to df
score_df <- data.frame(sentiment_data[,])
# Transpose the data frame and adjust column names
daily_sentiment_data <- transpose(score_df)
colnames(daily_sentiment_data) <- rownames(score_df)
# Add a date column
daily_sentiment_data$date <- df$date[i]
})
sentiment_scores_by_date <- do.call("rbind.data.frame", dates_scores_df)
What I'd like to get to is something like this (data here is made up and will not match the example above)
date anger anticipation disgust fear joy sadness surprise trust negative positive
2021-01-18 1 2 0 1 2 0 2 1 1 2
2021-01-17 1 2 0 2 3 3 1 2 0 1
You can try :
library(dplyr)
library(purrr)
library(syuzhet)
df %>%
split(.$date) %>%
imap_dfr(~get_nrc_sentiment(.x$text) %>%
summarise(across(.fns = sum)) %>%
mutate(date = .y, .before = 1)) -> result
result
Function lapply iterates over elements of a list. Data frame is technically a list with each column as an element of that list. So in your example you are iterating over columns rather than rows, or even dates (that seems to be your goal). Instead of lapply I'd use dplyr::group_by in combination with one of: dplyr::do, dplyr::summarize or tidyr::nest. See documentations for each function to figure out which function suits the most your need.
I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.
library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"
I am trying to do ngram analysis for in tidytext, I have a corpus of 770 speeches. However the function unnest_tokens in tidytext takes data frame as input. when i checked with the example (jane austin books) each line of the book is stored as row in a data frame. i am not able to convert the corpus into a dataframe, neither for one speech at a time nor for all the corpus at once.
What is the way i can run ngrams (n=2,3, etc) analysis on tidytext using unnest tokens on my corpus. Can someone please suggest?
Thanks
You can use library ngram & tm for this.You can replace "myCorpus" with the corpus you created.
library(tm)
library(ngarm)
myCorpus<-c("Hi How are you","Hello World","I love Stackoverflow","Good Bye All")
ng <- ngram (myCorpus , n =2)
get.phrasetable (ng)
If you want to tokenize and convert your corpus into a dataframe then use the below code.
tokenizedCorpus <- lapply(myCorpus, scan_tokenizer)
mydata <- data.frame(text = sapply(tokenizedCorpus, paste, collapse = " "),stringsAsFactors = FALSE)
You say that you have a "corpus" of 770 speeches. Do you mean you have a character vector? If so, you can tokenize your text in this way:
library(tidyverse)
library(tidytext)
speech_vec <- c("I am giving a speech!",
"My second speech is even better.",
"Unfortunately, this speech is terrible!",
"For my final speech, I will wow you all.")
speech_df <- tibble(text = speech_vec) %>%
mutate(speech = row_number())
tidy_speeches <- speech_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_speeches
#> # A tibble: 21 x 2
#> speech bigram
#> <int> <chr>
#> 1 1 i am
#> 2 1 am giving
#> 3 1 giving a
#> 4 1 a speech
#> 5 2 my second
#> 6 2 second speech
#> 7 2 speech is
#> 8 2 is even
#> 9 2 even better
#> 10 3 unfortunately this
#> # … with 11 more rows
Created on 2020-02-15 by the reprex package (v0.3.0)
If instead, you mean that you have a DocumentTermMatrix from the tm package, check out this chapter for details on how to convert to a tidy data structure.
I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.
Here is the simple dataframe for this example:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
I then turned the Text column into character...
df$Text<-as.character(df$Text)
Next I grouped by the id column and nested the dataframe.
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?
Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?
I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.
You can set up your dataframe like this:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
Then to find, for example, the number of words per line, you can use group_by and mutate.
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
You can do sentiment analysis by implementing an inner join; check out some examples here.