Twitter Data Sentiment Analysis - r

I'm very novice so, apologies if my question is trivial.
I am trying to do sentiment analysis on some twitter data I downloaded but am having trouble. I am trying to follow this example:
which creates a bar plot that shows positive/negative sentiment counts. The code for the example is here**
original_books %>%
unnest_tokens(output = word,input = text) %>%
inner_join(get_sentiments("bing")) %>%
count(book, index, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n) %>%
mutate(sent_score = positive - negative) %>%
ggplot() +
geom_col(aes(x = index, y = sent_score,
fill = book),
show.legend = F) +
facet_wrap(~book,scales = "free_x")
Here is the code I have so far for my own analysis:
#twitter scraping
ref <- search_tweets(
"#refugee", n = 18000, include_rts = FALSE,lang = "en"
)
data(stop_words)
new_stops <- tibble(word = c("https", "t.co", "1", "refugee", "#refugee", "amp", "refugees",
"day", "2022", "dont", "0", "2", "#refugees", "4", "2021") ,lexicon = "sabs")
full_stop <- stop_words %>%
bind_rows(new_stops) #bind_rows adds more rows (way to merge data)
Now I want to make a bar graph similar to the one above but I get an error because I don't have a column called "index." I tried making one but it didn't work. Here is the code I am trying to use:
ref %>%
unnest_tokens(word,text,token = "tweets") %>%
anti_join(full_stop) %>%
inner_join(get_sentiments("bing")) %>%
count(word, index, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n) %>%
mutate(sent_score = positive - negative) %>%
ggplot() + #plot the overall sentiment (pos - neg) versus index,
geom_col(aes(x = index, y = sent_score), show.legend = F)
Here is an image of the error
Any suggestions are really appreciated! Thank you
Contents of ref
enter image description here
enter image description here

In the example, index just refers to a group of lines from the book, in order (i.e., 1, 2, 3...) . It's a way to group the text -- you could think of it like a page, which would also be in numerical order. The text just be split up into some kind of groups in order to compute the sentiment within each group. Tweets are natural groups of words, and you want to compute the sentiment within a single tweet -- you don't need to split it up farther. In the example, the figure has a bar for each "page" of the book. You'll have a bar for each tweet. You need to assign the tweets consecutive numbers because they don't have a natural order. I did that below using rowid_to_column(), and I named the new column "tweet". It just contains the row numbers of the tweets, so once the ref dataframe is split up by word, each word is still tied back to the original tweet it belonged to by that number.
Note that many tweets don't have enough words with an associated sentiment to even calculate their sentiment score, so I then re-assigned a consecutive number to those that did -- this one is called "index".
I also added the argument values_fill = 0 to the pivot_wider() line because tweets with only positive (or negative) sentiment were not getting included, because the other value was NA instead of 0.
Along the way there are a couple of places where I just stop and look at the data -- this is really helpful in understanding errors.
library(tidyverse)
library(rtweet)
library(tidytext)
#twitter scraping
ref <- search_tweets(
"#refugee", n = 18000, include_rts = FALSE,lang = "en"
)
data(stop_words)
new_stops <- tibble(word = c("https", "t.co", "1", "refugee", "#refugee", "amp", "refugees",
"day", "2022", "dont", "0", "2", "#refugees", "4", "2021") ,lexicon = "sabs")
full_stop <- stop_words %>%
bind_rows(new_stops) #bind_rows adds more rows (way to merge data)
ref_w_sentiments <- ref %>%
rowid_to_column("tweet") %>%
unnest_tokens(word, text, token = "tweets") %>%
anti_join(full_stop) %>%
inner_join(get_sentiments("bing"))
# look at what the data looks like
select(ref_w_sentiments, tweet, word, sentiment)
#> # A tibble: 811 × 3
#> tweet word sentiment
#> <int> <chr> <chr>
#> 1 2 helping positive
#> 2 3 inspiring positive
#> 3 4 support positive
ref_w_scores <- ref_w_sentiments %>%
group_by(tweet) %>%
count(sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n, values_fill = 0) %>%
mutate(sent_score = positive - negative) %>%
# not all tweets were scored, so create a new index
rowid_to_column("index")
# look at the data again
ref_w_scores
#> # A tibble: 418 × 5
#> # Groups: tweet [418]
#> index tweet positive negative sent_score
#> <int> <int> <int> <int> <int>
#> 1 1 2 1 0 1
#> 2 2 3 1 0 1
#> 3 3 4 1 0 1
ggplot(ref_w_scores) + #plot the overall sentiment (pos - neg) versus index,
geom_col(aes(x = index, y = sent_score), show.legend = F)

Related

number of matches for keywords in specified categories

For a large scale text analysis problem, I have a data frame containing words that fall into different categories, and a data frame containing a column with strings and (empty) counting columns for each category. I now want to take each individual string, check which of the defined words appear, and count them within the appropriate category.
As a simplified example, given the two data frames below, i want to count how many of each animal type appear in the text cell.
df_texts <- tibble(
text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the
grasshopper"),
mammals=NA,
reptiles=NA,
birds=NA,
insects=NA
)
df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))
So my desired result would be:
df_result <- tibble(
text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the
grasshopper"),
mammals=c(2,1,0),
reptiles=c(0,1,0),
birds=c(0,0,1),
insects=c(0,0,1)
)
Is there a straightforward way to achieve this keyword-matching-and-counting that would be applicable to a much larger dataset?
Thanks in advance!
Here's a way do to it in the tidyverse. First look at whether strings in df_texts$text contain animals, then count them and sum by text and type.
library(tidyverse)
cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>%
pivot_longer(-text, names_to = "animals") %>%
left_join(df_animals) %>%
group_by(text, type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(id_cols = text, names_from = type, values_from = sum)
text bird insect mammal reptile
<chr> <int> <int> <int> <int>
1 "the ape and the fox" 0 0 2 0
2 "the owl and the the \n grasshopper" 1 0 0 0
3 "the tortoise and the hare" 0 0 1 1
To account for the several occurrences per text:
cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>%
setNames(c("text", df_animals$animals)) %>%
pivot_longer(-text, names_to = "animals") %>%
left_join(df_animals) %>%
group_by(text, type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(id_cols = text, names_from = type, values_from = sum)

Errors in counting + combining bing sentiment score variables in Tidytext?

I'm doing sentiment analysis on a large corpus of text. I'm using the bing lexicon in tidytext to get simple binary pos/neg classifications, but want to calculate the ratios of positive to total (positive & negative) words within a document. I'm rusty with dplyr workflows, but I want to count the number of words coded as "positive" and divide it by the total count of words classified with a sentiment.
I tried this approach, using sample code and stand-in data . . .
library(tidyverse)
library(tidytext)
#Creating a fake tidytext corpus
df_tidytext <- data.frame(
doc_id = c("Iraq_Report_2001", "Iraq_Report_2002"),
text = c("xxxx", "xxxx") #Placeholder for text
)
#Creating a fake set of scored words with bing sentiments
#for each doc in corpus
df_sentiment_bing <- data.frame(
doc_id = c((rep("Iraq_Report_2001", each = 3)),
rep("Iraq_Report_2002", each = 3)),
word = c("improve", "democratic", "violence",
"sectarian", "conflict", "insurgency"),
bing_sentiment = c("positive", "positive", "negative",
"negative", "negative", "negative") #Stand-ins for sentiment classification
)
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_sentiment_scored <- df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
summarise(bing_score = count(positive)/(count(negative) + count(positive)))
But I get the following error:
"Error: Problem with `summarise()` input `bing_score`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
ℹ Input `bing_score` is `count(positive)/(count(negative) + count(positive))`.
ℹ The error occurred in group 1: doc_id = "Iraq_Report_2001".
Would love some insight into what I'm doing wrong with my summarizing workflow here.
I don't understand what is the point of counting there if the columns are numeric. By the way, that is also why you are having the error.
One solution could be:
#Summarizing count of positive and negative words
# (number of positive words out of total scored words in each doc)
df_tidytext %>%
left_join(df_sentiment_bing) %>%
group_by(doc_id) %>%
dplyr::count(bing_sentiment) %>%
pivot_wider(names_from = bing_sentiment, values_from = n) %>%
replace(is.na(.), 0) %>%
summarise(bing_score = sum(positive)/(sum(negative) + sum(positive)))
The result you should get its:
Joining, by = "doc_id"
# A tibble: 2 × 2
doc_id bing_score
<fct> <dbl>
1 Iraq_Report_2001 0.667
2 Iraq_Report_2002 0

How do I remove (custom) stopwords from unigrams but keep them in bigrams?

I am working with the IMDB movie ratings dataset and struggle with the data preprocessing. There are some movie-related words that appear in many ratings but are uninformative as a unigram, i.e. "film". However, if the rating says "good film" or "bad movie", that is informative and I would like to keep that as a bigram. Unfortunately, I could not yet bring my code to do this:
library(tidyverse)
library(tidymodels)
library(textrecipes)
movie_stopwords <- tibble(word = c("movie","movies","movie's","act","acts","actor","actors",
"actress","actresses","actor's","actress´s",
"film","film's","director","directors","director's",
"character", "characters", "character's"))
my_corpus <- tibble(sentiment = c("positive","negative","positive"),
rating = c("this is a good movie","this movie sucks", "this movie has a good plot"))
# print the final unigrams, bigrams and trigrams
recipe(sentiment ~ rating, data = my_corpus) %>%
step_tokenize(rating) %>%
step_stopwords(rating, stopword_source = "marimo") %>%
step_ngram(rating, min_num_tokens = 1, num_tokens = 3) %>%
step_stopwords(rating, custom_stopword_source = movie_stopwords) %>%
step_untokenize(rating) %>%
prep() %>% bake(new_data = NULL)
This outputs the following tibble:
# OUTPUT AS IS
# A tibble: 3 x 2
rating sentiment
<fct> <fct>
1 good movie good_movie positive
2 movie sucks movie_sucks negative
3 movie good plot movie_good good_plot movie_good_plot positive
However, I would prefer the unigram "movie" to be removed, and I honestly expected the second step_stopwordsto do just that.
Does anyone have an idea how to do that efficiently (i.e. for 50k ratings)?
# OUTPUT AS I WANT IT TO BE
# A tibble: 3 x 2
rating sentiment
<fct> <fct>
1 good good_movie positive
2 sucks movie_sucks negative
3 good plot movie_good good_plot movie_good_plot positive
The custom_stop_words should be a character vector and not a data.frame/tibble
According to ?step_stopwords
custom_stop_words - A character vector to indicate a custom list of words that cater to the users specific problem.
library(tidymodels)
library(magrittr)
library(textrecipes)
recipe(sentiment ~ rating, data = my_corpus) %>%
step_tokenize(rating) %>%
step_stopwords(rating, stopword_source = "marimo") %>%
step_ngram(rating, min_num_tokens = 1, num_tokens = 3) %>%
step_stopwords(rating, custom_stopword_source = movie_stopwords$word) %>%
step_untokenize(rating) %>%
prep() %>%
bake(new_data = NULL)
-output
# A tibble: 3 x 2
# rating sentiment
# <fct> <fct>
#1 good good_movie positive
#2 sucks movie_sucks negative
#3 good plot movie_good good_plot movie_good_plot positive

How to use a month index in a count function in R using Pipr

I have the following problem: I have a dataframe with 3 columns, line number, date, and a single word. I am trying to perform text analysis on GitHub commit comments using the https://www.tidytextmining.com/ method. I would like to have my aggregate sentiment score on a quarterly basis rather than by the number of comments which i did by count(index = line %/% 10, sentiment) %>%. Is there an easy way to count all my "sentiment scores" by quarter?
Many thanks for any suggestions.
single_word_with_date$date <- substr(single_word_with_date$date,1,nchar(single_word_with_date$date)-10)
single_word_with_date$date <- as.Date(single_word_with_date$date , format = "%Y-%m-%d")
comment_sentiments_with_date <- single_word_with_date %>%
inner_join(get_sentiments("bing")) %>%
count(index = date %/% month(date) , sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
This is the dataframe line is the comment number (e.g. line 2 comment had several words in it), date is datetime, word is a string.
> head(single_word_with_date)
line date word
1 1 2011-11-16 love
2 2 2012-04-13 random
2.1 2 2012-04-13 question
2.8 2 2012-04-13 answered
2.14 2 2012-04-13 darwin
2.19 2 2012-04-13 purpose
Try this :
library(tidytext)
library(dplyr)
library(tidyr)
single_word_with_date %>%
inner_join(get_sentiments("bing")) %>%
group_by(quarter = paste(format(date, '%Y'), quarters(date), sep = '-')) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) -> result
result

summarise function in R

I am trying to create a R database including some numerical variable.
While doing this, I made a typing mistake whose result looks weird to me and I would like to understand why (for sure I am missing something, here).
I have tried to look around for possible explanation but haven' t found what I am looking for.
library("dplyr")
library("tidyr")
data <-
data.frame(FS = c(1), FS_name = c("Armenia"), Year = c(2015), class =
c("class190"), area_1000ha = c(66.447)) %>%
mutate(FS_name = as.character(FS_name)) %>%
mutate(Year = as.integer(Year)) %>%
mutate(class = as.character(class)) %>%
tbl_df()
data
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = TRUE)) %>%
ungroup()
As you can see, the mistake is
rm.na=
rather than
na.rm=
When I type correctly, I have the right result on area_1000ha variable (10.5).
If I don't - i.e. keeping rm.na= I get 11.5, instead (+1, in fact).
What am I missing?
I think rm.na=TRUE is added to the sum, and as TRUE is considered as 1, it sums your initial sum and 1.
If you change TRUE to 2 for example
x <- data %>%
group_by(FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = 2)) %>%
ungroup()
The result is
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 12.5
There is no function in R as rm.na hence R is considering it as a variable which has value TRUE i.e. 1.
Try keeping it na.rm = T and you will get the right result.
Even if you change the name of the variable
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, tester = TRUE)) %>%
ungroup()
I have replaced rm.na with tester variable.
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 11.5

Resources