Twitter Sentimental Analysis with twitteR, all scores are zero? - r

I'm new to Twitter Sentimental Analysis with twitteR, and used the positive.txt and negative.txt from Hu and Liu. I was so glad that everything ran smoothly but the scores for over 1000 tweets all turned out to be neutral (score = 0)? I can't figure out what went wrong, any help is greatly appreciated!
setup_twitter_oauth(consumer_key, consumer_secret, token, token_secret)
#Get tweets about "House of Cards", due to the limitation, we'll set n=1500
netflix.tweets<- searchTwitter("#HouseofCards",n=1500)
tweet=netflix.tweets[[1]]
tweet$getScreenName()
tweet$getText()
netflix.text=laply(netflix.tweets,function(t)t$getText())
head(netflix.text)
write(netflix.text, "HouseofCards_Tweets.txt", ncolumns = 1)
#loaded the positive and negative.txt from Hu and Liu
positive <- scan("/users/xxx/desktop/positive_words.txt", what = character(), comment.char = ";")
negative <- scan("/users/xxx/desktop/negative_words.txt", what = character(), comment.char = ";")
#add positive words
pos.words =c(positive,"miss","Congratulations","approve","watching","enlightening","killing","solid")
scoredsentiment <- function(hoc.vec, pos.word, neagtive)
{
clean <- gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "",hoc.vec)
clean <- gsub("^\\s+|\\s+$", "", clean)
clean <- gsub("[[:punct:]]", "", clean)
clean <- gsub("[^[:graph:]]", "", clean)
clean <- gsub("[[:cntrl:]]", "", clean)
clean <- gsub("#\\w+", "", clean)
clean <- gsub("\\d+", "", clean)
clean <- tolower(clean)
hoc.list <- strsplit(clean, "")
hoc=unlist(hoc.list)
pos.matches = match(hoc, pos.words)
scoredpositive <- sapply(hoc.list, function(x) sum(!is.na(match(pos.matches, positive))))
scorednegative <- sapply(hoc.list, function(x) sum(!is.na(match(x, negative))))
hoc.df <- data.frame(score = scoredpositive - scorednegative, message = hoc.vec, stringsAsFactors = F)
return (hoc.df)
}
twitter_scores <- scoredsentiment(netflix.text, scoredpositive, scorednegative)
print(twitter_scores)
write.csv(twitter_scores, file=paste('twitter_scores.csv'), row.names=TRUE)
#draw a graph to show the final outcome
hist(twitter_scores$score)
qplot(twitter_scores$score)
Everything works, but the score for each tweet is the same (score =0)

You can use Microsoft Cognitive Services for calculation of the Sentiments Score.
Microsoft Cognitive Services (Text Analytics API) API can detect sentiment, key phrases, topics, and language from your text.
Refer this link to use Microsoft Cognitive Services in R link
For Sentimental Analysis in R

From your code, I don't think that the simple match will work. You need to use some form of fuzzy matching scheme. With match, you need the exact word repeated which will not happen a lot and further, you are matching a single word to a string of words.

Related

Removing Apostrophies while text mining, or don't

This is absolutely driving me nuts, and I'm ashamed to say that I've spent the last 3 hours trying to figure this out.
I'm mining tweeter data, and I'd like to do some text analysis, but words like "doesn't" are throwing me off. I either want to keep the ' or replace it with an empty string (""). I've tried:
tweets$text <- gsub("\'", "", tweets$text)
tweets$text <- gsub("\\'", "", tweets$text)
tweets$text <- gsub("'", "", tweets$text)
tweets$text <- gsub("\W", "", tweets$text)
What I WANT is doesn't -> doesnt OR doesn't.
I want to remove the rest of the special characters, but because what comes after ' changes the word, I want to keep that. Later in the code I'm using gsub("[^A-Za-z]", " ", twt_txt_url) to clean the special characters.
I shared this code earlier for a different question, but it should still get the point across. Note that this is split into two codes, one for you to pull the data, and two to see how I'm cleaning.
PULLING DATA:
library(rtweet)
library(tidyverse)
library(httpuv)
# API access keys
app_name = "app_name"
consumer_key <- "consumer_key"
consumer_secret <- "consumer_secret"
access_token <- "access_token"
access_secret <- "access_secret"
# Create twitter connection
create_token(app = app_name,
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_token_secret)
# who do we want to observe
account <- "#BillGates"
# 3200 is the max we can pull at once
account.timeline <- get_timeline(account, n=100, includeRts =TRUE)
# create data frame and csv from tweets
write_as_csv(account.timeline, BillGates.csv", fileEncoding = "UTF-8")
CLEANING DATA
library(tidyverse)
library(qdapRegex)
library(tm)
library(qdap)
library(wordcloud)
tweets <- read_csv("BillGate.csv")
# This is where I'm trying to remove the ' from words
tweets$text <- gsub("\\'","", tweets$text)
# Separate out the text column
twt_txt <- tweets$text
# remove URLs
twt_txt_url <- rm_twitter_url(twt_txt)
# remove special characters
twt_txt_chrs <- gsub("[^A-Za-z]", " ", twt_txt_url)
# convert to a text corpus
twt_corpus <- twt_txt_chrs %>%
VectorSource() %>%
Corpus()
Here's some of the data that you can manipulate as well. It seems like I can remove the first ', but nothing after that.
df = data.frame(
tweet = c(1, 2, 3, 4, 5),
text = c(
"Standing up for science has never been more important. Congratulations to Dr. Anthony Fauci and Dr. Salim Abdool Karim on receiving this honor.",
"I've known and learned from #RonConway for more than 40 years. I'm glad to see #svangel team up with #bchesky to mentor and support companies working to create more economic empowerment opportunities for people across the world.",
"This book has nothing to do with viruses or pandemics. But it is surprisingly relevant for these times. #exlarson provides a brilliant and gripping account of another era of widespread anxiety: the years 1940 and 1941.",
"The season finale of our podcast features two incredible people who are using their positions as artists to change the world for the better.",
"Like many people, I’ve tried to deepen my understanding of systemic racism in recent months. If you’re interested in learning more about the lives caught up in our country's justice system, I highly recommend #thenewjimcrow by Michelle Alexander."
)
)

using Regex and/or removing duplicate

I am scraping the website and as a result, I have half cleaned code:
[3] "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%"*
Above is one example and I am trying to remove a number before or after that heart.
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
Basically removing numbers between heart and colon including heart.
I will greatly appreciate any help. I have tried the following codes, but they did not work.
str_replace_all(dataSet, "♠*:", "", fixed = T)
gsub("*♠", "", data, fixed = T)
website <- read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population")
results <- website %>% html_nodes("table")
data_body <- results[1] %>% html_nodes("tbody")
rows <- data_body %>% html_nodes("tr")
clean_rows_text <- str_replace_all(rows_text,"[7000100000000000000]", "")
clean_rows_text <- str_replace_all(clean_rows_text, "\n\n", ":")
clean_rows_text <- str_replace_all(clean_rows_text, "\n", "")
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
From this point, I can handle the rest.
This should do it:
data <- "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%*"
gsub("♠.+?(?=:)", "", data, perl=T)

R Regex seemingly not working properly in Linux

I'm trying to scrape the webpage of Fangraphs with alphabetical player indices to get a single column dataframe of each letter reference.
I have been able to get the code below to successfully work on a Windows version of R 3.4.1, but cannot get it to work on the Linux side at all, and I can't figure out what exactly is going wrong/different.
library(XML)
# Scrape to get the webpage
url <- paste0("http://www.fangraphs.com/players.aspx?")
table <- readHTMLTable(url, stringsAsFactors = FALSE)
letterz <- table[[2]]
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=", ")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
# Below this is where I can notice that the code is not operating the same
# as on my Windows machine. None of the gsub commands seem to impact
# the strings at all.
# Stripping the trailing whitespace
letterz$letters <- gsub("[[:space:]]+$", "", letterz$letters)
# Replacing patterns like "AzB Ba" to instead have "Az,Ba"
letterz$letters <- gsub("[[:upper:]]+?[[:space:]]+?[[:space:]]+?[[:space:]]+", ",", letterz$letters)
# Final cleaning up
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=",")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
letterz$letters <- gsub('c\\("|"\\)|"', "", letterz$letters)
letterz$letters <- gsub('^$', NA, letterz$letters)
letterz$letters <- gsub("^[[:space:]]+","", letterz$letters)
letterz$letters <- gsub("[[:space:]]+$","", letterz$letters)
letterz$letters <- gsub("'", "%27", letterz$letters)
letterz <- na.omit(letterz)
From what I could find, the only real difference between Windows/Linux regex would be the linebreak implementation, which I went back and tried to see if that was making the difference... but still got no change.
I also tried to substitute the R-specific "[[:space:]]" and "[[:upper:]]" style notation with the more standardized "\s" to see if that would fix anything.
As for fixes, I know there are a handful of other packages that I can look into to simply get the result I'm looking for, but more generally, are there just simply differences in how Windows and Linux implement regex that I'm unaware of and am oblivious to? And if so, how would I implement them into gsub to get the same result I get on Windows?
Thanks.

Twitter slang look up in R

I am writing a R script to analyse the sentiment of the tweets. I am using twitteR and ROAuth package to get the tweets based on some search key words.I am using the below code to achieve this.
library(twitteR)
library(ROAuth)
library(httr)
# Set API Keys
api_key <- "xxxxxx"
api_secret <- "yyyyyy"
acs_token <- "aaxxbbbb"
access_token_secret <- "xyyzziiassss"
setup_twitter_oauth(api_key, api_secret, acs_token, access_token_secret)
# Grab latest tweets
tweets_results <- searchTwitter('xfinity x1 netflix', n=1500)
# Loop over tweets and extract text
feed_results = lapply(tweets_results, function(t) t$getText())
Now i am using the following function to clean up the tweets.
clean_text = function(x)
{
x = gsub("rt", "", x) # remove Retweet
x = gsub("#\\w+", "", x) # remove at(#)
x = gsub("[[:punct:]]", "", x) # remove punctuation
x = gsub("[[:digit:]]", "", x) # remove numbers/Digits
x = gsub("http\\w+", "", x) # remove links http
x = gsub("[ |\t]{2,}", "", x) # remove tabs
x = gsub("^ ", "", x) # remove blank spaces at the beginning
x = gsub(" $", "", x) # remove blank spaces at the end
try.error = function(z) #To convert the text in lowercase
{
y = NA
try_error = tryCatch(tolower(z), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(z)
return(y)
}
x = sapply(x, try.error)
return(x)
Now after this clean up is done there are certain twitter slang words (like "Luv","BFF","BAE" etc.). For doing effective sentiment analytics these slang words needs to be transformed into standard English words. I was hoping to find a dictionary in R that would help me achieve this, but didn't find one. Does any one know about any such dictionary, if no can some one suggest me the best way to get around this problem.
Here are some useful resources -
Acronyms
Jargons
More Slang
You can download the data and use it as a dictionary or lookup. Don't forget to remove stop words and perform stemming.

Im trying to create a dictionary of brands and then clean an input of a certain transaction to extract only the brand name

Im working with gsub to erase every word after a brand in the dictionary, but how can I erase the words before to?
Hi, Im trying to clean transactions to look clearly at the brands that the clients use. This is an example using gsub and erasing every word after the brand "cabify"
tabla1_texto <- "exppcabify u.s.2313; 1212; 534"
tabla1_texto <- gsub("cabify", "cabify-", tabla1_texto)
tabla1_texto <- gsub(";", " ;",tabla1_texto)
tabla1_texto <- gsub("-\\S* ","", tabla1_texto)
this erase every character till the ";", how can I delete the "expp" to?
Someone also knows how can i create a dictionary of brands automatically?
Thanks
To delete the prior word, you can use:
gsub("\\w+(?=cabify)", "", tabla1_texto, perl = TRUE)
To delete everything before, you can use:
gsub(".*(?=cabify)", "", tabla1_texto, perl = TRUE)
A starting point for a "dictionary" could be:
brands <- c("cabify", "thundersausage")
for (brand in brands) {
tabla1_texto <- gsub(brand, paste0(brand, "-"), tabla1_texto)
tabla1_texto <- gsub(";", " ;",tabla1_texto)
tabla1_texto <- gsub("-\\S* ","", tabla1_texto)
tabla1_texto <- gsub(paste0("\\w+(?=", brand, ")"), "", tabla1_texto, perl = TRUE)
}
tabla1_texto # view the result

Resources