Twitter slang look up in R - r

I am writing a R script to analyse the sentiment of the tweets. I am using twitteR and ROAuth package to get the tweets based on some search key words.I am using the below code to achieve this.
library(twitteR)
library(ROAuth)
library(httr)
# Set API Keys
api_key <- "xxxxxx"
api_secret <- "yyyyyy"
acs_token <- "aaxxbbbb"
access_token_secret <- "xyyzziiassss"
setup_twitter_oauth(api_key, api_secret, acs_token, access_token_secret)
# Grab latest tweets
tweets_results <- searchTwitter('xfinity x1 netflix', n=1500)
# Loop over tweets and extract text
feed_results = lapply(tweets_results, function(t) t$getText())
Now i am using the following function to clean up the tweets.
clean_text = function(x)
{
x = gsub("rt", "", x) # remove Retweet
x = gsub("#\\w+", "", x) # remove at(#)
x = gsub("[[:punct:]]", "", x) # remove punctuation
x = gsub("[[:digit:]]", "", x) # remove numbers/Digits
x = gsub("http\\w+", "", x) # remove links http
x = gsub("[ |\t]{2,}", "", x) # remove tabs
x = gsub("^ ", "", x) # remove blank spaces at the beginning
x = gsub(" $", "", x) # remove blank spaces at the end
try.error = function(z) #To convert the text in lowercase
{
y = NA
try_error = tryCatch(tolower(z), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(z)
return(y)
}
x = sapply(x, try.error)
return(x)
Now after this clean up is done there are certain twitter slang words (like "Luv","BFF","BAE" etc.). For doing effective sentiment analytics these slang words needs to be transformed into standard English words. I was hoping to find a dictionary in R that would help me achieve this, but didn't find one. Does any one know about any such dictionary, if no can some one suggest me the best way to get around this problem.

Here are some useful resources -
Acronyms
Jargons
More Slang
You can download the data and use it as a dictionary or lookup. Don't forget to remove stop words and perform stemming.

Related

using Regex and/or removing duplicate

I am scraping the website and as a result, I have half cleaned code:
[3] "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%"*
Above is one example and I am trying to remove a number before or after that heart.
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
Basically removing numbers between heart and colon including heart.
I will greatly appreciate any help. I have tried the following codes, but they did not work.
str_replace_all(dataSet, "♠*:", "", fixed = T)
gsub("*♠", "", data, fixed = T)
website <- read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population")
results <- website %>% html_nodes("table")
data_body <- results[1] %>% html_nodes("tbody")
rows <- data_body %>% html_nodes("tr")
clean_rows_text <- str_replace_all(rows_text,"[7000100000000000000]", "")
clean_rows_text <- str_replace_all(clean_rows_text, "\n\n", ":")
clean_rows_text <- str_replace_all(clean_rows_text, "\n", "")
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
From this point, I can handle the rest.
This should do it:
data <- "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%*"
gsub("♠.+?(?=:)", "", data, perl=T)

Twitter Sentimental Analysis with twitteR, all scores are zero?

I'm new to Twitter Sentimental Analysis with twitteR, and used the positive.txt and negative.txt from Hu and Liu. I was so glad that everything ran smoothly but the scores for over 1000 tweets all turned out to be neutral (score = 0)? I can't figure out what went wrong, any help is greatly appreciated!
setup_twitter_oauth(consumer_key, consumer_secret, token, token_secret)
#Get tweets about "House of Cards", due to the limitation, we'll set n=1500
netflix.tweets<- searchTwitter("#HouseofCards",n=1500)
tweet=netflix.tweets[[1]]
tweet$getScreenName()
tweet$getText()
netflix.text=laply(netflix.tweets,function(t)t$getText())
head(netflix.text)
write(netflix.text, "HouseofCards_Tweets.txt", ncolumns = 1)
#loaded the positive and negative.txt from Hu and Liu
positive <- scan("/users/xxx/desktop/positive_words.txt", what = character(), comment.char = ";")
negative <- scan("/users/xxx/desktop/negative_words.txt", what = character(), comment.char = ";")
#add positive words
pos.words =c(positive,"miss","Congratulations","approve","watching","enlightening","killing","solid")
scoredsentiment <- function(hoc.vec, pos.word, neagtive)
{
clean <- gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "",hoc.vec)
clean <- gsub("^\\s+|\\s+$", "", clean)
clean <- gsub("[[:punct:]]", "", clean)
clean <- gsub("[^[:graph:]]", "", clean)
clean <- gsub("[[:cntrl:]]", "", clean)
clean <- gsub("#\\w+", "", clean)
clean <- gsub("\\d+", "", clean)
clean <- tolower(clean)
hoc.list <- strsplit(clean, "")
hoc=unlist(hoc.list)
pos.matches = match(hoc, pos.words)
scoredpositive <- sapply(hoc.list, function(x) sum(!is.na(match(pos.matches, positive))))
scorednegative <- sapply(hoc.list, function(x) sum(!is.na(match(x, negative))))
hoc.df <- data.frame(score = scoredpositive - scorednegative, message = hoc.vec, stringsAsFactors = F)
return (hoc.df)
}
twitter_scores <- scoredsentiment(netflix.text, scoredpositive, scorednegative)
print(twitter_scores)
write.csv(twitter_scores, file=paste('twitter_scores.csv'), row.names=TRUE)
#draw a graph to show the final outcome
hist(twitter_scores$score)
qplot(twitter_scores$score)
Everything works, but the score for each tweet is the same (score =0)
You can use Microsoft Cognitive Services for calculation of the Sentiments Score.
Microsoft Cognitive Services (Text Analytics API) API can detect sentiment, key phrases, topics, and language from your text.
Refer this link to use Microsoft Cognitive Services in R link
For Sentimental Analysis in R
From your code, I don't think that the simple match will work. You need to use some form of fuzzy matching scheme. With match, you need the exact word repeated which will not happen a lot and further, you are matching a single word to a string of words.

Im trying to create a dictionary of brands and then clean an input of a certain transaction to extract only the brand name

Im working with gsub to erase every word after a brand in the dictionary, but how can I erase the words before to?
Hi, Im trying to clean transactions to look clearly at the brands that the clients use. This is an example using gsub and erasing every word after the brand "cabify"
tabla1_texto <- "exppcabify u.s.2313; 1212; 534"
tabla1_texto <- gsub("cabify", "cabify-", tabla1_texto)
tabla1_texto <- gsub(";", " ;",tabla1_texto)
tabla1_texto <- gsub("-\\S* ","", tabla1_texto)
this erase every character till the ";", how can I delete the "expp" to?
Someone also knows how can i create a dictionary of brands automatically?
Thanks
To delete the prior word, you can use:
gsub("\\w+(?=cabify)", "", tabla1_texto, perl = TRUE)
To delete everything before, you can use:
gsub(".*(?=cabify)", "", tabla1_texto, perl = TRUE)
A starting point for a "dictionary" could be:
brands <- c("cabify", "thundersausage")
for (brand in brands) {
tabla1_texto <- gsub(brand, paste0(brand, "-"), tabla1_texto)
tabla1_texto <- gsub(";", " ;",tabla1_texto)
tabla1_texto <- gsub("-\\S* ","", tabla1_texto)
tabla1_texto <- gsub(paste0("\\w+(?=", brand, ")"), "", tabla1_texto, perl = TRUE)
}
tabla1_texto # view the result

Search-and-replace on a set of columns - getting an error trying to gsub

this is a follow-up to this question: Search-and-replace on a list of strings - gsub eapply?
I have the following code:
library(quantmod)
library(stringr)
stockData <- new.env()
stocksLst <- c("AAB.TO", "BBD-B.TO", "BB.TO", "ZZZ.TO")
nrstocks = length(stocksLst)
startDate = as.Date("2016-09-01")
for (i in 1:nrstocks) {
getSymbols(stocksLst[i], env = stockData, src = "yahoo", from = startDate)
}
stockData = as.list(stockData)
names(stockData) = gsub("[.].*$", "", names(stockData))
names(stockData) = gsub("-", "", names(stockData))
symbolsLstCl <- ls(stockData)
The last post got me this far and I greatly appreciate the help. Now, I am trying to do a similar replace for the column names as quantmod includes the symbol name in the columns:
colnames(stockData$ZZZ)
# [1] "ZZZ.TO.Open" "ZZZ.TO.High" "ZZZ.TO.Low" "ZZZ.TO.Close" "ZZZ.TO.Volume" "ZZZ.TO.Adjusted"
I can easily update one of the xts objects using colnames, but I want to include this in a loop so I can do it to all. This is what I had tried, but it fails:
eval(parse(text = paste0("colnames(stockData$", symbolsLstCl[i], ")"))) <- eval(parse(text = (paste0("str_replace(colnames(stockData$", symbolsLstCl[i], "), ", "\".TO\", ", "\"\")"))))
Which I find strange, as if I use this (where the left side is hard-coded), it works:
colnames(stockData$ZZZ) <- eval(parse(text = (paste0("str_replace(colnames(stockData$", symbolsLstCl[i], "), ", "\".TO\", ", "\"\")"))))
I have the sneaking suspicion that there is a much better way to update all of the columns for each element in these lists.. any suggestions are appreciated. Thanks, Adam
allnames <- lapply(stockData,
function(x) names(x) = gsub(".TO", "", names(x)))
# replace column names
for (i in 1:length(stockData)) {
names(stockData[[i]]) <- allnames[[i]]
}
# print all column names
for (i in 1:length(stockData)) {
print(names(stockData[[i]]))
}
[1] "AAB.Open" "AAB.High" "AAB.Low" "AAB.Close" "AAB.Volume" "AAB.Adjusted"
[1] "BBD-B.Open" "BBD-B.High" "BBD-B.Low" "BBD-B.Close" "BBD-B.Volume" "BBD-B.Adjusted"
[1] "ZZZ.Open" "ZZZ.High" "ZZZ.Low" "ZZZ.Close" "ZZZ.Volume" "ZZZ.Adjusted"
[1] "BB.Open" "BB.High" "BB.Low" "BB.Close" "BB.Volume" "BB.Adjusted"
Edited: the output were not correct just now.
I suppose this is what you hope to get.

How do I clean twitter data in R?

I extracted tweets from twitter using the twitteR package and saved them into a text file.
I have carried out the following on the corpus
xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,stripWhitespace, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removePunctuation, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,strip_retweets, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removeWords,stopwords(english), lazy=TRUE, 'mc.cores=1')
(using mc.cores=1 and lazy=True as otherwise R on mac is running into errors)
tdm<-TermDocumentMatrix(xx)
But this term document matrix has a lot of strange symbols, meaningless words and the like.
If a tweet is
RT #Foxtel: One man stands between us and annihilation: #IanZiering.
Sharknado‚Äã 3: OH HELL NO! - July 23 on Foxtel #SyfyAU
After cleaning the tweet I want only proper complete english words to be left , i.e a sentence/phrase void of everything else (user names, shortened words, urls)
example:
One man stands between us and annihilation oh hell no on
(Note: The transformation commands in the tm package are only able to remove stop words, punctuation whitespaces and also conversion to lowercase)
Using gsub and
stringr package
I have figured out part of the solution for removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls .
clean_tweet = gsub("&amp", "", unclean_tweet)
clean_tweet = gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", clean_tweet)
clean_tweet = gsub("#\\w+", "", clean_tweet)
clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
clean_tweet = gsub("http\\w+", "", clean_tweet)
clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)
ref: ( Hicks , 2014)
After the above
I did the below.
#get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT #[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
ref: (Stanton 2013)
Before doing any of the above I collapsed the whole string into a single long character using the below.
paste(mytweets, collapse=" ")
This cleaning process has worked for me quite well as opposed to the tm_map transforms.
All that I am left with now is a set of proper words and a very few improper words.
Now, I only have to figure out how to remove the non proper english words.
Probably i will have to subtract my set of words from a dictionary of words.
library(tidyverse)
clean_tweets <- function(x) {
x %>%
# Remove URLs
str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
# Remove mentions e.g. "#my_account"
str_remove_all("#[[:alnum:]_]{4,}") %>%
# Remove hashtags
str_remove_all("#[[:alnum:]_]+") %>%
# Replace "&" character reference with "and"
str_replace_all("&", "and") %>%
# Remove puntucation, using a standard character class
str_remove_all("[[:punct:]]") %>%
# Remove "RT: " from beginning of retweets
str_remove_all("^RT:? ") %>%
# Replace any newline characters with a space
str_replace_all("\\\n", " ") %>%
# Make everything lowercase
str_to_lower() %>%
# Remove any trailing whitespace around the text
str_trim("both")
}
tweets %>% clean_tweets
To remove the URLs you could try the following:
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
xx <- tm_map(xx, removeURL)
Possibly you could define similar functions to further transform the text.
For me, this code did not work, for some reason-
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
Error was-
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
So, instead, I used
clean_tweet4 <- str_replace_all(clean_tweet3, "https://t.co/[a-z,A-Z,0-9]*","")
clean_tweet5 <- str_replace_all(clean_tweet4, "http://t.co/[a-z,A-Z,0-9]*","")
to get rid of URLs
The code do some basic cleaning
Converts into lowercase
df <- tm_map(df, tolower)
Removing Special characters
df <- tm_map(df, removePunctuation)
Removing Special characters
df <- tm_map(df, removeNumbers)
Removing common words
df <- tm_map(df, removeWords, stopwords('english'))
Removing URL
removeURL <- function(x) gsub('http[[:alnum;]]*', '', x)

Resources