Removing Apostrophies while text mining, or don't - r

This is absolutely driving me nuts, and I'm ashamed to say that I've spent the last 3 hours trying to figure this out.
I'm mining tweeter data, and I'd like to do some text analysis, but words like "doesn't" are throwing me off. I either want to keep the ' or replace it with an empty string (""). I've tried:
tweets$text <- gsub("\'", "", tweets$text)
tweets$text <- gsub("\\'", "", tweets$text)
tweets$text <- gsub("'", "", tweets$text)
tweets$text <- gsub("\W", "", tweets$text)
What I WANT is doesn't -> doesnt OR doesn't.
I want to remove the rest of the special characters, but because what comes after ' changes the word, I want to keep that. Later in the code I'm using gsub("[^A-Za-z]", " ", twt_txt_url) to clean the special characters.
I shared this code earlier for a different question, but it should still get the point across. Note that this is split into two codes, one for you to pull the data, and two to see how I'm cleaning.
PULLING DATA:
library(rtweet)
library(tidyverse)
library(httpuv)
# API access keys
app_name = "app_name"
consumer_key <- "consumer_key"
consumer_secret <- "consumer_secret"
access_token <- "access_token"
access_secret <- "access_secret"
# Create twitter connection
create_token(app = app_name,
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_token_secret)
# who do we want to observe
account <- "#BillGates"
# 3200 is the max we can pull at once
account.timeline <- get_timeline(account, n=100, includeRts =TRUE)
# create data frame and csv from tweets
write_as_csv(account.timeline, BillGates.csv", fileEncoding = "UTF-8")
CLEANING DATA
library(tidyverse)
library(qdapRegex)
library(tm)
library(qdap)
library(wordcloud)
tweets <- read_csv("BillGate.csv")
# This is where I'm trying to remove the ' from words
tweets$text <- gsub("\\'","", tweets$text)
# Separate out the text column
twt_txt <- tweets$text
# remove URLs
twt_txt_url <- rm_twitter_url(twt_txt)
# remove special characters
twt_txt_chrs <- gsub("[^A-Za-z]", " ", twt_txt_url)
# convert to a text corpus
twt_corpus <- twt_txt_chrs %>%
VectorSource() %>%
Corpus()
Here's some of the data that you can manipulate as well. It seems like I can remove the first ', but nothing after that.
df = data.frame(
tweet = c(1, 2, 3, 4, 5),
text = c(
"Standing up for science has never been more important. Congratulations to Dr. Anthony Fauci and Dr. Salim Abdool Karim on receiving this honor.",
"I've known and learned from #RonConway for more than 40 years. I'm glad to see #svangel team up with #bchesky to mentor and support companies working to create more economic empowerment opportunities for people across the world.",
"This book has nothing to do with viruses or pandemics. But it is surprisingly relevant for these times. #exlarson provides a brilliant and gripping account of another era of widespread anxiety: the years 1940 and 1941.",
"The season finale of our podcast features two incredible people who are using their positions as artists to change the world for the better.",
"Like many people, Iโ€™ve tried to deepen my understanding of systemic racism in recent months. If youโ€™re interested in learning more about the lives caught up in our country's justice system, I highly recommend #thenewjimcrow by Michelle Alexander."
)
)

Related

How do I turn this code into a working function?

This is my data
[1] "the rooms were clean very comfortable and the staff was amazing they went over and beyond to help make our stay enjoyable i highly recommend this hotel for anyone visiting downtown "
[2] "excellent property and very convenient to activities front desk staff is extremely efficient, pleasant and helpful property is clean and has a fantastic old time charm "
[3] "the rooftop cafeteria of hotel was great. wen i say food was great "
I want to create a function that returns the count of positive sentiments per row. This is my code so far.
x=data$sentences[1:3]
library(tidytext)
library(tm)
bing <- get_sentiments("bing")
positive = bing %>% filter(sentiment %in% "positive")
positive = subset(positive, select = -c(sentiment))
positive = as.vector(positive$word)
positive = paste0(positive," ")
positive_reviews <- function(x) {
data = as.vector(x)
data = Corpus(VectorSource(data))
data = tm_map(data, removePunctuation)
data = as.character(data)
positive_count = sapply(positive, function(x) str_count(data,x))
return(sum(positive_count))
}
print(positive_reviews(x))
I am running into an error that I do not know how to fix.
Error: no function to return from, jumping to top level
How would I write my code to make this work?

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput() (including the step force to lowercase:
Emoji_struct <- c(
list(content = "๐Ÿ”ฅ๐Ÿ”ฅ wow", "๐Ÿ˜ฎ look at that", "๐Ÿ˜คthis makes me angry๐Ÿ˜ค", "๐Ÿ˜โค\ufe0f, i love it!"),
list(content = "๐Ÿ˜๐Ÿ˜", "๐Ÿ˜Š thanks for helping", "๐Ÿ˜ข oh no, why? ๐Ÿ˜ข", "careful, challenging โŒโŒโŒ")
)
Current coding (data_orig is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Any ideas? Lower cases would work, too.
Best regards. Stay safe. Stay healthy.
Answer
Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
Example
Single character string:
replace_emoji("wow!๐Ÿ˜ฎ that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
Character vector:
replace_emoji(c("1: ๐Ÿ˜Š", "2: ๐Ÿ˜"), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
List:
list("list_element_1: ๐Ÿ”ฅ", "list_element_2: โŒ") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
Rationale
To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.

R Regex seemingly not working properly in Linux

I'm trying to scrape the webpage of Fangraphs with alphabetical player indices to get a single column dataframe of each letter reference.
I have been able to get the code below to successfully work on a Windows version of R 3.4.1, but cannot get it to work on the Linux side at all, and I can't figure out what exactly is going wrong/different.
library(XML)
# Scrape to get the webpage
url <- paste0("http://www.fangraphs.com/players.aspx?")
table <- readHTMLTable(url, stringsAsFactors = FALSE)
letterz <- table[[2]]
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=", ")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
# Below this is where I can notice that the code is not operating the same
# as on my Windows machine. None of the gsub commands seem to impact
# the strings at all.
# Stripping the trailing whitespace
letterz$letters <- gsub("[[:space:]]+$", "", letterz$letters)
# Replacing patterns like "AzB Ba" to instead have "Az,Ba"
letterz$letters <- gsub("[[:upper:]]+?[[:space:]]+?[[:space:]]+?[[:space:]]+", ",", letterz$letters)
# Final cleaning up
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=",")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
letterz$letters <- gsub('c\\("|"\\)|"', "", letterz$letters)
letterz$letters <- gsub('^$', NA, letterz$letters)
letterz$letters <- gsub("^[[:space:]]+","", letterz$letters)
letterz$letters <- gsub("[[:space:]]+$","", letterz$letters)
letterz$letters <- gsub("'", "%27", letterz$letters)
letterz <- na.omit(letterz)
From what I could find, the only real difference between Windows/Linux regex would be the linebreak implementation, which I went back and tried to see if that was making the difference... but still got no change.
I also tried to substitute the R-specific "[[:space:]]" and "[[:upper:]]" style notation with the more standardized "\s" to see if that would fix anything.
As for fixes, I know there are a handful of other packages that I can look into to simply get the result I'm looking for, but more generally, are there just simply differences in how Windows and Linux implement regex that I'm unaware of and am oblivious to? And if so, how would I implement them into gsub to get the same result I get on Windows?
Thanks.

Twitter Sentimental Analysis with twitteR, all scores are zero?

I'm new to Twitter Sentimental Analysis with twitteR, and used the positive.txt and negative.txt from Hu and Liu. I was so glad that everything ran smoothly but the scores for over 1000 tweets all turned out to be neutral (score = 0)? I can't figure out what went wrong, any help is greatly appreciated!
setup_twitter_oauth(consumer_key, consumer_secret, token, token_secret)
#Get tweets about "House of Cards", due to the limitation, we'll set n=1500
netflix.tweets<- searchTwitter("#HouseofCards",n=1500)
tweet=netflix.tweets[[1]]
tweet$getScreenName()
tweet$getText()
netflix.text=laply(netflix.tweets,function(t)t$getText())
head(netflix.text)
write(netflix.text, "HouseofCards_Tweets.txt", ncolumns = 1)
#loaded the positive and negative.txt from Hu and Liu
positive <- scan("/users/xxx/desktop/positive_words.txt", what = character(), comment.char = ";")
negative <- scan("/users/xxx/desktop/negative_words.txt", what = character(), comment.char = ";")
#add positive words
pos.words =c(positive,"miss","Congratulations","approve","watching","enlightening","killing","solid")
scoredsentiment <- function(hoc.vec, pos.word, neagtive)
{
clean <- gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "",hoc.vec)
clean <- gsub("^\\s+|\\s+$", "", clean)
clean <- gsub("[[:punct:]]", "", clean)
clean <- gsub("[^[:graph:]]", "", clean)
clean <- gsub("[[:cntrl:]]", "", clean)
clean <- gsub("#\\w+", "", clean)
clean <- gsub("\\d+", "", clean)
clean <- tolower(clean)
hoc.list <- strsplit(clean, "")
hoc=unlist(hoc.list)
pos.matches = match(hoc, pos.words)
scoredpositive <- sapply(hoc.list, function(x) sum(!is.na(match(pos.matches, positive))))
scorednegative <- sapply(hoc.list, function(x) sum(!is.na(match(x, negative))))
hoc.df <- data.frame(score = scoredpositive - scorednegative, message = hoc.vec, stringsAsFactors = F)
return (hoc.df)
}
twitter_scores <- scoredsentiment(netflix.text, scoredpositive, scorednegative)
print(twitter_scores)
write.csv(twitter_scores, file=paste('twitter_scores.csv'), row.names=TRUE)
#draw a graph to show the final outcome
hist(twitter_scores$score)
qplot(twitter_scores$score)
Everything works, but the score for each tweet is the same (score =0)
You can use Microsoft Cognitive Services for calculation of the Sentiments Score.
Microsoft Cognitive Services (Text Analytics API) API can detect sentiment, key phrases, topics, and language from your text.
Refer this link to use Microsoft Cognitive Services in R link
For Sentimental Analysis in R
From your code, I don't think that the simple match will work. You need to use some form of fuzzy matching scheme. With match, you need the exact word repeated which will not happen a lot and further, you are matching a single word to a string of words.

Web scraping techniques to obtain links that the website of interest contains

I am working with the following website:
http://www.crowdrise.com/skollsechallenge
Specifically on this page there are 57 crowdfunding campaigns. ย Each of those crowdfunding campaigns have text that details out why they want to raise money, the total money raised so far, and the team members. ย Some of the campaigns also specify the fundraising goal. I want to write some R code that will scrape and organize this information from each of the 57 sites.
for now, I am trying to scrap each of the 57 links that leads to the 57 different campaigns.
Below is the code I tried:
library("RCurl")
library("XML")
library("stringr")
url <- "http://www.crowdrise.com/skollSEchallenge"
cat("URL:", url)
url.data <- readLines(url)
doc <- htmlTreeParse(url.data, useInternalNodes=TRUE)
xp_exp <- "//a[#href]"
links <- xpathSApply(doc, xp_exp,xmlValue)
the variable
links
however, does not contain links to the 57 websites.....I am little confused...
can someone help me?
thanks,
Using this for example :
xpathApply(doc, '//*[#id="teams-results"]/div/div/div/h4/a'
,xmlGetAttr,'href')
You will get the 16 links of the first page. But you still have the problem of activating javascript code behind( SHOW MORE TEAMS) to see the rest of links.
This very ugly solution gets 32 of them, it is very very verbose, but it does not need to evaluate javascript.
library(httr)
x <- as.character(GET("http://www.crowdrise.com/skollSEchallenge"))
x <- unlist(strsplit(x, split = "\n", fixed = TRUE))
x <- gsub("\t", "", grep('class="profile">', x, value = TRUE, fixed = TRUE))
x <- unlist(strsplit(x, split = 'class="profile">', fixed = TRUE))[-1]
x <- gsub("\r<div class=\"content\">\r<a href=\"/", "", x, fixed = TRUE)
x <- substr(x, 1, as.integer(regexpr('\"><img', x)) - 1)
x <- paste("www.crowdrise.com/", x, sep = '')

Resources