How to use unnest_token on twitter text data? - r

I'm trying to run the following and gives me an error message.
data <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation. . . Video: . . - #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
"#Copingwiththelockdown... Festac town, Lagos. #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
"Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- as.data.frame(data)
remove_reg <- "&|<|>"
tidy_data <- data_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "data_df") %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"))
It gives me the following error message:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
argument str should be a character vector (or an object coercible to)"
How can I fix it?

The main problem is that you gave your text column the name data but then referred to it later as text. Try it something more like this:
library(tidyverse)
library(tidytext)
text <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation. . . Video: . . - #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
"#Copingwiththelockdown... Festac town, Lagos. #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
"Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- tibble(text)
remove_reg <- "&|<|>"
data_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords()) %>%
filter(str_detect(word, "[a-z]"))
#> Joining, by = "word"
#> # A tibble: 105 x 1
#> word
#> <chr>
#> 1 said
#> 2 cant
#> 3 lil
#> 4 dance
#> 5 party
#> 6 stuck
#> 7 quarantine
#> 8 happy
#> 9 friday
#> 10 cousins
#> # … with 95 more rows
If you are specifically interested in Twitter data, consider using token = "tweets":
data_df %>%
unnest_tokens(word, text, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 121 x 1
#> word
#> <chr>
#> 1 who
#> 2 said
#> 3 we
#> 4 cant
#> 5 have
#> 6 a
#> 7 lil
#> 8 dance
#> 9 party
#> 10 while
#> # … with 111 more rows
Created on 2020-04-12 by the reprex package (v0.3.0)
This option handles hashtags and usernames well.

Related

Cannot identify html node for scraping in rvest

trying to grab links from a page for subsequent analysis and can only grab about 1/2 of them which may be due to filtering. I'm trying to extract the links highlighted here:
My approach is as follows, which is not ideal because I believe I may be losing some links in the filter() call.
library(rvest)
library(tidyverse)
#initiate session
session <- html_session("https://www.backlisted.fm/episodes")
#collect links for all episodes from the index page:
session %>%
read_html() %>%
html_nodes(".underline-body-links a") %>%
html_attr("href") %>%
tibble(link_temp = .) %>%
filter(str_detect(link_temp, pattern = "episodes/")) %>%
distinct()
#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
#result:
link_temp
<chr>
1 /episodes/116-mfk-fisher-how-to-cook-a-wolf
2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women
3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody
4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona
5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man
7 /episodes/114-william-golding-the-inheritors
8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia
9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king
# … with 43 more rows
I've been reading multiple documents but I can't target that one type of href. Any help will be much appreciated. Thank you.
Try this
library(rvest)
library(tidyverse)
session <- html_session("https://www.backlisted.fm/index")
raw_html <- read_html(session)
node <- raw_html %>% html_nodes(css = "li p a")
link <- node %>% html_attr("href")
title <- node %>% html_text()
tibble(title, link)
# A tibble: 117 x 2
# title link
# <chr> <chr>
# 1 "A Month in the Country" https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country
# 2 " - J.L. Carr (with Lissa Evans)" #
# 3 "Good Morning, Midnight - Jean Rhys" https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight
# 4 "It Had to Be You - David Nobbs" https://www.backlisted.fm/episodes/3-david-nobbs-1
# 5 "The Blessing - Nancy Mitford" https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing
# 6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
# 7 "Passing - Nella Larsen" https://www.backlisted.fm/episodes/6-nella-larsen-passing
# 8 "The Great Fire - Shirley Hazzard" https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire
# 9 "Lolly Willowes - Sylvia Townsend Warner" https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
# 10 "The Information - Martin Amis" https://www.backlisted.fm/episodes/9-martin-amis-the-information
# … with 107 more rows

How to tokenize my dataset in R using the tidytext library?

I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.
Here are the packages I have loaded:
# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)
Here is the dataset I tried to use which is online, so the results should be reproducible:
bible <- readLines('http://bereanbible.com/bsb.txt')
And here is where everything falls apart.
Input:
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.
Input:
bible <- readLines('http://bereanbible.com/bsb.txt')
bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.
The problem is that readLines()creates a vector, not a dataframe, as expected by unnest_tokens(), so you need to convert it. It is also helpful to separate the verse to it's own column:
library(tidytext)
library(tidyverse)
bible_orig <- readLines('http://bereanbible.com/bsb.txt')
# Get rid of the copyright etc.
bible_orig <- bible_orig[4:length(bible_orig)]
# Convert to df
bible <- enframe(bible_orig)
# Separate verse from text
bible <- bible %>%
separate(value, into = c("verse", "text"), sep = "\t")
tidy_bible <- bible %>%
unnest_tokens(word, text)
tidy_bible
#> # A tibble: 730,130 x 3
#> name verse word
#> <int> <chr> <chr>
#> 1 1 Genesis 1:1 in
#> 2 1 Genesis 1:1 the
#> 3 1 Genesis 1:1 beginning
#> 4 1 Genesis 1:1 god
#> 5 1 Genesis 1:1 created
#> 6 1 Genesis 1:1 the
#> 7 1 Genesis 1:1 heavens
#> 8 1 Genesis 1:1 and
#> 9 1 Genesis 1:1 the
#> 10 1 Genesis 1:1 earth
#> # … with 730,120 more rows
Created on 2020-07-14 by the reprex package (v0.3.0)

Rvest reading separated article data

I am looking to scrape article data from inquirer.net.
This is a follow-up question to Scrape Data through RVest
Here is the code that works based on the answer:
library(rvest)
#> Loading required package: xml2
library(tibble)
year <- 2020
month <- 06
day <- 13
url <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)
div <- read_html(url) %>% html_node(xpath = '//*[#id ="index-wrap"]')
links <- html_nodes(div, xpath = '//a[#rel = "bookmark"]')
post_date <- html_nodes(div, xpath = '//span[#class = "index-postdate"]') %>%
html_text()
test <- tibble(date = post_date,
text = html_text(links),
link = html_attr(links, "href"))
test
#> # A tibble: 261 x 3
#> date text link
#> <chr> <chr> <chr>
#> 1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#> 2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#> 3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#> 4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#> 5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#> 6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#> 7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#> 8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#> 9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows
I now want to add a new column to this output which has the full article for each row. Before doing the for-loop, I was investigating the html code for the first article: https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light
Digging into the html code, I'm noticing it is not that clean. From my findings so far, the main article data falls under #article_content , p. So my output right now is multiple rows separated and there is a lot of non-article data appearing. here is what I have currently:
article_data<-data.frame(test)
article_url<- read_html(article_data[2, 3])
article<-article_url %>%
html_nodes("#article_content , p") %>%
html_text()
View(article)
I'm ok with this being multiple rows because I can just union the final result. But since there are other non-article items then it will mess up what I am trying to do (sentiment analysis).
Can someone please assist on how to clean this data so that the full article is next to each article link?
I could simply just union the results excluding the first row and last 2 rows but looking for a cleaner way because I want to do this for all article data and not just this one.
After a short look in the structure of the article page, I suggest using the css selector: ".article_align div p".
library(rvest)
library(dplyr)
url <- "https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light"
read_html(url) %>%
html_nodes(".article_align div p") %>%
html_text()

tidyverse: Combining a column with different length into exisitng tibble

I have tibble which looks like:
Review_Text
<chr>
Because it is a nice game
Best trump soumd board out there
Boring hated because it does not work when I get done
but you can make better game if game has unlimeted chemicals bottles
cant get pass loading screen
Can't play video
Casting from Note 3 to Roku 3 screen appears to start loading then back to Roku home screen. Roku software version 6.1 build 5604. It is up to date but still not able to cast Showbox. ..
Crashes all the time in the middle of the show. Whining ensues. Ugh.
Crashing
Does not work on tab 3
Doesn't work
Doesn't work with S7 which is unacceptable in this day and age.
Doesn't work... I absolutely hate it
Dont use this app battery consumers
Dose this work for snmsung I tried some many times 😡
😄I loved it so much I would recommend this to other families 😄
Every time i pressed apply it just took me to the home screen
Everytime it says collect on T.V. it won't obtain the magisword
Excellent!!! My grandchildren watch it all the time...
Feel like Lizzie McGuire 😂â\u009d¤
I want to remove the stopwords from the Review_Text and append the column (that does not have stopwords) with the existing tibble. I am using following code, to remove the stopwords:
no_stpwrd <- tibble(line = 1:nrow(tb), text = tb$Review_Text) %>%
unnest_tokens(word, text)%>%
anti_join(stop_words, by = c("word" = "word")) %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
Then I use the following command to merge the no_stpwrd with the existing tibble:
add_column(tb,no_stpwrd).
However, when I run the above command, it throws an error message because of mismatch between the number of rows tibble and no_stowrd have. There are few row values in tibble which contains the only stopword (for example, line 11 of tibble), so when I remove stopwords it returns null hence the number of rows reduced in a no_stpwrd column. Is there any way to fix the issue?
Instead of trying to use add_column() here, what you want to do is use a join.
library(tidyverse)
library(tidytext)
review_df <- tibble(review_text = c("Because it is a nice game",
"cant get pass loading screen",
"Because I don't",
"Dont use this app battery consumers")) %>%
mutate(line = row_number())
review_df
#> # A tibble: 4 x 2
#> review_text line
#> <chr> <int>
#> 1 Because it is a nice game 1
#> 2 cant get pass loading screen 2
#> 3 Because I don't 3
#> 4 Dont use this app battery consumers 4
no_stpwrd <- review_df %>%
unnest_tokens(word, review_text) %>%
anti_join(get_stopwords()) %>%
group_by(line) %>%
summarise(title = paste(word,collapse =' '))
#> Joining, by = "word"
no_stpwrd
#> # A tibble: 3 x 2
#> line title
#> <int> <chr>
#> 1 1 nice game
#> 2 2 cant get pass loading screen
#> 3 4 dont use app battery consumers
Notice that the third document is no longer there because it was made up of all stop words. It's time for a left_join().
review_df %>%
left_join(no_stpwrd)
#> Joining, by = "line"
#> # A tibble: 4 x 3
#> review_text line title
#> <chr> <int> <chr>
#> 1 Because it is a nice game 1 nice game
#> 2 cant get pass loading screen 2 cant get pass loading screen
#> 3 Because I don't 3 <NA>
#> 4 Dont use this app battery consumers 4 dont use app battery consumers
Created on 2020-03-20 by the reprex package (v0.3.0)

Sentiment analysis in R for cyrillic

I can't comment on this page where i found a function Sentiment Analysis Text Analytics in Russian / Cyrillic languages
get_sentiment_rus <- function(char_v, method="custom", lexicon=NULL, path_to_tagger = NULL, cl = NULL, language = "english") {
language <- tolower(language)
russ.char.yes <- "[\u0401\u0410-\u044F\u0451]"
russ.char.no <- "[^\u0401\u0410-\u044F\u0451]"
if (is.na(pmatch(method, c("syuzhet", "afinn", "bing", "nrc",
"stanford", "custom"))))
stop("Invalid Method")
if (!is.character(char_v))
stop("Data must be a character vector.")
if (!is.null(cl) && !inherits(cl, "cluster"))
stop("Invalid Cluster")
if (method == "syuzhet") {
char_v <- gsub("-", "", char_v)
}
if (method == "afinn" || method == "bing" || method == "syuzhet") {
word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
if (is.null(cl)) {
result <- unlist(lapply(word_l, get_sent_values,
method))
}
else {
result <- unlist(parallel::parLapply(cl = cl, word_l,
get_sent_values, method))
}
}
else if (method == "nrc") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
lexicon <- dplyr::filter_(syuzhet:::nrc, ~lang == tolower(language),
~sentiment %in% c("positive", "negative"))
lexicon[which(lexicon$sentiment == "negative"), "value"] <- -1
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "custom") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "stanford") {
if (is.null(path_to_tagger))
stop("You must include a path to your installation of the coreNLP package. See http://nlp.stanford.edu/software/corenlp.shtml")
result <- get_stanford_sentiment(char_v, path_to_tagger)
}
return(result)
}
It gives an error
> mysentiment <- get_sentiment_rus(as.character(corpus))
Show Traceback
Rerun with Debug
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "NULL"
And the sentiment scores are equal to 0
> SentimentScores <- data.frame(colSums(mysentiment[,]))
> SentimentScores
colSums.mysentiment.....
anger 0
anticipation 0
disgust 0
fear 0
joy 0
sadness 0
surprise 0
trust 0
negative 0
positive 0
Could you please point out where a problem might be? Or suggest any other working method for sentiment analysis в R? Just wonder what package supports russian language.
I am looking for any working method for sentiment analysis of a text in russian.
It looks to me like your function did not really find any sentiment words in your text. This might have to do with the sentiment dictionary you are using. Instead of trying to repair this function, you might want to consider a tidy approach instead, which is outlined in the book "Text Mining with R. A Tidy Approach". The advantage is that it does not mind the cyrillic letters and that it is really easy to understand and tweak.
First, we need a dictionary with sentiment values. I found one on GitHub, which we can directly read into R:
library(rvest)
library(stringr)
library(tidytext)
library(dplyr)
dict <- readr::read_csv("https://raw.githubusercontent.com/text-machine-lab/sentimental/master/sentimental/word_list/russian.csv")
Next, let's get some test data to work with. For no particular reason, I use the Russian Wikipedia entry for Brexit and scrape the text:
brexit <- "https://ru.wikipedia.org/wiki/%D0%92%D1%8B%D1%85%D0%BE%D0%B4_%D0%92%D0%B5%D0%BB%D0%B8%D0%BA%D0%BE%D0%B1%D1%80%D0%B8%D1%82%D0%B0%D0%BD%D0%B8%D0%B8_%D0%B8%D0%B7_%D0%95%D0%B2%D1%80%D0%BE%D0%BF%D0%B5%D0%B9%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%81%D0%BE%D1%8E%D0%B7%D0%B0" %>%
read_html() %>%
html_nodes("body") %>%
html_text() %>%
tibble(text = .)
Now this data can be turned into a tidy format. I split the text into paragraphs first, so we can check sentiment scores for paragraphs individually.
brexit_tidy <- brexit %>%
unnest_tokens(output = "paragraph", input = "text", token = "paragraphs") %>%
mutate(id = seq_along(paragraph)) %>%
unnest_tokens(output = "word", input = "paragraph", token = "words")
The way a dictionary is used with tidy data is incredibly straightwoard from this point. You just combine the data frame with sentiment values (i.e., the dictionary) and the data frame with the words in your text. Where text and dictionary match, the sentiment value is added. All other values are dropped.
# apply dictionary
brexit_sentiment <- brexit_tidy %>%
inner_join(dict, by = "word")
head(brexit_sentiment)
#> # A tibble: 6 x 3
#> id word score
#> <int> <chr> <dbl>
#> 1 7 затяжной -1.7
#> 2 13 против -5
#> 3 22 популярность 5
#> 4 22 против -5
#> 5 23 нужно 1.7
#> 6 39 против -5
Instead of the value for each word, you probably prefer the values per paragraphs. This can easily be done by getting the mean for each paragraph:
# group sentiment by paragraph
brexit_sentiment %>%
group_by(id) %>%
summarise(sentiment = mean(score))
#> # A tibble: 25 x 2
#> id sentiment
#> <int> <dbl>
#> 1 7 -1.7
#> 2 13 -5
#> 3 22 0
#> 4 23 1.7
#> 5 39 -5
#> 6 42 5
#> 7 43 -1.88
#> 8 44 -3.32
#> 9 45 -3.35
#> 10 47 1.7
#> # … with 15 more rows
There are a couple of ways this approach could be improved if necessary:
to get rid of different word forms, you could lemmatize the words, making matches more likely
in case your text includes misspellings, you could consider matching words which are similar with e.g. fuzzyjoin
you can find or create a better dictionary than the one I pulled of the first page I found when googling "russian sentiment dictionary"

Resources