R - Count with tidytext data - r

I'm working on text mining with some Freud books from the Gutenberg project. When I try to do a sentiment analysis, using following code:
library(dplyr)
library(tidytext)
library(gutenbergr)
freud_books <- gutenberg_download(c(14969, 15489, 34300, 35875, 35877, 38219, 41214), meta_fields = "title")
tidy_books <- freud_books %>%
unnest_tokens(word, text)
f_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(title, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
I get the error:
Error in mutate_impl(.data, dots) : Evaluation error: non-numeric
argument to binary operator.
I can see that the problem is in the last block, in the count function. Any help with this?

you should mutate line to your data after using the inner_join function because it's not column of your data so if you need it you have to create it yourself
pay attention to the mutate(line = row_number()) part, you can modify it if you need another way of assigning line numbers and then you can use index = line %/% 80 in count
try this:
library(dplyr)
library(tidytext)
library(gutenbergr)
freud_books <- gutenberg_download(c(14969, 15489, 34300, 35875, 35877, 38219, 41214),
meta_fields = "title")
tidy_books <- freud_books %>%
unnest_tokens(word, text)
f_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), by = "word") %>%
mutate(line = row_number()) %>%
count(title, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)

Related

How to repeat a vector of strings N times in a dataframe using dplyr

I am working with a list of dataframes and want to create a new column with the names of the variables. There are three variables and the length of the dataframe is 684, therefore I need the variable names to repeat 228 times. However, I can't get this to work.
Here is the snippet I am currently using:
empleo = lapply(lista.empleo, function(x){x = x %>%
read_excel(skip=4) %>%
head(23) %>%
drop_na() %>%
clean_names() %>%
pivot_longer(!1,
names_to = 'fecha',
values_to = 'valor') %>%
mutate(variable = rep(c('trabajadores',
'masa',
'salario'),
times = 228))})
So far, I have tried to use mutate, but I get the following mistake:
Error in `mutate()`:
! Problem while computing `variable = rep(c("trabajadores", "masa",
"salario"), times = 228)`.
x `variable` must be size 0 or 1, not 684.
I will add the structure of a sample df in the comments since it is too big.
Thanks in advance for any help!
The rep may fail as some datasets may have different number of rows in the list. Use length.out to make sure it returns n() elements (number of rows)
library(readxl)
library(tidyr)
library(dplyr)
library(janitor)
empleo <- lapply(lista.empleo, function(x){x = x %>%
read_excel(skip=4) %>%
head(23) %>%
drop_na() %>%
clean_names() %>%
pivot_longer(!1,
names_to = 'fecha',
values_to = 'valor') %>%
mutate(variable = rep(c('trabajadores',
'masa',
'salario'),
228, length.out = n()))})

How do i remove a specific term in my dataframe string?

df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE)
Hey guys, im using these functions to get a list of word frequency (im working with a giant text), but im getting repeated words like "banana" , "banana.", "banana?" etc. and they are counting separately. How do i delete the dots, interrogation and others to sum banana correctly? Thx!!!
Try using :
df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
gsub('[[:punct:]]', '', .) %>%
table() %>%
sort(decreasing = TRUE)

Cosine Similarity: Funtion Can't Calculate The Matrix

So, I recently building a music recommender system using Collaborative Filtering in Rstudio. I have some problem with the function of cosine similarity which the system said "subscript out of bond" on the matrix that I want to calculate.
I use Cosine Similarity which I got the reference from this website: https://bgstieber.github.io/post/recommending-songs-using-cosine-similarity-in-r/
I've tried to fix the script but still apparently the output isn't working.
##cosinesim-crossprod
cosine_sim <- function(a,b) {crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))}
##User data
play_data <- "https://static.turi.com/datasets/millionsong/10000.txt" %>%
read_tsv(col_names = c('user', 'song_id', 'plays'))
##Song data
song_data <- read_csv("D:/3rd Term/DataAnalysis/dataSet/song_data.csv") %>%
distinct(song_id, title, artist_name)
##Grouped
all_data <- play_data %>%
group_by(user, song_id) %>%
summarise(plays = sum(plays, na.rm = TRUE)) %>%
inner_join(song_data)
top_1k_songs <- all_data %>%
group_by(song_id, title, artist_name) %>%
summarise(sum_plays = sum(plays)) %>%
ungroup() %>%
top_n(1000, sum_plays) %>%
distinct(song_id)
all_data_top_1k <- all_data %>%
inner_join(top_1k_songs)
top_1k_wide <- all_data_top_1k %>%
ungroup() %>%
distinct(user, song_id, plays) %>%
spread(song_id, plays, fill = 0)
ratings <- as.matrix(top_1k_wide[,-1])
##Function
calc_cos_sim <- function(song_code = top_1k_songs,
rating_mat = ratings,
songs = song_data,
return_n = 5) {
song_col_index <- which(colnames(ratings)== song_code) %>%
cos_sims <- apply(rating_mat, 2,FUN = function(y)
cosine_sim(rating_mat[,song_col_index], y))
##output
data_frame(song_id = names(cos_sims), cos_sim = cos_sims) %>%
filter(song_id != song_code) %>% # remove self reference
inner_join(songs) %>%
arrange(desc(cos_sim)) %>%
top_n(return_n, cos_sim) %>%
select(song_id, title, artist_name, cos_sim)
}
I expect when I use this script:
shots <- 'SOJYBJZ12AB01801D0'
knitr::kable(calc_cos_sim(shots))
The output would be a data frame of 5 songs.
The pipe at the end of this line looks like a typo:
song_col_index <- which(colnames(ratings)== song_code) %>%
Replace it with:
song_col_index <- which(colnames(ratings)== song_code)

Error: `by` required, because the data sources have no common variables

I am trying to apply the codes to my data in this link
https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join
The code in the book is
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
I wrote it like the following (excluded "filter" because I have just filenames and words columns in my data)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
abc %>%
inner_join(nrc_joy ) %>%
count(word, sort = TRUE)
I get this error:
Error: by required, because the data sources have no common variables
Any ideas how to deal with it?
After running into a similar issue this is what I found.
The complete code from the website is:
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
The 'abc' dataset is unspecified in the question; however, it is easy to make up a substitute dataset using a 'differentColumnNameForWord'.
library(tidytext)
abc <- data.frame(differentColumnNameForWord = stop_words$word, stringsAsFactors = FALSE)
The way to find which column name the words are stored in the data frame is to use the 'names' function.
> names(abc)
[1] "DifferentColumnNameForWord"
Once the name of the column is identified the code would need to be modified as follows:
abc %>% inner_join(nrc_joy, by = c("DifferentColumnNameForWord" = "word")) %>%
count(DifferentColumnNameForWord, sort = TRUE)
In my situation, one dataset had the words under the 'word' column while another had the words under the 'term' column.

R tidytext stop_words are not filtering consistently from gutenbergr downloads

This is a bizarre puzzle. I downloaded 2 texts from gutenbergr - Alice in Wonderland and Ulysses.
The stop_words disappear from Alice but they are still in Ulysses.
This issue persisted even when replacing anti_join with
filter (!word %in% stop_words$word).
How do I get the stop_words out of Ulysses?
Thanks for your help!
Plot of top 15 tf_idf for Alice & Ulysses
library(gutenbergr)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
titles <- c("Alice's Adventures in Wonderland", "Ulysses")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = c("title", "author"))
data(stop_words)
tidy_books <- books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
plot_tidy_books <- tidy_books %>%
bind_tf_idf(word, title, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
mutate(title = factor(title, levels = unique(title)))
plot_tidy_books %>%
group_by(title) %>%
arrange(desc(n))%>%
top_n(15, tf_idf) %>%
mutate(word=reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf, fill=title)) +
geom_col(show.legend = FALSE) +
labs(x=NULL, y="tf-idf") +
facet_wrap(~title, ncol=2, scales="free") +
coord_flip()
After a bit of digging in the tokenized Ulysses, the text "it's" is actually using a right single quotation mark instead of an apostrophe. stop_words in tidytext uses an apostrophe. You have to replace the right single quotation with an apostrophe.
I found this out by:
> utf8ToInt('it’s')
[1] 105 116 8217 115
Googling the 8217 lead me to here. From there it's as easy as grabbing the C++/Java source \u2019 and adding a mutate and gsub statement prior to your anti-join.
tidy_books <- books %>%
unnest_tokens(word, text) %>%
mutate(word = gsub("\u2019", "'", word)) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
Results in:

Resources