Error: `by` required, because the data sources have no common variables

Error: `by` required, because the data sources have no common variables - r

I am trying to apply the codes to my data in this link
https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join
The code in the book is
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
I wrote it like the following (excluded "filter" because I have just filenames and words columns in my data)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
abc %>%
inner_join(nrc_joy ) %>%
count(word, sort = TRUE)
I get this error:
Error: by required, because the data sources have no common variables
Any ideas how to deal with it?

After running into a similar issue this is what I found.
The complete code from the website is:
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
The 'abc' dataset is unspecified in the question; however, it is easy to make up a substitute dataset using a 'differentColumnNameForWord'.
library(tidytext)
abc <- data.frame(differentColumnNameForWord = stop_words$word, stringsAsFactors = FALSE)
The way to find which column name the words are stored in the data frame is to use the 'names' function.
> names(abc)
[1] "DifferentColumnNameForWord"
Once the name of the column is identified the code would need to be modified as follows:
abc %>% inner_join(nrc_joy, by = c("DifferentColumnNameForWord" = "word")) %>%
count(DifferentColumnNameForWord, sort = TRUE)
In my situation, one dataset had the words under the 'word' column while another had the words under the 'term' column.

Related

Classification issues between Rvest and Map_dfr during web-scrape

I'm currently scraping stats from a website, but on certain stat pages I hit a snag with the following prompt:
Error: Column avg can't be converted from numeric to character
I try something like mutate(avg = avg %>% as.numeric), but then I get the prompt the column avg can't be found.
The issue in the code below occurs whenever I add stat_id 336 or 340. Any ideas on how to solve this?
library(dplyr)
library(tidyverse)
library(janitor)
library(rvest)
library(magrittr)
df <- expand.grid(
tournament_id = c("t464", "t054", "t047"),
stat_id = c("02564", "101", "102", "336", "340")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character) %>%
mutate(tournament)
}
# Retrieve the tables and bind them
test12 <- df %$%
map2_dfr(links, tournament_id, get_info)
test12

You generally don't want to put a pipe inside of a dplyr verb, or at least I have never before seen that done. Not sure why you need that in this example as average automatically parses as numeric. Try this instead:
# Function to get the table
get_info <- function(link, tournament_id) {
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d+")))
try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
data
}
test12 <- df %>%
mutate(tables = map2(links, tournament_id, get_info)) %>%
tidyr::unnest(everything())

How do i remove a specific term in my dataframe string?

df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE)
Hey guys, im using these functions to get a list of word frequency (im working with a giant text), but im getting repeated words like "banana" , "banana.", "banana?" etc. and they are counting separately. How do i delete the dots, interrogation and others to sum banana correctly? Thx!!!

Try using :
df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
gsub('[[:punct:]]', '', .) %>%
table() %>%
sort(decreasing = TRUE)

Unnest and concatenate values in r

I am trying to unnest two columns that do not always have the same number of values per cell and then concatenate the values that have a correspond between the two columns. For example:
library('dplyr')
library('tidyr')
#Sample Data
df <- data.frame(id = c(1:4),
first.names = c('Michael, Jim', 'Michael, Michael', 'Creed', 'Creed, Jim'),
last.names = c('Scott, Halpert', 'Scott, Cera', '', 'Halpert'))
Not all values in df$first.names are associated with a value in df$last.names. I am trying to get the following results:
#Desired output
df.results <- data.frame(id = c(1,1,2,2,3,4,4),
first.names = c('Michael', 'Jim', 'Michael', 'Michael', 'Creed', 'Creed', 'Jim'),
last.names = c('Scott', 'Halpert', 'Scott', 'Cera', '', '', 'Halpert'),
full.names = c('Michael Scott', 'Jim Halpert', 'Michael Scott', 'Michael Cera', 'Creed', 'Creed', 'Jim Halpert'))
I have tried using unnest, it works for first.names, but not for last.names (it drops the row where last.names is blank):
#convert to characters
df$first.names <- as.character(df$first.names)
df$last.names <- as.character(df$last.names)
#Unnest first names
df <- df %>%
transform(first.names = strsplit(first.names, ',')) %>%
unnest(first.names)%>%
transform(last.names = strsplit(last.names, ',')) %>%
unnest(last.names)
I was then going to delete duplicate lines, but that still does not solve the the issues with the values in df$first.names that do not have a value in df$last.names
Is there a better way to do this?

Check this solution:
library(tidyverse)
df %>%
as_tibble() %>%
mutate_at(2:3, ~ strsplit(as.character(.x), ',') %>% map(~ str_trim(.x))) %>%
mutate(
First = map2_chr(first.names, last.names, ~ paste(.x[1], .y[1])),
Second = map2_chr(first.names, last.names, ~ paste(.x[2], .y[2]))
) %>%
mutate_at(4:5, ~ str_remove_all(.x, 'NA') %>% str_trim()) %>%
gather('x', 'full.names', First:Second) %>%
filter(full.names != '') %>%
mutate(
first.names = map_chr(full.names, ~ str_split(.x, ' ')[[1]][1]),
last.names = map_chr(full.names, ~ str_split(.x, ' ')[[1]][2]) %>%
replace_na('')
) %>%
select(-x) %>%
arrange(id)
I can include a logic, that if there is one last.names it will combine it with the second first.names to get the same result, but I don't think this is what you want. Vector with first.names that has no second.names can solve the problem.

Standardize column names in excel sheets before combining with purrr and readxl

I would like to compile an Excel file with multiple tabs labeled by year (2016, 2015, 2014, etc). Each tab has identical data, but column names may be spelled differently from year-to-year.
I would like to standardize columns in each sheet before combining.
This is the generic way of combining using purrr and readxl for such tasks:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map_dfr(read_excel, path = my.file, .id = "sheet")
...however as noted, this creates separate columns for "COLUMN ONE", and "Column One", which have the same data.
Inserting make.names into the pipeline would probably be the best solution.
Keeping it all together would be ideal...something like:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>%
map(~(names(.) %>% #<---WRONG
make.names() %>%
str_to_upper() %>%
str_trim() %>%
set_names()) )
..but the syntax is all wrong.

Rather than defining your own function, the clean_names function from the janitor package may be able to help you. It takes a dataframe/tibble as an input and returns a dataframe/tibble with clean names as an output.
Here's an example:
library(tidyverse)
tibble(" a col name" = 1,
"another-col-NAME" = 2,
"yet another name " = 3) %>%
janitor::clean_names()
#> # A tibble: 1 x 3
#> a_col_name another_col_name yet_another_name
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
You can then plop it right into the code you gave:
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(janitor::clean_names) %>% #<janitor::clean_names
bind_rows(.id = "sheet")

Creating a new function is doable but is verbose and uses two maps:
# User defined function: col_rename
col_rename <- function(df){
names(df) <- names(df) %>%
str_to_upper() %>%
make.names() %>%
str_trim()
return(df)
}
combined.df <- excel_sheets(my.file) %>%
set_names() %>%
map(read_excel, path = my.file) %>% #<Import as list, not dfr
map(col_rename) %>% #<Fix colnames (user defined function)
bind_rows(.id = "sheet")

R - Count with tidytext data

I'm working on text mining with some Freud books from the Gutenberg project. When I try to do a sentiment analysis, using following code:
library(dplyr)
library(tidytext)
library(gutenbergr)
freud_books <- gutenberg_download(c(14969, 15489, 34300, 35875, 35877, 38219, 41214), meta_fields = "title")
tidy_books <- freud_books %>%
unnest_tokens(word, text)
f_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(title, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
I get the error:
Error in mutate_impl(.data, dots) : Evaluation error: non-numeric
argument to binary operator.
I can see that the problem is in the last block, in the count function. Any help with this?

you should mutate line to your data after using the inner_join function because it's not column of your data so if you need it you have to create it yourself
pay attention to the mutate(line = row_number()) part, you can modify it if you need another way of assigning line numbers and then you can use index = line %/% 80 in count
try this:
library(dplyr)
library(tidytext)
library(gutenbergr)
freud_books <- gutenberg_download(c(14969, 15489, 34300, 35875, 35877, 38219, 41214),
meta_fields = "title")
tidy_books <- freud_books %>%
unnest_tokens(word, text)
f_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), by = "word") %>%
mutate(line = row_number()) %>%
count(title, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Error: `by` required, because the data sources have no common variables - r

Related

Classification issues between Rvest and Map_dfr during web-scrape

How do i remove a specific term in my dataframe string?

Unnest and concatenate values in r

Standardize column names in excel sheets before combining with purrr and readxl

R - Count with tidytext data

Categories

Resources