Classification issues between Rvest and Map_dfr during web-scrape - r

I'm currently scraping stats from a website, but on certain stat pages I hit a snag with the following prompt:
Error: Column avg can't be converted from numeric to character
I try something like mutate(avg = avg %>% as.numeric), but then I get the prompt the column avg can't be found.
The issue in the code below occurs whenever I add stat_id 336 or 340. Any ideas on how to solve this?
library(dplyr)
library(tidyverse)
library(janitor)
library(rvest)
library(magrittr)
df <- expand.grid(
tournament_id = c("t464", "t054", "t047"),
stat_id = c("02564", "101", "102", "336", "340")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character) %>%
mutate(tournament)
}
# Retrieve the tables and bind them
test12 <- df %$%
map2_dfr(links, tournament_id, get_info)
test12

You generally don't want to put a pipe inside of a dplyr verb, or at least I have never before seen that done. Not sure why you need that in this example as average automatically parses as numeric. Try this instead:
# Function to get the table
get_info <- function(link, tournament_id) {
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d+")))
try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
data
}
test12 <- df %>%
mutate(tables = map2(links, tournament_id, get_info)) %>%
tidyr::unnest(everything())

Related

Creating a function that loops through page numbers

I have a script that it importing data, which looks like this:
library(tidyverse)
library(rvest)
library(magrittr)
page_number <- 1:20
base_url <- read_html("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?ViewPath=~%2FViews%2FSkyNet%2FPlayerSportRanking%2F_SimpleSetForSeason.ascx&Page=1")
rankings <- base_url %>% html_nodes(".meta , .score , .position , .rankings-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame
You will notice in the base_url, at the very end, it includes &Page=1. Well, I'm trying to do that for 20 pages, hence the:
page_number <- 1:20
What would be the most efficient way to loop those numbers into the URL without having to write 20 different sets of code?
You can use paste0 or sprintf to construct all the URL's
all_urls <- paste0("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?ViewPath=~%2FViews%2FSkyNet%2FPlayerSportRanking%2F_SimpleSetForSeason.ascx&Page=", 1:20)
You can then iterate over each URL and extract the data needed.
library(tidyverse)
library(rvest)
rankings <- map(all_urls, ~.x %>% read_html %>%
html_nodes(".meta , .score , .position , .rankings-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)

Can I use purrr to execute a dplyr query and save the result of each query output

I have the following dataset:
combined <- data.frame(
client = c('aaa','aaa','aaa','bbb','bbb','ccc','ccc','ddd','ddd','ddd'),
type = c('norm','reg','opt','norm','norm','reg','opt','opt','opt','reg'),
age = c('>50','>50','75+','<25','<25','>50','75+','25-50','25-50','75+'),
cases = c('1','2','2','1','0','1','2','0','3','2'),
IsActive = c('1','0','0','1','1','0','1','1','1','0')
)
And have identified the unique variable combinations with :
# get unique variable combinations
unique_vars <- combined %>%
select(1:3,5) %>%
distinct()
I am trying to iterate on this query combined %>% anti_join(slice(unique_vars,1)) using purrr and save both the output of the query and also save summary of cases from each output back to the unique_vars table. The slice should iterate through each row of unique_vars, not be fixed at 1
I tried :
qry <- combined %>% anti_join(slice(unique_vars,1))
map(.x = unique_vars %>%
slice(.),
~qry %>%
summarise(CaseCnt = sum(cases)) %>%
inner_join(.x))
My desired output would be two things:
Full output of the query
the new Field CaseCnt added to the unique_vars dataframe
Is this possible?
Although I don't completely follow the intuition behind your query, it seems that for #1 you would want:
lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE)
})
And for #2 you would want:
unique_vars$CaseCnt <- lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
}) %>% do.call(what = rbind.data.frame,
args = .)
Alternatively for #2 with purrr:map_df():
unique_vars$CaseCnt <- map_df(c(1:nrow(unique_vars)), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
})
Just as an aside -- you could do this directly with:
combined %>%
mutate(cases = as.numeric(cases)) %>%
mutate(tot_cases = sum(cases)) %>% # sum total cases across unique_id's
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = mean(tot_cases) - sum(cases))
Or if what you were actually looking for is the sum of cases in that group:
combined %>%
mutate(cases = as.numeric(cases)) %>%
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = sum(cases))

Cosine Similarity: Funtion Can't Calculate The Matrix

So, I recently building a music recommender system using Collaborative Filtering in Rstudio. I have some problem with the function of cosine similarity which the system said "subscript out of bond" on the matrix that I want to calculate.
I use Cosine Similarity which I got the reference from this website: https://bgstieber.github.io/post/recommending-songs-using-cosine-similarity-in-r/
I've tried to fix the script but still apparently the output isn't working.
##cosinesim-crossprod
cosine_sim <- function(a,b) {crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))}
##User data
play_data <- "https://static.turi.com/datasets/millionsong/10000.txt" %>%
read_tsv(col_names = c('user', 'song_id', 'plays'))
##Song data
song_data <- read_csv("D:/3rd Term/DataAnalysis/dataSet/song_data.csv") %>%
distinct(song_id, title, artist_name)
##Grouped
all_data <- play_data %>%
group_by(user, song_id) %>%
summarise(plays = sum(plays, na.rm = TRUE)) %>%
inner_join(song_data)
top_1k_songs <- all_data %>%
group_by(song_id, title, artist_name) %>%
summarise(sum_plays = sum(plays)) %>%
ungroup() %>%
top_n(1000, sum_plays) %>%
distinct(song_id)
all_data_top_1k <- all_data %>%
inner_join(top_1k_songs)
top_1k_wide <- all_data_top_1k %>%
ungroup() %>%
distinct(user, song_id, plays) %>%
spread(song_id, plays, fill = 0)
ratings <- as.matrix(top_1k_wide[,-1])
##Function
calc_cos_sim <- function(song_code = top_1k_songs,
rating_mat = ratings,
songs = song_data,
return_n = 5) {
song_col_index <- which(colnames(ratings)== song_code) %>%
cos_sims <- apply(rating_mat, 2,FUN = function(y)
cosine_sim(rating_mat[,song_col_index], y))
##output
data_frame(song_id = names(cos_sims), cos_sim = cos_sims) %>%
filter(song_id != song_code) %>% # remove self reference
inner_join(songs) %>%
arrange(desc(cos_sim)) %>%
top_n(return_n, cos_sim) %>%
select(song_id, title, artist_name, cos_sim)
}
I expect when I use this script:
shots <- 'SOJYBJZ12AB01801D0'
knitr::kable(calc_cos_sim(shots))
The output would be a data frame of 5 songs.
The pipe at the end of this line looks like a typo:
song_col_index <- which(colnames(ratings)== song_code) %>%
Replace it with:
song_col_index <- which(colnames(ratings)== song_code)

Error: `by` required, because the data sources have no common variables

I am trying to apply the codes to my data in this link
https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join
The code in the book is
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
I wrote it like the following (excluded "filter" because I have just filenames and words columns in my data)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
abc %>%
inner_join(nrc_joy ) %>%
count(word, sort = TRUE)
I get this error:
Error: by required, because the data sources have no common variables
Any ideas how to deal with it?
After running into a similar issue this is what I found.
The complete code from the website is:
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
The 'abc' dataset is unspecified in the question; however, it is easy to make up a substitute dataset using a 'differentColumnNameForWord'.
library(tidytext)
abc <- data.frame(differentColumnNameForWord = stop_words$word, stringsAsFactors = FALSE)
The way to find which column name the words are stored in the data frame is to use the 'names' function.
> names(abc)
[1] "DifferentColumnNameForWord"
Once the name of the column is identified the code would need to be modified as follows:
abc %>% inner_join(nrc_joy, by = c("DifferentColumnNameForWord" = "word")) %>%
count(DifferentColumnNameForWord, sort = TRUE)
In my situation, one dataset had the words under the 'word' column while another had the words under the 'term' column.

Web scraping with Rvest -- Return NA if node is not found?

I am a bit stuck here. I would like to scrape data from a website, and extract a few things like user ratings, comments etc.
I am trying to add the data to a data frame.
Below is the code i have so far:
# Read html and select the URLs for each game review.
library(rvest)
library(dplyr)
library(plyr)
# Read the webpage and the number of ratings.
getGame <- function(metacritic_game) {
total_ratings<- metacritic_game %>%
html_nodes("strong") %>%
html_text()
total_ratings <- ifelse(length(total_ratings) == 0, NA,
as.numeric(strsplit(total_ratings, " ") [[1]][1]))
# Get the game title and the platform.
game_title <- metacritic_game %>%
html_nodes("h1") %>%
html_text()
game_platform <- metacritic_game %>%
html_nodes(".platform a") %>%
html_text()
game_platform <- strsplit(game_platform," ")[[1]][57:58]
game_platform <- gsub("\n","", game_platform)
game_platform<- paste(game_platform[1], game_platform[2], sep = " ")
game_publisher <- metacritic_game %>%
html_nodes(".publisher a:nth-child(1)") %>%
html_attr("href") %>%
strsplit("/company/")%>%
unlist()
game_publisher <- gsub("\\W", " ", game_publisher)
game_publisher <- strsplit(game_publisher,"\\t")[[2]][1]
release_date <- metacritic_game %>%
html_nodes(".release_data .data") %>%
html_text()
user_ratings <- metacritic_game %>%
html_nodes("#main .indiv") %>%
html_text() %>%
as.numeric()
user_name <- metacritic_game %>%
html_nodes(".name a") %>%
html_text()
review_date <- metacritic_game %>%
html_nodes("#main .date") %>%
html_text()
user_comment <- metacritic_game %>%
html_nodes("#main .review_section .review_body") %>%
html_text()
record_game <- data.frame(game_title = game_title,
game_platform = game_platform,
game_publisher = game_publisher,
username = user_name,
ratings = user_ratings,
date = review_date,
comments = user_comment)
}
metacritic_home <-read_html("https://www.metacritic.com/browse/games/score/metascore/90day/all/filtered")
game_urls <- metacritic_home %>%
html_nodes("#main .product_title a") %>%
html_attr("href")
get100games <- function(game_urls) {
data <- data.frame()
i = 1
for(i in 1:length(game_urls)) {
metacritic_game <- read_html(paste0("https://www.metacritic.com",
game_urls[i], "/user-reviews"))
record_game <- getGame(metacritic_game)
data <-rbind.fill(data, record_game)
print(i)
}
data
}
df100games <- get100games(game_urls)
Some of the links, though, do not have any user reviews and as a result
rvest is not able to find the node and I get the following error: Error in data.frame(game_title = game_title, game_platform = game_platform, :
arguments imply differing number of rows: 1, 0.
I have tried to include ifelse statements such as:
username = ifelse(length(user_name) !=0 , user_name, NA),
ratings = ifelse(length(user_ratings) != 0,
user_ratings, NA),
date = ifelse(length(review_date) != 0,
review_date, NA),
comments = ifelse(length(user_comment) != 0,
user_comment, NA))
However, the data frame only returns one review per game instead of returning all the reviews.. Any thoughts on this?
Thanks..!
You can use the function operator possibly form the purrr package:
df100games <- purrr::map(game_urls, purrr::possibly(get100games, NULL)) %>%
purrr::compact() %>%
dplyr::bind_rows()
I believe this will return your desired output.

Resources