Creating a function that loops through page numbers - r

I have a script that it importing data, which looks like this:
library(tidyverse)
library(rvest)
library(magrittr)
page_number <- 1:20
base_url <- read_html("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?ViewPath=~%2FViews%2FSkyNet%2FPlayerSportRanking%2F_SimpleSetForSeason.ascx&Page=1")
rankings <- base_url %>% html_nodes(".meta , .score , .position , .rankings-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame
You will notice in the base_url, at the very end, it includes &Page=1. Well, I'm trying to do that for 20 pages, hence the:
page_number <- 1:20
What would be the most efficient way to loop those numbers into the URL without having to write 20 different sets of code?

You can use paste0 or sprintf to construct all the URL's
all_urls <- paste0("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?ViewPath=~%2FViews%2FSkyNet%2FPlayerSportRanking%2F_SimpleSetForSeason.ascx&Page=", 1:20)
You can then iterate over each URL and extract the data needed.
library(tidyverse)
library(rvest)
rankings <- map(all_urls, ~.x %>% read_html %>%
html_nodes(".meta , .score , .position , .rankings-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)

Related

received Error in open.connection(x, "rb") : HTTP error 404. after running a for-loop in r

While trying to scrape information from several links, I got the error: Error in open.connection(x, "rb") : HTTP error 404.
I feel like it has something to do with the first part of my for-loop, so I tried changing numbers from character to numeric, but that did not fix the problem. I also tried advice here, however, it returned more problems.
Think you can spot where I went wrong?
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)
numbers = as.numeric(numbers)
## create empty vector ----------------------------
master1.tree = data.frame()
## Create for loop ---------------------------------
for (i in length(numbers)){
url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
ageDivision <- url %>% html_nodes('.category-title__age-division') %>% html_text()
gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text()
matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, data.frame(matches))
}
I also ran this, but it did not return the data frame for the scraped data. Instead it printed the results on the screen instead
map_df(get_links, function(i){
url <- read_html(i)
matches <- data.frame(ageDivision <- url %>%
html_nodes('.category-title__age-division') %>% html_text(),
gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text() )
master1.tree <- rbind(master1.tree, matches)
})
Here is an alternative to your code. First, it's not necessary to extract the numbers. You can directly loop over the vector get_links. Second, I use purrr::map_df for the looping part which is a more concise way than using the for loop. To this end I use a custom function to scrape one of your pages. Finally, I use trim=TRUE with html_text to remove the leading and trailing white space:
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
scrape_page <- function(url) {
html <- read_html(url)
data.frame(
division = html %>% html_nodes('.category-title__age-division') %>% html_text(trim = TRUE),
gender = html %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text(trim = TRUE)
)
}
master1.tree <- purrr::map_df(get_links[1:5], scrape_page)
master1.tree
#> division gender
#> 1 Master 1 Male
#> 2 Master 1 Male
#> 3 Master 1 Male
#> 4 Master 1 Male
#> 5 Master 1 Male
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)
numbers = as.numeric(numbers)
## create empty vector ----------------------------
master1.tree = data.frame()
## Create for loop ---------------------------------
for (i in numbers){
url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
ageDivision <- url %>%
html_nodes('.category-title__age-division') %>%
html_text()
gender <- url %>%
html_nodes('.category-title__age-division+ .category-title__label') %>%
html_text()
matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, matches)
}

Classification issues between Rvest and Map_dfr during web-scrape

I'm currently scraping stats from a website, but on certain stat pages I hit a snag with the following prompt:
Error: Column avg can't be converted from numeric to character
I try something like mutate(avg = avg %>% as.numeric), but then I get the prompt the column avg can't be found.
The issue in the code below occurs whenever I add stat_id 336 or 340. Any ideas on how to solve this?
library(dplyr)
library(tidyverse)
library(janitor)
library(rvest)
library(magrittr)
df <- expand.grid(
tournament_id = c("t464", "t054", "t047"),
stat_id = c("02564", "101", "102", "336", "340")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character) %>%
mutate(tournament)
}
# Retrieve the tables and bind them
test12 <- df %$%
map2_dfr(links, tournament_id, get_info)
test12
You generally don't want to put a pipe inside of a dplyr verb, or at least I have never before seen that done. Not sure why you need that in this example as average automatically parses as numeric. Try this instead:
# Function to get the table
get_info <- function(link, tournament_id) {
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d+")))
try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
data
}
test12 <- df %>%
mutate(tables = map2(links, tournament_id, get_info)) %>%
tidyr::unnest(everything())

How do i remove a specific term in my dataframe string?

df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE)
Hey guys, im using these functions to get a list of word frequency (im working with a giant text), but im getting repeated words like "banana" , "banana.", "banana?" etc. and they are counting separately. How do i delete the dots, interrogation and others to sum banana correctly? Thx!!!
Try using :
df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
gsub('[[:punct:]]', '', .) %>%
table() %>%
sort(decreasing = TRUE)

Web scraping with Rvest -- Return NA if node is not found?

I am a bit stuck here. I would like to scrape data from a website, and extract a few things like user ratings, comments etc.
I am trying to add the data to a data frame.
Below is the code i have so far:
# Read html and select the URLs for each game review.
library(rvest)
library(dplyr)
library(plyr)
# Read the webpage and the number of ratings.
getGame <- function(metacritic_game) {
total_ratings<- metacritic_game %>%
html_nodes("strong") %>%
html_text()
total_ratings <- ifelse(length(total_ratings) == 0, NA,
as.numeric(strsplit(total_ratings, " ") [[1]][1]))
# Get the game title and the platform.
game_title <- metacritic_game %>%
html_nodes("h1") %>%
html_text()
game_platform <- metacritic_game %>%
html_nodes(".platform a") %>%
html_text()
game_platform <- strsplit(game_platform," ")[[1]][57:58]
game_platform <- gsub("\n","", game_platform)
game_platform<- paste(game_platform[1], game_platform[2], sep = " ")
game_publisher <- metacritic_game %>%
html_nodes(".publisher a:nth-child(1)") %>%
html_attr("href") %>%
strsplit("/company/")%>%
unlist()
game_publisher <- gsub("\\W", " ", game_publisher)
game_publisher <- strsplit(game_publisher,"\\t")[[2]][1]
release_date <- metacritic_game %>%
html_nodes(".release_data .data") %>%
html_text()
user_ratings <- metacritic_game %>%
html_nodes("#main .indiv") %>%
html_text() %>%
as.numeric()
user_name <- metacritic_game %>%
html_nodes(".name a") %>%
html_text()
review_date <- metacritic_game %>%
html_nodes("#main .date") %>%
html_text()
user_comment <- metacritic_game %>%
html_nodes("#main .review_section .review_body") %>%
html_text()
record_game <- data.frame(game_title = game_title,
game_platform = game_platform,
game_publisher = game_publisher,
username = user_name,
ratings = user_ratings,
date = review_date,
comments = user_comment)
}
metacritic_home <-read_html("https://www.metacritic.com/browse/games/score/metascore/90day/all/filtered")
game_urls <- metacritic_home %>%
html_nodes("#main .product_title a") %>%
html_attr("href")
get100games <- function(game_urls) {
data <- data.frame()
i = 1
for(i in 1:length(game_urls)) {
metacritic_game <- read_html(paste0("https://www.metacritic.com",
game_urls[i], "/user-reviews"))
record_game <- getGame(metacritic_game)
data <-rbind.fill(data, record_game)
print(i)
}
data
}
df100games <- get100games(game_urls)
Some of the links, though, do not have any user reviews and as a result
rvest is not able to find the node and I get the following error: Error in data.frame(game_title = game_title, game_platform = game_platform, :
arguments imply differing number of rows: 1, 0.
I have tried to include ifelse statements such as:
username = ifelse(length(user_name) !=0 , user_name, NA),
ratings = ifelse(length(user_ratings) != 0,
user_ratings, NA),
date = ifelse(length(review_date) != 0,
review_date, NA),
comments = ifelse(length(user_comment) != 0,
user_comment, NA))
However, the data frame only returns one review per game instead of returning all the reviews.. Any thoughts on this?
Thanks..!
You can use the function operator possibly form the purrr package:
df100games <- purrr::map(game_urls, purrr::possibly(get100games, NULL)) %>%
purrr::compact() %>%
dplyr::bind_rows()
I believe this will return your desired output.

R Web Scraping - data frame

I am trying to create a data frame with the following variables. However, after using the SelectorGadget tool to determine the CSS selector needed to scrape this information, the vectors yield different values. Even after copying the selector straight from the HTML source code. If done correctly, this table should have 34 rows. Here it is my code and the respective error:
womens_bb <- read_html("http://gomason.com/schedule.aspx?path=wbball")
womens_opponents <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-name a") %>%
html_text()
womens_locations <- womens_bb %>%
html_nodes(".sidearm-schedule-game-location span:nth-child(1)") %>%
html_text()
womens_dates <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(1)") %>%
html_text()
womens_times <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(2)") %>%
html_text()
as.numeric()
womens_scores <- womens_bb %>%
html_nodes("div.sidearm-schedule-game-result span:nth-child(3)") %>%
html_text()
as.numeric()
womens_win_loss <- womens_bb %>%
html_nodes(".text-italic span:nth-child(2)") %>%
html_text() %>%
str_replace("\\,", "")
womens_df <- data_frame(
date = womens_dates, time = womens_times, opponent = womens_opponents, location = womens_locations, score = womens_scores, win_loss = womens_win_loss)
Error: Columns `date`, `time`, `opponent`, `score`, `win_loss` must be length 1 or 37, not 36, 36, 34, 34, 35
How can I resolve this issue?
I think there are some issue with the img tag. So to avoid, these you can first gather the global div tag ( 36 when i do the script), and do a loop inside to get your result. And perform a little if controle on tag that appear weirds :
womens_bb <- read_html("http://gomason.com/schedule.aspx?path=wbball")
divs <- womens_bb %>% html_nodes(".sidearm-schedule-game")
for (div in divs){
womens_opponents <- div %>%
html_nodes(".sidearm-schedule-game-opponent-name, .sidearm-schedule-game-opponent-name a") %>%
html_text
womens_opponents <- gsub("\\s{2,}","",womens_opponents[1])
womens_locations <- div %>%
html_nodes(".sidearm-schedule-game-location span:nth-child(1)") %>%
html_text()
womens_locations <- womens_locations[1]
womens_dates <- div %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(1)") %>%
html_text()
womens_times <- div %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(2)") %>%
html_text()
womens_scores <- div %>%
html_nodes("div.sidearm-schedule-game-result span:nth-child(3)") %>%
html_text()
if(length(womens_scores)==0) womens_scores = ""
womens_win_loss <- div %>%
html_nodes(".text-italic span:nth-child(2)") %>%
html_text()
womens_win_loss <- gsub("\\,", "",womens_win_loss)
res <- c(date = womens_dates, time = womens_times, opponent = womens_opponents, location = womens_locations, score = womens_scores, win_loss = womens_win_loss)
print(length(res))
df <- rbind(df,res)
}
Hope that will helps,
Gottavianoni

Resources