Same webscrape code works on one page, not another using rvest - r

I built a simple scrape to get a data frame with NFL draft results for 2020. I intent to use this code to map several years of results but for some reason, when I change the code for a single page scrape for any other year than 2020, I get the error at the bottom.
library(tidyverse)
library(rvest)
library(httr)
library(curl)
This scrape for 2020 works flawlesslessy, although the col names are in row 1 which isn't a big deal to me as I can deal with this later (mentioning though in case this might have to do with the problem):
x <- "https://www.pro-football-reference.com/years/2020/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame()
below the url is changed from 2020 to 2019 which is an active page with a table of the same format. For some reason, the same call as above does not work:
x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame()
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 261, 2

There are two tables at the url provided. There is the core draft (table 1, id = "drafts") and the supplemental draft (table 2, id = "drafts_supp").
The as.data.frame() call fails because it is trying to combine the two tables but they have differing columns in both name and number. You can direct rvest to read just the specific table you are interested in by providing the html_node() with either the xpath or the selector. You can find the xpath or selector by inspecting the specific table you are interested in, right-click > inspect on Chrome/Mozilla. Note that for selector to use id you'll need to use #drafts not just drafts and for xpath you typically have to wrap the text in single quotes.
This works: html_node(xpath = '//*[#id="drafts"]')
This doesn't because of the double quotes: html_node(xpath = "//*[#id="drafts"]")
Note that I believe the html_nodes("table") used in your example is unnecessary, as html_table() already selects only tables.
x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
raw_html <- read_html(x)
# use xpath
raw_html %>%
html_node(xpath = '//*[#id="drafts"]') %>%
html_table()
# use selector
raw_html %>%
html_node("#drafts") %>%
html_table()

Related

Rvest Pulls Empty Tables

The site I use to scrape data has changed and I'm having issues pulling the data into table format. I used two different types of codes below trying to get the tables, but it is returning blanks instead of tables.
I'm a novice in regards to scraping and would appreciate the expertise of the group. Should I look for other solutions in rvest, or try to learn a program like rSelenium?
https://www.pgatour.com/stats/detail/02675
Scrape for Multiple Links
library("dplyr")
library("purr")
library("rvest")
df23 <- expand.grid(
stat_id = c("02568","02674", "02567", "02564", "101")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/detail/',
stat_id
)
) %>%
as_tibble()
#replaced tournament_id with stat_id
get_info <- function(link, stat_id){
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]]
}
test_main_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_main_stats <- test_main_stats %>%
unnest(everything())
Alternative Code
url <- read_html("https://www.pgatour.com/stats/detail/02568")
test1 <- url %>%
html_nodes(".css-8atqhb") %>%
html_table
This page uses javascript to create the table, so rvest will not directly work. But if one examines the page's source code, all of the data is stored in JSON format in a "<script>" node.
This code finds that node and converts from JSON to a list. The variable is the main table but there is a wealth of other information contained in the JSON data struture.
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows

Rvest scraping child nodes but filling missing values with NA

I am trying to scrape some data from the sec website. Each parent node has child nodes that contains text of interest. However, in some cases a particular child node does not exist. So for example in this link:
urll <- "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml"
There are 728 parent nodes. Each parent node has a number of entries that are child nodes that have a specific tag. Here is an example of one full entry (of the 728):
<infoTable>
<nameOfIssuer>APPLE INC</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>037833100</cusip>
<value>1486</value>
<shrsOrPrnAmt>
<sshPrnamt>11200</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<putCall>Put</putCall>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>11200</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
In this example the "putCall" tag may or may not exist. When it exists I want to be able to get the relevant text, so "Put" in this instance. However for this link, only 8 of the 728 parent nodes have the "putCall" node. I want to fill the nodes where there is no "putCall" node with NA so that I always have the 728 entries for each tag that I can coerce into a data frame. So for example this is what I have tried so far inspired by Inputting NA where there are missing values when scraping with rvest.
library(polite)
library(rvest)
library(purrr)
library(tidyverse)
library(httr)
session <- bow("https://www.sec.gov/")
urll <- "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml"
test <- session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ list(
name_of_issuer = html_elements(.x, xpath = "//*[local-name()='nameOfIssuer']") %>%
html_text() %>%
{
if (length(.) == 0)
NA
else
.
},
title_of_class = html_elements(.x, xpath = "//*[local-name()='titleOfClass']") %>%
html_text() %>%
{
if (length(.) == 0)
NA
else
.
},
put_or_call = html_elements(.x, xpath = "//*[local-name()='putCall']") %>%
html_text() %>%
{
if (length(.) == 0)
NA
else
.
}))
This fails with the error message:
Error: Can't recycle `name_of_issuer` (size 728) to match `put_or_call` (size 8).
It seems that the NA fill in not working for the "putCall" node and it only returns a list of 8 entries.
Any suggestions on what I am doing wrong and how to fix it?
Thanks much!
If I simply use httr then I can pass in a valid UA header and re-write your code to instead use a data.frame call, instead of list, that way I can return N/A where value not present.
Swap out html_elements for html_element.
You also need to amend your xpaths to avoid getting the first node value repeated for each row.
library(tidyverse)
library(httr)
headers <- c("User-Agent" = "Safari/537.36")
r <- httr::GET(url = "https://www.sec.gov/Archives/edgar/data/1002784/000139834421003391/fp0061633_13fhr-table.xml", httr::add_headers(.headers = headers))
r %>%
content() %>%
html_elements(xpath = "//*[local-name()='infoTable']") %>% # select enclosing nodes
# iterate over each parent node, pulling out desired parts and coerce to data.frame
# not the complete list
map_df(
~ data.frame(
name_of_issuer = html_element(.x, xpath = ".//*[local-name()='nameOfIssuer']") %>%
html_text(),
title_of_class = html_element(.x, xpath = ".//*[local-name()='titleOfClass']") %>%
html_text(),
put_or_call = html_element(.x, xpath = ".//*[local-name()='putCall']") %>%
html_text()
)
)

R - using rvest to scrape <p> tag only if a sister <img> tag is also present in nodes

I am scraping college basketball player images and https://unfospreys.com/sports/womens-basketball/roster/2020-21 is one of the many pages with these images. Unfortunately, the 15th player on this page, Britney Gore, does not have a player image. As a result, the above data.frame() is not created because the column imgSrc is length 14 and the column playerName is length 15. (you can run the code separately for each column in the data.frame() and each line works individually).
library(rvest)
library(xml2)
rosters_url = 'https://unfospreys.com/sports/womens-basketball/roster/2020-21'
rosters_page = rosters_url %>% read_html()
this_rosters_df <- data.frame(
baseUrl = 'https://unfospreys.com/sports/womens-basketball/roster/2020-21',
imgSrc = rosters_page %>% html_nodes('div.sidearm-roster-player-image a img') %>% html_attr("data-src"),
playerName = rosters_page %>% html_nodes('div.sidearm-roster-player-name p') %>% html_text() %>% trimws(),
stringsAsFactors = FALSE
)
Is there anyway for the code to identify on this page - okay, this player doesn't have an image tag, so don't pull their name, so we don't have this mismatch in the data frame? I cannot change that there are only 14 tags to the 15 tags, but perhaps I can change the code for playerName to exclude all nodes that don't have a child/sister tag?
The key to solve this problem is retrieve the parent nodes for all of the players. Then parse this vector of parent nodes for the requested information for each player, using the html_node() function (notice no ending s).
This technique works for this problem since there is a one-to-one relationship between the player's parent node to the requested information. For example one name, one position. The advantage of using html_node() instead of html_nodes() is html_node()will always return a value even if it is NA. So when there is no image node a NA is returned and your vectors stay aligned.
library(rvest)
rosters_url <- "https://unfospreys.com/sports/womens-basketball/roster/2020-21"
rosters_page <- rosters_url %>% read_html()
#find the parent node which has all of the desired information for each player
players <- rosters_page %>% html_nodes(".sidearm-roster-player")
#Extract the requested information for each player
baseUrl = 'https://unfospreys.com/sports/womens-basketball/roster/2020-21'
imgSrc = players %>% html_node('img') %>% html_attr("data-src")
playername <- players %>% html_node('.sidearm-roster-player-name p') %>% html_text() %>% trimws()
#build the final answer
data.frame(baseUrl, imgSrc, playername)
You could grab a shared parent and thereby restrict to only those where both targets are present. I choose a selector for parent node that allows me to pull the name from an aria-label attribute (to match with displayed name after a substring replace)
library(rvest)
library(purrr)
library(stringr)
rosters_url <- "https://unfospreys.com/sports/womens-basketball/roster/2020-21"
rosters_page <- rosters_url %>% read_html()
parent_nodes <- rosters_page %>% html_nodes(".sidearm-roster-player-image.column")
this_rosters_df <- map_df(parent_nodes, ~ {
data.frame(
imgSrc = .x %>% html_node(".lazyload") %>% html_attr("data-src") %>% url_absolute(., rosters_url),
playerName = .x %>% html_node("a") %>% html_attr('aria-label') %>% str_replace(' - View Full Bio',''),
stringsAsFactors = FALSE
)
})
head(this_rosters_df)

RVEST package seems to collect data in random order

I have the following question.
I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe).
Here's my code:
library(rvest)
library(lubridate)
library(tidyverse)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).
Then, I add this code to select the properties I want:
read_hotel <- function(url){ # collecting hotel names
ho <- read_html(url)
headline <- ho %>%
html_nodes("span.sr-hotel__name") %>% # the node I want to read
html_text() %>%
as_tibble()
}
hotels <- map_dfr(page_booking, read_hotel)
read_pr <- function(url){ # collecting price tags
pr <- read_html(url)
full_pr <- pr %>%
html_nodes("div.bui-price-display__value") %>% #the node I want to read
html_text() %>%
as_tibble()
}
fullprice <- map_dfr(page_booking, read_pr)
... and eventually save the whole data in the dataframe:
dfr <- tibble(hotels = hotels,
price_fact = fullprice)
I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless.
I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses.
Could you please explain me the processes take place during the rvest node hoping?
I would really appreciate at least some explanation, just to better understand the tool we work with.
You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.
library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
map_dfr(
items,
function(item) {
data.frame(
hotel = item %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = item %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
}
)
(The dots start the XPath syntaxes present the current node which is the hotel item.)
Update:
Update the code that I think faster but still does the job:
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
data.frame(
hotel = items %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = items %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)

Scraping Lineup Data From Football Reference Using R

I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()

Resources