On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:
https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html
STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website.
Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.
https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id
#Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.
How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?
Code
library(tidyverse)
library(rvest)
library(dplyr)
df23 <- expand.grid(
stat_id = c("02568","02675", "101")
) %>%
mutate(
links = paste0(
"https://www.pgatour.com/stats/detail/",
stat_id
)
) %>%
as_tibble()
get_info <- function(link, stat_id) {
data <- link %>%
read_html() %>%
html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>%
html_text() %>%
jsonlite::fromJSON()
answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
drop_na(playerName)
# get lists of dataframes into single dataframe, then merge back with original dataframe
answer2 <- answer$stats
answer2 <- bind_rows(answer2, .id = "column_label") %>%
select(-color) %>%
pivot_wider(
values_from = statValue,
names_from = statName)
#All stats combined and unnested
stats2 <- dplyr::bind_cols(answer, answer2)
}
test_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_stats <- test_stats %>%
unnest(everything())
Simplified code courtesy of #Dave2e
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows
If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:
It returns a JSON dataset similar to the one in your original post:
To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.
Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:
(you can probably hardcode these)
Related
I am trying to scrape some sports data from this website (https://en.khl.ru/stat/players/1097/skaters/) using rvest. There are no pages to filter through, but there is a 'Show All' icon to show all the data on the page.
I have been trying to use a css selector to extract the table. Unfortunately, no rows are produced but the column names of the table are present.
I suspect the problem lies in the website's interactive features with the table.
Yes, this page is dynamically generated, thus troublesome for rvest to handle. But the key to scrape this page is to realize the data is stored as JSON in a script element on the page.
The code below reads the page and extracts the script nodes. Reviewed the script node to find the correct one. Then some trial and error extracted the JSON data. Cleaned up the player and team name columns for the final answer.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://en.khl.ru/stat/players/1097/skaters/"
page <- read_html(url)
#the data for the page is stored in a script element
scripts <-page %>% html_elements("script")
#get column names
headers <- page %>% html_elements("thead th") %>% html_text()
#examined the nodes and manually determined the 31st node was it
tail(scripts, 18)
data <- scripts[31] %>% html_text()
#examined the data string and notice the start of the JSON was '[ ['
#end of the JSON was ']]'
jsonstring <- str_extract(data, "\\[ \\[.+\\]\\]")
#convert the JSON into data frame
answer <- jsonlite::fromJSON(jsonstring) %>% as.data.frame
#rename column titles
names(answer) <- headers
#function to clean up html code in columns
cleanhtml <- function(text) {
out<-text %>% read_html() %>% html_text()
}
#remove the html information in columns 1 &3
answer <- answer[ , -32] %>% rowwise() %>%
mutate(Player = cleanhtml(Player), Team=cleanhtml(Team))
answer
I would like to scrape product name and the rating from a webpage. Upon inspecting the element, I know I need to get the data from product__title and attraqt-star-rating-stars__bar. But I am not sure how to do it as this is embedded within the multiple layers of tag. I've tried the following with no avail; any suggestions are welcome.
library(rvest)
library(dplyr)
url = 'https://www.chemistwarehouse.com.au/shop-online/159/oral-hygiene-and-dental-care'
stores <- read_html(url)
stores %>% html_nodes('body') %>%
html_nodes('.product__title') %>%
rvest::html_text()
stores %>% html_nodes('body') %>%
html_nodes('attraqt-star-rating-stars__bar') %>%
rvest::html_text()
Data is pulled dynamically from an API call. As the json returned is nested you need to extract the desired info e.g., by writing a couple of user-defined functions.
I first extract the listings (list of products), then have a function get_info, which takes an individual product listing and extracts the title and rating and returns a tibble. As the index at which the rating may appear can vary, I have an additional helper function get_rating_index, which retrieves dynamically the correct index for the rating. This function passes the index back to get_info.
I apply get_info over the list of product info, listings, using map_dfr to generate a final DataFrame from each tibble.
library(jsonlite)
library(purrr)
library(dplyr)
data <- jsonlite::read_json("https://www.chemistwarehouse.com.au/searchapi/webapi/search/category?category=159&index=0&sort=")
listings <- data$universes$universe[[1]]$`items-section`$items$item
get_info <- function(listing) {
tibble(
title = listing$attribute[[2]]$value[[1]]$value,
rating = listing$attribute[[get_rating_index(listing$attribute)]]$value[[1]]$value %>% as.numeric()
) -> t
return(t)
}
get_rating_index <-function(attribute){
return(match(T, map(attribute, ~{.x$name == 'bv_star_rating'})))
}
dental_product_ratings <- purrr::map_dfr(listings, get_info)
As a practice project, I am trying to scrape property data from a website. (I only intend to practice my web scraping skills with no intention to further take advantage of the data scraped). But I found that some properties don't have price available, therefore, this creates an error of different length when I am trying to combine them into one data frame.
Here is the code for scraping:
library(tidyverse)
library(revest)
web_page <- read_html("https://wx.fang.anjuke.com/loupan/all/a1_p2/")
community_name <- web_page %>%
html_nodes(".items-name") %>%
html_text()
length(community_name)
listed_price <- web_page %>%
html_nodes(".price") %>%
html_text()
length(listed_price)
property_data <- data.frame(
name=community_name,
price=listed_price
)
How can I identity the property with no listed price and fill the price variable with NA when there is no value scraped?
Inspection of the web page shows that the class is .price when price has a value, and .price-txt when it does not. So one solution is to use an XPath expression in html_nodes() and match classes that start with "price":
listed_price <- web_page %>%
html_nodes(xpath = "//p[starts-with(#class, 'price')]") %>%
html_text()
length(listed_price)
[1] 60
I'm scraping the ASN database (http://aviation-safety.net/database/). I've written code to paginate through each of the years (1919-2019) and scrape all relevant nodes except fatalities (represented as "fat."). Selector Gadget tells me the fatalities node is called "'#contentcolumnfull :nth-child(5)'". For some reason ".list:nth-child(5)" doesn't work.
When I scrape #contentcolumnfull :nth-child(5), the first element is blank, represented as "".
How can I write a function to delete the first empty element for every year/page that's scraped? It's simple to delete the first element when I scrape a single page on its own:
fat <- html_nodes(webpage, '#contentcolumnfull :nth-child(5)')
fat <- html_text(fat)
fat <- fat[-1]
but I'm finding it difficult to write into a function.
I also have a second question regarding date-time and formatting. My days data are represented as day-month-year. Several element days and months are missing (ex: ??-??-1985, JAN-??-2004). Ideally, I'd like to transform the dates into a lubridate object, but I can't with missing data or if I only keep the years.
At this point, I've used gsub() and regex to clean the data (delete "??" and floating dashes), so I have a mixed bag of data formats. However, this makes it difficult to visualize the data. Thoughts on best practice?
# Load libraries
library(tidyverse)
library(rvest)
library(xml2)
library(httr)
years <- seq(1919, 2019, by=1)
pages <- c("http://aviation-safety.net/database/dblist.php?Year=") %>%
paste0(years)
# Leaving out the category, location, operator, etc. nodes for sake of brevity
read_date <- function(url){
az <- read_html(url)
date <- az %>%
html_nodes(".list:nth-child(1)") %>%
html_text() %>%
as_tibble()
}
read_type <- function(url){
az <- read_html(url)
type <- az %>%
html_nodes(".list:nth-child(2)") %>%
html_text() %>%
as_tibble()
}
date <- bind_rows(lapply(pages, read_date))
type <- bind_rows(lapply(pages, read_type))
# Writing to dataframe
aviation_df <- cbind(type, date)
aviation_df <- data.frame(aviation_df)
# Excluding data cleaning
It is bad practice to ping the same page more than once in order to extract the requested information. You should read the page, extract all of the desired information and then move to the next page.
In this case the individual nodes are all store in one master table. rvest's html_table() function is handy to convert a html table into a data frame.
library(rvest)
library(dplyr)
years <- seq(2010, 2015, by=1)
pages <- c("http://aviation-safety.net/database/dblist.php?Year=") %>%
paste0(years)
# Leaving out the category, location, operator, etc. nodes for sake of brevity
read_table <- function(url){
#add delay so that one is not attacking the host server (be polite)
Sys.sleep(0.5)
#read page
page <- read_html(url)
#extract the table out (the data frame is stored in the first element of the list)
answer<-(page %>% html_nodes("table") %>% html_table())[[1]]
#convert the falatities column to character to make a standardize column type
answer$fat. <-as.character(answer$fat.)
answer
}
# Writing to dataframe
aviation_df <- bind_rows(lapply(pages, read_table))
The are a few extra columns which will need clean-up
I know how to loop when a page is paginated, but I wish to scrape multiple information/html_nodes in one loop function, but I am not sure if you can set it up. So far I have tried the following. It's basically a jobsearch website, where I want company name, company description and number of open positions.
I use sprintf to get page 1-14.
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
I have made a loop, which works to scrape one data source.
company <- function(virksomhed){
company %>% read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
}
virk <- lapply(urlingtek, virksomhed)
But I wish to scrape all the utilities down at once if possible.
I have so far tried using
jobvirksom <- function(alt){
alt %>%
read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
html_nodes('.jix_active a') %>%
html_text()
}
So far without any luck. Would be a lot better if I could scrape it all at once, press lapply and turn into one list.
Here is the start of a solution. In this case with only 14 webpages to parse through it is sometimes easier to just use a loop. With this number of pages the time between a for loop and lapply is insignificant.
I notice the web pages are not consistently formatted so this solution will need additional work when the data is missing or inconsistent. This will work for the first 2 pages and fail on the third where the overview is missing.
library(rvest)
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
#define empty data frame to store all data
alllistings<-data.frame()
for (i in urlingtek){
print(i)
#read the page just once
page<-read_html(i)
#parse company name
company<-page%>%html_nodes('.jix_company_name_link a') %>% html_text()
#remove blank company names
company<-trimws(company)
company<-company[nchar(company)>1]
#parse company overview
overv<-page %>% html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
#parse active information
active<-page %>% html_nodes('.jix_active a') %>% html_text()
#create temporary dataframe to store data from this loop
tempdf<-data.frame(company, overv, active)
#combine temp with all data
alllistings<-rbind(alllistings, tempdf)
}