I want to scrape the statistics from this page:
url <- "http://www.pgatour.com/players/player.20098.stuart-appleby.html/statistics"
Specifically, I want to grab the data in the table that's underneath Stuart's headshot. It's headlined by "Stuart Appleby - 2015 STATS PGA TOUR"
I attempt to use rvest, in combo with the Selector Gadget (http://selectorgadget.com/).
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")]//td')
'Should' get me the table without, for example, the row on top that says "Recap -- Rank -- Additional Stats"
url_html <- url %>% html()
url_html %>%
html_nodes(xpath = '//*[(#id = "playerStats")] | //th//*[(#id = "playerStats")]//td')
'Should' get me the table with that "Recap -- Rank -- Add'l Stats" line.
Neither do.
Obvs I'm a complete newb when it comes to web scraping. When I click on 'view source' for that webpage, the data contained in the table isn't there.
In the source code, where I think the table should be starting, is this bit of code:
<script id="playerStatsTourTemplate" type="text/x-jquery-tmpl">
{{each(t, tour) tours}}
{{if pgatour.players.shouldProcessTour(tour.tourCodeLC)}}
<div class="statistics-head">
<h2 class="title">Stuart Appleby - <b>${year} STATS
.
.
.
So, it appears the table is stored somewhere (Json? Jquery? Javascript? Are those terms applicable here?) that isn't accessible to the html() function. Is there anyway to use rvest to grab this data? Is there an rvest equivalent for grabbing data that is stored in this manner?
Thanks.
I'd probably use the GET request that the page is making to get the raw data from their API and work on parsing that...
content(a) gives you a list representation... basically the output from fromJSON()
or
as(a, "character") gives you the raw JSON
library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")
Check this out.
Open source project on GitHub scraping PGA data: https://github.com/zachwill/golf/blob/master/pga.py
Related
I am trying to scrape the following: 13.486 Kč from: https://www.aofis.cz/informace-pro-klienty/elba-opf/
For some reason, the following code does not seem to find the number. I am rather a newbie to this so perhaps it is because the string in xml_find_all is wrong. Can anyone please have a look why?
library(xml)
library(xml2)
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- read_html(page)
Price <- read_page %>%
rvest::html_nodes('page-content') %>%
xml2::xml_find_all("//strong[contains(#class 'sg_selected')]") %>%
rvest::html_text()
Price
Thank you!!
Michael
The html code you see in your browser developer panel (or selector gadget) is not the same as the content that is being delivered to your R session. It is actually a javascript file which then builds the web page. This is why your rvest call isn't finding the correct html node: there are no html nodes in the string you are processing!
There are a few different ways to get the information you want, but perhaps the best is to just get the monetary values from the javascript code using regex:
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- httr::content(httr::GET(page), "text")
stringr::str_extract_all(read_page, "\\d+\\.\\d+ K")[[1]][1]
#> [1] "13.486 K"
I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()
I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()
You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:
There are a number of NBA Fantasy Projections that I would like to scrape in a more streamlined approach. Currently I use a combination of importhtml function in google sheets and simple archaic cut'n'paste.
I use R regularly to scrape other data from the internet, however, I can't manage to get these tables to scrape. The tables I am having trouble with are located at three separate addresses (1 table per page), they are:
1) http://www.sportsline.com/nba/player-projections/player-stats/all-players/
2) https://swishanalytics.com/optimus/nba/daily-fantasy-projections
3) http://www.sportingcharts.com/nba/dfs-projections/
For all my other scraping activities I use packages rvest and xml. Following the same process I've tried both methods listed below which result in the outputs shown. I'm sure this has something to do with the format of the table on the website, however I haven't been able to find something that can help me.
Method 1
library(XML)
projections1 <- readHTMLTable("http://www.sportsline.com/nba/player-projections/player-stats/all-players/")
projections2 <- readHTMLTable("https://swishanalytics.com/optimus/nba/daily-fantasy-projections")
projections3 <- readHTMLTable("http://www.sportingcharts.com/nba/dfs-projections/")
Output
projections1
named list()
projections2
named list()
Warning message:
XML content does not seem to be XML: 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
projections3 - I get the headers of the table but not the content of the table.
Method 2
library(rvest)
URL <- "http://www.sportsline.com/nba/player-projections/player-stats/all-players/"
projections1 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "https://swishanalytics.com/optimus/nba/daily-fantasy-projections"
projections2 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
URL <- "http://www.sportingcharts.com/nba/dfs-projections/"
projections3 <- URL %>%
read_html %>%
html_nodes("table") %>%
html_table(trim=TRUE,fill=TRUE)
Output
projections1
list()
projections2 - I get the headers of the table but not the content of the table.
projections3 - I get the headers of the table but not the content of the table.
If anybody could point me in the right direction it would be greatly appreciated.
the content of the table is generated by javascript, so readHTMLTable and read_html find nothing, you can find the table as below
projections1: link
import requests
url = 'http://www.sportsline.com/sportsline-web/service/v1/playerProjections?league=nba&position=all-players&sourceType=FD&game=&page=PS&offset=0&max=25&orderField=&optimal=false&release2Ver=true&auth=3'
r = requests.get(url)
print r.json()
projections2: view-source:https://swishanalytics.com/optimus/nba/daily-fantasy-projections Line 1181
import requests
url = 'https://swishanalytics.com/optimus/nba/daily-fantasy-projections'
r = requests.get(url)
text = r.content
print eval(text.split('this.players = ')[1].split(';')[0])
projections3: view-source Line 918
I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]