I am new to webscraping and working on a test project in which I am trying to scrape every table of data on the following website for this particular team. There should be 15 tables but when I run my code, it only seems to pull the first 6 of the 15. How do I go about getting the rest of the tables?
Here is the code:
library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(magrittr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
iowa_stats %>% html_table()
Edit: So I decided to dig a little bit deeper into the problem and see if I could get any more insights. So I decided to start with the first table that doesn't appear when you call the html_table command which is the 'Totals' Table. I did the following to follow the path of the html all the way down to the table to see if I could figure out what's wrong. To do so, I used the following code.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper")
This is as far as I can get prior to getting an error. At the next step, there should be the following: div#div_totals.table_container.is_setup in which the table is stored but if I were to add that to the above code, it doesn't exist. When I type the following, it doesn't exist as well.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper") %>% html_nodes("div")
Does someone who is better with html/css have any idea why this is the case?
It looks like this webpage is storing some of the tables as comments. To solve this read and save the web page. Remove the comment tags and then process normally.
library(rvest)
library(dplyr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
#Only save and work with the body
body<-html_node(iowa_stats,"body")
write_xml(body, "temp.xml")
#Find and remove comments
lines<-readLines("temp.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "temp2.xml")
#Read the file back in and process normally
body<-read_html("temp2.xml")
html_nodes(body, "table") %>% html_table()
Related
I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()
Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.
My first post and a beginner with R so patience requested if I should have found an answer to my question elsewhere.
I'm trying to cobble together a table with data pulled from multiple sites from CME (https://www.cmegroup.com/trading/energy/crude-oil/western-canadian-select-wcs-crude-oil-futures.html is one).
I've tried using rvest but get a blank table.
I think this is because of the Javascript that is being used to populate the table in real time? I've fumbled my way around this site to look for similar problems and haven't quite figured out how best to pull this data. Any help is much appreciated.
library(rvest)
library(dplyr)
WCS_page <- "https://www.cmegroup.com/trading/energy/crude-oil/canadian-heavy-crude-oil-net-energy-index-futures_quotes_globex.html"
WCS_diff <- read_html(WCS_page)
month <- WCS_diff %>%
rvest::html_nodes('th') %>%
xml2::xml_find_all("//scope[contains(#col, 'Month')]") %>%
rvest::html_text()
price <- WCS_diff %>%
rvest::html_nodes('tr') %>%
xml2::xml_find_all("//td[contains(#class, 'quotesFuturesProductTable1_CLK0_last')]") %>%
rvest::html_text()
WTI_df <- data.frame(month, price)
knitr::kable(
WTI_df %>% head (10))
Yes, the page is using JS to load the data.
The easy way to check is to view source and then search for some of the text you saw in the table. For example the word "May" never shows up in the raw HTML, so it must have been loaded later.
The next step is to use something like the Chrome DevTools to inspect the network requests that were made. In this case there is a clear winner, and your structured data is coming down from here:
https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/6038/G
I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()
Here I am, a total beginner in R. I am trying to learn more about rvest and how to scrape from the web. Here is the wiki page (https://en.wikipedia.org/wiki/Andy_Murray) and below is the table I want to transfer to R.
Using CSS Selector, I found that the particular table is on ".wikitable". Following some tutorials on other webpages, here is the code that I used:
library(rvest)
tennis <- read_html("https://en.wikipedia.org/wiki/Andy_Murray")
trial <- tennis %>% html_nodes(".wikitable") %>% html_table(fill = T)
trial
I could not isolate the result to the table that I wanted. Can someone please teach me how? An another thing, what does the pipe do (%>%)?
You were almost there. What you extracted was a list. To get to your desired element you need to use indexing:
trial[[2]]
To clean it further use:
df <- trial[[2]]
df <- df[-1,]
df[,17:20] <- NULL
%>% is called a pipe from the magrittr/dplyr package. More info here.