I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.
Related
I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()
I'm attempting to scrape tables from this page:
https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds
I'm trying to gather the info under "player","over", and "under"
so the first row would be Joe Flacco 1.5 +140 1.5 -190 (these numbers change so when you're reading it might be different)
As an example of code I used on the same website, but different table/link, I used this:
library(rvest)
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds'
test <- url %>%
read_html() %>%
html_nodes('.default-color , .sportsbook-outcome-cell__line , .sportsbook-row-name') %>%
html_text()
This code gives me the exact data that I want.
Note that this working code is for a separate page: https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-yds
I used selectorGadget extension to ascertain the selector value
The 2 different pages I'm looking at are accessible from the header above "Bal Ravens"
Pass Yds is the default table selection for the page; Pass TDs is next to it, which gets you to the page which I'm also attempting to scrape.
For some reason, scraping the Pass TDs table using the same method as Pass Yds table leaves me with an empty string:
url <- 'https://sportsbook.draftkings.com/leagues/football/nfl?category=passing-props&subcategory=pass-tds'
test <- url %>%
read_html() %>%
html_nodes('.sportsbook-row-name , .default-color , .sportsbook-outcome-cell__line') %>%
html_text()
Note that when using selectorGadget for this page, it gives me a different html_nodes selector
I have also tried using xpath and finding the individual tables (with html_table) on the Inspect page. Again, this process works with the Pass Yds page, but not the Pass Tds page. I assume this problem relates to the fact that the table on the website is variable, with Pass Yds being the default.
If anyone could help me with this, or point me in the direction to information regarding scraping these menu-selectable tables, I would greatly appreciate it.
Thanks!
I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()
This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"
I have a script below that works for simple html scraping. Nothing is returned below for this particular site. New to using html with R and selectorgadget but I have other sites that work. I am wondering why this one does not see the element. The picture below has the path in the highlighted red box and I am curious if it because of the # before the fancy-box that makes this hidden. Any tips and language correction would be helpful as I am still learning how to scrape html.
library(rvest)
library(dplyr)
library(tm)
library(stringi)
library(readr)
url <- read_html('https://www.draftkings.com/draft/contest/84207356')
rot <- url %>%
html_nodes('..prize-payouts td+ td') %>%
html_text()
roster <- data.frame(ROT = rot)
The website is using javascript to render the page. One solution is to download the data as JSON. If you examine the files from the network under the developer tools on your web browser.
This file should provide the information you are looking for:
library(jsonlite)
fromJSON("https://api.draftkings.com/contests/v1/contests/84207356?format=json")
Be sure to comply with the term of service on this website.
I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))
For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.