HTML scraping with R and selectorgadget

HTML scraping with R and selectorgadget - r

I have a script below that works for simple html scraping. Nothing is returned below for this particular site. New to using html with R and selectorgadget but I have other sites that work. I am wondering why this one does not see the element. The picture below has the path in the highlighted red box and I am curious if it because of the # before the fancy-box that makes this hidden. Any tips and language correction would be helpful as I am still learning how to scrape html.
library(rvest)
library(dplyr)
library(tm)
library(stringi)
library(readr)
url <- read_html('https://www.draftkings.com/draft/contest/84207356')
rot <- url %>%
html_nodes('..prize-payouts td+ td') %>%
html_text()
roster <- data.frame(ROT = rot)

The website is using javascript to render the page. One solution is to download the data as JSON. If you examine the files from the network under the developer tools on your web browser.
This file should provide the information you are looking for:
library(jsonlite)
fromJSON("https://api.draftkings.com/contests/v1/contests/84207356?format=json")
Be sure to comply with the term of service on this website.

Related

Why is html_nodes() in R not giving me the desired output for this webpage?

I am looking to extract all the links for each episode on this webpage, however I appear to be having difficulty using html_nodes() where I haven't experienced such difficulty before. I am trying to iterate the code using "." such that all the attributes for the page are obtained with that CSS. This code is meant to give an output of all the attributes, but instead I get {xml_nodeset (0)}. I know what to do once I have all the attributes in order to obtain the links specifically out of them, but this step is proving a stumbling block for this website.
Here is the code I have begun in R:
episode_list_page_1 <- "https://jrelibrary.com/episode-list/"
episode_list_page_1 %>%
read_html() %>%
html_node("body") %>%
html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
html_attrs()

This rvest down does not work here because this page uses javascript to insert another webpage into an iframe on this page, to display the information.
If you search the imebedded script you will find a reference to this page: "https://datawrapper.dwcdn.net/eoqPA/66/" which will redirect you to "https://datawrapper.dwcdn.net/eoqPA/67/". This second page contains the data you are looking for in as embedded JSON and generated via javascript.
The links to the shows are extractable, and there is a link to a Google doc that is the full index.
Searching this page turns up a link to a Google doc:
library(rvest)
library(dplyr)
library(stringr)
page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")
#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\\"')
#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"

Web scraping price with the use of xml

I am trying to scrape the following: 13.486 Kč from: https://www.aofis.cz/informace-pro-klienty/elba-opf/
For some reason, the following code does not seem to find the number. I am rather a newbie to this so perhaps it is because the string in xml_find_all is wrong. Can anyone please have a look why?
library(xml)
library(xml2)
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- read_html(page)
Price <- read_page %>%
rvest::html_nodes('page-content') %>%
xml2::xml_find_all("//strong[contains(#class 'sg_selected')]") %>%
rvest::html_text()
Price
Thank you!!
Michael

The html code you see in your browser developer panel (or selector gadget) is not the same as the content that is being delivered to your R session. It is actually a javascript file which then builds the web page. This is why your rvest call isn't finding the correct html node: there are no html nodes in the string you are processing!
There are a few different ways to get the information you want, but perhaps the best is to just get the monetary values from the javascript code using regex:
page <- "https://www.aofis.cz/informace-pro-klienty/elba-opf/"
read_page <- httr::content(httr::GET(page), "text")
stringr::str_extract_all(read_page, "\\d+\\.\\d+ K")[[1]][1]
#> [1] "13.486 K"

Web scrape synonyms

I am trying to scrape synonyms from the National Cancer Institute Thesaurus data base, however I am having some trouble finding the right html to point to for this. Below is my code and the data frame I am using. When I run my script to pull the synonyms I get an Error in open.connection(x, "rb") : HTTP error 404. I cant seem to figure out what the right html link should be and how to find it.
library(xml2)
library(rvest)
library(dplyr)
library(tidyverse)
synonyms<-read_csv("terms.csv")
##list of acronyms
words <- c(synonyms$Keyword)
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Data<-data.frame(Pages=c(htmls))
results<-sapply(Data$Pages, function(url){
try(
url %>%
as.character() %>%
read_html() %>%
html_nodes('p') %>%
html_text()
)
})

I suspect there's a problem with this line of code:
##Designate html like and the values to search
htmls <- paste0("https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/", words)
Because paste0() just concatenates text together, this will give you you URLs like
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Ketamine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Azacitidine
https://ncit.nci.nih.gov/ncitbrowser/pages/concept_details.jsf/Axicabtagene+Ciloleucel
While I do not have particular experience with rvest, the 404 error you see is almost certainly tied to the inability of web browsers to load those URLs. I recommend logging or printing out htmls so you can confirm that they indeed work properly in a web browser.
I will point out that in this particular case the website offers a downloadable database; you might find it easier to download and query that offline than to do this web scraping.

Scraping a HTML table in R, after changing a Javascript dropdown option

I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.

Harvesting data with rvest retrieves no value from data-widget

I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))

For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.