Harvesting data with rvest retrieves no value from data-widget

Harvesting data with rvest retrieves no value from data-widget - r

I'm trying to harvest data using rvest (also tried using XML and selectr) but I am having difficulties with the following problem:
In my browser's web inspector the html looks like
<span data-widget="turboBinary_tradologic1_rate" class="widgetPlaceholder widgetRate rate-down">1226.45</span>
(Note: rate-downand 1226.45 are updated periodically.) I want to harvest the 1226.45 but when I run my code (below) it says there is no information stored there. Does this have something to do with
the fact that its a widget? Any suggestions on how to proceed would be appreciated.
library(rvest);library(selectr);library(XML)
zoom.turbo.url <- "https://www.zoomtrader.com/trade-now?game=turbo"
zoom.turbo <- read_html(zoom.turbo.url)
# Navigate to node
zoom.turbo <- zoom.turbo %>% html_nodes("span") %>% `[[`(90)
# No value
as.character(zoom.turbo)
html_text(zoom.turbo)
# Using XML and Selectr
doc <- htmlParse(zoom.turbo, asText = TRUE)
xmlValue(querySelector(doc, 'span'))

For websites that are difficult to scrape, for example where the content is dynamic, you can use RSelenium. With this package and a browser docker, you are able to navigate websites with R commands.
I have used this method to scrape a website that had a dynamic login script, that I could not get to work with other methods.

Related

R: Webscraping data not contained in HTML

I'm trying to webscrape in R from webpages such as these. But the html is only 50 lines so I'm assuming the numbers are hidden in a javascript file or on their server. I'm not sure how to find the numbers I want (e.g., the enrollment number under student population).
When I try to use rvest, as in
num <- school_webpage %>%
html_elements(".number no-mrg-btm") %>%
html_text()
I get an error that says "could not find function "html_elements"" even though I've installed and loaded rvest.
What's my best strategy for getting those various numbers and why am I getting that error message? Thnx.

That data is coming from an API request you can find in the browser network tab. It returns json. Make a request direct to that page (as you don't have a browser to do this based off landing page):
library(jsonlite)
data <- jsonlite::read_json('https://api.caschooldashboard.org/LEAs/01611766000590/6/true')
print(data$enrollment)

How can I extract various tables from Wikipedia using R?

I have the next link:
https://en.wikipedia.org/wiki/List_of_prime_ministers_of_Spain
I'm trying to extract the information about Prime Ministers but it gives a table of data without any apparent order.
This is currently the code that I am using
library(XML)
library(httr)
url = "https://en.wikipedia.org/wiki/List_of_prime_ministers_of_Spain"
url <- GET(url)
datos = readHTMLTable(rawToChar(url$content), header=T,stringsAsFactors=F)
tabla2= datos[[2]]

I would suggest using Selenium. Through Selenium API you can access all functionalities of the DOM.Prior I have used urlib library in Python but it didn't help if the page uses a lot of functionality thus the DOM always changes.

HTML scraping with R and selectorgadget

I have a script below that works for simple html scraping. Nothing is returned below for this particular site. New to using html with R and selectorgadget but I have other sites that work. I am wondering why this one does not see the element. The picture below has the path in the highlighted red box and I am curious if it because of the # before the fancy-box that makes this hidden. Any tips and language correction would be helpful as I am still learning how to scrape html.
library(rvest)
library(dplyr)
library(tm)
library(stringi)
library(readr)
url <- read_html('https://www.draftkings.com/draft/contest/84207356')
rot <- url %>%
html_nodes('..prize-payouts td+ td') %>%
html_text()
roster <- data.frame(ROT = rot)

The website is using javascript to render the page. One solution is to download the data as JSON. If you examine the files from the network under the developer tools on your web browser.
This file should provide the information you are looking for:
library(jsonlite)
fromJSON("https://api.draftkings.com/contests/v1/contests/84207356?format=json")
Be sure to comply with the term of service on this website.

Scraping a HTML table in R, after changing a Javascript dropdown option

I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.

R: Scraping multiple tables in URL

I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?

If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Harvesting data with rvest retrieves no value from data-widget - r

Related

R: Webscraping data not contained in HTML

How can I extract various tables from Wikipedia using R?

HTML scraping with R and selectorgadget

Scraping a HTML table in R, after changing a Javascript dropdown option

R: Scraping multiple tables in URL

Categories

Resources