Get table from html with htmltab - r

I am trying to get a table from a website into R. The code that I am currently running is:
library(htmltab)
url1 <- 'https://covid19-dashboard.ages.at/dashboard_Hosp.html'
TAB<-htmltab(url1, which = "//table[#id = 'tblIcuTimeline']")
This is selecting the correct table because the variables are the ones I want but the table is empty. It might be a problem with my XPath. The error that I am getting is:
No encoding supplied: defaulting to UTF-8.
Error in Node[[1]] : subscript out of bounds

This website has a convenient JSON file available, which you can extract like so:
library(jsonlite)
url <- "https://covid19-dashboard.ages.at/data/JsonData.json"
ll <- jsonlite::fromJSON(txt = url)
From there you can subset and extract what you want. My guess is you are after the ll$CovidFallzahlen My German is not so good, so couln't isolate the exact values you are after.

The problem is (probaby) that on direct approach of the page, the table is empy and has to be filled on pageload. But on initial approach of the page (using your code), the table is still empty.
Below is a RSelenium approach that results in a list all.table with all filled tables. Pick the one you need.
requirement: firefox is installed
library(RSelenium)
library(rvest)
library(xml2)
#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE )
server <- driver$server
browser <- driver$client
#goto url in browser
browser$navigate("https://covid19-dashboard.ages.at/dashboard_Hosp.html")
#get all tables
doc <- xml2::read_html(browser$getPageSource()[[1]])
all.table <- rvest::html_table(doc)
#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)

Related

Scraping a Webpage with RSelenium

I am trying to scrape this link here: https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer -- and return the player props on the page in some sort of workable table within R where I can clean it to a final result.
I am working with the RSelenium package in combination with the tidyverse and rvest in order to scrape this info into R. I have had success on other pages on this website in the past, but can't seem to crack this one.
I've gotten as far as Inspecting the webpage down to the most granular <div> that contains the entire list of players on the page, and copied the corresponding xpath from that line of the inspection.
My code looks as such:
# Run this code to scrape the player props for goals from draftkings
library(tidyverse)
library(RSelenium)
library(rvest)
# start up local selenium server
rD <- rsDriver(browser = "chrome", port=6511L, chromever = "96.0.4664.45")
remote_driver <- rD$client
# Open chrome
remote_driver$open()
# Navigate to URL
url <- "https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer"
remote_driver$navigate(url)
# Find the table via the XML path
table_xml <- remote_driver$findElement(using = "xpath", value = "//*[#id='root']/section/section[2]/section/div[3]/div/div[3]/div/div/div[2]/div")
# Locates the table, turns it into a list, and binds into a single dataframe
player_prop_table <- table_xml$getElementAttribute("innerHTML")
That last line, instead of returning a workable list, tibble, or dataframe like I'm used to returns a large list that contains the same values I see on the Chrome inspect tool.
What am I missing here in terms of successfully scraping this page?

How to download embedded PDF files from webpage using RSelenium?

EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!
If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window

Unable to download a file with rvest/httr after submitting search form

This seems like a simple problem but I've been struggling with it for a few days. This is a minimum working example rather than the actual problem:
This question seemed similat but I was unable to use the answer to solve my problem.
In a browser, I go to this url, and click on [Search] (no need to make any choices from the lists), and then on [Download Results] (choosing, for example, the Xlsx option). The file then downloads.
To automate this in R I have tried:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
Using Chrome Developer tools I find the url being used to initiate the download, so I try:
url2 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
res <- GET(url = url2, query = list(format = "xlsx"))
However this does not download the file:
> res$content
raw(0)
I also tried
download.file(url = paste0(url2, "?format=xlsx") , destfile = "down.xlsx", mode = "wb")
But this downloads nothing:
> Content type '' length 0 bytes
> downloaded 0 bytes
Note that, in the browser, pasting url2 and adding the format query does initiate the download (after doing the search from url1)
I thought that I should somehow be using the session info from the initial code block to do the download, but so far I can't see how.
Thanks in advance for any help !
You are almost there and your intuition is correct about using the session info.
You just need to use rvest::jump_to to navigate to the second url and then write it to disk:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
url2 <- "https://secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
#### The above is your original code - below is the additional code you need:
download <- jump_to(subform, paste0(url2, "?format=xlsx"))
writeBin(download$response$content, "down.xlsx")

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

Scraping with RSelenium and understanding html tags

I've been learning how to scrape the web using RSelenium by trying to gather sports data. I have a difficult time understanding how to get certain elements based on tags. In the following example, I get the player names that I want, but I only get the top 28. I don't understand why, since when I inspect lower elements, they have similar xpaths. Example:
library(rvest)
library(RCurl)
library(RSelenium)
library(XML)
library(dplyr)
# URL
rotoURL = as.character("https://rotogrinders.com/lineuphq/nba?site=draftkings")
# Start remote driver
remDrall <- rsDriver(browser = "chrome", verbose = F)
remDr <- remDrall$client
remDr$open(silent = TRUE)
Sys.sleep(1)
# Go to URL
remDr$navigate(rotoURL)
Sys.sleep(3)
# Get player names and clean
plyrNms <- remDr$findElement(using = "xpath", "//*[#id='primary-pane']/div/div[3]/div/div/div/div")
plyrNmsText <- plyrNms$getElementAttribute("outerHTML")[[1]]
plyrNmsClean <- htmlTreeParse(plyrNmsText, useInternalNodes=T)
plyrNmsCleaner <- trimws(unlist(xpathApply(plyrNmsClean, '//a', xmlValue)))
plyrNmsCleaner <- plyrNmsCleaner[!plyrNmsCleaner=='']
If you run this, you'll see that the list stops at Ben McLemore, even though there are 50+ names below. I tried this code yesterday as well and it still limited me to 28 names, which tells me the 28 isn't arbitrary.
What part of my code is preventing me from grabbing all names? I'm assuming it has to do with findElement, but I've tried a 100 different xpaths with the InspectorGadget html selector tool, and nothing seems to work. Any help would be much appreciated!

Resources