Web scraping in R, extracting urls from website with lazy-loading pages

Web scraping in R, extracting urls from website with lazy-loading pages - r

I am trying to extract urls from the website below. The tricky thing here is that the website automatically loads new pages. I did not manage to get the xpath for scraping all urls, including those on the newly loaded pages - I only manage to get the first 15 urls (of more than 70). I assume the xpath in the last line (new_results...) is missing some crucial element to account also for the pages after. Any ideas? Thank you!
# load packages
library(rvest)
library(httr)
library(RCurl)
library(XML)
library(stringr)
library(xml2)
# aim: download all speeches stored at:
# https://sheikhmohammed.ae/en-us/Speeches
# first, create vector which stores all urls to each single speech
all_links <- character()
new_results <- "/en-us/Speeches"
signatures = system.file("CurlSSL", cainfo = "cacert.pem", package = "RCurl")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
while(length(new_results) > 0){
new_results <- str_c("https://sheikhmohammed.ae", new_results)
results <- getURL(new_results, cainfo = signatures)
results_tree <- htmlParse(results)
all_links <- c(all_links, xpathSApply(results_tree,"//div[#class='speech-share-board']", xmlGetAttr,"data-url"))
new_results <- xpathSApply(results_tree,"//div[#class='speech-share-board']//after",xmlGetAttr, "data-url")}
# or, alternatively with phantomjs (also here, it loads only first 15 urls):
url <- "https://sheikhmohammed.ae/en-us/Speeches#"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
write(readLines(pipe("phantomjs scrape.js", "r")), "scrape.html")

Running the Javascript for lazy loading in RSelenium or Selenium in Python would be the most elegant approach to solve the problem. Yet, as a less elegant but faster alternative, one can manually change the settings of the json query in the firefox development modus/network feature to load not only 15 but more (=all) speeches at once. This worked fine for me and I was able to extract all the links via the json response.

Related

Get table from html with htmltab

I am trying to get a table from a website into R. The code that I am currently running is:
library(htmltab)
url1 <- 'https://covid19-dashboard.ages.at/dashboard_Hosp.html'
TAB<-htmltab(url1, which = "//table[#id = 'tblIcuTimeline']")
This is selecting the correct table because the variables are the ones I want but the table is empty. It might be a problem with my XPath. The error that I am getting is:
No encoding supplied: defaulting to UTF-8.
Error in Node[[1]] : subscript out of bounds

This website has a convenient JSON file available, which you can extract like so:
library(jsonlite)
url <- "https://covid19-dashboard.ages.at/data/JsonData.json"
ll <- jsonlite::fromJSON(txt = url)
From there you can subset and extract what you want. My guess is you are after the ll$CovidFallzahlen My German is not so good, so couln't isolate the exact values you are after.

The problem is (probaby) that on direct approach of the page, the table is empy and has to be filled on pageload. But on initial approach of the page (using your code), the table is still empty.
Below is a RSelenium approach that results in a list all.table with all filled tables. Pick the one you need.
requirement: firefox is installed
library(RSelenium)
library(rvest)
library(xml2)
#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE )
server <- driver$server
browser <- driver$client
#goto url in browser
browser$navigate("https://covid19-dashboard.ages.at/dashboard_Hosp.html")
#get all tables
doc <- xml2::read_html(browser$getPageSource()[[1]])
all.table <- rvest::html_table(doc)
#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)

Unable to download a file with rvest/httr after submitting search form

This seems like a simple problem but I've been struggling with it for a few days. This is a minimum working example rather than the actual problem:
This question seemed similat but I was unable to use the answer to solve my problem.
In a browser, I go to this url, and click on [Search] (no need to make any choices from the lists), and then on [Download Results] (choosing, for example, the Xlsx option). The file then downloads.
To automate this in R I have tried:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
Using Chrome Developer tools I find the url being used to initiate the download, so I try:
url2 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
res <- GET(url = url2, query = list(format = "xlsx"))
However this does not download the file:
> res$content
raw(0)
I also tried
download.file(url = paste0(url2, "?format=xlsx") , destfile = "down.xlsx", mode = "wb")
But this downloads nothing:
> Content type '' length 0 bytes
> downloaded 0 bytes
Note that, in the browser, pasting url2 and adding the format query does initiate the download (after doing the search from url1)
I thought that I should somehow be using the session info from the initial code block to do the download, but so far I can't see how.
Thanks in advance for any help !

You are almost there and your intuition is correct about using the session info.
You just need to use rvest::jump_to to navigate to the second url and then write it to disk:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
url2 <- "https://secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
#### The above is your original code - below is the additional code you need:
download <- jump_to(subform, paste0(url2, "?format=xlsx"))
writeBin(download$response$content, "down.xlsx")

Get and set cookies in rvest

How can I check my session cookies and specify those cookies before making a subsequent web request?
I want to scrape a page but I cannot submit the cookies.
I'm using the rvest library.
My code:
library(rvest)
WP <- html_session("http://www.wp.pl/")
headers <- httr::headers(WP)
cookies <- unlist(headers[names(headers) == "set-cookie"])
crumbs <- stringr::str_split_fixed(cookies, "; ", 4)
# method 1
stringr::str_split_fixed(crumbs[, 1], "=", 2)
# method 2
cookies(WP)
How do I set my cookies to do the web scraping?

Keep in mind that rvest is built on top of the httr library.
For some reason that I can't explain, this code didn't work until I rebooted RStudio.
Here's some code that'll do the trick:
library(httr)
library(rvest)
httr::GET("http://www.wp.pl/",
set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
`__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
`__utmb` = "29983421.5.10.1413649536",
`__utmc` = "29983421",
`__utmt` = "1",
`__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)")) %>%
read_html %>% # Sample rvest code
read_table(fill=TRUE) # Sample rvest code

Rvest not seeing xpath in website

I am attempting to scrape this website using the rvest package in R. I have done it successfully with several other website but this one doesn't seem to work and I am not sure why.
I copied the xpath from inside chrome's inspector tool, but when i specify it in the rvest script it shows that it doesn't exist. Does it have anything to do with the fact that the table is generated and not static?
appreciate the help!
library(rvest)
library (tidyverse)
library(stringr)
library(readr)
a<-read_html("http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201")
a<-html_node(a, xpath="//*[#id='indicator10']")
a<-html_table(a)
a

Regarding your question, yes, you are unable to get it because is being generated dynamically. In these cases, it's better to use the RSelenium library:
#Loading libraries
library(rvest) # to read the html
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the website
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
#Specifying the url for desired website to be scrapped
url <- "http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201"
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the element you are looking for
a <-html_node(html_obj, xpath="//*[#id='indicator10']")
I guess that you are trying to get the first table. In that case, maybe it's better to just get the table with read_table:
# get the table with the indicator10 id
indicator10_table <-html_node(html_obj, "#indicator10 table") %>% html_table()
I'm using the CSS selector this time instead of the XPath.
Hope it helps! Happy scraping!

How to use an R script from GitHub?

I am trying to use an R script hosted on GitHub plugin-draw.R. How should I use this plugin?

You can simply use source_url from package devtools :
library(devtools)
source_url("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")

Based on #Matifou's reply, but using the "new" method appending ?raw=TRUE at the end of your URL:
devtools::source_url("https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/bingSearchXScraper/bingSearchXScraper.R?raw=TRUE")

You can use solution offered on R-Bloggers:
source_github <- function(u) {
# load package
require(RCurl)
# read script lines from website
script <- getURL(u, ssl.verifypeer = FALSE)
# parase lines and evaluate in the global environment
eval(parse(text = script))
}
source_github("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")
For the function to be evaluated in a global environment (I'm guessing that you will prefer this solution) you can use:
source_https <- function(u, unlink.tmp.certs = FALSE) {
# load package
require(RCurl)
# read script lines from website using a security certificate
if(!file.exists("cacert.pem")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem")
script <- getURL(u, followlocation = TRUE, cainfo = "cacert.pem")
if(unlink.tmp.certs) unlink("cacert.pem")
# parase lines and evealuate in the global environement
eval(parse(text = script), envir= .GlobalEnv)
}
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/bingSearchXScraper/bingSearchXScraper.R")
source_https("https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R", unlink.tmp.certs = TRUE)
As mentioned in the the original article by Tony Breyal, this discussion on SO should also be credited as it is relevant to the discussed question.

If it is a link on GitHub where you can click on Raw next to the Blame, you can actually just use the ordinary base::source. Go to the R script of your choice and find the Raw button.
The link will contain raw.githubusercontent.com now, and the page show nothing but R script itself. Then, for this example,
source(
paste0(
"https://raw.githubusercontent.com/betanalpha/knitr_case_studies/master/",
"stan_intro/stan_utility.R"
)
)
(paste0 was used just to fit the URL into a narrower screen.)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping in R, extracting urls from website with lazy-loading pages - r

Related

Get table from html with htmltab

Unable to download a file with rvest/httr after submitting search form

Get and set cookies in rvest

Rvest not seeing xpath in website

How to use an R script from GitHub?

Categories

Resources