Scrape web page with splashr and load more button - r

I'm try to scrape a few 1801 census pages with splashr that may have 0 to many Load More buttons (since 50 records are loaded at a time). This page should have 174.
url <- "https://digitalarkivet.no/en/census/district/tf01058443000001"
doc <- splash("localhost") %>% render_html(url, wait =3)
html_nodes(doc2, xpath="//h4[not(#class)]/a") %>% length()
[1] 50
I tried following the url by Load More, but that just gets the first 50 records again.
url2 <- html_nodes(doc, xpath="//div[#class='load-more']") %>% html_attr("data-url")
[1] "https://digitalarkivet.no/en/census/related/rural-residences/tf01058443000001?page=2"
Note that most districts have fewer than 50 records, so I don't need to click load more for every page.

Thx for trying the splashr package (I'm the author).
Thankfully, you won't need it in this case. The data load is done through XHR requests which we can mimic in R:
library(httr)
library(rvest)
census_page <- function(district, page=1L) {
GET(
url = "https://digitalarkivet.no",
path=sprintf("en/census/related/rural-residences/%s", district),
accept_json(),
add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.45 Safari/537.36",
Referer = "https://digitalarkivet.no/en/census/district/tf01058443000001",
`X-Requested-With` = "XMLHttpRequest"
),
query = list(page=page)
) -> res
stop_for_status(res)
res <- content(res)
list(
divs = read_html(res$view),
next_page = parse_url(res$nextPage)$query$page
)
}
Now, just pass-in the district and page of data you want:
res <- census_page("tf01058443000001", 1)
And get the results:
str(res, 1)
## List of 2
## $ divs :List of 2
## ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
## $ next_page: chr "2"
The function returns a list with:
divs which is the parsed content containing the <div>s of the info you want
next_page can be used to pass to another call of the function
I didn't try it through to the end (i.e. I don't know if there will always be a 'next page') and you will need to extract the data from the <div>s on your own, but this will help you avoid a third-party dependency.

Related

Click button using R + httr

I'm trying to scrape randomly generated names from a website.
library(httr)
library(rvest)
url <- "https://letsmakeagame.net//tools/PlanetNameGenerator/"
mywebsite <- read_html(url) %>%
html_nodes(xpath="//div[contains(#id,'title')]")
However, that does not work. I'm assuming I have to «click» the «generate» button before extracting the content. Is there a simple way (without RSelenium) to achieve that?
Something similar to:
POST(url,
body = list("EntryPoint.generate()" = T),
encode = "form") -> res
res_t <- content(res, as="text")
Thanks!
rvest isn't much of a help here as planet names are not requested from a remote service, names are generated locally with javascript, that's what the EntryPoint.generate() call does. A relatively simple way is to use chromote, though its session/process closing seems kind of messy at the moment:
library(chromote)
b <- ChromoteSession$new()
{
b$Page$navigate("https://letsmakeagame.net/tools/PlanetNameGenerator")
b$Page$loadEventFired()
}
# call EntryPoint.generate(), read result from <p id="title></p> element,
# replicate 10x
replicate(10, b$Runtime$evaluate('EntryPoint.generate();document.getElementById("title").innerText')$result$value)
#> [1] "Torade" "Ukiri" "Giconerth" "Dunia" "Brihoria"
#> [6] "Tiulaliv" "Giahiri" "Zuthewei 4A" "Elov" "Brachomia"
b$close()
#> [1] TRUE
b$parent$close()
#> Error in self$send_command(msg, callback = callback_, error = error_, : Chromote object is closed.
b$parent$get_browser()$close()
#> [1] TRUE
Created on 2023-01-25 with reprex v2.0.2

Webscraping with R a continuous page with "view more"

I'm new to R and need to scrape the titles and the dates on the posts on this website https://www.healthnewsreview.org/news-release-reviews/
Using rvest I was able to write the basic code to get the info:
url <- 'https://www.healthnewsreview.org/?post_type=news-release-review&s='
webpage <- read_html(url)
date_data_html <- html_nodes(webpage,'span.date')
date_data <- html_text(date_data_html)
head(date_data)
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'h2')
title_data <- html_text(title_data_html)
head(title_data)
But since the website only displays 10 items at first, and then you have to click "view more" I don't know how to scrape the whole site. Thank you!!
Introducing third-party dependencies should be done as a last resort. RSelenium (as r2evans posited as the only solution, originally) is not necessary the vast majority of the time, including now. (It is necessary for gosh-awful sites that use horrible tech like SharePoint since maintaining state without a browser context for that is more pain than it's worth).)
If we start with the main page:
library(rvest)
pg <- read_html("https://www.healthnewsreview.org/news-release-reviews/")
We can get the first set of links (10 of them):
pg %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .)
## [1] "https://www.healthnewsreview.org/news-release-review/more-unwarranted-hype-over-the-unique-benefits-of-proton-therapy-this-time-in-combo-with-thermal-therapy/"
## [2] "https://www.healthnewsreview.org/news-release-review/caveats-and-outside-expert-balance-speculative-claim-that-anti-inflammatory-diet-might-benefit-bipolar-disorder-patients/"
## [3] "https://www.healthnewsreview.org/news-release-review/plug-for-study-of-midwifery-for-low-income-women-is-fuzzy-on-benefits-costs/"
## [4] "https://www.healthnewsreview.org/news-release-review/tiny-safety-trial-prematurely-touts-clinical-benefit-of-cancer-vaccine-for-her2-positive-cancers/"
## [5] "https://www.healthnewsreview.org/news-release-review/claim-that-milk-protein-alleviates-chemotherapy-side-effects-based-on-study-of-just-12-people/"
## [6] "https://www.healthnewsreview.org/news-release-review/observational-study-cant-prove-surgery-better-than-more-conservative-prostate-cancer-treatment/"
## [7] "https://www.healthnewsreview.org/news-release-review/recap-of-mental-imagery-for-weight-loss-study-requires-that-readers-fill-in-the-blanks/"
## [8] "https://www.healthnewsreview.org/news-release-review/bmjs-attempt-to-hook-readers-on-benefits-of-golf-slices-way-out-of-bounds/"
## [9] "https://www.healthnewsreview.org/news-release-review/time-to-test-all-infants-gut-microbiomes-or-is-this-a-product-in-search-of-a-condition/"
## [10] "https://www.healthnewsreview.org/news-release-review/zika-vaccine-for-brain-cancer-pr-release-headline-omits-crucial-words-in-mice/"
I guess you want to scrape the content of those ^^ so have at it.
But, there's that pesky "View more" button.
When you click on it, it issues this POST request:
With curlconverter we can convert it into a callable httr function (which may not exist given the impossibility of this task). We can wrap that function call in in another function with a pagination parameter:
view_more <- function(current_offset=10) {
httr::POST(
url = "https://www.healthnewsreview.org/wp-admin/admin-ajax.php",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
action = "viewMore",
current_offset = as.character(as.integer(current_offset)),
page_id = "22332",
btn = "btn btn-gray",
active_filter = "latest"
),
encode = "form"
) -> res
list(
links = httr::content(res) %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .),
next_offset = current_offset + 4
)
}
Now, we can run it (since it defaults to the 10 issued in the first View More click):
x <- view_more()
str(x)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/university-pr-misleads-with-claim-that-preliminary-blood-t"| __truncated__ "https://www.healthnewsreview.org/news-release-review/observational-study-on-testosterone-replacement-therapy-fo"| __truncated__ "https://www.healthnewsreview.org/news-release-review/recap-of-lung-cancer-screening-test-relies-on-hyperbole-co"| __truncated__ "https://www.healthnewsreview.org/news-release-review/ties-to-drugmaker-left-out-of-postpartum-depression-drug-study-recap/"
## $ next_offset: num 14
We can pass that new offset to another call:
y <- view_more(x$next_offset)
str(y)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/sweeping-claims-based-on-a-single-case-study-of-advanced-c"| __truncated__ "https://www.healthnewsreview.org/news-release-review/false-claims-of-benefit-weaken-news-release-on-experimenta"| __truncated__ "https://www.healthnewsreview.org/news-release-review/contrary-to-claims-heart-scans-dont-save-lives-but-subsequ"| __truncated__ "https://www.healthnewsreview.org/news-release-review/breastfeeding-for-stroke-prevention-kudos-to-heart-associa"| __truncated__
## $ next_offset: num 18
You can do the hard part of scraping the initial article count (it's on the main page) and doing the math to put that in a loop and stop efficiently.
NOTE: If you are doing this scraping to archive the complete site (whether for them or independently) since it's dying at the end of the year, you should comment to that effect and I have better suggestions for that use-case than manual coding in any programming language. There are free, industrial "site preservation" frameworks designed to preserve these types of dying resources. If you just need the article content, then an iterator and custom scraper is likely a 👍🏼 (but, apparently impossible) choice.
NOTE also that the pagination increment of 4 is what the site does when you literally press the button, so this just mimics that functionality.

web scraping multiple web pages with different directory strings in r with rvest

I know there are a lot questions similar to this but I haven't seemed to find one that ask this (Please forgive me if I am wrong). I am trying to scrape a website for weather data and I was successful at doing so for one of the web pages. However, I would like to loop the process. I have looked at
enter link description here
enter link description here
but I don't believe they solve my problem..
The directory changes slightly at the end from http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=avgtto
http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=pcpn
and so on.. How could I loop through them even though they aren't increasing by numbers?
Code:
nj_weather_data<-read_html("http://climate.rutgers.edu/stateclim_v1/nclimdiv/")
### Get info you want from web page###
hurr<-html_nodes(nj_weather_data,"#climdiv_table")
### Extract info and turn into dataframe###
precip_table<-as.data.frame(html_table(hurr))%>%
select(-Rank)
Assuming you want average T, minimum T, precipitation... Look at the way url changes when you click either in the table above the temperature table. This is done through javascript and in order to obtain that, you would have to load the page through some sort of (headless) browser such as phantomJS.
Another way is to just get the names for individual page and append it to the url and load the data.
library(rvest)
# notice the %s at the end - this is replaced by elements of cs in sprintf
# statement below
x <- "http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=%s"
cs <- c("mint", "avgt", "pcpn", "hdd", "cdd")
# you could paste together new url using paste, too
customstat <- sprintf(x, cs) # %s is replaced with mint, avgt...
# prepare empty object for results
out <- vector("list", length(customstat))
names(out) <- cs
# get individual table and insert it into the output
for (i in customstat) {
out[[which(i == customstat)]] <- read_html(i) %>%
html_nodes("#climdiv_table") %>%
html_table() %>%
.[[1]]
}
> str(out)
List of 5
$ mint:'data.frame': 131 obs. of 15 variables:
..$ Rank : logi [1:131] NA NA NA NA NA NA ...
..$ Year : chr [1:131] "1895" "1896" "1897" "1898" ...
..$ Jan : chr [1:131] "18.1" "18.6" "18.7" "23.2" ...
..$ Feb : chr [1:131] "11.7" "20.7" "22.5" "22.1" ...
You can now glue together tables (e.g. using do.call(rbind, out)) or whatever it is required for your analysis.

Rvest unexpectedly stopped working - Scraping tables

After having used the following script for a while, it suddenly stopped working. I constructed a simple function that finds a table - based on its xpath - within a web page.
library(rvest)
url <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08'
find_table <- function(x){read_html(x) %>%
html_nodes(xpath = '//*[#id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()}
table <- find_table(url)
I also tried to use httr::GET before read_html, passing the following argument:
query = list(r_date = "2017-12-22")
but nothing changed. Any ideas?
Well, that code doesn't work since you missed a ) in the url <- line.
We'll add in httr:
library(httr)
library(rvest)
url is the name of a base function. using base function names as variables can make problems in code hard to debug. Unless you write perfect code, it's a good idea to not use the names that way.
URL <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08')
I don't know if you know the "rules" about web scraping but if you're making repeated requests to this site, then a "crawl delay" should be used. They don't have one set in their robots.txt so 5 seconds is the accepted alternative. I point this out as you may be getting rate limited.
find_table <- function(x, crawl_delay=5) {
Sys.sleep(crawl_delay) # you can put this in a loop vs here if you aren't often doing repeat gets
# switch to httr::GET so you can get web server interaction info.
# since you're scraping, it's expected that you use a custom user agent
# that also supplies contact info.
res <- GET(x, user_agent("My scraper"))
# check to see if there's a non HTTP 200 response which there may be
# if you're getting rate-limited
stop_for_status(res)
# now, try to do the parsing. It looks like you're trying to target a
# single table, so i switched it from `html_nodes()` to `html_node()` since
# the latter returns a `list` and the pipe will error out if there's more
# than on list element.
content(res, "parsed") %>%
html_node(xpath = '//*[#id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()
}
table is a base function name, too (see above)
result <- find_table(URL)
Worked fine for me:
str(result)
## 'data.frame': 11 obs. of 5 variables:
## $ ENTI EROGATORI : chr "Cassa DD.PP." "Istituti di previdenza amministrati dal Tesoro" "Istituto per il credito sportivo" "Aziende di credito" ...
## $ : logi NA NA NA NA NA NA ...
## $ ACCENSIONE ACCERTAMENTI : chr "4.638.500,83" "0,00" "0,00" "953.898,47" ...
## $ ACCENSIONE RISCOSSIONI C|COMP. + RESIDUI: chr "2.177.330,12" "0,00" "129.114,22" "848.935,84" ...
## $ RIMBORSO IMPEGNI : chr "438.696,57" "975,07" "45.584,55" "182.897,01" ...

Extract table from

I would like to extract the following table using rvest from http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp (for any date):
I tried the following but failed to produce any result:
library(rvest)
url <- "http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp"
htmlSession <-html_session(url) ## create session
goForm <- html_form(htmlSession)[[2]] ## pull form from session
#filledGoForm <- set_values(goForm, value="04/26/2017") # This does not work
filledGoForm <- goForm
filledGoForm$fields[[1]]$value <- "04/26/2017"
htmlSession <- submit_form(htmlSession, filledGoForm)
> htmlSession <- submit_form(htmlSession, filledGoForm)
Submitting with ''
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Not Found (HTTP 404).
Any hints on how to do this highly appreciated.
That site uses many XHR requests to populate the tables. And, it establishes a server session with a hidden POST request which won't be replicated with html_session().
We'll need to add in httr for some help:
library(httr)
library(rvest)
The first thing we need to do is to just hit the site to get an initial qs_wid cookie into the implicit cookie jar curl/httr/rvest share:
init <- GET("http://finra-markets.morningstar.com/MarketData/Default.jsp")
Next, we need to mimic the hidden "login" that the web page does:
nxt <- POST(url = "http://finra-markets.morningstar.com/finralogin.jsp",
body = list(redirectPage = "/BondCenter/TRACEMarketAggregateStats.jsp"),
encode = "form")
That creates a session on the server back-end and places a few other cookies in our cookie jar.
Finally:
GET(
url = "http://finra-markets.morningstar.com/transferPage.jsp",
query = list(
`path`="http://muni-internal.morningstar.com/public/MarketBreadth/C",
`date`="04/24/2017",
`_`=as.numeric(Sys.time())
)
) -> res
makes the request. You can make a function out of all three steps (together) and parameterize that last GET.
Unfortunately, that returns a very broken HTML <table> that html_table() can't translate into a data frame automagically for you, but that shouldn't stop you:
content(res) %>%
html_nodes("td") %>%
html_text() %>%
matrix(ncol=4, byrow=TRUE) %>%
as_data_frame() %>%
mutate_all(as.numeric) %>%
rename(all_issues=V1, investment_grade=V2, high_yield=V3, convertible=V4) %>%
mutate(category = c("total_issues_traded", "advances", "declines", "unchanged", "high_52", "low_52", "dollar_volume"))
## # A tibble: 7 × 5
## all_issues investment_grade high_yield convertible category
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 7983 5602 2194 187 total_issues_traded
## 2 3025 1798 1100 127 advances
## 3 4448 3575 824 49 declines
## 4 124 42 75 7 unchanged
## 5 257 66 175 16 high_52
## 6 139 105 33 1 low_52
## 7 22601 16143 5742 715 dollar_volume
To get the other data tables, go to the Developer Tools option in your browser (switch to one that has it if yours doesn't … you're likely on Windows given that you're doing finance things and IE/Edge aren't very good browsers for introspection) and refresh the page to see the other requests that get made.

Resources