Webscraping with R a continuous page with "view more" - r

I'm new to R and need to scrape the titles and the dates on the posts on this website https://www.healthnewsreview.org/news-release-reviews/
Using rvest I was able to write the basic code to get the info:
url <- 'https://www.healthnewsreview.org/?post_type=news-release-review&s='
webpage <- read_html(url)
date_data_html <- html_nodes(webpage,'span.date')
date_data <- html_text(date_data_html)
head(date_data)
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'h2')
title_data <- html_text(title_data_html)
head(title_data)
But since the website only displays 10 items at first, and then you have to click "view more" I don't know how to scrape the whole site. Thank you!!

Introducing third-party dependencies should be done as a last resort. RSelenium (as r2evans posited as the only solution, originally) is not necessary the vast majority of the time, including now. (It is necessary for gosh-awful sites that use horrible tech like SharePoint since maintaining state without a browser context for that is more pain than it's worth).)
If we start with the main page:
library(rvest)
pg <- read_html("https://www.healthnewsreview.org/news-release-reviews/")
We can get the first set of links (10 of them):
pg %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .)
## [1] "https://www.healthnewsreview.org/news-release-review/more-unwarranted-hype-over-the-unique-benefits-of-proton-therapy-this-time-in-combo-with-thermal-therapy/"
## [2] "https://www.healthnewsreview.org/news-release-review/caveats-and-outside-expert-balance-speculative-claim-that-anti-inflammatory-diet-might-benefit-bipolar-disorder-patients/"
## [3] "https://www.healthnewsreview.org/news-release-review/plug-for-study-of-midwifery-for-low-income-women-is-fuzzy-on-benefits-costs/"
## [4] "https://www.healthnewsreview.org/news-release-review/tiny-safety-trial-prematurely-touts-clinical-benefit-of-cancer-vaccine-for-her2-positive-cancers/"
## [5] "https://www.healthnewsreview.org/news-release-review/claim-that-milk-protein-alleviates-chemotherapy-side-effects-based-on-study-of-just-12-people/"
## [6] "https://www.healthnewsreview.org/news-release-review/observational-study-cant-prove-surgery-better-than-more-conservative-prostate-cancer-treatment/"
## [7] "https://www.healthnewsreview.org/news-release-review/recap-of-mental-imagery-for-weight-loss-study-requires-that-readers-fill-in-the-blanks/"
## [8] "https://www.healthnewsreview.org/news-release-review/bmjs-attempt-to-hook-readers-on-benefits-of-golf-slices-way-out-of-bounds/"
## [9] "https://www.healthnewsreview.org/news-release-review/time-to-test-all-infants-gut-microbiomes-or-is-this-a-product-in-search-of-a-condition/"
## [10] "https://www.healthnewsreview.org/news-release-review/zika-vaccine-for-brain-cancer-pr-release-headline-omits-crucial-words-in-mice/"
I guess you want to scrape the content of those ^^ so have at it.
But, there's that pesky "View more" button.
When you click on it, it issues this POST request:
With curlconverter we can convert it into a callable httr function (which may not exist given the impossibility of this task). We can wrap that function call in in another function with a pagination parameter:
view_more <- function(current_offset=10) {
httr::POST(
url = "https://www.healthnewsreview.org/wp-admin/admin-ajax.php",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
action = "viewMore",
current_offset = as.character(as.integer(current_offset)),
page_id = "22332",
btn = "btn btn-gray",
active_filter = "latest"
),
encode = "form"
) -> res
list(
links = httr::content(res) %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .),
next_offset = current_offset + 4
)
}
Now, we can run it (since it defaults to the 10 issued in the first View More click):
x <- view_more()
str(x)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/university-pr-misleads-with-claim-that-preliminary-blood-t"| __truncated__ "https://www.healthnewsreview.org/news-release-review/observational-study-on-testosterone-replacement-therapy-fo"| __truncated__ "https://www.healthnewsreview.org/news-release-review/recap-of-lung-cancer-screening-test-relies-on-hyperbole-co"| __truncated__ "https://www.healthnewsreview.org/news-release-review/ties-to-drugmaker-left-out-of-postpartum-depression-drug-study-recap/"
## $ next_offset: num 14
We can pass that new offset to another call:
y <- view_more(x$next_offset)
str(y)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/sweeping-claims-based-on-a-single-case-study-of-advanced-c"| __truncated__ "https://www.healthnewsreview.org/news-release-review/false-claims-of-benefit-weaken-news-release-on-experimenta"| __truncated__ "https://www.healthnewsreview.org/news-release-review/contrary-to-claims-heart-scans-dont-save-lives-but-subsequ"| __truncated__ "https://www.healthnewsreview.org/news-release-review/breastfeeding-for-stroke-prevention-kudos-to-heart-associa"| __truncated__
## $ next_offset: num 18
You can do the hard part of scraping the initial article count (it's on the main page) and doing the math to put that in a loop and stop efficiently.
NOTE: If you are doing this scraping to archive the complete site (whether for them or independently) since it's dying at the end of the year, you should comment to that effect and I have better suggestions for that use-case than manual coding in any programming language. There are free, industrial "site preservation" frameworks designed to preserve these types of dying resources. If you just need the article content, then an iterator and custom scraper is likely a 👍🏼 (but, apparently impossible) choice.
NOTE also that the pagination increment of 4 is what the site does when you literally press the button, so this just mimics that functionality.

Related

Click button using R + httr

I'm trying to scrape randomly generated names from a website.
library(httr)
library(rvest)
url <- "https://letsmakeagame.net//tools/PlanetNameGenerator/"
mywebsite <- read_html(url) %>%
html_nodes(xpath="//div[contains(#id,'title')]")
However, that does not work. I'm assuming I have to «click» the «generate» button before extracting the content. Is there a simple way (without RSelenium) to achieve that?
Something similar to:
POST(url,
body = list("EntryPoint.generate()" = T),
encode = "form") -> res
res_t <- content(res, as="text")
Thanks!
rvest isn't much of a help here as planet names are not requested from a remote service, names are generated locally with javascript, that's what the EntryPoint.generate() call does. A relatively simple way is to use chromote, though its session/process closing seems kind of messy at the moment:
library(chromote)
b <- ChromoteSession$new()
{
b$Page$navigate("https://letsmakeagame.net/tools/PlanetNameGenerator")
b$Page$loadEventFired()
}
# call EntryPoint.generate(), read result from <p id="title></p> element,
# replicate 10x
replicate(10, b$Runtime$evaluate('EntryPoint.generate();document.getElementById("title").innerText')$result$value)
#> [1] "Torade" "Ukiri" "Giconerth" "Dunia" "Brihoria"
#> [6] "Tiulaliv" "Giahiri" "Zuthewei 4A" "Elov" "Brachomia"
b$close()
#> [1] TRUE
b$parent$close()
#> Error in self$send_command(msg, callback = callback_, error = error_, : Chromote object is closed.
b$parent$get_browser()$close()
#> [1] TRUE
Created on 2023-01-25 with reprex v2.0.2

How can I GET youtube video title with httr request?

I'm not a programer so I might be tripping onto something really silly to solve.
I'm trying to get the title of multiple youtube videos for my research. I recently found the httr package, and I think the GET function reaches this info really well, the problem is that I don't know how to access the response.
I tried
x <- GET("https://www.youtube.com/watch?v=2lAe1cqCOXo")
content(x)
and it gave me this response
{html_document}
<html style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" lang="pt-BR" system-icons="" typography="" typography-spacing="">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http ...
[2] <body>\n<div id="watch7-content" class="watch-main-col" itemscope itemid="" itemtype="h ...
I know that every video title is in [1]<head'> part as:
<title= TITLE OF THE VIDEO - YOUTUBE </'title>
or as
<'meta name="title" content=" TITLE OF THE VIDEO ">
Is that a way to browse the content response to extract this information?
Here is another approach that can be considered :
library(rvest)
library(stringr)
vector_URL_Title_To_Extract <- c("https://www.youtube.com/watch?v=DyX5RFSxWOY",
"https://www.youtube.com/watch?v=2lAe1cqCOXo",
"https://www.youtube.com/watch?v=ndTktsXlN7w")
nb_URL <- length(vector_URL_Title_To_Extract)
list_URL_Names <- list()
for(i in 1 : nb_URL)
{
print(i)
webpage <- read_html(vector_URL_Title_To_Extract[i])
webpage_Text <- html_text(webpage)
title <- stringr::str_extract_all(webpage_Text, "(;[^;]*#YouTubeRewind - YouTube)|(;[^;]*- YouTube\\{)")[[1]]
list_URL_Names[[i]] <- title
}
list_URL_Names
[[1]]
[1] ";CAQ celebrates victory in Quebec City, as opposition absorbs loss in Montreal - YouTube{"
[[2]]
[1] ";YouTube Rewind 2019: For the Record | #YouTubeRewind - YouTube"
[[3]]
[1] ";Crazy Capoeira Master Setting the UFC on Fire - Michel Pereira - YouTube{"
The best way to do this would probably be using YouTube's API designed for this purpose
But if you know the format of the youtube page's HTML you could probably just represent the whole HTML document as a string and use a string function to find the index of where the title tag is located, and isolate it that way. Or you could use a HTML parser to parse the HTML and find the data.
Here is an approach that can be considered :
library(RSelenium)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
vector_URL_Title_To_Extract <- c("https://www.youtube.com/watch?v=DyX5RFSxWOY",
"https://www.youtube.com/watch?v=2lAe1cqCOXo",
"https://www.youtube.com/watch?v=ndTktsXlN7w")
nb_URL <- length(vector_URL_Title_To_Extract)
list_URL_Names <- list()
for(i in 1 : nb_URL)
{
print(i)
remDr$navigate(vector_URL_Title_To_Extract[i])
Sys.sleep(5)
web_Obj <- remDr$findElement("css selector", '#title > h1 > yt-formatted-string')
list_URL_Names[[i]] <- web_Obj$getElementText()
}
list_URL_Names
[[1]]
[[1]][[1]]
[1] "CAQ celebrates victory in Quebec City, as opposition absorbs loss in Montreal"
[[2]]
[[2]][[1]]
[1] "YouTube Rewind 2019: For the Record | #YouTubeRewind"
[[3]]
[[3]][[1]]
[1] "Crazy Capoeira Master Setting the UFC on Fire - Michel Pereira"

web scraping multiple web pages with different directory strings in r with rvest

I know there are a lot questions similar to this but I haven't seemed to find one that ask this (Please forgive me if I am wrong). I am trying to scrape a website for weather data and I was successful at doing so for one of the web pages. However, I would like to loop the process. I have looked at
enter link description here
enter link description here
but I don't believe they solve my problem..
The directory changes slightly at the end from http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=avgtto
http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=pcpn
and so on.. How could I loop through them even though they aren't increasing by numbers?
Code:
nj_weather_data<-read_html("http://climate.rutgers.edu/stateclim_v1/nclimdiv/")
### Get info you want from web page###
hurr<-html_nodes(nj_weather_data,"#climdiv_table")
### Extract info and turn into dataframe###
precip_table<-as.data.frame(html_table(hurr))%>%
select(-Rank)
Assuming you want average T, minimum T, precipitation... Look at the way url changes when you click either in the table above the temperature table. This is done through javascript and in order to obtain that, you would have to load the page through some sort of (headless) browser such as phantomJS.
Another way is to just get the names for individual page and append it to the url and load the data.
library(rvest)
# notice the %s at the end - this is replaced by elements of cs in sprintf
# statement below
x <- "http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=%s"
cs <- c("mint", "avgt", "pcpn", "hdd", "cdd")
# you could paste together new url using paste, too
customstat <- sprintf(x, cs) # %s is replaced with mint, avgt...
# prepare empty object for results
out <- vector("list", length(customstat))
names(out) <- cs
# get individual table and insert it into the output
for (i in customstat) {
out[[which(i == customstat)]] <- read_html(i) %>%
html_nodes("#climdiv_table") %>%
html_table() %>%
.[[1]]
}
> str(out)
List of 5
$ mint:'data.frame': 131 obs. of 15 variables:
..$ Rank : logi [1:131] NA NA NA NA NA NA ...
..$ Year : chr [1:131] "1895" "1896" "1897" "1898" ...
..$ Jan : chr [1:131] "18.1" "18.6" "18.7" "23.2" ...
..$ Feb : chr [1:131] "11.7" "20.7" "22.5" "22.1" ...
You can now glue together tables (e.g. using do.call(rbind, out)) or whatever it is required for your analysis.

Rvest unexpectedly stopped working - Scraping tables

After having used the following script for a while, it suddenly stopped working. I constructed a simple function that finds a table - based on its xpath - within a web page.
library(rvest)
url <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08'
find_table <- function(x){read_html(x) %>%
html_nodes(xpath = '//*[#id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()}
table <- find_table(url)
I also tried to use httr::GET before read_html, passing the following argument:
query = list(r_date = "2017-12-22")
but nothing changed. Any ideas?
Well, that code doesn't work since you missed a ) in the url <- line.
We'll add in httr:
library(httr)
library(rvest)
url is the name of a base function. using base function names as variables can make problems in code hard to debug. Unless you write perfect code, it's a good idea to not use the names that way.
URL <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08')
I don't know if you know the "rules" about web scraping but if you're making repeated requests to this site, then a "crawl delay" should be used. They don't have one set in their robots.txt so 5 seconds is the accepted alternative. I point this out as you may be getting rate limited.
find_table <- function(x, crawl_delay=5) {
Sys.sleep(crawl_delay) # you can put this in a loop vs here if you aren't often doing repeat gets
# switch to httr::GET so you can get web server interaction info.
# since you're scraping, it's expected that you use a custom user agent
# that also supplies contact info.
res <- GET(x, user_agent("My scraper"))
# check to see if there's a non HTTP 200 response which there may be
# if you're getting rate-limited
stop_for_status(res)
# now, try to do the parsing. It looks like you're trying to target a
# single table, so i switched it from `html_nodes()` to `html_node()` since
# the latter returns a `list` and the pipe will error out if there's more
# than on list element.
content(res, "parsed") %>%
html_node(xpath = '//*[#id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()
}
table is a base function name, too (see above)
result <- find_table(URL)
Worked fine for me:
str(result)
## 'data.frame': 11 obs. of 5 variables:
## $ ENTI EROGATORI : chr "Cassa DD.PP." "Istituti di previdenza amministrati dal Tesoro" "Istituto per il credito sportivo" "Aziende di credito" ...
## $ : logi NA NA NA NA NA NA ...
## $ ACCENSIONE ACCERTAMENTI : chr "4.638.500,83" "0,00" "0,00" "953.898,47" ...
## $ ACCENSIONE RISCOSSIONI C|COMP. + RESIDUI: chr "2.177.330,12" "0,00" "129.114,22" "848.935,84" ...
## $ RIMBORSO IMPEGNI : chr "438.696,57" "975,07" "45.584,55" "182.897,01" ...

Scrape web page with splashr and load more button

I'm try to scrape a few 1801 census pages with splashr that may have 0 to many Load More buttons (since 50 records are loaded at a time). This page should have 174.
url <- "https://digitalarkivet.no/en/census/district/tf01058443000001"
doc <- splash("localhost") %>% render_html(url, wait =3)
html_nodes(doc2, xpath="//h4[not(#class)]/a") %>% length()
[1] 50
I tried following the url by Load More, but that just gets the first 50 records again.
url2 <- html_nodes(doc, xpath="//div[#class='load-more']") %>% html_attr("data-url")
[1] "https://digitalarkivet.no/en/census/related/rural-residences/tf01058443000001?page=2"
Note that most districts have fewer than 50 records, so I don't need to click load more for every page.
Thx for trying the splashr package (I'm the author).
Thankfully, you won't need it in this case. The data load is done through XHR requests which we can mimic in R:
library(httr)
library(rvest)
census_page <- function(district, page=1L) {
GET(
url = "https://digitalarkivet.no",
path=sprintf("en/census/related/rural-residences/%s", district),
accept_json(),
add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.45 Safari/537.36",
Referer = "https://digitalarkivet.no/en/census/district/tf01058443000001",
`X-Requested-With` = "XMLHttpRequest"
),
query = list(page=page)
) -> res
stop_for_status(res)
res <- content(res)
list(
divs = read_html(res$view),
next_page = parse_url(res$nextPage)$query$page
)
}
Now, just pass-in the district and page of data you want:
res <- census_page("tf01058443000001", 1)
And get the results:
str(res, 1)
## List of 2
## $ divs :List of 2
## ..- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
## $ next_page: chr "2"
The function returns a list with:
divs which is the parsed content containing the <div>s of the info you want
next_page can be used to pass to another call of the function
I didn't try it through to the end (i.e. I don't know if there will always be a 'next page') and you will need to extract the data from the <div>s on your own, but this will help you avoid a third-party dependency.

Resources