I am attempting to scrape the World Health Organization website (https://www.who.int/publications/m) >> using the "WHO document type" dropdown for "Press Briefing transcript".
In the past Ive been able to use the following script to download all specified file types to the working directory, however I haven't been able to deal with the drop down properly.
# Working example
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.github.com/rstudio/cheatsheets")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.github.com", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.github.com", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
If I start with the below. What steps would I need to include to account for the "WHO document type" dropdown for "Press Briefing transcript" and DL all files to the working directory?
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.who.int/publications/m")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.who.int", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.who.int", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
Currently, I get the following:
Error in .f(.x\[\[1L\]\], .y\[\[1L\]\], ...) : cannot open URL 'NA'
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.who.int/publications/m")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.who.int", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.who.int", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
Results
PDFs downloaded to working directory
There's not much to do with rvest, that document list is not included in the page's source (that rvest could access) but pulled by javascript that is executed by the browser (and rvest can't do that). Though you can make those same calls yourself:
library(jsonlite)
library(dplyr)
library(purrr)
library(stringr)
# get list of reports, partial API documentation can be found
# at https://www.who.int/api/hubs/meetingreports/sfhelp
# additional parameters (i.e. select & filter) recovered from who.int web requests
# skip: number of articles to skip
get_reports <- function(skip = 0){
read_json(URLencode(paste0("https://www.who.int/api/hubs/meetingreports?",
"$select=TrimmedTitle,PublicationDateAndTime,DownloadUrl,Tag&",
"$filter=meetingreporttypes/any(x:x eq f6c6ebea-eada-4dcb-bd5e-d107a357a59b)&",
"$orderby=PublicationDateAndTime desc&",
"$count=true&",
"$top=100&",
"$skip=", skip
)), simplifyVector = T) %>%
pluck("value") %>%
tibble()
}
# make 2 requests to collect all current (164) reports ("...&skip=0", "...&skip=100")
report_urls <- map_dfr(c(0,100), ~ get_reports(.x))
report_urls
#> # A tibble: 164 × 4
#> PublicationDateAndTime TrimmedTitle Downl…¹ Tag
#> <chr> <chr> <chr> <chr>
#> 1 2023-01-24T19:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 2 2023-01-11T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 3 2023-01-04T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 4 2022-12-21T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 5 2022-12-02T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 6 2022-11-16T16:00:00Z COVID-19, Monkeypox & Other Global Heal… https:… Pres…
#> 7 2022-11-10T22:00:00Z WHO press conference on global health i… https:… Pres…
#> 8 2022-10-19T21:00:00Z WHO press conference on global health i… https:… Pres…
#> 9 2022-10-19T21:00:00Z WHO press conference on global health i… https:… Pres…
#> 10 2022-10-12T21:00:00Z WHO press conference on COVID-19, monke… https:… Pres…
#> # … with 154 more rows, and abbreviated variable name ¹DownloadUrl
# get 1st 3 transcripts, for destfiles plit url by "?", take the 1st part, use basename to extract file name from url
walk(report_urls$DownloadUrl[1:3],
~ download.file(
url = .x,
destfile = basename(str_split_i(.x, "\\?", 1)),mode = "wb"))
# str_split_i() requires stringr >= 1.5.0, feel free to replace with:
# destfile = basename(str_split(.x, "\\?")[[1]][1]),mode = "wb"))
# list downloaded files
list.files(pattern = "press.*pdf")
#> [1] "who-virtual-press-conference-on-global-health-issues-11-jan-2023.pdf"
#> [2] "virtual-press-conference-on-global-health-issues-24-january-2023.pdf"
#> [3] "virtual-press-conference-on-global-health-issues_4-january-2023.pdf"
Created on 2023-01-28 with reprex v2.0.2
That "working example" in question comes from https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199 , it is rather difficult to take and apply anything from that article unless you are already familiar with everything written there. To understand why applying scraping logic built for one site almost never works for another, maybe check https://rvest.tidyverse.org/articles/rvest.html & https://r4ds.hadley.nz/webscraping.html (both from rvest author).
Related
I am trying to scrape details from a website in order to gather details for pictures with a script in R.
What I need is:
Image name (1.jpg)
Image caption ("A recruit demonstrates the proper use of a CO2 portable extinguisher to put out a small outside fire.")
Photo credit ("Photo courtesy of: James Fortner")
There are over 16,000 files, and thankfully the web url goes "...asp?photo=1, 2, 3, 4" so there is base url which doesn't change, just the last section with the image number. I would like the script to loop for either a set number (I tell it where to start) or it just breaks when it gets to a page which doesn't exisit.
Using the code below, I can get the caption of the photo, but only one line. I would like to get the photo credit, which is on a separate line; there are three between the main caption and photo credit. I'd be fine if the table which is generated had two or three blank columns to account for the lines, as I can delete them later.
library(rvest)
library(dplyr)
link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)
caption = page %>% html_nodes(".text7 i") %>% html_text()
info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")
Scraping with rvest and tidyverse
library(tidyverse)
library(rvest)
get_picture <- function(page) {
cat("Scraping page", page, "\n")
page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
read_html()
tibble(
image_name = page %>%
html_element(".text7 img") %>%
html_attr("src"),
caption = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(1),
credit = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(3)
)
}
# Get the first 1:50
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))
# A tibble: 42 × 3
image_name caption credit
<chr> <chr> <chr>
1 /photos/1.jpg Recruit Clay Hamric demonstrates the use… James…
2 /photos/2.jpg A recruit demonstrates the proper use of… James…
3 /photos/3.jpg Recruit Paul Melnick demonstrates the pr… James…
4 /photos/4.jpg Rescue 104 James…
5 /photos/5.jpg Rescue 104 James…
6 /photos/6.jpg Rescue 104 James…
7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
For the images, you can use the command line tool curl. For example, to download images 1.jpg through 100.jpg
curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"
For the R code, if you grab the whole .text7 section, then you can split into caption and photo credit subsequently:
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
As a loop
library(rvest)
library(tidyverse)
df<-data.frame(id=1:20,
image=NA,
caption=NA,
credit=NA)
for (i in 1:20){
cat(i, " ") # to monitor progress and debug
link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
tryCatch({ # This is to avoid stopping on an error message for missing pages
page <- read_html(link)
close(link)
df$image[i] <- page %>% html_nodes(".text7 img") %>% html_attr("src")
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1] # This is an awkward way of saying "list 1, element 1"
df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
},
error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I get inconsistent results with this current code, for example, page 15 has more line breaks than page 1.
TODO: enhance string extraction; switch to an 'append' method of adding data to a data.frame (vs pre-allocate and insert).
I am using rvest to scrape a website to download all the data in tables. Step 1 is working. I am not getting the Step 2 correctly:
Step 1:
library(rvest)
library(httr)
url<-'http://www.ahw.gov.ab.ca/IHDA_Retrieval/ihdaData.do'
sess<-html_session(url)
sess %>% follow_link(css='#content > div > p:nth-child(8) > a') -> sess
sess %>% follow_link(css='#content > div > table:nth-child(3) > tbody > tr:nth-child(10) > td > a') -> sess
Step 2:
pg_form<-html_form(sess)[[2]]
filled_form <-set_values(pg_form, `displayObject.id` = "1006")
d<-submit_form(session=sess, form=filled_form)
I am not sure how to submit the selected form. Do I need to use Selenium instead of rvest?
You don't need to use RSelenium. You can scrape this particular site using rvest and httr, but it's a little tricky. You need to learn how to send forms in http requests. This requires a bit of exploration of the underlying html and the http requests sent by your web browser.
In your case, the form is actually pretty simple. It only has two fields: a command field, which is always "doSelect" and a displayObject.id, which is a unique number for each selection item, obtained from the "value" attributes of the "option" tags in the html.
Here's how we can look at the drop-downs and their associated ids:
library(tidyverse)
library(rvest)
library(httr)
url <- "http://www.ahw.gov.ab.ca/IHDA_Retrieval/"
paste0(url, "ihdaData.do") %>%
GET() %>%
read_html() %>%
html_node('#content > div > p:nth-child(8) > a') %>%
html_attr("href") %>%
{paste0(url, .)} %>%
GET() %>%
read_html() %>%
html_node('#content > div > table:nth-child(3) > tbody > tr:nth-child(10) > td > a') %>%
html_attr("href") %>%
{paste0(url, .)} %>%
GET() %>%
read_html() -> page
pages <- tibble(id = page %>% html_nodes("option") %>% html_attr("value"),
item = page %>% html_nodes("option") %>% html_text())
pages <- pages[which(pages$item != ""), ]
This gives us a listing of the available items on the page:
pages
#> # A tibble: 8 x 2
#> id item
#> <chr> <chr>
#> 1 724 Human Immunodeficiency Virus (HIV) Incidence Rate (Age Specific)
#> 2 723 Human Immunodeficiency Virus (HIV) Incidence Rate (by Geography)
#> 3 886 Human Immunodeficiency Virus (HIV) Proportion (Ethnicity)
#> 4 887 Human Immunodeficiency Virus (HIV) Proportion (Exposure Cateogory)
#> 5 719 Notifiable Diseases - Age-Sex Specific Incidence Rate
#> 6 1006 Sexually Transmitted Infections (STI) - Age-Sex Specific Case Counts (P~
#> 7 466 Sexually Transmitted Infections (STI) - Age-Sex Specific Rates of Repor~
#> 8 1110 Sexually Transmitted Infections (STI) - Quarterly Congenital Syphilis C~
Now, if we want to select the first one, we just post a list with the required parameters to the correct url, which you can find by checking the developer console in your browser (F12 in Chrome, Firefox or IE). In this case, it is the relative url "selectSubCategory.do"
params <- list(command = "doSelect", displayObject.id = pages$id[1])
next_page <- POST(paste0(url, "selectSubCategory.do"), body = params)
So now next_page contains the html of the page you were looking for. Unfortunately, in this case it is another drop-down selection page.
Hopefully by following the methods above, you will be able to navigate the pages well enough to get the data you need.
I'm trying to scrape & download csv files from a webpage with tons of csv's.
Code:
# Libraries
library(rvest)
library(httr)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# The csv's I want are from 14 through 378 (2018 year)
selector_nodes <- seq(from = 14, to = 378, by = 1)
# HTML read / rvest action
link <- url %>%
read_html() %>%
html_nodes(paste0("body > ul > li:nth-child(", (gdelt_nodes), ")> a")) %>%
html_attr("href")
I get this error:
Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Expecting a single string value: [type=character; extent=365].
How do I tell it I want the nodes 14 to 378 correctly?
After I can get that assigned, I'm going to run a quick for loop and download all of the 2018 csv's.
See the comments in the code for the step-by-step solution.
library(rvest)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# Read the page in once then attempt to process it.
page <- url %>% read_html()
#extract the file list
filelist<-page %>% html_nodes("ul li a") %>% html_attr("href")
#filter for files from 2018
filelist<-filelist[grep("2018", filelist)]
#Loop would go here to download all of the pages
#pause between file downloads and then download a file
Sys.sleep(1)
download.file(paste0("http://data.gdeltproject.org/events/", filelist[1]), filelist[1])
I am trying to download a bunch of zip files from the website
https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml
Any suggestions? I have tried using rvest to identify the href, but have not had any luck.
We can avoid platform-specific issues with download.file() and handle the downloads with httr.
First, we'll read in the page:
library(xml2)
library(httr)
library(rvest)
library(tidyverse)
pg <- read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml")
Now, we'll target all the .zip file links. They're relative paths (e.g. Zip) so we'll prepend the URL prefix to them as well:
html_nodes(pg, xpath=".//a[contains(#href, '.zip')]") %>% # this href gets _all_ of them
html_attr("href") %>%
sprintf("https://mesonet.agron.iastate.edu%s", .) -> zip_urls
Here's a sample of what ^^ looks like:
head(zip_urls)
## [1] "https://mesonet.agron.iastate.edu/data/gis/shape/4326/us/current_ww.zip"
## [2] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_all.zip"
## [3] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_tsmf.zip"
## [4] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_all.zip"
## [5] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_tsmf.zip"
## [6] "https://mesonet.agron.iastate.edu/pickup/wwa/1988_all.zip"
There are 84 of them:
length(zip_urls)
## [1] 84
So we'll make sure to include a Sys.sleep(5) in our download walker so we aren't hammering their servers since our needs are not more important than the site's.
Make a place to store things:
dir.create("mesonet-dl")
This could also be done with a for loop but using purrr::walk makes it fairly explicit we're generating side effects (i.e. downloading to disk and not modifying anything in the R environment):
walk(zip_urls, ~{
message("Downloading: ", .x) # keep us informed
# this is way better than download.file(). Read the httr man page on write_disk
httr::GET(
url = .x,
httr::write_disk(file.path("mesonet-dl", basename(.x)))
)
Sys.sleep(5) # be kind
})
We use file.path() to construct the save-file location in a platform-agnostic way and use basename() to extract the filename portion vs regex hacking since it's an R C-backed internal function that is platform-idiosyncrasy-aware.
This should work
library(tidyverse)
library(rvest)
setwd("YourDirectoryName") # set the directory where you want to download all files
read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml") %>%
html_nodes(".table-striped a") %>%
html_attr("href") %>%
lapply(function(x) {
filename <- str_extract(x, pattern = "(?<=wwa/).*") # this extracts the filename from the url
paste0("https://mesonet.agron.iastate.edu",x) %>% # this creates the relevant url from href
download.file(destfile=filename, mode = "wb")
Sys.sleep(5)})})
I'm new to R and web scraping. I'm trying to read a table from the World Bank website into R.
Here is the url link for one of the projects as an example (my goal is to read the left table under "Basic Information"): http://projects.worldbank.org/P156880/?lang=en&tab=details
I'm using Chrome's Dedvtools to identify the selector nodes that i need for that particular table.
Here is my code:
library(rvest)
url <- "http://projects.worldbank.org/P156880/?lang=en&tab=details"
details <- url %>%
read_html() %>%
html_nodes(css = '#projectDetails > div:nth-child(2) > div.column-left > table') %>%
html_table()
Unfortunately, I get an empty list:
> details
list()
Any help on how to resolve this would be greatly appreciated.
This site uses XML http requests which you can get using httr. Open Chrome developer tools and go to the Network tab and then load your url above. You will notice four other urls are requested when loading the page, so click on projectdetails? and you should see the html table in the Preview tab. Next, right click on projectdetails? and Copy as cURL to a text editor and paste the URL, Referer, and X-Requested-With into the httr GET function below.
library(httr)
library(rvest)
res <- GET(
url = "http://projects.worldbank.org/p2e/projectdetails?projId=P156880&lang=en",
add_headers(Referer = "http://projects.worldbank.org/P156880/?lang=en&tab=details",
`X-Requested-With` = "XMLHttpRequest")
)
content(res) %>% html_node("table") %>% html_table( header=TRUE)
Project ID P156880
1 Status Active
2 Approval Date December 14, 2017
3 Closing Date December 15, 2023
4 Country Colombia
5 Region Latin America and Caribbean
6 Environmental Category B
Or write a function to get any project ID
get_project <-function(id){
res <- GET(
url = "http://projects.worldbank.org",
path = paste0("p2e/projectdetails?projId=", id, "&lang=en"),
add_headers(
Referer = paste0("http://projects.worldbank.org/", id, "/?lang=en&tab=details"),
`X-Requested-With` = "XMLHttpRequest")
)
content(res) %>% html_node("table") %>% html_table(header=TRUE)
}
get_project("P156880")