I am trying to download a bunch of zip files from the website
https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml
Any suggestions? I have tried using rvest to identify the href, but have not had any luck.
We can avoid platform-specific issues with download.file() and handle the downloads with httr.
First, we'll read in the page:
library(xml2)
library(httr)
library(rvest)
library(tidyverse)
pg <- read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml")
Now, we'll target all the .zip file links. They're relative paths (e.g. Zip) so we'll prepend the URL prefix to them as well:
html_nodes(pg, xpath=".//a[contains(#href, '.zip')]") %>% # this href gets _all_ of them
html_attr("href") %>%
sprintf("https://mesonet.agron.iastate.edu%s", .) -> zip_urls
Here's a sample of what ^^ looks like:
head(zip_urls)
## [1] "https://mesonet.agron.iastate.edu/data/gis/shape/4326/us/current_ww.zip"
## [2] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_all.zip"
## [3] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_tsmf.zip"
## [4] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_all.zip"
## [5] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_tsmf.zip"
## [6] "https://mesonet.agron.iastate.edu/pickup/wwa/1988_all.zip"
There are 84 of them:
length(zip_urls)
## [1] 84
So we'll make sure to include a Sys.sleep(5) in our download walker so we aren't hammering their servers since our needs are not more important than the site's.
Make a place to store things:
dir.create("mesonet-dl")
This could also be done with a for loop but using purrr::walk makes it fairly explicit we're generating side effects (i.e. downloading to disk and not modifying anything in the R environment):
walk(zip_urls, ~{
message("Downloading: ", .x) # keep us informed
# this is way better than download.file(). Read the httr man page on write_disk
httr::GET(
url = .x,
httr::write_disk(file.path("mesonet-dl", basename(.x)))
)
Sys.sleep(5) # be kind
})
We use file.path() to construct the save-file location in a platform-agnostic way and use basename() to extract the filename portion vs regex hacking since it's an R C-backed internal function that is platform-idiosyncrasy-aware.
This should work
library(tidyverse)
library(rvest)
setwd("YourDirectoryName") # set the directory where you want to download all files
read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml") %>%
html_nodes(".table-striped a") %>%
html_attr("href") %>%
lapply(function(x) {
filename <- str_extract(x, pattern = "(?<=wwa/).*") # this extracts the filename from the url
paste0("https://mesonet.agron.iastate.edu",x) %>% # this creates the relevant url from href
download.file(destfile=filename, mode = "wb")
Sys.sleep(5)})})
Related
I am attempting to scrape the World Health Organization website (https://www.who.int/publications/m) >> using the "WHO document type" dropdown for "Press Briefing transcript".
In the past Ive been able to use the following script to download all specified file types to the working directory, however I haven't been able to deal with the drop down properly.
# Working example
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.github.com/rstudio/cheatsheets")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.github.com", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.github.com", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
If I start with the below. What steps would I need to include to account for the "WHO document type" dropdown for "Press Briefing transcript" and DL all files to the working directory?
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.who.int/publications/m")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.who.int", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.who.int", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
Currently, I get the following:
Error in .f(.x\[\[1L\]\], .y\[\[1L\]\], ...) : cannot open URL 'NA'
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.who.int/publications/m")
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://www.who.int", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.who.int", .) %>% # prepend the website again to get a full url
walk2(., basename(.), download.file, mode = "wb") # use purrr to download the pdf associated with each url to the current working directory
Results
PDFs downloaded to working directory
There's not much to do with rvest, that document list is not included in the page's source (that rvest could access) but pulled by javascript that is executed by the browser (and rvest can't do that). Though you can make those same calls yourself:
library(jsonlite)
library(dplyr)
library(purrr)
library(stringr)
# get list of reports, partial API documentation can be found
# at https://www.who.int/api/hubs/meetingreports/sfhelp
# additional parameters (i.e. select & filter) recovered from who.int web requests
# skip: number of articles to skip
get_reports <- function(skip = 0){
read_json(URLencode(paste0("https://www.who.int/api/hubs/meetingreports?",
"$select=TrimmedTitle,PublicationDateAndTime,DownloadUrl,Tag&",
"$filter=meetingreporttypes/any(x:x eq f6c6ebea-eada-4dcb-bd5e-d107a357a59b)&",
"$orderby=PublicationDateAndTime desc&",
"$count=true&",
"$top=100&",
"$skip=", skip
)), simplifyVector = T) %>%
pluck("value") %>%
tibble()
}
# make 2 requests to collect all current (164) reports ("...&skip=0", "...&skip=100")
report_urls <- map_dfr(c(0,100), ~ get_reports(.x))
report_urls
#> # A tibble: 164 × 4
#> PublicationDateAndTime TrimmedTitle Downl…¹ Tag
#> <chr> <chr> <chr> <chr>
#> 1 2023-01-24T19:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 2 2023-01-11T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 3 2023-01-04T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 4 2022-12-21T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 5 2022-12-02T16:00:00Z Virtual Press conference on global heal… https:… Pres…
#> 6 2022-11-16T16:00:00Z COVID-19, Monkeypox & Other Global Heal… https:… Pres…
#> 7 2022-11-10T22:00:00Z WHO press conference on global health i… https:… Pres…
#> 8 2022-10-19T21:00:00Z WHO press conference on global health i… https:… Pres…
#> 9 2022-10-19T21:00:00Z WHO press conference on global health i… https:… Pres…
#> 10 2022-10-12T21:00:00Z WHO press conference on COVID-19, monke… https:… Pres…
#> # … with 154 more rows, and abbreviated variable name ¹DownloadUrl
# get 1st 3 transcripts, for destfiles plit url by "?", take the 1st part, use basename to extract file name from url
walk(report_urls$DownloadUrl[1:3],
~ download.file(
url = .x,
destfile = basename(str_split_i(.x, "\\?", 1)),mode = "wb"))
# str_split_i() requires stringr >= 1.5.0, feel free to replace with:
# destfile = basename(str_split(.x, "\\?")[[1]][1]),mode = "wb"))
# list downloaded files
list.files(pattern = "press.*pdf")
#> [1] "who-virtual-press-conference-on-global-health-issues-11-jan-2023.pdf"
#> [2] "virtual-press-conference-on-global-health-issues-24-january-2023.pdf"
#> [3] "virtual-press-conference-on-global-health-issues_4-january-2023.pdf"
Created on 2023-01-28 with reprex v2.0.2
That "working example" in question comes from https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199 , it is rather difficult to take and apply anything from that article unless you are already familiar with everything written there. To understand why applying scraping logic built for one site almost never works for another, maybe check https://rvest.tidyverse.org/articles/rvest.html & https://r4ds.hadley.nz/webscraping.html (both from rvest author).
I'm trying to scrape some data from websites with rvest. I have a tibble of thousands of URLs, and I need to extract one piece of data from each URL. In order to not be blocked by the main site I'm visiting, I need to rest about 2 minutes after each 200 URLs I visit (learned this via trial and error). I'm wondering how I can use sys.sleep to do this.
My current code is below. I am going to each URL in url_tibble and pulling data (".verified").
# Function to extract data
get_data <- function(x) {
read_html(x) %>%
html_nodes(".verified") %>%
html_attr("href")
}
# Extract data
data_I_need <- url_tibble %>%
mutate(profile = map(url, ~ get_data(.x)),)
This code works for a limited number of URLS, until I get blocked for trying to scrape from the site too quickly. To avoid being blocked, I'd like to pause for 2 minutes after each 200 URLs using sys.sleep. Can you help me figure out how to do this?
The best recommendation I found for how to do this was from the solution posted on Recommendation when using Sys.sleep() in R with rvest, but I can't figure out how to integrate the solution with my code. This solution uses loops instead of map. I tried doing something like this:
output <- vector(length = length(url_tibble$url))
for(i in 1:length(url_tibble$url)) {
data_I_need <- read_html(url_tibble$url[i]) %>%
html_nodes(".verified") %>%
html_attr("href")
output[i] <- data_I_need
if((i %% 200) == 0){
Sys.sleep(160)
}
}
However, this does not work either, and I receive an error message.
We can lapply in lieu of a loop. Also, I have added an https:// to each URL such that read_html recognises them as links not files. We should replace 2 with 200 for the actual data.
lapply(1:length(url_tibble$url), function(x){
if(x%%2 == 0){
print(paste0("Sleeping at ", x))
Sys.sleep(20)
}
read_html(paste0("https://",url_tibble$url[x])) %>%
html_nodes(".verified") %>%
html_attr("href")
})
Output (truncated)
[1] "Sleeping at 2"
[1] "Sleeping at 4"
[1] "Sleeping at 6"
[[1]]
[1] "https://www.psychologytoday.com/us/therapists/aak-bright-start-rego-park-ny/936718"
[2] "https://www.psychologytoday.com/us/therapists/leslie-aaron-new-york-ny/148793"
[3] "https://www.psychologytoday.com/us/therapists/lindsay-aaron-frieman-new-york-ny/761657"
[4] "https://www.psychologytoday.com/us/therapists/fay-m-aaronson-brooklyn-ny/840861"
[5] "https://www.psychologytoday.com/us/therapists/anita-aasen-staten-island-ny/291614"
[6] "https://www.psychologytoday.com/us/therapists/aask-therapeutic-services-fishkill-ny/185423"
[7] "https://www.psychologytoday.com/us/therapists/amanda-abady-brooklyn-ny/935849"
[8] "https://www.psychologytoday.com/us/therapists/denise-abatemarco-new-york-ny/143678"
[9] "https://www.psychologytoday.com/us/therapists/raya-abat-robinson-new-york-ny/810730"
I want to scrape a large amount of websites. For this, I first read in the websites' html-scripts and store them as xml_nodesets. As I only need the websites' contents, I lastly extract each websites' contents from the xml_nodesets. To achieve this, I have written following code:
# required packages
library(purrr)
library(dplyr)
library(xml2)
library(rvest)
# urls of the example sources
test_files <- c("https://en.wikipedia.org/wiki/Web_scraping", "https://en.wikipedia.org/wiki/Data_scraping")
# reading in the html sources, storing them as xml_nodesets
test <- test_files %>%
map(., ~ xml2::read_html(.x, encoding = "UTF-8"))
# extracting selected nodes (contents)
test_tbl <- test %>%
map(., ~tibble(
# scrape contents
test_html = rvest::html_nodes(.x, xpath = '//*[(#id = "toc")]')
))
Unfortunately, this produces following error:
Error: All columns in a tibble must be vectors.
x Column `test_html` is a `xml_nodeset` object.
I think I understand the substance of this error, but I can't find a way around it. It's also a bit strange, because I was able to smoothly run this code in January and suddenly it is not working anymore. I suspected package updates to be the reason, but installing older versions of xml2, rvest or tibble didn't help either. Also, scraping only one single website doesn't produce any errors either:
test <- read_html("https://en.wikipedia.org/wiki/Web_scraping", encoding = "UTF-8") %>%
rvest::html_nodes(xpath = '//*[(#id = "toc")]')
Do you have any suggestions on how to solve this issue? Thank you very much!
EDIT: I removed %>% html_text from ...
test_tbl <- test %>%
map(., ~tibble(
# scrape contents
test_html = rvest::html_nodes(.x, xpath = '//*[(#id = "toc")]')
))
... as this doesn't produce this error. The edited code does, though.
You need to store the objects in a list.
test %>%
purrr::map(~tibble(
# scrape contents
test_html = list(rvest::html_nodes(.x, xpath = '//*[(#id = "toc")]'))
))
#[[1]]
# A tibble: 1 x 1
# test_html
# <list>
#1 <xml_ndst>
#[[2]]
# A tibble: 1 x 1
# test_html
# <list>
#1 <xml_ndst>
I'm trying to scrape & download csv files from a webpage with tons of csv's.
Code:
# Libraries
library(rvest)
library(httr)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# The csv's I want are from 14 through 378 (2018 year)
selector_nodes <- seq(from = 14, to = 378, by = 1)
# HTML read / rvest action
link <- url %>%
read_html() %>%
html_nodes(paste0("body > ul > li:nth-child(", (gdelt_nodes), ")> a")) %>%
html_attr("href")
I get this error:
Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Expecting a single string value: [type=character; extent=365].
How do I tell it I want the nodes 14 to 378 correctly?
After I can get that assigned, I'm going to run a quick for loop and download all of the 2018 csv's.
See the comments in the code for the step-by-step solution.
library(rvest)
# URL
url <- "http://data.gdeltproject.org/events/index.html"
# Read the page in once then attempt to process it.
page <- url %>% read_html()
#extract the file list
filelist<-page %>% html_nodes("ul li a") %>% html_attr("href")
#filter for files from 2018
filelist<-filelist[grep("2018", filelist)]
#Loop would go here to download all of the pages
#pause between file downloads and then download a file
Sys.sleep(1)
download.file(paste0("http://data.gdeltproject.org/events/", filelist[1]), filelist[1])
I know that this question has been asked here tons of times but after reading a bunch of topics I'm still stucked on this :( . I've a list of scraped html nodes like this
http://bit.d o/bnRinN9
and I just want to clean all code part. Unfortunately I'm a newbie and the only thing it comes to my mind is the Cthulhu way (regex, argh!). Which way I can do this?
*I put a space between "d" and "o" in domain name because SO doesn't allow to post that link
This uses the data linked in Why R can't scrape these links? which was downloaded.
library(rvest)
library(stringr)
# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")
# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")
# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]
# the real urls are in the html text, prefixed with http
span_text <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]
# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
library rvest includes many simple functions for scraping and processing html. It depends on package xml2. Generally you can scrape and filter in one step.
It's not clear if you want to extract the href value or the html text, which are the same in your example. This code extracts the href value by finding the a nodes and then the html attribute href. alternatively you can use html_text to get the link display text.
library(rvest)
links <- list('
http://anydomain.com/bnRinN9
<a href="domain.com/page">
')
# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs
## [1] "http://anydomain.com/bnRinN9" "domain.com/page"