Download multiple files from a url, using R - r

I have this url: https://www.cnpm.embrapa.br/projetos/relevobr/download/index.htm with geographic information about Brazilian states. If you click in any state, you will find these grids:
Now, if you click in any grid, you will be able to download the geographic information of this specific grid:
What I need: download all the grids at once. Is it possible?

You can scrape the page to get the URLs for the zip files, then iterate across the URLs to download everything:
library(rvest)
# get page source
h <- read_html('https://www.cnpm.embrapa.br/projetos/relevobr/download/mg/mg.htm')
urls <- h %>%
html_nodes('area') %>% # get all `area` nodes
html_attr('href') %>% # get the link attribute of each node
sub('.htm$', '.zip', .) %>% # change file suffix
paste0('https://www.cnpm.embrapa.br/projetos/relevobr/download/mg/', .) # append to base URL
# create a directory for it all
dir <- file.path(tempdir(), 'mg')
dir.create(dir)
# iterate and download
lapply(urls, function(url) download.file(url, file.path(dir, basename(url))))
# check it's there
list.files(dir)
#> [1] "sd-23-y-a.zip" "sd-23-y-b.zip" "sd-23-y-c.zip" "sd-23-y-d.zip" "sd-23-z-a.zip" "sd-23-z-b.zip"
#> [7] "sd-23-z-c.zip" "sd-23-z-d.zip" "sd-24-y-c.zip" "sd-24-y-d.zip" "se-22-y-d.zip" "se-22-z-a.zip"
#> [13] "se-22-z-b.zip" "se-22-z-c.zip" "se-22-z-d.zip" "se-23-v-a.zip" "se-23-v-b.zip" "se-23-v-c.zip"
#> [19] "se-23-v-d.zip" "se-23-x-a.zip" "se-23-x-b.zip" "se-23-x-c.zip" "se-23-x-d.zip" "se-23-y-a.zip"
#> [25] "se-23-y-b.zip" "se-23-y-c.zip" "se-23-y-d.zip" "se-23-z-a.zip" "se-23-z-b.zip" "se-23-z-c.zip"
#> [31] "se-23-z-d.zip" "se-24-v-a.zip" "se-24-v-b.zip" "se-24-v-c.zip" "se-24-v-d.zip" "se-24-y-a.zip"
#> [37] "se-24-y-c.zip" "sf-22-v-b.zip" "sf-22-x-a.zip" "sf-22-x-b.zip" "sf-23-v-a.zip" "sf-23-v-b.zip"
#> [43] "sf-23-v-c.zip" "sf-23-v-d.zip" "sf-23-x-a.zip" "sf-23-x-b.zip" "sf-23-x-c.zip" "sf-23-x-d.zip"
#> [49] "sf-23-y-a.zip" "sf-23-y-b.zip" "sf-23-z-a.zip" "sf-23-z-b.zip" "sf-24-v-a.zip"

Related

How to download data from the Reptile database using r

I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"

How to download several datasets from website stored in form Index of data/months/years. .?

I need to download climatic datasets at monthly resolutions and several years. The data is available here: https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/
I can download unique files by clicking on them and saving them. But how I can download several datasets (how to filter for e.g. specific years?), or simply download all of the files within a directory? I am sure there should be an automatic way using some FTP connection, or some R coding (in R studio), but can't find any relevant suggestions. I am a Windows 10 user. Please, where to start?
Try this:
library(rvest)
baseurl <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"
res <- read_html(baseurl)
urls1 <- html_nodes(res, "a") %>%
html_attr("href") %>%
Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
paste0(baseurl, .)
This gets us the first level,
urls1
# [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/"
# [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/02_Feb/"
# [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/03_Mar/"
# [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/04_Apr/"
# [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/05_May/"
# [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/06_Jun/"
# [7] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/07_Jul/"
# [8] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/08_Aug/"
# [9] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/09_Sep/"
# [10] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/10_Oct/"
# [11] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/11_Nov/"
# [12] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/12_Dec/"
# [13] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [14] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf"
As you can see, some are files, some are directories. We can iterate over these URLs to do the same thing:
urls2 <- lapply(grep("/$", urls1, value = TRUE), function(url) {
res2 <- read_html(url)
html_nodes(res2, "a") %>%
html_attr("href") %>%
Filter(function(z) grepl("^[[:alnum:]]", z), .) %>%
paste0(url, .)
})
Each of those folders contain 141-142 different files:
lengths(urls2)
# [1] 142 142 142 142 142 142 141 141 141 141 141 141
### confirm no more directories
sapply(urls2, function(z) any(grepl("/$", z)))
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
(This would not be difficult to transform into a recursive search vice a fixed-2-deep search.)
These files can all be combined with those from urls1 that were files (the two .pdf files)
allurls <- c(grep("/$", urls1, value = TRUE, invert = TRUE), unlist(urls2))
length(allurls)
# [1] 1700
head(allurls)
# [1] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/BESCHREIBUNG_gridsgermany_monthly_air_temperature_mean_de.pdf"
# [2] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/DESCRIPTION_gridsgermany_monthly_air_temperature_mean_en.pdf"
# [3] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz"
# [4] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188201.asc.gz"
# [5] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz"
# [6] "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188401.asc.gz"
And now you can filter as desired and download those that are needed:
needthese <- allurls[c(3,5)]
ign <- mapply(download.file, needthese, basename(needthese))
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188101.asc.gz'
# Content type 'application/octet-stream' length 221215 bytes (216 KB)
# downloaded 216 KB
# trying URL 'https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/01_Jan/grids_germany_monthly_air_temp_mean_188301.asc.gz'
# Content type 'application/octet-stream' length 217413 bytes (212 KB)
# downloaded 212 KB
file.info(list.files(pattern = "gz$"))
# size isdir mode mtime ctime atime exe
# grids_germany_monthly_air_temp_mean_188101.asc.gz 221215 FALSE 666 2022-07-06 09:17:21 2022-07-06 09:17:19 2022-07-06 09:17:52 no
# grids_germany_monthly_air_temp_mean_188301.asc.gz 217413 FALSE 666 2022-07-06 09:17:22 2022-07-06 09:17:21 2022-07-06 09:17:52 no
You can use rvest package for scrapping the links and use those links to download the files for a specific month in the following way:
library(rvest)
library(stringr)
page_link <- "https://opendata.dwd.de/climate_environment/CDC/grids_germany/monthly/air_temperature_mean/"
month_name <- "01_Jan"
# you can set month_name as "05_May" to get the data from 05_May
# getting the html page for 01_Jan folder
page <- read_html(paste0(page_link, month_name, "/"))
# getting the link text
link_text <- page %>%
html_elements("a") %>%
html_text()
# creating links
links <- paste0(page_link, month_name, "/", link_text)[-1]
# extracting the numbers for filename
filenames <- stringr::str_extract(pattern = "\\d+", string = link_text[-1])
# creating a directory
dir.create(month_name)
# setting the option for maximizing time limits for downloading
options(timeout = max(600, getOption("timeout")))
# downloading the file
for (i in seq_along(links)) {
download.file(links[i], paste0(month_name, "/", filenames[i], "asc.gz"))
}

Web Scraping with R on multiple pages/links

I have a list of 5000 movies in an excel file:
Avatar
Tangled
Superman Returns
Avengers : Endgame
Man of Steel
And so on....
I need to extract weekend collections of these movies.
The weekend collections are available on boxofficemojo.com website.
By writing the following code, i am only able to fetch the Weekend collections of one single movie 'Avatar' since the url mentioned in the code contains only the details of 'Avatar'.
library(rvest)
webpage <- read_html("https://www.boxofficemojo.com/release/rl876971521/weekend/?ref_=bo_rl_tab#tabs")
weekend_collections <- webpage %>%
html_nodes(".mojo-field-type-rank+ .mojo-estimatable") %>%
html_text()
Other movies will have different url's.
5000 different movie's weekend collections will be in 5000 different url's.
Is it possible to just give the list of the movies and ask r to fetch the weekend collections of every movie without providing the respective url's of the movies ?
I can add the url's of the movies manually and perform the task but it isn't a great idea to manually add the url's of the movies to the code.
So how do i fetch the weekend collections of these 5000 movies ?
I am new to R.
Need help.
It is possible to automate the search process on this site, since it is easy enough to generate the search string and parse the incoming html to navigate to the weekends page.
The problem is that the search will sometimes generate several hits, so you can't be sure you are getting exactly the right movie. You can only examine the title afterwards to find out.
Here is a function you can use. You supply it with a movie title and it will try to get the url to the weekend collections for the original release. It will select the first hit on the search page, so you have no guarantee it's the correct movie.
get_weekend_url <- function(movie)
{
site <- "https://www.boxofficemojo.com"
search_query <- paste0(site, "/search/?q=")
search_xpath <- "//a[#class = 'a-size-medium a-link-normal a-text-bold']"
release_xpath <- "//option[text() = 'Original Release']"
territory_xpath <- "//option[text() = 'Domestic']"
weekend <- "weekend/?ref_=bo_rl_tab#tabs"
movie_url <- url_escape(movie) %>%
{gsub("%20", "+", .)} %>%
{paste0(search_query, .)} %>%
read_html() %>%
html_nodes(xpath = search_xpath) %>%
html_attr("href")
if(!is.na(movie_url[1]))
{
release <- read_html(paste0(site, movie_url[1])) %>%
html_node(xpath = release_xpath) %>%
html_attr("value") %>%
{paste0(site, .)}
} else release <- NA # We can stop if there is no original release found
if(!is.na(release))
{
target <- read_html(release) %>%
html_node(xpath = territory_xpath) %>%
html_attr("value") %>%
{paste0(site, ., weekend)}
} else target <- "Movie not found"
return(target)
}
Now you can use sapply to get the urls you want:
movies <- c("Avatar",
"Tangled",
"Superman Returns",
"Avengers : Endgame",
"Man of Steel")
urls <- sapply(movies, get_weekend_url)
urls
#> Avatar
#> "https://www.boxofficemojo.com/release/rl876971521/weekend/?ref_=bo_rl_tab#tabs"
#> Tangled
#> "https://www.boxofficemojo.com/release/rl980256257/weekend/?ref_=bo_rl_tab#tabs"
#> Superman Returns
#> "https://www.boxofficemojo.com/release/rl4067591681/weekend/?ref_=bo_rl_tab#tabs"
#> Avengers : Endgame
#> "https://www.boxofficemojo.com/release/rl3059975681/weekend/?ref_=bo_rl_tab#tabs"
#> Man of Steel
#> "https://www.boxofficemojo.com/release/rl4034037249/weekend/?ref_=bo_rl_tab#tabs"
Now you can use these to get the tables for each movie:
css <- ".mojo-field-type-rank+ .mojo-estimatable"
weekends <- lapply(urls, function(x) read_html(x) %>% html_nodes(css) %>% html_text)
Which gives you:
weekends
#> $`Avatar`
#> [1] "Weekend\n " "$77,025,481" "$75,617,183"
#> [4] "$68,490,688" "$50,306,217" "$42,785,612"
#> [7] "$54,401,446" "$34,944,081" "$31,280,029"
#> [10] "$22,850,881" "$23,611,625" "$28,782,849"
#> [13] "$16,240,857" "$13,655,274" "$8,118,102"
#> [16] "$6,526,421" "$4,027,005" "$2,047,475"
#> [19] "$980,239" "$1,145,503" "$844,651"
#> [22] "$1,002,814" "$920,204" "$633,124"
#> [25] "$425,085" "$335,174" "$188,505"
#> [28] "$120,080" "$144,241" "$76,692"
#> [31] "$64,767" "$45,181" "$44,572"
#> [34] "$28,729" "$35,706" "$36,971"
#> [37] "$15,615" "$16,817" "$13,028"
#> [40] "$10,511"
#>
#> $Tangled
#> [1] "Weekend\n " "$68,706,298" "$56,837,104"
#> [4] "$48,767,052" "$21,608,891" "$14,331,687"
#> [7] "$8,775,344" "$6,427,816" "$9,803,091"
#> [10] "$5,111,098" "$3,983,009" "$5,638,656"
#> [13] "$3,081,926" "$2,526,561" "$1,850,628"
#> [16] "$813,849" "$534,351" "$743,090"
#> [19] "$421,474" "$790,248" "$640,753"
#> [22] "$616,057" "$550,994" "$336,339"
#> [25] "$220,670" "$85,574" "$31,368"
#> [28] "$16,475" "$5,343" "$6,351"
#> [31] "$910,502" "$131,938" "$135,891"
#>
#> $`Superman Returns`
#> [1] "Weekend\n " "$52,535,096" "$76,033,267"
#> [4] "$21,815,243" "$12,288,317" "$7,375,213"
#> [7] "$3,788,228" "$2,158,227" "$1,242,461"
#> [10] "$848,255" "$780,405" "$874,141"
#> [13] "$1,115,228" "$453,273" "$386,424"
#> [16] "$301,373" "$403,377" "$296,502"
#> [19] "$331,938" "$216,430" "$173,300"
#> [22] "$40,505"
#>
#> $`Avengers : Endgame`
#> [1] "Weekend\n " "$357,115,007" "$147,383,211"
#> [4] "$63,299,904" "$29,973,505" "$17,200,742"
#> [7] "$22,063,855" "$8,037,491" "$4,870,963"
#> [10] "$3,725,855" "$1,987,849" "$6,108,736"
#> [13] "$3,118,317" "$2,104,276" "$1,514,741"
#> [16] "$952,609" "$383,158" "$209,992"
#> [19] "$100,749" "$50,268" "$70,775"
#> [22] "$86,837" "$12,680"
#>
#> $`Man of Steel`
#> [1] "Weekend\n " "$116,619,362" "$41,287,206"
#> [4] "$20,737,490" "$11,414,297" "$4,719,084"
#> [7] "$1,819,387" "$749,233" "$466,574"
#> [10] "$750,307" "$512,308" "$353,846"
#> [13] "$290,194" "$390,175" "$120,814"
#> [16] "$61,017"
If you have 5000 movies to look up, it is going to take a long time to send and parse all these requests. Depending on your internet connection, it may well take 2-3 seconds per movie. That's not bad, but it may still be 4 hours of processing time. I would recommend starting with an empty list and writing each result to the list as it is received, so that if something breaks after an hour or two, you don't lose everything you have so far.

Unable to extract image links using Rvest

I am unable to extract links of images from a website.
I am new to data scraping. I have used Selectorgadget as well as inspect element method to get the class of the image, but to no avail.
main.page <- read_html(x= "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
urls <- main.page %>%
html_nodes(".match-detail--item:nth-child(9) .lazyloaded") %>%
html_attr("src")
sotu <- data.frame(urls = urls)
I am getting the following output:
<0 rows> (or 0-length row.names)
Certain classes and parameters don't show up in the scraped data for some reason. Just target img instead of .lazyloaded and data-src instead of src:
library(rvest)
main.page <- read_html("https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
main.page %>%
html_nodes(".match-detail--item:nth-child(9) img") %>%
html_attr("data-src")
#### OUTPUT ####
[1] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/1.png&h=25&w=25"
[2] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[3] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[4] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[5] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[6] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[7] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[8] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[9] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[10] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[11] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[12] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
As the DOM is modified via javascript (using React) when using browser you don't get the same layout for rvest. You could, less optimal, regex out the info from the javascript object the links are housed in. Then use a json parser to extract the links
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
url <- "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'debuts":(.*?\\])')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$imgicon)

html_nodes returning two results for a link

I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?
As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"

Resources