Download an online folder with Windows 10 - r

I wish to download an online folder using Windows 10 on my Dell laptop. In this example the folder I wish to download is named Targetfolder. I am trying to use the Command Window but also am wondering whether there is a simple solution in R. I have included an image at the bottom of this post showing the target folder. I should add that Targetfolder includes a file and multiple subfolders containing files. Not all files have the same extension. Also, please note this is a hypothetical site. I did not want to include the real site for privacy issues.
EDIT
Here is a real site that can serve as a functional, reproducible example. The folder rel2020 can take the place of the hypothetical Targetfolder:
https://www2.census.gov/geo/docs/maps-data/data/rel2020/
None of the answers here seem to work with Targetfolder:
How to download HTTP directory with all files and sub-directories as they appear on the online files/folders list?
Below are my attempts based on answers posted at the link above and the result I obtained:
Attempt One
lftp -c 'mirror --parallel=300 https://www.examplengo.org/datadisk/examplefolder/userdirs/user3/Targetfolder/ ;exit'
Returned:
lftp is not recognized as an internal or external command, operable program or batch file.
Attempt Two
wget -r -np -nH --cut-dirs=3 -R index.html https://www.examplengo.org/datadisk/examplefolder/userdirs/user3/Targetfolder/
Returned:
wget is not recognized as an internal or external command, operable program or batch file.
Attempt Three
https://sourceforge.net/projects/visualwget/files/latest/download
VisualWget returned Unsupported scheme next to the url.

Here is a way with packages httr and rvest.
First get the folders where the files are from the link.
Then loop through the folders with Map, getting the filenames and downloading them in a lapply loop.
If errors such as time out conditions occur, they will be trapped in tryCatch. The last code lines will tell if and where there were errors.
Note: I only downloaded from folders[1:2], in the Map below change this to folders.
suppressPackageStartupMessages({
library(httr)
library(rvest)
library(dplyr)
})
link <- "https://www2.census.gov/geo/docs/maps-data/data/rel2020/"
page <- read_html(link)
folders <- page %>%
html_elements("a") %>%
html_attr("href") %>%
.[8:14] %>%
paste0(link, .)
files_txt <- Map(\(x) {
x %>%
read_html() %>%
html_elements("a") %>%
html_attr("href") %>%
grep("\\.txt$", ., value = TRUE) %>%
paste0(x, .) %>%
lapply(\(y) {
tryCatch(
download.file(y, destfile = file.path("~/Temp", basename(y))),
error = function(e) e
)
})
}, folders[1:2])
err <- sapply(unlist(files_txt, recursive = FALSE), inherits, "error")
lapply(unlist(files_txt, recursive = FALSE)[err], simpleError)

Related

Downloading multiple files using purrr

I am trying to download all of the Excel files at: https://www.grants.gov.au/reports/gaweeklyexport
Using the code below I get errors similar to the following for each link (77 in total):
[[1]]$error
<simpleError in download.file(.x, .y, mode = "wb"): scheme not supported in URL '/Reports/GaWeeklyExportDownload?GaWeeklyExportUuid=0db183a2-11c6-42f8-bf52-379aafe0d21b'>
I get this error when trying to iterate over the full list, but when I call download.file on an individual list item it works fine.
I would be grateful if someone could tell me what I have done wrong or suggest a better way of doing it.
The code that produces the error:
library(tidyverse)
library(rvest)
# Reading links to the Excel files to be downloaded
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- html_attr(html_nodes(webpage, '.u'), "href")
# Creating names for the files to supply to the download.file function
wb_names = str_c(1:77, ".xlsx")
# Defining a function that using purrr's safely to ensure it doesn't fail if there is a dead link
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Combining links, file names, and the function returns an error
map2(links, wb_names, safe_download)
You need to prepend 'https://www.grants.gov.au/' to the URL to get the absolute path of the file which can be used to download the file.
library(rvest)
library(purrr)
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- paste0('https://www.grants.gov.au/', html_attr(html_nodes(webpage, '.u'), "href"))
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Creating names for the files to supply to the download.file function
wb_names = paste0(1:77, ".xlsx")
map2(links, wb_names, safe_download)

how to download pdf file with R from web (encode issue)

I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))

Download data from web using R gives libcurl error

I have the following R script for downloading data but it gives me an error. How can I fix this error?
rm(list=ls(all=TRUE))
library('purrr')
years <- c(1980:1981)
days <- c(001:002)
walk(years, function(x) {
map(x, ~sprintf("https://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%s/.grb", years, days)) %>%
flatten_chr() -> urls
download.file(urls, basename(urls), method="libcurl")
})
Error:
Error in download.file(urls, basename(urls), method = "libcurl") :
download.file(method = "libcurl") is not supported on this platform
I have the following R script for downloading data but it gives me an error. How can I fix this error?
Session info:
That means that libcurl may not be installed or available for your operative system. Please note that the method argument has other options and that method varies across operative systems (more or less the same as platform in the error message). I would try with other methods (e.g., wget, curl...).
From the help of download.files...
The supported ‘method’s do change: method ‘libcurl’ was introduced
in R 3.2.0 and is still optional on Windows - use
‘capabilities("libcurl")’ in a program to see if it is available.
I had started to do a light edit to #gballench's answer (since I don't rly need the pts) but it's more complex than you have it since you're not going to get to the files you need with that idiom (which I'm 99% sure is from an answer of mine :-) for a whole host of reasons.
First days needs to be padded to length 3 with 0s but the way you did it won't do that. Second, You likely want to download all the .grb files from each year/00x combo, so you need a way to get those. Finally, that site requires authentication, so you need to register and use basic authentication for it.
Something like this:
library(purrr)
library(httr)
library(rvest)
years <- c(1980:1981)
days <- sprintf("%03d", 1:2)
sprintf("http://hydro1.gesdisc.eosdis.nasa.gov/data/NLDAS/NLDAS_MOS0125_H.002/%s/%%s/", years) %>%
map(~sprintf(.x, days)) %>%
flatten_chr() %>%
map(~{
base_url <- .x
sprintf("%s/%s", base_url, read_html(.x) %>%
html_nodes(xpath=".//a[contains(#href, '.grb')]") %>%
html_attr("href"))
}) %>%
flatten_chr() %>%
discard(~grepl("xml$", .)) %>%
walk(~{
output_path <- file.path("FULL DIRECTORY PATH", basename(.x))
if (!file.exists(output_path)) {
message(.x)
GET(
url = .x,
config = httr::config(ssl_verifypeer = FALSE),
write_disk(output_path, overwrite=TRUE),
authenticate(user = "me#example.com", password = "xldjkdjfid8y83"),
progress()
)
}
})
You'll need to install the httr package which will install the curl package and ultimately make libcurl available for simpler batch downloads in the future.
I remembered that I had an account so I linked it with this app & tested this (killed it at 30 downloads) and it works. I added progress() to the GET() call so you can see it downloading individual files. It skips over already downloaded files (so you can kill it and restart it at any time). If you need to re-download any, just remove the file you want to re-download.
If you also need the .xml files, then remove the discard() call.

Scraping PDF from iframe into R

I am trying to scrape the text of U.N. Security Council (UNSC) resolutions into R. The U.N. maintains an online archive of all UNSC resolutions in PDF format (here). So, in theory, this should be do-able.
If I click on the hyperlink for a specific year and then click on the link for a specific document (e.g., this one), I can see the PDF in my browser. When I try to download that PDF by pointing download.file at the link in the URL bar, it seems to work. When I try to read the contents of that file into R using the pdf_text function from the pdftools package, however, I get a stack of error messages.
Here's what I'm trying that's failing. If you run it, you'll see the error messages I'm talking about.
library(pdftools)
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)"
tmp <- tempfile()
download.file(pdflink, tmp, mode = "wb")
doc <- pdf_text(tmp)
What am I missing? I think it has to do with the link addresses to the downloadable versions of these files differing from the link addresses for the in-browser display, but I can't figure out how to get the path to the former. I tried right-clicking on the download icon; using the "Inspect" option in Chrome to see the URL identified as 'src' there (this link); and pointing the rest of my process at it. Again, the download.file part executes, but I get the same error messages when I run pdf_text. I also tried a) varying the mode part of the call to download.file and b) tacking ".pdf" onto the end of the path to tmp, but neither of those helped.
The pdf you are looking to download is in an iframe in the main page, so the link you are downloading only contains html.
You need to follow the link in the iframe to get the actual link to the pdf. You need to jump to several pages to get cookies/temporary urls before getting to the direct link to download the pdf.
Here's an example for the link you posted:
rm(list=ls())
library(rvest)
library(pdftools)
s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)")
#get the link in the mainFrame iframe holding the pdf
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[#name='mainFrame']") %>%
html_attr("src")
#go to that link
s <- s %>% jump_to(url=frame_link)
#there is a meta refresh with a link to another page, get it and go there
temp_url <- s %>% read_html() %>%
html_nodes("meta") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(url=temp_url)
#get the LtpaToken cookie then come back
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>%
back()
#get the pdf link and download it
pdf_link <- s %>% read_html() %>%
html_nodes(xpath="//meta[#http-equiv='refresh']") %>%
html_attr("content") %>% {gsub(".*URL=","",.)}
s <- s %>% jump_to(pdf_link)
tmp <- tempfile()
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
doc

batch download zipped files in R

I am trying to download zipped files from website like http://cdo.ncdc.noaa.gov/qclcd_ascii/.
Since there are many files, is there a way to download them in batch instead of one by one? Ideally, the downloaded files can be unzipped in batch after downloading.
I tried to use system(curl http://cdo.ncdc.noaa.gov/qclcd_ascii/QCLCD") etc.. but got many errors and status 127 warnings.
Any idea or suggestions?
Thanks!
This should work.
library(XML)
url<-c("http://cdo.ncdc.noaa.gov/qclcd_ascii/")
doc<-htmlParse(url)
#get <a> nodes.
Anodes<-getNodeSet(doc,"//a")
#get the ones with .zip's and .gz's
files<-grep("*.gz|*.zip",sapply(Anodes, function(Anode) xmlGetAttr(Anode,"href")),value=TRUE)
#make the full url
urls<-paste(url,files,sep="")
#Download each file.
mapply(function(x,y) download.file(x,y),urls,files)
It's not R, but you could easily use the program wget, ignoring robots.txt:
wget -r --no-parent -e robots=off --accept *.gz
http://cdo.ncdc.noaa.gov/qclcd_ascii/
Here's my take on it:
### Load XML package, for 'htmlParse'
require(XML)
### Read in HTML contents, extract file names.
root <- 'http://cdo.ncdc.noaa.gov/qclcd_ascii/'
doc <- htmlParse(root)
fnames <- xpathSApply(doc, '//a[#href]', xmlValue)
### Keep only zip files, and create url paths to scrape.
fnames <- grep('zip$', fnames, value = T)
paths <- paste0(root, fnames)
Now that you have a vector of url's and corresponding file-name's in R, you can download them to your hard disk. You have two options. You can download in serial, or in parallel.
### Download data in serial, saving to the current working directory.
mapply(download.file, url = paths, destfile = fnames)
### Download data in parallel, also saving to current working directory.
require(parallel)
cl <- makeCluster(detectCores())
clusterMap(cl, download.file, url = paths, destfile = fnames,
.scheduling = 'dynamic')
If you choose to download in parallel, I recommend considering 'dynamic' scheduling, which means that each core won't have to wait for others to finish before starting its next download. The downside to dynamic scheduling is the added communication overhead, but since the process of downloading ~50mb files is not very resource intensive, it will be worth it to use this option so long as files download at slightly varying speeds.
Lastly, if you want to also include tar files as well, change the regular expression to
fnames <- grep('(zip)|(gz)$', fnames, value = T)
To download everything under that directory you can do this:
wget -r -e robots=off http://cdo.ncdc.noaa.gov/qclcd_ascii/

Resources