I need to download a file, save it in a folder while keeping the original filename from the website.
url <- "http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
From a web browser, if you click on that link, you get to download an excel file with this filename:
AfiliadosMuni-02-2015.xlsx
I know I can easily download it with the command download.file in R like this:
download.file(url, "test.xlsx", method = "curl")
But what I really need for my script is to download it keeping the original filename intact. I also know I can do this with curl from my console like this.
curl -O -J $"http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
But, again, I need this within an R script. Is there a way similar to the one above but in R? I have looked into the RCurl package but I couldn't find a solution.
You could always do something like:
library(httr)
library(stringr)
# alternate way to "download.file"
fil <- GET("http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1",
write_disk("tmp.fil"))
# get what name the site suggests it shld be
fname <- str_match(headers(fil)$`content-disposition`, "\"(.*)\"")[2]
# rename
file.rename("tmp.fil", fname)
I think basename() would be the simplest option https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/basename
e.g.
download.file(url, basename(url))
Related
I'm accessing ncdf files directly from a website [here][1] into my RMarkdown.
When I try to read the file using the nc_open functions as in the code below, I get the error 'Passed a filename that is NOT a string of characters!'
Any idea how I can solve this?
ps: I even tried uncompressing the files with the gzcon function but the result is the same when I try to read the data.
Thanks for your help!
Kami
library(httr)
library(ncdf4)
nc<-GET("https://crudata.uea.ac.uk/cru/data/hrg/cru_ts_4.05/cruts.2103051243.v4.05/pre/cru_ts4.05.2011.2020.pre.dat.nc.gz")
cru_nc<-nc_open(nc)
OK here is the fill answer:
library(httr)
library(ncdf4)
library(R.utils)
url <- "https://crudata.uea.ac.uk/cru/data/hrg/cru_ts_4.05/cruts.2103051243.v4.05/pre/cru_ts4.05.2011.2020.pre.dat.nc.gz"
filename <- "/tmp/file.nc.gz"
# Download the file and store it as a temp file
download.file(url, filename, mode = "wb")
# Unzip the temp file
gunzip(filename)
# The unzipped filename drops the .gz
unzip_filename <- "/tmp/file.nc"
# You can now open the unzipped file with its **filename** rather than the object
cru_nc<-nc_open(unzip_filename)
Is this a mode="w" Vs mode="wb" issue. I've had this with files before. No experience of ncdf4.
Not sure if you can pass mode="wb" to get but does
file.download(yourUrl, mode="wb")
Work / help
Edit:
Ah. Other thing is you are storing the object as an object (nc) but nc_open wants to open a file.
I think you need to save the object locally (unless nc_open can just take the URL) and then open it? Possibly after unzipping.
I am new to R and would like to seek some advice.
I am trying to download multiple url links (pdf format, not html) and save it into pdf file format using R.
The links I have are in character (took from the html code of the website).
I tried using download.file() function, but this requires specific url link (Written in R script) and therefore can only download 1 link for 1 file. However I have many url links, and would like to get help in doing this.
Thank you.
I believe what you are trying to do is download a list of URLs, you could try something like this approach:
Store all the links in a vector using c(), ej:
urls <- c("http://link1", "http://link2", "http://link3")
Iterate through the file and download each file:
for (url in urls) {
download.file(url, destfile = basename(url))
}
If you're using Linux/Mac and https you may need to specify method and extra attributes for download.file:
download.file(url, destfile = basename(url), method="curl", extra="-k")
If you want, you can test my proof of concept here: https://gist.github.com/erickthered/7664ec514b0e820a64c8
Hope it helps!
URL
url = c('https://cran.r-project.org/doc/manuals/r-release/R-data.pdf',
'https://cran.r-project.org/doc/manuals/r-release/R-exts.pdf',
'http://kenbenoit.net/pdfs/text_analysis_in_R.pdf')
Designated names
names = c('manual1',
'manual2',
'manual3')
Iterate through the file and download each file with corresponding name:
for (i in 1:length(url)){
download.file(url[i], destfile = names[i], mode = 'wb')
}
I need to download a file, save it in a folder while keeping the original filename from the website.
url <- "http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
From a web browser, if you click on that link, you get to download an excel file with this filename:
AfiliadosMuni-02-2015.xlsx
I know I can easily download it with the command download.file in R like this:
download.file(url, "test.xlsx", method = "curl")
But what I really need for my script is to download it keeping the original filename intact. I also know I can do this with curl from my console like this.
curl -O -J $"http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
But, again, I need this within an R script. Is there a way similar to the one above but in R? I have looked into the RCurl package but I couldn't find a solution.
You could always do something like:
library(httr)
library(stringr)
# alternate way to "download.file"
fil <- GET("http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1",
write_disk("tmp.fil"))
# get what name the site suggests it shld be
fname <- str_match(headers(fil)$`content-disposition`, "\"(.*)\"")[2]
# rename
file.rename("tmp.fil", fname)
I think basename() would be the simplest option https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/basename
e.g.
download.file(url, basename(url))
I am trying to download zipped files from website like http://cdo.ncdc.noaa.gov/qclcd_ascii/.
Since there are many files, is there a way to download them in batch instead of one by one? Ideally, the downloaded files can be unzipped in batch after downloading.
I tried to use system(curl http://cdo.ncdc.noaa.gov/qclcd_ascii/QCLCD") etc.. but got many errors and status 127 warnings.
Any idea or suggestions?
Thanks!
This should work.
library(XML)
url<-c("http://cdo.ncdc.noaa.gov/qclcd_ascii/")
doc<-htmlParse(url)
#get <a> nodes.
Anodes<-getNodeSet(doc,"//a")
#get the ones with .zip's and .gz's
files<-grep("*.gz|*.zip",sapply(Anodes, function(Anode) xmlGetAttr(Anode,"href")),value=TRUE)
#make the full url
urls<-paste(url,files,sep="")
#Download each file.
mapply(function(x,y) download.file(x,y),urls,files)
It's not R, but you could easily use the program wget, ignoring robots.txt:
wget -r --no-parent -e robots=off --accept *.gz
http://cdo.ncdc.noaa.gov/qclcd_ascii/
Here's my take on it:
### Load XML package, for 'htmlParse'
require(XML)
### Read in HTML contents, extract file names.
root <- 'http://cdo.ncdc.noaa.gov/qclcd_ascii/'
doc <- htmlParse(root)
fnames <- xpathSApply(doc, '//a[#href]', xmlValue)
### Keep only zip files, and create url paths to scrape.
fnames <- grep('zip$', fnames, value = T)
paths <- paste0(root, fnames)
Now that you have a vector of url's and corresponding file-name's in R, you can download them to your hard disk. You have two options. You can download in serial, or in parallel.
### Download data in serial, saving to the current working directory.
mapply(download.file, url = paths, destfile = fnames)
### Download data in parallel, also saving to current working directory.
require(parallel)
cl <- makeCluster(detectCores())
clusterMap(cl, download.file, url = paths, destfile = fnames,
.scheduling = 'dynamic')
If you choose to download in parallel, I recommend considering 'dynamic' scheduling, which means that each core won't have to wait for others to finish before starting its next download. The downside to dynamic scheduling is the added communication overhead, but since the process of downloading ~50mb files is not very resource intensive, it will be worth it to use this option so long as files download at slightly varying speeds.
Lastly, if you want to also include tar files as well, change the regular expression to
fnames <- grep('(zip)|(gz)$', fnames, value = T)
To download everything under that directory you can do this:
wget -r -e robots=off http://cdo.ncdc.noaa.gov/qclcd_ascii/
I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon
You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()
UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.