I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon
You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()
UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.
Related
I have a problem with downloading data from HTTPS in R, I try using curl, but it doesn't work.
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
options('download.file.method'='curl')
download.file(URL, destfile = "./data.csv", method="auto")
I downloaded the CSV file with that code, but the format was changed when I checked the data. So it didn't download correctly.
Would you please someone help me?
I think you might actually have the URL wrong. I think you want:
https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv
Then you can download the file directly using library(RCurl) rather than creating a variable with the URL
library(RCurl)
download.file("https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv",destfile="./data.csv",method="libcurl")
You can also just load the file directly into R from the site using the following
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
out <- read.csv(textConnection(URL))
You can use the 'raw.githubusercontent.com' link, i.e. in the browser, when you go to "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv" you can click on the link "View raw" (it's above "Sorry about that, but we can’t show files that are this big right now.") and this takes you to the actual data. You also have some minor typos.
This worked as expected for me:
url <- "https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
download.file(url, destfile = "./data.csv", method="auto")
df <- read.csv("~/Desktop/data.csv")
(NB- I am very much a beginner in R.)
This is the code I tried:
read_xlsx("valid/url")
For some reason I get the error message:
'path' does not exist:'valid/url'
I know the URL works, I have tested it many times. I am mystified, so any help would be much appreciated.
If I understand your issue correctly, I think you are inputting the URL into the read_xlsx command. Far as I am aware, this will not work if your excel file is online, you will need to download it locally first.
I suggest the following adjustment:
url <- "valid/url"
temp <- tempfile()
download.file(url, temp, mode="wb")
df1 <- read_excel(path = temp)
This will download the excel file into a temporary file, which you can then read into a dataframe, since it will be saved locally.
I need to download a file, save it in a folder while keeping the original filename from the website.
url <- "http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
From a web browser, if you click on that link, you get to download an excel file with this filename:
AfiliadosMuni-02-2015.xlsx
I know I can easily download it with the command download.file in R like this:
download.file(url, "test.xlsx", method = "curl")
But what I really need for my script is to download it keeping the original filename intact. I also know I can do this with curl from my console like this.
curl -O -J $"http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
But, again, I need this within an R script. Is there a way similar to the one above but in R? I have looked into the RCurl package but I couldn't find a solution.
You could always do something like:
library(httr)
library(stringr)
# alternate way to "download.file"
fil <- GET("http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1",
write_disk("tmp.fil"))
# get what name the site suggests it shld be
fname <- str_match(headers(fil)$`content-disposition`, "\"(.*)\"")[2]
# rename
file.rename("tmp.fil", fname)
I think basename() would be the simplest option https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/basename
e.g.
download.file(url, basename(url))
I am getting an error from fread:
Internal error: ch>eof when detecting eol
when trying to read a csv file downloaded from an https server, using R 3.2.0. I found something related on Github, https://github.com/Rdatatable/data.table/blob/master/src/fread.c, but don't know how I could use this, if at all. Thanks for any help.
Added info: the data was downloaded from here:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
then I used
download.file(fileURL, "Idaho2006.csv", method = "Internal")
The problem is that download.file doesn't work with https with method=internal unless you're on Windows and set an option. Since fread uses download.file when you pass it a URL and not a local file, it'll fail. You have to download the file manually then open it from a local file.
If you're on Linux or have either of the following already then do method=wget or method=curl instead
If you're on Windows and don't have either and don't want to download them then do setInternet2(use = TRUE) before your download.file
http://www.inside-r.org/r-doc/utils/setInternet2
For example:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
download.file(fileURL, tempf, method = "curl")
DT <- fread(tempf)
unlink(tempf)
Or
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
setInternet2 = TRUE
download.file(fileURL, tempf)
DT <- fread(tempf)
unlink(tempf)
fread() now utilises curl package for downloading files. And this seems to work just fine atm:
require(data.table) # v1.9.6+
fread(fileURL, showProgress = FALSE)
The easiest way to fix this problem in my experience is to just remove the s from https. Also remove the method you don't need it. My OS is Windows and i have tried the following code and works.
fileURL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
download.file(fileURL, "Idaho2006.csv")
I need to download a file, save it in a folder while keeping the original filename from the website.
url <- "http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
From a web browser, if you click on that link, you get to download an excel file with this filename:
AfiliadosMuni-02-2015.xlsx
I know I can easily download it with the command download.file in R like this:
download.file(url, "test.xlsx", method = "curl")
But what I really need for my script is to download it keeping the original filename intact. I also know I can do this with curl from my console like this.
curl -O -J $"http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
But, again, I need this within an R script. Is there a way similar to the one above but in R? I have looked into the RCurl package but I couldn't find a solution.
You could always do something like:
library(httr)
library(stringr)
# alternate way to "download.file"
fil <- GET("http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1",
write_disk("tmp.fil"))
# get what name the site suggests it shld be
fname <- str_match(headers(fil)$`content-disposition`, "\"(.*)\"")[2]
# rename
file.rename("tmp.fil", fname)
I think basename() would be the simplest option https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/basename
e.g.
download.file(url, basename(url))