Downloading large files with R/RCurl efficiently - r

I see that many examples for downloading binary files with RCurl are like such:
library("RCurl")
curl = getCurlHandle()
bfile=getBinaryURL (
"http://www.example.com/bfile.zip",
curl= curl,
progressfunction = function(down, up) {print(down)}, noprogress = FALSE
)
writeBin(bfile, "bfile.zip")
rm(curl, bfile)
If the download is very large, I suppose it would be better writing it concurrently to the storage medium, instead of fetching all in memory.
In RCurl documentation there are some examples to get files by chunks and manipulate them as they are downloaded, but they seem all referred to text chunks.
Can you give a working example?
UPDATE
A user suggests using the R native download file with mode = 'wb' option for binary files.
In many cases the native function is a viable alternative, but there are a number of use-cases where this native function does not fit (https, cookies, forms etc.) and this is the reason why RCurl exists.

This is the working example:
library(RCurl)
#
f = CFILE("bfile.zip", mode="wb")
curlPerform(url = "http://www.example.com/bfile.zip", writedata = f#ref)
close(f)
It will download straight to file. The returned value will be (instead of the downloaded data) the status of the request (0, if no errors occur).
Mention to CFILE is a bit terse on RCurl manual. Hopefully in the future it will include more details/examples.
For your convenience the same code is packaged as a function (and with a progress bar):
bdown=function(url, file){
library('RCurl')
f = CFILE(file, mode="wb")
a = curlPerform(url = url, writedata = f#ref, noprogress=FALSE)
close(f)
return(a)
}
## ...and now just give remote and local paths
ret = bdown("http://www.example.com/bfile.zip", "path/to/bfile.zip")

um.. use mode = 'wb' :) ..run this and follow along w/ my comments.
# create a temporary file and a temporary directory on your local disk
tf <- tempfile()
td <- tempdir()
# run the download file function, download as binary.. save the result to the temporary file
download.file(
"http://sourceforge.net/projects/peazip/files/4.8/peazip_portable-4.8.WINDOWS.zip/download",
tf ,
mode = 'wb'
)
# unzip the files to the temporary directory
files <- unzip( tf , exdir = td )
# here are your files
files

Related

download.file() download corrupt xls

I am trying to create a package to download, import and clean data from the Dominican Republic Central Bank web page. I have done all the coding in Rstudio.cloud and everything works just fine, but when I try the functions in my local machine they do not work.
After digging a bit on each function, I realized that the problem was the downloaded file, it is corrupt.
I am including the first steps of a function just to illustrate my issue.
file url
# Packages
library(readxl)
# file url.
url <- paste0("https://cdn.bancentral.gov.do/documents/",
"estadisticas/precios/documents/",
"ipc_base_2010.xls?v=1570116997757")
# termporary path
file_path <- tempfile(pattern = "", fileext = ".xls")
# downloading
download.file(url, file_path, quiet = TRUE)
# reading the file
ipc_general <- readxl::read_excel(
file_path,
sheet = 1,
col_names = FALSE,
skip = 7
)
Error:
filepath: C:\Users\Johan Rosa\AppData\Local\Temp\RtmpQ1rOT3\2a74778a1a64.xls
libxls error: Unable to open file
I am using temporary files, but that is not the problem, you can try to download the file in your working directory and the problem persist.
I want to konw:
Why this code works in rstudio.clowd and not local?
What can I do to get the job done? (alternative approach, packages, functions)
By the way, I am using Windows 10
Edit
Answer:
1- Rstudio.cloud runs on linux, but for Windows, I need to make some adjustments to the download.file() command.
2- download.file(url, file_path, quiet = TRUE, mode = "wb")
This is what I was looking for.
Now I have a different problem. I have to think a way to detect if the function is running on Linux or Windows, to set that argument accordingly.
I can write a new download file function using if else calls on .Platform$OS.type result.
Or, can I set mode = "wb" for all download.file() calls?
Do you have any recommendations?
From the Documentation of download.file()
The choice of binary transfer (mode = "wb" or "ab") is important on
Windows, since unlike Unix-alikes it does distinguish between text and
binary files and for text transfers changes \n line endings to \r\n
(aka CRLF).
Code written to download binary files must use mode = "wb" (or "ab"),
but the problems incurred by a text transfer will only be seen on
Windows.
From the source of download.file
head(print(download.file),12)
1 function (url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE,
2 extra = getOption("download.file.extra"), headers = NULL,
3 ...)
4 {
5 destfile
6 method <- if (missing(method))
7 getOption("download.file.method", default = "auto")
8 else match.arg(method, c("auto", "internal", "wininet", "libcurl",
9 "wget", "curl", "lynx"))
10 if (missing(mode) && length(grep("\\\\.(gz|bz2|xz|tgz|zip|rd[as]|RData)$",
11 URLdecode(url))))
12 mode <- "wb"
So looking at the source, if you did not set mode, the function uses automatically "w", except, the URL contains gz,bz2,xz etc. (that is why you get the first error).
In my humble opinion I think that in Unix-alikes (e.g. Linux) "w" and "wb" are the same, because they do not differentiate between text and binary files, but Windows does.
So you could set mode="wd" for all download.file calls (as long as it is not a text transfer under Windows), this will not affect the function in Linux.

Import excel from Azure blob using R

I have the basic setup done following the link below:
http://htmlpreview.github.io/?https://github.com/Microsoft/AzureSMR/blob/master/inst/doc/tutorial.html
There is a method 'azureGetBlob' which allows you to retrieve objects from the containers. however, it seems to only allow "raw" and "text" format which is not very useful for excel. I've tested the connections and etc, I can retrieve .txt / .csv files but not .xlsx files.
Does anyone know any workaround for this?
Thanks
Does anyone know any workaround for this?
There is no file type on the azure blob storage, it is just a blob name. The extension type is known for OS. If we want to open the excel file in the r, we could use the 3rd library to do that such as readXl.
Work around:
You could use the get blob api to download the blob file to local path then use readXl to read the file. We also get could more demo code from this link.
# install
install.packages("readxl")
# Loading
library("readxl")
# xls files
my_data <- read_excel("my_file.xls")
# xlsx files
my_data <- read_excel("my_file.xlsx")
Solved with the following code. Basically, read the file in byte then wrote the file to disk then read it into R
excel_bytes <- azureGetBlob(sc, storageAccount = "accountname", container = "containername", blob=blob_name, type="raw")
q <- tempfile()
f <- file(q, 'wb')
writeBin(excel_bytes, f)
close(f)
result <- read.xlsx(q, sheetIndex = sheetIndex)
unlink(q)

R Downloading multiple file from FTP using Rcurl

I'm a new R. user.
I am trying to download 7.000 files(.nc format) from ftp server ( which I got from user and password). On the website, each file is a link to download. I would like to download all the files (.nc).
I thank anyone who can help me how to run those jobs in R. Just an example what I have tried to do using Rcurl and a loop and informs me: cannot download all files.
library(RCurl)
url<- "ftp://ftp.my.link.fr/1234/"
userpwd <- userpwd="user:password"
destination <- "/Users/ME/Documents"
filenames <- getURL(url, userpwd="user:password",
ftp.use.epsv = FALSE, dirlistonly = TRUE)
for(i in seq_along(url)){
download.file(url[i], destination[i], mode="wb")
}
how can I do that?
The first thing you'd see is that the files in your directory, ie the object filenames, would be listed as one long string. To obtain an object of all file names as a character vector, you may try:
files <- unlist(strsplit(filenames, '\n'))
From here on, it's simply a matter of looping through all the files in the directory. I recommend you use the curl package, not Rcurl, to download the files, as it's easier to supply auth info for every download request.
library(curl)
h <- new_handle()
handle_setopt(h, userpwd = "user:pwd")
and then
lapply(files, function(filename){
curl_download(paste(url, filename, sep = ""), destfile = filename, handle = h)
})

Write R data as csv directly to s3

I would like to be able to write data directly to a bucket in AWS s3 from a data.frame\ data.table object as a csv file without writing it to disk first using the AWS CLI.
obj.to.write.s3 <- data.frame(cbind(x1=rnorm(1e6),x2=rnorm(1e6,5,10),x3=rnorm(1e6,20,1)))
at the moment I write to csv first then upload to an existing bucket then remove the file using:
fn <- 'new-file-name.csv'
write.csv(obj.to.write.s3,file=fn)
system(paste0('aws s3 ',fn,' s3://my-bucket-name/',fn))
system(paste0('rm ',fn))
I would like a function that writes directly to s3? is that possible?
In aws.s3 0.2.2 the s3write_using() (and s3read_using()) functions were added.
They make things much simpler:
s3write_using(iris, FUN = write.csv,
bucket = "bucketname",
object = "objectname")
The easiest solution is just to save the .csv in a tempfile(), which will be purged automatically when you close your R session.
If you need to only work in memory you can do this by doing write.csv() to a rawConnection:
# write to an in-memory raw connection
zz <- rawConnection(raw(0), "r+")
write.csv(iris, zz)
# upload the object to S3
aws.s3::put_object(file = rawConnectionValue(zz),
bucket = "bucketname", object = "iris.csv")
# close the connection
close(zz)
In case you're unsure, you can then check that this worked correctly by downloading the object from S3 and reading it back into R:
# check that it worked
## (option 1: save locally)
save_object(object = "iris.csv", bucket = "bucketname", file = "iris.csv")
read.csv("iris.csv")
## (option 2: keep in memory)
read.csv(text = rawToChar(get_object(object = "iris.csv", bucket = "bucketname")))
Sure -- but 'saving to file' requires that your OS sees the desired target directory as an accessible filesystem. So in essence you "just" need to mount S3. Here is a quick Google search for that topic.
An alternative is writing to a temporary file, and then using whatever you use to transfer files. You could code up both operations as a simple helper function.

Using R to download gzipped data file, extract, and import data

A follow up to this question: How can I download and uncompress a gzipped file using R? For example (from the UCI Machine Learning Repository), I have a file of insurance data. How can I download it using R?
Here is the data url: http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz.
I like Ramnath's approach, but I would use temp files like so:
tmpdir <- tempdir()
url <- 'http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz'
file <- basename(url)
download.file(url, file)
untar(file, compressed = 'gzip', exdir = tmpdir )
list.files(tmpdir)
The list.files() should produce something like this:
[1] "TicDataDescr.txt" "dictionary.txt" "ticdata2000.txt" "ticeval2000.txt" "tictgts2000.txt"
which you could parse if you needed to automate this process for a lot of files.
Here is a quick way to do it.
# create download directory and set it
.exdir = '~/Desktop/tmp'
dir.create(.exdir)
.file = file.path(.exdir, 'tic.tar.gz')
# download file
url = 'http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz'
download.file(url, .file)
# untar it
untar(.file, compressed = 'gzip', exdir = path.expand(.exdir))
Please the content of help(download.file) for that. If the file in question is merely a gzipped but otherwise readable file, you can feed the complete URL to read.table() et al too.
Using library(archive) one can also read in a particular csv file within an archive without having to UNZIP it first : read_csv(archive_read("http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", file = 1), col_types = cols())
This is quite a bit faster.
To unzip everything one can do archive_extract("http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", dir=XXX).
That worked very well for me & is faster than the unbuilt untar(). It also works on all platforms. It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.

Resources