R not downloading total tar.gz using download.file - r

I am trying to download a 286 mb tarball (tar.gz) from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd_hcn.tar.gz (You have to be registered to try it). I am able to download the whole 286 mb when using a browser (firefox), but when I try to download in R, I get varying sizes of files. I can't ever get the whole 286 mb. Any reason for this behavior?
Here is my R code.
tb = "ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd_hcn.tar.gz"
tempfile = paste(path,"/ghcnd_hcn.tar.gz",sep="") # temporary file name
download.file(tb,destfile = tempfile,replace = TRUE) ## download data

I can download the whole file in R without problems but it is quite slow to download.
Maybe a time-out issue?
You can also try different values to the method parameter for download.file()
download.file(tb,destfile = tempfile, method = "curl", replace = TRUE)
file.size(tempfile)
download.file(tb,destfile = tempfile, method = "wget", replace = TRUE)
file.size(tempfile)

Related

download.file() download corrupt xls

I am trying to create a package to download, import and clean data from the Dominican Republic Central Bank web page. I have done all the coding in Rstudio.cloud and everything works just fine, but when I try the functions in my local machine they do not work.
After digging a bit on each function, I realized that the problem was the downloaded file, it is corrupt.
I am including the first steps of a function just to illustrate my issue.
file url
# Packages
library(readxl)
# file url.
url <- paste0("https://cdn.bancentral.gov.do/documents/",
"estadisticas/precios/documents/",
"ipc_base_2010.xls?v=1570116997757")
# termporary path
file_path <- tempfile(pattern = "", fileext = ".xls")
# downloading
download.file(url, file_path, quiet = TRUE)
# reading the file
ipc_general <- readxl::read_excel(
file_path,
sheet = 1,
col_names = FALSE,
skip = 7
)
Error:
filepath: C:\Users\Johan Rosa\AppData\Local\Temp\RtmpQ1rOT3\2a74778a1a64.xls
libxls error: Unable to open file
I am using temporary files, but that is not the problem, you can try to download the file in your working directory and the problem persist.
I want to konw:
Why this code works in rstudio.clowd and not local?
What can I do to get the job done? (alternative approach, packages, functions)
By the way, I am using Windows 10
Edit
Answer:
1- Rstudio.cloud runs on linux, but for Windows, I need to make some adjustments to the download.file() command.
2- download.file(url, file_path, quiet = TRUE, mode = "wb")
This is what I was looking for.
Now I have a different problem. I have to think a way to detect if the function is running on Linux or Windows, to set that argument accordingly.
I can write a new download file function using if else calls on .Platform$OS.type result.
Or, can I set mode = "wb" for all download.file() calls?
Do you have any recommendations?
From the Documentation of download.file()
The choice of binary transfer (mode = "wb" or "ab") is important on
Windows, since unlike Unix-alikes it does distinguish between text and
binary files and for text transfers changes \n line endings to \r\n
(aka CRLF).
Code written to download binary files must use mode = "wb" (or "ab"),
but the problems incurred by a text transfer will only be seen on
Windows.
From the source of download.file
head(print(download.file),12)
1 function (url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE,
2 extra = getOption("download.file.extra"), headers = NULL,
3 ...)
4 {
5 destfile
6 method <- if (missing(method))
7 getOption("download.file.method", default = "auto")
8 else match.arg(method, c("auto", "internal", "wininet", "libcurl",
9 "wget", "curl", "lynx"))
10 if (missing(mode) && length(grep("\\\\.(gz|bz2|xz|tgz|zip|rd[as]|RData)$",
11 URLdecode(url))))
12 mode <- "wb"
So looking at the source, if you did not set mode, the function uses automatically "w", except, the URL contains gz,bz2,xz etc. (that is why you get the first error).
In my humble opinion I think that in Unix-alikes (e.g. Linux) "w" and "wb" are the same, because they do not differentiate between text and binary files, but Windows does.
So you could set mode="wd" for all download.file calls (as long as it is not a text transfer under Windows), this will not affect the function in Linux.

Downloading multiple files from URL using R not working

I am trying to download multiple NetCDF (.nc) format files from multiple URLs in a loop. However, when I try to open the files, they seem to be corrupted.
You will find my code below. I have tried different methods, for instance, using download.file or system.
This is an example of the files I need to download:
http://thredds.met.no/thredds/catalog/metusers/senorge2/seNorge2/provisional_archive/PREC1d/gridded_dataset/201601/catalog.html
But I need to download hundreds of files, since each file represents a day.
Here's my code so far:
year = c("2016","2017")
mon = c("01","02")
day = c("01","02","03","04","05","06","07","08","09","10",
"11","12","13","14","15","16","17","18","19","20",
"21","22","23","24","25","26","27","28","29","30","31")
for (y in year){
for (m in mon){
for (d in day){
download.file(paste("http://thredds.met.no/thredds/fileServer/metusers/senorge2/seNorge2/provisional_archive/",
"PREC1d/gridded_dataset/",y,m,"/seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),
destfile=paste("seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),method="curl",mode="wb")
#try(system(paste("wget ",paste("http://thredds.met.no/thredds/fileServer/metusers/senorge2/seNorge2/provisional_archive/",
# "PREC1d/gridded_dataset/",y,m,"/seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),sep=""),
# intern = TRUE, ignore.stderr = TRUE, wait=TRUE))
}
}
}
Any help is appreciated.
Thank you!
Best,
Michel
When I try your code I get in some files 503 Service Temporarily Unavailable. To retry the download in this case add --retry-on-http-error=503. Maye also add --random-wait. I changed the method from curl to wget and removed mode="wb" as the manula says Not used for methods ‘"wget"’ and ‘"curl"’. Hope the following solves your problem.
year = c("2016","2017")
mon = c("01","02")
day = c("01","02","03","04","05","06","07","08","09","10",
"11","12","13","14","15","16","17","18","19","20",
"21","22","23","24","25","26","27","28","29","30","31")
for (y in year){
for (m in mon){
for (d in day){
download.file(paste("http://thredds.met.no/thredds/fileServer/metusers/senorge2/seNorge2/provisional_archive/",
"PREC1d/gridded_dataset/",y,m,"/seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),
destfile=paste("seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),method="wget",extra="--random-wait --retry-on-http-error=503")
}
}
}
What do you mean when you say that the file is 'corrupted'? How are you trying to read the nc files?
Your code seems to work and I can read the downloaded files. You can use the raster package in R to read the file. Please also ensure you have the ncdf4 package installed.
library(raster)
r = raster('seNorge_v2_0_PREC1d_grid_20160101_20160101.nc')

small files in R use of RAM

I'm trying to open a 163MB .xlsx file in R.
library(openxlsx)
df <- read.xlsx(xlsxFile = "df.xlsx", sheet = 1, colNames = T)
Doing this this small file (relatively small) uses all the 8GB of RAM I have on my laptop.
I have a CSV version of this file but due to the use of , and ; in one of the columns using a CSV is not an option How does this happen, knowing that I recently loaded a kaggle file (a 0.5GB csv) into R and still used my laptop for browsing internet ?
Edit : the RAM usage + output of pryr::object_size(df)
did you try readxl package https://blog.rstudio.org/2017/04/19/readxl-1-0-0/
read_xlsx(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max))
You can also read it as tab delimited (read.csv(..., sep="\t")) or save it as a .txt file and read it as tab delimited.
It looks like this is (or at least was) a problem with openxlsx. This Github issue describes the problem of inflated file sizes and suggests a solution (use the development version): https://github.com/awalker89/openxlsx/issues/161
So, potential solutions:
Try the development version of openxlsx (devtools::install_github("awalker89/openxlsx")
As suggested by #Ajay Ohri, try the readxl package instead.
Load the file and save it as a binary R file with save() or saveRDS()
Try again with the .csv file with readr::read_csv() or data.table::fread(); both are faster than base R's read.csv()

trying to use fread() on .csv file but getting internal error "ch>eof"

I am getting an error from fread:
Internal error: ch>eof when detecting eol
when trying to read a csv file downloaded from an https server, using R 3.2.0. I found something related on Github, https://github.com/Rdatatable/data.table/blob/master/src/fread.c, but don't know how I could use this, if at all. Thanks for any help.
Added info: the data was downloaded from here:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
then I used
download.file(fileURL, "Idaho2006.csv", method = "Internal")
The problem is that download.file doesn't work with https with method=internal unless you're on Windows and set an option. Since fread uses download.file when you pass it a URL and not a local file, it'll fail. You have to download the file manually then open it from a local file.
If you're on Linux or have either of the following already then do method=wget or method=curl instead
If you're on Windows and don't have either and don't want to download them then do setInternet2(use = TRUE) before your download.file
http://www.inside-r.org/r-doc/utils/setInternet2
For example:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
download.file(fileURL, tempf, method = "curl")
DT <- fread(tempf)
unlink(tempf)
Or
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
setInternet2 = TRUE
download.file(fileURL, tempf)
DT <- fread(tempf)
unlink(tempf)
fread() now utilises curl package for downloading files. And this seems to work just fine atm:
require(data.table) # v1.9.6+
fread(fileURL, showProgress = FALSE)
The easiest way to fix this problem in my experience is to just remove the s from https. Also remove the method you don't need it. My OS is Windows and i have tried the following code and works.
fileURL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
download.file(fileURL, "Idaho2006.csv")

Downloading large files with R/RCurl efficiently

I see that many examples for downloading binary files with RCurl are like such:
library("RCurl")
curl = getCurlHandle()
bfile=getBinaryURL (
"http://www.example.com/bfile.zip",
curl= curl,
progressfunction = function(down, up) {print(down)}, noprogress = FALSE
)
writeBin(bfile, "bfile.zip")
rm(curl, bfile)
If the download is very large, I suppose it would be better writing it concurrently to the storage medium, instead of fetching all in memory.
In RCurl documentation there are some examples to get files by chunks and manipulate them as they are downloaded, but they seem all referred to text chunks.
Can you give a working example?
UPDATE
A user suggests using the R native download file with mode = 'wb' option for binary files.
In many cases the native function is a viable alternative, but there are a number of use-cases where this native function does not fit (https, cookies, forms etc.) and this is the reason why RCurl exists.
This is the working example:
library(RCurl)
#
f = CFILE("bfile.zip", mode="wb")
curlPerform(url = "http://www.example.com/bfile.zip", writedata = f#ref)
close(f)
It will download straight to file. The returned value will be (instead of the downloaded data) the status of the request (0, if no errors occur).
Mention to CFILE is a bit terse on RCurl manual. Hopefully in the future it will include more details/examples.
For your convenience the same code is packaged as a function (and with a progress bar):
bdown=function(url, file){
library('RCurl')
f = CFILE(file, mode="wb")
a = curlPerform(url = url, writedata = f#ref, noprogress=FALSE)
close(f)
return(a)
}
## ...and now just give remote and local paths
ret = bdown("http://www.example.com/bfile.zip", "path/to/bfile.zip")
um.. use mode = 'wb' :) ..run this and follow along w/ my comments.
# create a temporary file and a temporary directory on your local disk
tf <- tempfile()
td <- tempdir()
# run the download file function, download as binary.. save the result to the temporary file
download.file(
"http://sourceforge.net/projects/peazip/files/4.8/peazip_portable-4.8.WINDOWS.zip/download",
tf ,
mode = 'wb'
)
# unzip the files to the temporary directory
files <- unzip( tf , exdir = td )
# here are your files
files

Resources