download.file() download corrupt xls - r

I am trying to create a package to download, import and clean data from the Dominican Republic Central Bank web page. I have done all the coding in Rstudio.cloud and everything works just fine, but when I try the functions in my local machine they do not work.
After digging a bit on each function, I realized that the problem was the downloaded file, it is corrupt.
I am including the first steps of a function just to illustrate my issue.
file url
# Packages
library(readxl)
# file url.
url <- paste0("https://cdn.bancentral.gov.do/documents/",
"estadisticas/precios/documents/",
"ipc_base_2010.xls?v=1570116997757")
# termporary path
file_path <- tempfile(pattern = "", fileext = ".xls")
# downloading
download.file(url, file_path, quiet = TRUE)
# reading the file
ipc_general <- readxl::read_excel(
file_path,
sheet = 1,
col_names = FALSE,
skip = 7
)
Error:
filepath: C:\Users\Johan Rosa\AppData\Local\Temp\RtmpQ1rOT3\2a74778a1a64.xls
libxls error: Unable to open file
I am using temporary files, but that is not the problem, you can try to download the file in your working directory and the problem persist.
I want to konw:
Why this code works in rstudio.clowd and not local?
What can I do to get the job done? (alternative approach, packages, functions)
By the way, I am using Windows 10
Edit
Answer:
1- Rstudio.cloud runs on linux, but for Windows, I need to make some adjustments to the download.file() command.
2- download.file(url, file_path, quiet = TRUE, mode = "wb")
This is what I was looking for.
Now I have a different problem. I have to think a way to detect if the function is running on Linux or Windows, to set that argument accordingly.
I can write a new download file function using if else calls on .Platform$OS.type result.
Or, can I set mode = "wb" for all download.file() calls?
Do you have any recommendations?

From the Documentation of download.file()
The choice of binary transfer (mode = "wb" or "ab") is important on
Windows, since unlike Unix-alikes it does distinguish between text and
binary files and for text transfers changes \n line endings to \r\n
(aka CRLF).
Code written to download binary files must use mode = "wb" (or "ab"),
but the problems incurred by a text transfer will only be seen on
Windows.
From the source of download.file
head(print(download.file),12)
1 function (url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE,
2 extra = getOption("download.file.extra"), headers = NULL,
3 ...)
4 {
5 destfile
6 method <- if (missing(method))
7 getOption("download.file.method", default = "auto")
8 else match.arg(method, c("auto", "internal", "wininet", "libcurl",
9 "wget", "curl", "lynx"))
10 if (missing(mode) && length(grep("\\\\.(gz|bz2|xz|tgz|zip|rd[as]|RData)$",
11 URLdecode(url))))
12 mode <- "wb"
So looking at the source, if you did not set mode, the function uses automatically "w", except, the URL contains gz,bz2,xz etc. (that is why you get the first error).
In my humble opinion I think that in Unix-alikes (e.g. Linux) "w" and "wb" are the same, because they do not differentiate between text and binary files, but Windows does.
So you could set mode="wd" for all download.file calls (as long as it is not a text transfer under Windows), this will not affect the function in Linux.

Related

How to use UTF-8 folder names in download_html and download.file

I would like to scrape country-specific information from a Chinese webpage into folders that are named according to these countries. Since I am extracting the list of countries from the Chinese page as well, the folder names would contain Chinese characters - which seems to be problematic.
My code is
url <- "https://www.baidu.com/"
path <- file.path("中国", "江苏")
dir.create(path, recursive = TRUE)
download_html(url, file = file.path(path, "baidu.html"))
download.file(url, destfile = file.path(path, "baidu.html"))
The error message of the last line reads
Error in download.file(url, destfile = file.path(path, "baidu.html")) :
cannot open destfile '<U+4E2D><U+56FD>/<U+6C5F><U+82CF>/baidu.html', reason 'Invalid argument'
so it seems that download.file converts Chinese characters internally. Interestingly, file.path has no issues creating folders containing Chinese characters. I am running Windows 10 64 bit and R version 4.0.2.
Is there a way (or alternative function) that accepts Chinese characters or coerces download.file to use the correct encoding? If not, what alternatives do I have? I could think of:
navigating into the folder using setwd (which does work but forces me to use a loop)
converting the Chinese names, for example by using its romanization (which is ambiguous and probably does not exist as an R function)
EDIT:
Perhaps this is part of a bigger issue on my machine. The first line of the following code works (i.e. shows "two" as a result), whereas the second line does not:
stringr::str_replace_all("one", c("one" = "two"))
stringr::str_replace_all("阿富汗", c("阿富汗" = "Afghanistan"))
Instead, the second line produces an error similar to the one above:
Warning message:
unable to translate '<U+963F><U+5BCC><U+6C57>' to native encoding
However, when I create a string containing Chinese characters, the result seems to be in UTF-8:
string <- "阿富汗"
stringi::stri_enc_isutf8(string)
shows TRUE.
EDIT 2:
On my old laptop running Ubuntu, stringr::str_replace_all() works just fine with Chinese characters.
I think the leading cause of your error during download.file is that your default encoding of Chinese in Windows is UTF-16. I have done some trials and meet similar errors about download_html. But I couldn't reproduce the same error. However I think its essence is the same as download.file. My problem is that download_html only can discern file in arguments which includes Chinese character in UTF-8. The code as below:
library(xml2)
url <- 'https://www.baidu.com/'
path <- file.path('中国', '江苏')
dir.create(path, recursive = TRUE)
download_html(url, file = file.path(path, 'baidu.html'))
download.file(url, destfile = file.path(path, "baidu.html")
The error occurred:
Error in curl::curl_download(url, file, quiet = quiet, mode = mode, handle = handle) :
Failed to open file C:\Users\16071098\Documents\涓浗\鍖椾含\baidu.html.
But when I change the command about download_html like below:
download_html(url, file = enc2utf8(file.path(path, 'baidu.html')))
download.file(url, destfile = enc2utf8(file.path(path, "baidu.html")))
It is okay.
Your error shows your Chinese part in path is encoded by UTF-16.
No matter what reason, I think it results from that your default encoding type is different from the function's accepted type.

R not downloading total tar.gz using download.file

I am trying to download a 286 mb tarball (tar.gz) from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd_hcn.tar.gz (You have to be registered to try it). I am able to download the whole 286 mb when using a browser (firefox), but when I try to download in R, I get varying sizes of files. I can't ever get the whole 286 mb. Any reason for this behavior?
Here is my R code.
tb = "ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd_hcn.tar.gz"
tempfile = paste(path,"/ghcnd_hcn.tar.gz",sep="") # temporary file name
download.file(tb,destfile = tempfile,replace = TRUE) ## download data
I can download the whole file in R without problems but it is quite slow to download.
Maybe a time-out issue?
You can also try different values to the method parameter for download.file()
download.file(tb,destfile = tempfile, method = "curl", replace = TRUE)
file.size(tempfile)
download.file(tb,destfile = tempfile, method = "wget", replace = TRUE)
file.size(tempfile)

How to source data from a web-based shared code editor in R

I am interested in directly reading (sourcing) R scripts from a multi-user web-based codesharing site. I have not found any sites which store code in a format that is accessible to R (which may very well be due to my ignorance). I have tried a workaround in which I download a GoogleDoc to a .txt and then source it from my local machine, but there seems to be an encoding issue that I don't understand. I have searched for solutions but have not found anything that is current.
(1) Does anyone have specific solutions for how to accomplish a direct source()-like operation from an online code editor (e.g. codeshare.io, kobra.io, etc)? To be clear, I want to be able to read in scripts from shared coding sessions and run them on my local machine in one or two keystrokes. I am not interested in github.
(2) If not, can anyone tell me why the following code snippet fails to source, and what I must do to correct the error?
Example...
dl_from_GoogleD <- function(output, key, format) {
require(RCurl)
bin <- getBinaryURL(paste0("https://docs.google.com/document/d/", key, "/export?format=", format), ssl.verifypeer = FALSE)
con <- file(output, open = "wb")
writeBin(bin, con)
close(con)
message(noquote(paste(output, "read into", getwd())))
}
setwd(tempdir())
dl_from_GoogleD(output = "test.txt", key = "11jYc5uvDOWrHmYRXOJtLsycxvojiW4qIN6aVQsJCYQM", format = "txt")
source("test.txt", echo=T)
Error received:
Error in source("test.txt", echo = T) : test.txt:1:2: unexpected input
1: ï»
^
I'm running Windows 7 and RStudio.

trying to use fread() on .csv file but getting internal error "ch>eof"

I am getting an error from fread:
Internal error: ch>eof when detecting eol
when trying to read a csv file downloaded from an https server, using R 3.2.0. I found something related on Github, https://github.com/Rdatatable/data.table/blob/master/src/fread.c, but don't know how I could use this, if at all. Thanks for any help.
Added info: the data was downloaded from here:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
then I used
download.file(fileURL, "Idaho2006.csv", method = "Internal")
The problem is that download.file doesn't work with https with method=internal unless you're on Windows and set an option. Since fread uses download.file when you pass it a URL and not a local file, it'll fail. You have to download the file manually then open it from a local file.
If you're on Linux or have either of the following already then do method=wget or method=curl instead
If you're on Windows and don't have either and don't want to download them then do setInternet2(use = TRUE) before your download.file
http://www.inside-r.org/r-doc/utils/setInternet2
For example:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
download.file(fileURL, tempf, method = "curl")
DT <- fread(tempf)
unlink(tempf)
Or
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
setInternet2 = TRUE
download.file(fileURL, tempf)
DT <- fread(tempf)
unlink(tempf)
fread() now utilises curl package for downloading files. And this seems to work just fine atm:
require(data.table) # v1.9.6+
fread(fileURL, showProgress = FALSE)
The easiest way to fix this problem in my experience is to just remove the s from https. Also remove the method you don't need it. My OS is Windows and i have tried the following code and works.
fileURL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
download.file(fileURL, "Idaho2006.csv")

Downloading large files with R/RCurl efficiently

I see that many examples for downloading binary files with RCurl are like such:
library("RCurl")
curl = getCurlHandle()
bfile=getBinaryURL (
"http://www.example.com/bfile.zip",
curl= curl,
progressfunction = function(down, up) {print(down)}, noprogress = FALSE
)
writeBin(bfile, "bfile.zip")
rm(curl, bfile)
If the download is very large, I suppose it would be better writing it concurrently to the storage medium, instead of fetching all in memory.
In RCurl documentation there are some examples to get files by chunks and manipulate them as they are downloaded, but they seem all referred to text chunks.
Can you give a working example?
UPDATE
A user suggests using the R native download file with mode = 'wb' option for binary files.
In many cases the native function is a viable alternative, but there are a number of use-cases where this native function does not fit (https, cookies, forms etc.) and this is the reason why RCurl exists.
This is the working example:
library(RCurl)
#
f = CFILE("bfile.zip", mode="wb")
curlPerform(url = "http://www.example.com/bfile.zip", writedata = f#ref)
close(f)
It will download straight to file. The returned value will be (instead of the downloaded data) the status of the request (0, if no errors occur).
Mention to CFILE is a bit terse on RCurl manual. Hopefully in the future it will include more details/examples.
For your convenience the same code is packaged as a function (and with a progress bar):
bdown=function(url, file){
library('RCurl')
f = CFILE(file, mode="wb")
a = curlPerform(url = url, writedata = f#ref, noprogress=FALSE)
close(f)
return(a)
}
## ...and now just give remote and local paths
ret = bdown("http://www.example.com/bfile.zip", "path/to/bfile.zip")
um.. use mode = 'wb' :) ..run this and follow along w/ my comments.
# create a temporary file and a temporary directory on your local disk
tf <- tempfile()
td <- tempdir()
# run the download file function, download as binary.. save the result to the temporary file
download.file(
"http://sourceforge.net/projects/peazip/files/4.8/peazip_portable-4.8.WINDOWS.zip/download",
tf ,
mode = 'wb'
)
# unzip the files to the temporary directory
files <- unzip( tf , exdir = td )
# here are your files
files

Resources