Downloading multiple files from URL using R not working

Downloading multiple files from URL using R not working - r

I am trying to download multiple NetCDF (.nc) format files from multiple URLs in a loop. However, when I try to open the files, they seem to be corrupted.
You will find my code below. I have tried different methods, for instance, using download.file or system.
This is an example of the files I need to download:
http://thredds.met.no/thredds/catalog/metusers/senorge2/seNorge2/provisional_archive/PREC1d/gridded_dataset/201601/catalog.html
But I need to download hundreds of files, since each file represents a day.
Here's my code so far:
year = c("2016","2017")
mon = c("01","02")
day = c("01","02","03","04","05","06","07","08","09","10",
"11","12","13","14","15","16","17","18","19","20",
"21","22","23","24","25","26","27","28","29","30","31")
for (y in year){
for (m in mon){
for (d in day){
download.file(paste("http://thredds.met.no/thredds/fileServer/metusers/senorge2/seNorge2/provisional_archive/",
"PREC1d/gridded_dataset/",y,m,"/seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),
destfile=paste("seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),method="curl",mode="wb")
#try(system(paste("wget ",paste("http://thredds.met.no/thredds/fileServer/metusers/senorge2/seNorge2/provisional_archive/",
# "PREC1d/gridded_dataset/",y,m,"/seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),sep=""),
# intern = TRUE, ignore.stderr = TRUE, wait=TRUE))
}
}
}
Any help is appreciated.
Thank you!
Best,
Michel

When I try your code I get in some files 503 Service Temporarily Unavailable. To retry the download in this case add --retry-on-http-error=503. Maye also add --random-wait. I changed the method from curl to wget and removed mode="wb" as the manula says Not used for methods ‘"wget"’ and ‘"curl"’. Hope the following solves your problem.
year = c("2016","2017")
mon = c("01","02")
day = c("01","02","03","04","05","06","07","08","09","10",
"11","12","13","14","15","16","17","18","19","20",
"21","22","23","24","25","26","27","28","29","30","31")
for (y in year){
for (m in mon){
for (d in day){
download.file(paste("http://thredds.met.no/thredds/fileServer/metusers/senorge2/seNorge2/provisional_archive/",
"PREC1d/gridded_dataset/",y,m,"/seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),
destfile=paste("seNorge_v2_0_PREC1d_grid_",y,m,d,"_",y,m,d,".nc",sep=""),method="wget",extra="--random-wait --retry-on-http-error=503")
}
}
}

What do you mean when you say that the file is 'corrupted'? How are you trying to read the nc files?
Your code seems to work and I can read the downloaded files. You can use the raster package in R to read the file. Please also ensure you have the ncdf4 package installed.
library(raster)
r = raster('seNorge_v2_0_PREC1d_grid_20160101_20160101.nc')

Related

R waldo does not find difference that snapshot test finds in .RData file

looking for a bit of support:
I have done some expect_snapshot_file() tests from the testthat package - and the files end up not matching when being done in different RStudio sessions (I'm guessing that this is related to this).
I opened both files (the original and the .new file) and attempted to compare them using waldo::compare(old,new) - but waldo does not find any differences in the files. Both the snapshot tests and my git system also notes that these files are different, but again I don't know why/where in the file. Note: It is possible to recreate the difference when setting all this up in a git folder - commit the outcomes.RData file, then rename the outcomes.new.RData file to outcomes.RDataand then git will display a difference - but again because it's a binary file, it won't tell me where...
I cannot use the snapshot_review() function because it does not seem to work with .RData files - it immediately gives me a complicated error.
So I would like to investigate where the differences are. I presume that it might be something related to the R session that I'm working in.
I have provided both files here:
https://transfer.sh/7UwTLD/outcomes.new.RData
https://transfer.sh/xgqhL7/outcomes.RData
and then compare them:
download.file("https://transfer.sh/xgqhL7/outcomes.RData", destfile = "outcomes.RData")
download.file("https://transfer.sh/7UwTLD/outcomes.new.RData", destfile = "outcomes.new.RData")
load("outcomes.RData")
old <- outcomes
load("outcomes.new.RData")
new <- outcomes
waldo::compare(old, new)
#> v No differences
Created on 2022-02-13 by the reprex package (v0.3.0)
The original saving function always stays the same:
save_file <- function(outcomes){
path <- tempfile(fileext = ".RData")
save(outcomes, file = path)
path
}
expect_snapshot_file(save_file(outcomes), name = "outcomes.RData")
Is there any other way that snapshot uses "under the hood"? What other aspects could I consider? Are there any other tools how I could compare files under Windows in a useful manner?
Many thanks!

Is there a way to open, save, and then close excel files in a directory in R?

Basically, there are a few excel reports that are emailed every morning to us. I download them directly to a specified drive then I wrangle them in R.
The thing is that I have to manually open each file and save them before running my script in R. If not this will happen:
When I go in and manually open and save the files and re-run my script. I get the correct results:
Firstly, do you guys know why this happens? and secondly, is there a function that will allow me to open these files and save them. I did try openxlsx. However, I still have to manually press the save button.
Here is the function I created to bring in the files:
store.FUN = function(x)
{
m = as.data.frame(read_excel(file))
names(m) = c(1:length(m))
m[1, 1] = str_sub(m[2, 1], 13)
m = bind_rows((m)[1, ], subset(m, (m)[1] == "Total Income"))
m[2, 1] = m[2, 2]
m = m[-c(2)]
return(m)
}
district_1.stores = sapply(store.file, store.FUN, simplify=FALSE) %>%
bind_rows(.id = "Store ID")
Thanks!
Edit: So it looks like the cells are formulated:
But, if I do nothing and only save the file and go back to R to perform the script, the numbers pull in just fine.
Here is an example of the excel file:
enter image description here

I took the time to post the issue on github for openxlsx.
Tl;dr: it's not a bug, it's a built-in "problem" when importing from/exporting to Excel and is true for all such packages. The developer suggests exactly what the TO did in case Excel sheets contain formula: open the file in Excel first, save it and only then import it into R. Which doesn't answer the TO's question (which was how to open and save an Excel file automatically through R), but I'm posting this answer nonetheless, because it adds some helpful context.
https://github.com/ycphs/openxlsx/issues/261 and https://github.com/ycphs/openxlsx/issues/188#issuecomment-832591241

Actually just found that I can use the "reticulate" package in R to run a python module for this purpose.
Thanks for your help everyone!

readxl::read_xls returns "libxls error: Unable to open file"

I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!
UPDATE:
It seems that some users are able to open the file above using the readxl::read_xls function, while others are not, both on Mac and Windows, using the most up to date versions of R, Rstudio, and readxl. The issue has been posted on the readxl GitHub and has not been resolved yet.

I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
install.packages("readxl")
library(readxl)
library(data.table)
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls function instead read_excel.
I checked, it also works correctly and even a little faster, since read_excel is a wrapper over read_xls and read_xlsx functions from readxl package.
Also, you can use excel_sheets function from readxl package to read all sheets of your Excel file.
UPDATE
Benchmarking is done with microbenchmark package for the following packages/functions: gdata::read.xls, XLConnect::readWorksheetFromFile and readxl::read_excel.
But XLConnect it's a Java-based solution, so it requires a lot of RAM.

I found that I was unable to open the file with read_xl immediately after downloading it, but if I opened the file in Excel, saved it, and closed it again, then read_xl was able to open it without issue.
My suggested workaround for handling hundreds of files is to build a little C# command line utility that opens, saves, and closes an Excel file. Source code is below, the utility can be compiled with visual studio community edition.
using System.IO;
using Excel = Microsoft.Office.Interop.Excel;
namespace resaver
{
class Program
{
static void Main(string[] args)
{
string srcFile = Path.GetFullPath(args[0]);
Excel.Application excelApplication = new Excel.Application();
excelApplication.Application.DisplayAlerts = false;
Excel.Workbook srcworkBook = excelApplication.Workbooks.Open(srcFile);
srcworkBook.Save();
srcworkBook.Close();
excelApplication.Quit();
}
}
}
Once compiled, the utility can be called from R using e.g. system2().

I will propose a different workflow. If you happen to have LibreOffice installed, then you can convert your excel files to csv programatically. I have Linux, so I do it in bash, but I'm sure it can be possible in macOS.
So open a terminal and navigate to the folder with your excel files and run in terminal:
for i in *.xls
do soffice --headless --convert-to csv "$i"
done
Now in R you can use data.table::fread to read your files with a loop:
Scenario 1: the structure of files is different
If the structure of files is different, then you wouldn't want to rbind them together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
all_files <- list()
for (i in 1:length(files)){
fileName <- gsub("(^.*/)(.*)(.csv$)", "\\2", files[i])
all_files[[fileName]] <- fread(files[i])
}
If you want to extract your named elements within the list into the global environment, so that they can be converted into objects, you can use list2env:
list2env(all_files, envir = .GlobalEnv)
Please be aware of two things: First, in the gsub call, the direction of the slash. And second, list2env may overwrite objects in your Global Environment if they have the same name as the named elements within the list.
Scenario 2: the structure of files is the same
In that case it's likely you want to rbind them all together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
joined <- list()
for (i in 1:length(files)){
joined <- rbindlist(joined, fread(files[i]), fill = TRUE)
}

On my system, i had to use path.expand.
R> file = "~/blah.xls"
R> read_xls(file)
Error:
filepath: ~/Dropbox/signal/aud/rba/balsheet/data/a03.xls
libxls error: Unable to open file
R> read_xls(path.expand(file)) # fixed

Resaving your file and you can solve your problem easily.
I also find this problem before but I get the answer from your discussion.
I used the read_excel() to open those files.

I was seeing a similar error and wanted to share a short-term solution.
library(readxl)
download.file("https://mjwebster.github.io/DataJ/spreadsheets/MLBpayrolls.xls", "MLBPayrolls.xls")
MLBpayrolls <- read_excel("MLBpayrolls.xls", sheet = "MLB Payrolls", na = "n/a")
Yields (on some systems in my classroom but not others):
Error: filepath: MLBPayrolls.xls libxls error: Unable to open file
The temporary solution was to paste the URL of the xls file into Firefox and download it via the browser. Once this was done we could run the read_excel line without error.
This was happening today on Windows 10, with R 3.6.2 and R Studio 1.2.5033.

If you have downloaded the .xls data from the internet, even if you are opening it in Ms.Excel, it will open a prompt first asking to confirm if you trust the source, see below screenshot, I am guessing this is the reason R (read_xls) also can't open it, as it's considered unsafe. Save it as .xlsx file and then use read_xlsx() or read_excel().

Even thought this is not a code-based solution, I just changed the type file. For instance, instead of xls I saved as csv or xlsx. Then I opened it as regular one.
I worked it for me, because when I opened my xlsfile, I popped up the message: "The file format and extension of 'file.xls'' don't match. The file could be corrupted or unsafe..."

Reading in Excel (downloaded with automated script) produces error when not manually opened and saved first

I run an automated script to download 3 .xls files from 3 websites every hour. When I later try to read in the .xls files in R to further work with them, R produces the following error message:
"Error: IOException (Java): block[ 2 ] already removed - does your POIFS have circular or duplicate block references?"
When I manually open and save the .xls files this problem doesn't appear anymore and everything works normal, but since the total number of files is increasing with 72 every day this is not a nice work around.
The script I use to download and save the files:
library(httr)
setwd("WORKDIRECTION")
orig_wd <- getwd()
FOLDERS <- c("NAME1","NAME2","NAME3") #representing folder names
LINKS <- c("WEBSITE_1", #the urls from which I download
"WEBSITE_2",
"WEBSITE_3")
NO <- length(FOLDERS)
for(i in 1:NO){
today <- as.character(Sys.Date())
if (!file.exists(paste(FOLDERS[i],today,sep="/"))){
dir.create(paste(FOLDERS[i],today,sep="/"))
}
setwd(paste(orig_wd,FOLDERS[i],today,sep="/"))
dat<-GET(LINKS[i])
bin <- content(dat,"raw")
now <- as.character(format(Sys.time(),"%X"))
now <- gsub(":",".",now)
writeBin(bin,paste(now,".xls",sep=""))
setwd(orig_wd)
}
I then read in the files with the following script:
require(gdata)
require(XLConnect)
require(xlsReadWrite)
wb = loadWorkbook("FILEPATH")
df = readWorksheet(wb, "Favourite List" , header = FALSE)
Does anybody have experience with this type of error, and knows a solution or workaround?

The problem is partly resolved by using the readxl package available in the CRAN library. After installation files can be read in with:
library(readxl)
read_excel("PathToFile")
The only problem is, that the last column is omitted while reading in. If I find a solution for this I'll update the awnser.

How to download an .xlsx file from a dropbox (https:) location

I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon

You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()

UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Downloading multiple files from URL using R not working - r

Related

R waldo does not find difference that snapshot test finds in .RData file

Is there a way to open, save, and then close excel files in a directory in R?

readxl::read_xls returns "libxls error: Unable to open file"

Reading in Excel (downloaded with automated script) produces error when not manually opened and saved first

How to download an .xlsx file from a dropbox (https:) location

Categories

Resources