Download, unzip, and load Excel file in R using tempfiles only - r

I am trying and failing to write a process that will download a .zip archive, extract a particular Excel file from that archive, and load that Excel file into my R workspace without ever writing any of those files (the .zip or the .xls) to my hard drive.
I have written a version of this process that works for zipped .csvs, but it doesn't work for .xls. Here's how that version goes, using one of the URLs I'm targeting in my current project and using readWorksheetFromFile() instead of read.csv() at the appropriate moment:
library(XLConnect)
waed.old.link <- "http://eventdata.parusanalytics.com/data.dir/pitf.world.19950101-20121231.xls.zip"
waed.old.file <- "pitf.world.19950101-20121231.xls"
tmp <- tempfile()
download.file(waed.old.link, tmp)
tmp2 <- tempfile()
tmp2 <- unz(tmp, waed.old.file)
WAED.old <- readWorksheetFromFile(tmp2, sheet = 1, startRow = 3, startCol = 1, endCol = 73)
unlink(tmp)
unlink(tmp2)
And here's what pops up after line 8, the one that tries to ingest the spreadsheet as WAED.old:
Error in path.expand(filename) : invalid 'path' argument
I also tried read_excel() at that step and got the same result:
> WAED.old <- read_excel(tmp2, skip = 2)
Error in file.exists(path) : invalid 'file' argument
I gather that this has something to do with pointing readWorksheetFromFile() at a connection rather than a file, but I'm not sure that's right, and I don't know how to fix it if it is. I searched stackoverflow and the web for an answer but couldn't find one that was right on point. I'd really appreciate some help.

As you say, it is because unz returns a connection object for the file within the zip (but does not explicitly unzip that file), while readWorksheetFromFile expects a path to a file.
Use unzip to explicitly unzip the file.
tmp2 <- unzip(zipfile=tmp, files = waed.old.file, exdir=tempdir())
# readWorksheetFromFile(tmp2, ...)

Related

Unzip failing due to long name in zipped folder

I want to be able to read and edit spatial SQlite tables that are downloaded from a server. These come compressed.
These zip files have a folder in them that contains information about the model that has been run as the name of the folder, and as such these can sometimes be quite long.
When this folder name gets too long, unziping the folder fails. I ultimately dont need to unzip the file. But i seem to get the same error when I use unz within readOGR.
I cant think of how to recreate a replicate able example but I can give an example of a path that works and one that doesnt.
Works:
"S:\3_Projects\CRC00001\4699-12103\scenario_initialised model\performance_assessment.sqlite"
4699-12103 is the zip file name
and "scenario_initialised model" is the offending subfolder
Fails:
""S:\3_Projects\CRC00001\4699-12129\scenario_tree_canopy_7, number_of_trees_0, roads_False, compliance_75, year_2030, nrz_cover_0.6, green_roofs_0\performance_assessment.sqlite""
4699-12103 is the zip file name
and "scenario_tree_canopy_7, number_of_trees_0, roads_False, compliance_75, year_2030, nrz_cover_0.6, green_roofs_0" is the offending subfolder
The code would work in a similar fashion to this.
list_zips <- list.files(pattern = "*.zip", recursive = TRUE, include.dirs = TRUE)
unzip(zipfile = paste(getwd(),"/",list_zips[i],sep = ""),
exdir=substr(paste(getwd(),"/",list_zips[i],sep = ""),1,nchar(paste(getwd(),"/",list_zips[i],sep = ""))-4))
But I would prefer to directly be able to load the spatial file in without unzipping. Such as:
sq_path <- unzip(list_zips[i], list=TRUE)[2,1]
temp <- unz(paste(getwd(),"/",list_zips[i],sep = ""),sq_path)
vectorImport <- readOGR(dsn=temp, layer="micro_climate_grid")
Any help would be appreciated! Tim

Error comes while importing files by data.table

I'm new to R studio and was not well aware of this portal T&C, so was blocked for questing for 5 days.
I have a code for importing multiple files from any directory to R.
Using this code for doing so, but the problem is this code runs sometime and sometime it gets failed with mentioned error.
I tried to found the solution of this but yet not found any solution.
library(data.table)
t = setwd("/home/dp/vishan/olp_data/19164/1/")
files <- file.info(list.files(path = t,pattern = "", full.names=TRUE))
files = rownames(files)[files$size > 0]
temp <- lapply(files, fread, sep=",")
Error:
Error in FUN(X[[i]], ...) :
'input' can not be a directory name, but must be a single character string containing a file name, a command, full path to a file, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or the input data itself.
Thanks in advance!
try using
files <- file.info(list.files(path = t,pattern = "", full.names=TRUE))
files <- subset(files, !isdir & size > 0)
temp <- lapply(rownames(files), fread, sep=',')
since list.files also shows directories. The data.frame you create in files can be easily subset on the isdir column which indicates if this is a directory or a file.

R download.file() rename the downloaded file, if the filename already exists

In R, I am trying to download files off the internet using the download.file() command in a simple code (am complete newbie). The files are downloading properly. However, if a file already exists in the download destination, I'd wish to rename the downloaded file with an increment, as against an overwrite which seems to be the default process.
nse.url = "https://www1.nseindia.com/content/historical/DERIVATIVES/2016/FEB/fo04FEB2016bhav.csv.zip"
nse.folder = "D:/R/Download files from Internet/"
nse.destfile = paste0(nse.folder,"fo04FEB2016bhav.csv.zip")
download.file(nse.url,nse.destfile,mode = "wb",method = "libcurl")
Problem w.r.t to this specific code: if "fo04FEB2016bhav.csv.zip" already exists, then get say "fo04FEB2016bhav.csv(2).zip"?
General answer to the problem (and not just the code mentioned above) would be appreciated as such a bottleneck could come up in any other situations too.
The function below will automatically assign the filename based on the file being downloaded. It will check the folder you are downloading to for the presence of a similarly named file. If it finds a match, it will add an incrementation and download to the new filename.
ekstroem's suggestion to fiddle with the curl settings is probably a much better approach, but I wasn't clever enough to figure out how to make that work.
download_without_overwrite <- function(url, folder)
{
filename <- basename(url)
base <- tools::file_path_sans_ext(filename)
ext <- tools::file_ext(filename)
file_exists <- grepl(base, list.files(folder), fixed = TRUE)
if (any(file_exists))
{
filename <- paste0(base, " (", sum(file_exists), ")", ".", ext)
}
download.file(url, file.path(folder, filename), mode = "wb", method = "libcurl")
}
download_without_overwrite(
url = "https://raw.githubusercontent.com/nutterb/redcapAPI/master/README.md",
folder = "[path_to_folder]")
Try this:
nse.url = "https://www1.nseindia.com/content/historical/DERIVATIVES/2016/FEB/fo04FEB2016bhav.csv.zip"
nse.folder = "D:/R/Download files from Internet/"
#Get file name from url, with file extention
fname.x <- gsub(".*/(.*)", "\\1", nse.url)
#Get file name from url, without file extention
fname <- gsub("(.*)\\.csv.*", "\\1", fname.x)
#Get xtention of file from url
xt <- gsub(".*(\\.csv.*)", "\\1", fname.x)
#How many times does the the file exist in folder
exist.times <- sum(grepl(fname, list.files(path = nse.folder)))
if(exist.times){
# if it does increment by 1
fname.x <- paste0(fname, "(", exist.times + 1, ")", xt)
}
nse.destfile = paste0(nse.folder, fname.x)
download.file(nse.url, nse.destfile, mode = "wb",method = "libcurl")
Issues
This approach will not work in cases where part of the file name already exists for example you have url/test.csv.zip and in the folder you have a file testABC1234blahblah.csv.zip. It will think the file already exists, so it will save it as test(2).csv.zip.
You will need to change the #How many times does the the file exist in folder part of the code accordingly.
This is not a proper answer and shouldn't be considered as such, but the comment section above was too small to write it all.
I thought the -O -n options to curl could be used to but now that I looked at it more closely it turned out that it wasn't implemented yet. Now wget automatically increment the filename when downloading a file that already exists. However, setting method="wget" doesn't work with download.file because you are forced to set the destination file name, and once you do that you overwrite the automatic file increments.
I like the solution that #Benjamin provided. Alternatively, you can use
system(paste0("wget ", nse.url))
to get the file through the system (provided that you have wget installed) and let wget handle the increment.

Reading in Excel (downloaded with automated script) produces error when not manually opened and saved first

I run an automated script to download 3 .xls files from 3 websites every hour. When I later try to read in the .xls files in R to further work with them, R produces the following error message:
"Error: IOException (Java): block[ 2 ] already removed - does your POIFS have circular or duplicate block references?"
When I manually open and save the .xls files this problem doesn't appear anymore and everything works normal, but since the total number of files is increasing with 72 every day this is not a nice work around.
The script I use to download and save the files:
library(httr)
setwd("WORKDIRECTION")
orig_wd <- getwd()
FOLDERS <- c("NAME1","NAME2","NAME3") #representing folder names
LINKS <- c("WEBSITE_1", #the urls from which I download
"WEBSITE_2",
"WEBSITE_3")
NO <- length(FOLDERS)
for(i in 1:NO){
today <- as.character(Sys.Date())
if (!file.exists(paste(FOLDERS[i],today,sep="/"))){
dir.create(paste(FOLDERS[i],today,sep="/"))
}
setwd(paste(orig_wd,FOLDERS[i],today,sep="/"))
dat<-GET(LINKS[i])
bin <- content(dat,"raw")
now <- as.character(format(Sys.time(),"%X"))
now <- gsub(":",".",now)
writeBin(bin,paste(now,".xls",sep=""))
setwd(orig_wd)
}
I then read in the files with the following script:
require(gdata)
require(XLConnect)
require(xlsReadWrite)
wb = loadWorkbook("FILEPATH")
df = readWorksheet(wb, "Favourite List" , header = FALSE)
Does anybody have experience with this type of error, and knows a solution or workaround?
The problem is partly resolved by using the readxl package available in the CRAN library. After installation files can be read in with:
library(readxl)
read_excel("PathToFile")
The only problem is, that the last column is omitted while reading in. If I find a solution for this I'll update the awnser.

Using R to download zipped data file, extract, and import data

#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")
For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))
To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.
Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.
Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")
rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)
I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)

Resources