Creating zip file from folders in R - r

Try to create a zip file from one folder using R.
It mentioned "Rcompression" package here:
Creating zip file from folders
But I didn't find where I can download this package for Windows system.
Any suggestions? or other functions to create a zip file?

You can create a zip file with the function zip from utils package quite easily. Say you have a directory testDir and you wish to zip a file (or multiple files) inside the directory,
dir('testDir')
# [1] "cats.csv" "test.csv" "txt.txt"
zip(zipfile = 'testZip', files = 'testDir/test.csv')
# adding: testDir/test.csv (deflated 68%)
The zipped file is saved in the current working directory, unless a different path is specified in the zipfile argument. We can see its size relative to the original unzipped file with
file.info(c('testZip.zip', 'testDir/test.csv'))['size']
# size
# testZip.zip 805
# testDir/test.csv 1493
You can zip the whole directory of files (if no sub-folders) with
files2zip <- dir('testDir', full.names = TRUE)
zip(zipfile = 'testZip', files = files2zip)
# updating: testDir/test.csv (deflated 68%)
# updating: testDir/cats.csv (deflated 27%)
# updating: testDir/txt.txt (stored 0%)
And unzip it to view the files,
unzip('testZip.zip', list = TRUE)
# Name Length Date
# 1 testDir/test.csv 1493 2014-05-14 20:54:00
# 2 testDir/cats.csv 116 2014-05-14 20:54:00
# 3 testDir/txt.txt 32 2014-05-08 09:37:00
Note: From ?zip, regarding the zip argument.
On Windows, the default relies on a zip program (for example that from Rtools) being in the path.

For avoiding (a) an issue with relative paths (i.e., the zip file itself containing a folder structure with the full folder path to be zipped) and (b) for loops (well, style), you may use
my_wd<-getwd() # save your current working directory path
dest_path<-"C:/.../folder_with_files_to_be_zipped"
setwd(dest_path)
files<-list.files(dest_path)
named<-paste0(files,".zip")
mapply(zip,zipfile=named,files=files)
setwd(my_wd) # reset working directory path
Unlike R´s build-in unzip function, zip requires a zip-program like 7-zip (Windows) or the one being part of Rtools to be present in your system path.

For people still looking for this: there is now a "zip" package that does not depend on external executables.

You can install from the omegahat repos:
install.packages('Rcompression', repos = "http://www.omegahat.org/R", type = "source")
for windows you will need to jump through hoops installing zlib and bzip2 and linking appropriately.
utils::zip can be used in some cases. There are a number of issues with it. One case is that the maximum length of the string that you can use at the command prompt is 8191 characters (2047 characters on some versions) for windows. If you are zipping a directory with alot of characters for the names of directories/files this will cause issues. For example if you zip your firefox profile directory. Also I found the zip command needed to be issued relative the directory I was zipping to use relative directory names. Rcompression has a altNames argument which handles this.
That being said I have always had problems getting Rcompression to run on windows.

It's worth noting that zip() will fail silently if it cannot find a zip program.
zip returns an error code (or exit code) invisibly. That is, it will not print, unless you explicitly ask it to.
You can run print(zip(output, input)), to print the exit code, which in the case of no zip program found, will print 127
Alternatively you can do something along the lines of
#exit code 0 for success, all other codes are for failure
if (exit_code <- zip(output, input) != 0) {
stop("Zipping ", input, " failed with exit code:", exit_code)
}

Make that
#Convertir todas las carpetas en .zip
d <- "C:/Users/Eric/Documents/R/win-library/3.3"
array <- list.files(d)
for (i in 1:length(array)){
name <- paste0(array[i],".zip")
zip(name, files = paste0(d,paste0("/",array[i])))
}

Related

Is there a way of reading shapefiles directly into R from an online source?

I am trying to find a way of loading shapefiles (.shp) from an online repository/folder/url directly into my global environment in R, for the purpose of making plots in ggplot2 using geom_sf. In the first instance I'm using my Google Drive to store these files but I'd ideally like to find a solution that works with any folder with a valid url and appropriate access rights.
So far I have tried a few options, the first 2 involving zipping the source folder on Google Drive where the shapefiles are stored and then downloading and unzipping in some way. Have included reproducable examples using a small test shapefile:
Using utils::download.file() to retrieve the compressed folder and unzipping using either base::system('unzip..') or zip::unzip() (loosely following this thread: Downloading County Shapefile from ONS):
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Download the zipped file/folder
download.file("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing", destfile = "data/test_shp.zip")
# Unzip folder using unzip (fails)
unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Unzip folder using system (also fails)
system("unzip data/test_shp.zip")
If you can't run the above code then FYI the 2 error messages are:
Warning message:
In unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", :
error 1 in extracting from zip file
AND
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of data/test_shp.zip or
data/test_shp.zip.zip, and cannot find data/test_shp.zip.ZIP, period.
Worth noting here that I can't even manually unzip this folder outside R so I think there's something going wrong with the download.file() step.
Using the googledrive package:
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Specify googledrive url:
test_shp = drive_get(as_id("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing"))
# Download zipped folder
drive_download(test_shp, path = "data/test_shp.zip")
# Unzip folder
zip::unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Load test.shp
test_shp <- read_sf("data/test_shp/test.shp")
And that works!
...Except it's still a hacky workaround, which requires me to zip, download, unzip and then use a separate function (such as sf::read_sf or st_read) to read in the data into my global environment. And, as it's using the googledrive package it's only going to work for files stored in this system (not OneDrive, DropBox and other urls).
I've also tried sf::read_sf, st_read and fastshp::read.shp directly on the folder url but those approaches all fail as one might expect.
So, my question: is there a workflow for reading shapefiles stored online directly into R or should I stop looking? If there is not, but there is a way of expanding my above solution (2) beyond googledrive, I'd appreciate any tips on that too!
Note: I should also add that I have deliberately ignored any option requiring the package rgdal due to its imminient permanent retirement and so am looking for options that are at least somewhat future-proof (I understand all packages drop off the map at some point). Thanks in advance!
I ran into a similar problem recently, having to read in shapefiles directly from Dropbox into R.
As a result, this solution only applies for the case of Dropbox.
The first thing you will need to do is create a refreshable token for Dropbox using rdrop2, given recent changes from Dropbox that limit single token use to 4 hours. You can follow this SO post.
Once you have set up your refreshable token, identify all the files in your spatial data folder on Dropbox using:
shp_files_on_db<- drop_dir("Dropbox path/to your/spatial data/", dtoken = refreshable_token) %>%
filter(str_detect(name, "adm2"))
My 'spatial data' folder contained two sets of shapefiles – adm1 and adm 2. I used the above code to choose only those associated with adm2.
Then create a vector of the names of the shp, csv, shx, dbf, cpg files in the 'spatial data' folder, as follows:
shp_filenames<- shp_files_on_db$name
I choose to read in shapefiles into a temporary directory, avoiding the need to have to store the files on my disk – also useful in a Shiny implementation. I create this temporary directory as follows:
# create a new directory under tempdir
dir.create(dir1 <- file.path(tempdir(), "testdir"))
#If needed later on, you can delete this temporary directory
unlink(dir1, recursive = T)
#And test that it no longer exists
dir.exists(dir1)
Now download the Dropbox files to this temporary directory:
for (i in 1: length(shp_filenames)){
drop_download(paste0("Dropbox path/to your/spatial data/",shp_filenames[i]),
dtoken = refreshable_token,
local_path = dir1)
}
And finally, read in your shapefile as follows:
#path to the shapefile in the temporary directory
path1_shp<- paste0(dir1, "/myfile_adm2.shp")
#reading in the shapefile using the sf package - a recommended replacement for rgdal
shp1a <- st_read(path1_shp)

Can´t unzip file in R

I downloaded some data from certain URL but I´m not able to unzip any of the files downloaded and I can´t understand why. The code for downloading the data follows.
library(downloader)
path <- getwd()
for(i in 1:15){
fileName <- sprintf("%02d",i)
if (!file.exists(paste0(fileName,".zip"))) {
urlFile = paste0("http://www.censo2017.cl/wp-content/uploads/2016/12/R",
fileName,".zip")
download(urlFile, dest = paste0("./R",fileName, ".zip"), mode ="wb")
}
}
Then I have 15 zip files named:
R01.zip
R02.zip
... and so on, but when I use
unzip(R01.zip)
or try to unzip any other file, I get the following error Warning message:
In unzip("R01.zip") : error 1 in extracting from zip file
I´ve read related StackOverflow posts such as this one or this one but no solution works in my case.
I can unzip the files manually, but I would like to do it directly within RStudio. Any ideas?
PD: The .zip files contain geographical data by the way, that is, .dbf, .prj, .shp files, etc.
Thanks!
they're not zip files, they are RAR archives:
$ unrar v 01.zip
UNRAR 5.00 beta 8 freeware Copyright (c) 1993-2013 Alexander Roshal
Archive: 01.zip
Details: RAR 4
Attributes Size Packed Ratio Date Time Checksum Name
----------- --------- -------- ----- -------- ----- -------- ----
..A.... 1213 240 19% 23-11-16 16:12 C6C40C6D R01/Comuna.dbf
..A.... 151 138 91% 23-11-16 16:12 A3C83CE4 R01/Comuna.prj
..A.... 212 165 77% 23-11-16 16:12 01752C2A R01/Comuna.sbn
..A.... 132 101 76% 23-11-16 16:12 C4CA93A2 R01/Comuna.sbx
I don't know if there's an R function for extracting RAR archives.
They probably shouldn't have .zip file extensions, but .rar instead. I've extracted the above using unrar on the command line.
Ok, so based on this post I was able to workaround a solution.
Since the files were not actually .zip files and since 7-zip supported the extraction of the files manually, I looked for a way of calling 7-zip within R. The link I just posted shows how to do that.
I modified my code, now the files are downloaded and unzipped automatically.
# load neccesary packages
library(downloader)
library(installr)
install.7zip(page_with_download_url = "http://www.7-zip.org/download.html")
# download data and unzipped data
path <- getwd()
for(i in 1:15){ # the files correspond to administrative regions of Chile
# there are fifteen of them and they are ordered.
fileName <- sprintf("%02d",i) # adding leading zeros to the index if
# the index number is of one digit
if (!file.exists(paste0("R",fileName,".zip"))) { # download only
# if file is not already
# downloaded
urlFile = paste0("http://www.censo2017.cl/wp-content/uploads/2016/12/R",
fileName,".zip") # specifying url address
download(urlFile, dest = paste0("./R",fileName, ".zip"), mode ="wb")
} # download file
if (!file.exists(paste0("R",fileName))){ # if file is not already unzipped,
# unzip it
z7path = shQuote('C:\\Program Files (x86)\\7-Zip\\7z')
file = paste0(getwd(), "/", "R", fileName, ".zip")
cmd = paste0(z7path, ' e ', file, ' -y -o', paste0(getwd(),"/R", fileName),
'/')
shell(cmd)
}
}
It would be awesome if someone can tell me if this solution works for you too!

R: possible truncation of >= 4GB file

I have a 370MB zip file and the content is a 4.2GB csv file.
I did:
unzip("year2015.zip", exdir = "csv_folder")
And I got this message:
1: In unzip("year2015.zip", exdir = "csv_folder") :
possible truncation of >= 4GB file
Have you experienced that before? How did you solve it?
I agree with #Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.
To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.
I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.
decompress_file <- function(directory, file, .file_cache = FALSE) {
if (.file_cache == TRUE) {
print("decompression skipped")
} else {
# Set working directory for decompression
# simplifies unzip directory location behavior
wd <- getwd()
setwd(directory)
# Run decompression
decompression <-
system2("unzip",
args = c("-o", # include override flag
file),
stdout = TRUE)
# uncomment to delete archive once decompressed
# file.remove(file)
# Reset working directory
setwd(wd); rm(wd)
# Test for success criteria
# change the search depending on
# your implementation
if (grepl("Warning message", tail(decompression, 1))) {
print(decompression)
}
}
}
Notes:
The function does a few things, which I like and recommend:
uses system2 over system because the documentation says "system2 is a more portable and flexible interface than system"
separates the directory and file arguments, and moves the working directory to the directory argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
I set it to use the -o flag to automatically overwrite when rerun, but you could supply any number of arguments
includes a .file_cache argument which allows you to skip decompression
this comes in handy if you're testing a process which runs on the decompressed file, since 4GB+ files tend to take some time to decompress
commented out in this instance, but if you know you don't need the archive after decompressing, you can remove it inline
the system2 command redirects the stdout to decompression, a character vector
an if + grepl check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression
Checking ?unzip, found the following comment in Note:
It does have some support for bzip2 compression and > 2GB zip files
(but not >= 4GB files pre-compression contained in a zip file: like
many builds of unzip it may truncate these, in R's case with a warning
if possible).
You can try to unzip it outside of R (using 7-Zip for example).
To add to the list of possible solutions, in case you have Java (JDK) available on your machine, you can wrap jar xf into an R function similar to utils::unzip() in interface, a very simple example:
unzipLarge <- function(zipfile, exdir = getwd()) {
oldWd <- getwd()
on.exit(setwd(oldWd))
setwd(exdir)
system2("jar", args = c("xf", zipfile))
}
And then use:
unzipLarge("year2015.zip", exdir = "csv_folder")

Check existence of file in archive (zip)

I'm using unz to extract data from a file within an archive. This actually works pretty well but unfortunately I've a lot of zip files and need to check the existence of a specific file within the archive. I could not manage to get a working solution with if exists or else.
Has anyone an idea how to perform a check if a file exists in an archive without extracting the whole archive before?
Example:
read.table(unz(D:/Data/Test.zip, "data.csv"), sep = ";")[-1,]
This works pretty well if data.csv exists but gives an error if the file is not available in the archive Test.zip.
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot locate file 'data.csv' in zip file 'D:/Data/Test.zip'
Any comments are welcome!
You could use unzip(file, list = TRUE)$Name to get the names of the files in the zip without having to unzip it. Then you can check to see if the files you need are in the list.
## character vector of all file names in the zip
fileNames <- unzip("D:/Data/Test.zip", list = TRUE)$Name
## check if any of those are 'data.csv' (or others)
check <- basename(fileNames) %in% "data.csv"
## extract only the matching files
if(any(check)) {
unzip("D:/Data/Test.zip", files = fileNames[check], junkpaths = TRUE)
}
You could probably put another if() statement to run unz() in cases where there is only one matched file name, since it's faster than running unzip() on a single file.

Package compilation and relative path

I must be very confused. Have looked around but cannot find a suitable answer and have a feeling I am doing something wrong.
Here is a minimalist example:
My function test import a file from a folder and does subsequent analysis on that file. I have dozens of compressed files in the folder specified by path = "inst/extdata/input_data"
test = structure(function(path,letter) {
file = paste0(path, "/file_",letter,".tsv.gz")
data = read.csv(file,sep="\t",header=F,quote="\"",stringsAsFactors=F)
return(mean(data$var1))
}, ex = function(){
path = "inst/extdata/input_data"
m1 = test(path,"A")
})
I am building a package with the function in the folder R/ of the package directory.
When I set the working directory to the package parent and run the example line by line, everything goes fine. However when I check the package with R CMD check it gives me the following:
cannot open file 'inst/extdata/input_data/file_A.tsv.gz': No such file or directory
Error in file(file, "rt") : cannot open the connection
I thought in checking and building the package the working directory is automatically set to the parent directory of the package (that in my case is "C:/Users/yuhu/R/Projects/ABCDpackage" but it seems not to be the case.
What is the best practice in this case? I would avoid converting all data in .rda format and put it in the data folder as there are too many files. Is there a way to compile the package and set in the function example the relative working directory where the package is located? This would be helpful also when the package is distributed (therefore it should not be my own path)
Many thanks for your help.
When R CMD check (or the user later for that matter) runs the example, you need to provide the full path to the file! You can build that path easily with the system.file or the path.package command.
If your package is called foo, the following should do the trick:
}, ex = function(){
path = paste0(system.file(package = "foo"), "/extdata/input_data")
m1 = test(path,"A")
})
You might want to add a file.path command somewhere to be OS independent.
Since read.csv is just a wrapper for read.table I would not expect any fundamental difference w.r.t. to reading compressed files.
Comment: R removes the "inst/" part of the directory when it builds the system directory. This thread has a discussion on the inst directory
I think you might just want to go with read.table... At any rate give this a try.
fopen <- file(paste0(path,"/file_",letter,".tsv.gz"),open="rt")
data <- read.table(fopen,sep="\t",header=F,quote="\"",stringsAsFactors=F)
Refinement:
At the end of the day I think your problem is mainly because you are using read.csv instead of read.table which can open up .gz zipped files directly. So just to be sure. Here is a little experiment I did.
Experiment:
# zip up a .csv file (in this case example_A.csv) that exists in my working directory into .gz format
system("gzip example_A.csv")
# just wanted to pass the path as a variable like you did
path <- getwd()
file <- paste0(path, "/example_", "A", ".csv.gz")
data <- read.table(file, sep=",", header=FALSE, stringsAsFactors=FALSE) # I think
# these are the only options you need.
# stringsAsFactors=FALSE is agood one.
data <- data[1:5,1:7] # a subset of the data
V1 V2 V3 V4 V5 V6 V7
1 id Scenario Region Fuel X2005 X2010 X2015
2 1 BSE9VOG4 R1 Biomass 0 2.2986 0.8306
3 2 BSE9VOG4 R1 Coal 7.4339 13.3548 9.2918
4 3 BSE9VOG4 R1 Gas 1.9918 2.4623 2.5558
5 4 BSE9VOG4 R1 LFG 0.2111 0.2111 0.2111
At the end of the day (I say that too much) you can be certain that the problem is in either the method you used to read the zipped up files or the text string you've constructed for the file names (haven't looked into the latter). At any rate best of luck with the package. I hope it turns tides.

Resources