Can´t unzip file in R - r

I downloaded some data from certain URL but I´m not able to unzip any of the files downloaded and I can´t understand why. The code for downloading the data follows.
library(downloader)
path <- getwd()
for(i in 1:15){
fileName <- sprintf("%02d",i)
if (!file.exists(paste0(fileName,".zip"))) {
urlFile = paste0("http://www.censo2017.cl/wp-content/uploads/2016/12/R",
fileName,".zip")
download(urlFile, dest = paste0("./R",fileName, ".zip"), mode ="wb")
}
}
Then I have 15 zip files named:
R01.zip
R02.zip
... and so on, but when I use
unzip(R01.zip)
or try to unzip any other file, I get the following error Warning message:
In unzip("R01.zip") : error 1 in extracting from zip file
I´ve read related StackOverflow posts such as this one or this one but no solution works in my case.
I can unzip the files manually, but I would like to do it directly within RStudio. Any ideas?
PD: The .zip files contain geographical data by the way, that is, .dbf, .prj, .shp files, etc.
Thanks!

they're not zip files, they are RAR archives:
$ unrar v 01.zip
UNRAR 5.00 beta 8 freeware Copyright (c) 1993-2013 Alexander Roshal
Archive: 01.zip
Details: RAR 4
Attributes Size Packed Ratio Date Time Checksum Name
----------- --------- -------- ----- -------- ----- -------- ----
..A.... 1213 240 19% 23-11-16 16:12 C6C40C6D R01/Comuna.dbf
..A.... 151 138 91% 23-11-16 16:12 A3C83CE4 R01/Comuna.prj
..A.... 212 165 77% 23-11-16 16:12 01752C2A R01/Comuna.sbn
..A.... 132 101 76% 23-11-16 16:12 C4CA93A2 R01/Comuna.sbx
I don't know if there's an R function for extracting RAR archives.
They probably shouldn't have .zip file extensions, but .rar instead. I've extracted the above using unrar on the command line.

Ok, so based on this post I was able to workaround a solution.
Since the files were not actually .zip files and since 7-zip supported the extraction of the files manually, I looked for a way of calling 7-zip within R. The link I just posted shows how to do that.
I modified my code, now the files are downloaded and unzipped automatically.
# load neccesary packages
library(downloader)
library(installr)
install.7zip(page_with_download_url = "http://www.7-zip.org/download.html")
# download data and unzipped data
path <- getwd()
for(i in 1:15){ # the files correspond to administrative regions of Chile
# there are fifteen of them and they are ordered.
fileName <- sprintf("%02d",i) # adding leading zeros to the index if
# the index number is of one digit
if (!file.exists(paste0("R",fileName,".zip"))) { # download only
# if file is not already
# downloaded
urlFile = paste0("http://www.censo2017.cl/wp-content/uploads/2016/12/R",
fileName,".zip") # specifying url address
download(urlFile, dest = paste0("./R",fileName, ".zip"), mode ="wb")
} # download file
if (!file.exists(paste0("R",fileName))){ # if file is not already unzipped,
# unzip it
z7path = shQuote('C:\\Program Files (x86)\\7-Zip\\7z')
file = paste0(getwd(), "/", "R", fileName, ".zip")
cmd = paste0(z7path, ' e ', file, ' -y -o', paste0(getwd(),"/R", fileName),
'/')
shell(cmd)
}
}
It would be awesome if someone can tell me if this solution works for you too!

Related

Read latest SPSS file from directory

I am trying to read the latest SPSS file from the directory which has several SPSS files. I want to read only the newest file from a list of 3 files which changes with time. Currently, I have manually entered the filename (SPSS-1568207835.sav for ex.) which works absolutely fine, but I want to make this dynamic and automatically fetch the latest file. Any help would be greatly appreciated.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
mydata = read_spss("SPSS-1568207835.sav",reencode = TRUE)
w <- data.frame(mydata)
args <- commandArgs(TRUE)
This should return a character string for the filename of the .sav file modified most recently
# get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
# use file.info to get the index of the file most recently modified
all_sav[with(file.info(all_sav), which.max(mtime))]

Reading in Excel (downloaded with automated script) produces error when not manually opened and saved first

I run an automated script to download 3 .xls files from 3 websites every hour. When I later try to read in the .xls files in R to further work with them, R produces the following error message:
"Error: IOException (Java): block[ 2 ] already removed - does your POIFS have circular or duplicate block references?"
When I manually open and save the .xls files this problem doesn't appear anymore and everything works normal, but since the total number of files is increasing with 72 every day this is not a nice work around.
The script I use to download and save the files:
library(httr)
setwd("WORKDIRECTION")
orig_wd <- getwd()
FOLDERS <- c("NAME1","NAME2","NAME3") #representing folder names
LINKS <- c("WEBSITE_1", #the urls from which I download
"WEBSITE_2",
"WEBSITE_3")
NO <- length(FOLDERS)
for(i in 1:NO){
today <- as.character(Sys.Date())
if (!file.exists(paste(FOLDERS[i],today,sep="/"))){
dir.create(paste(FOLDERS[i],today,sep="/"))
}
setwd(paste(orig_wd,FOLDERS[i],today,sep="/"))
dat<-GET(LINKS[i])
bin <- content(dat,"raw")
now <- as.character(format(Sys.time(),"%X"))
now <- gsub(":",".",now)
writeBin(bin,paste(now,".xls",sep=""))
setwd(orig_wd)
}
I then read in the files with the following script:
require(gdata)
require(XLConnect)
require(xlsReadWrite)
wb = loadWorkbook("FILEPATH")
df = readWorksheet(wb, "Favourite List" , header = FALSE)
Does anybody have experience with this type of error, and knows a solution or workaround?
The problem is partly resolved by using the readxl package available in the CRAN library. After installation files can be read in with:
library(readxl)
read_excel("PathToFile")
The only problem is, that the last column is omitted while reading in. If I find a solution for this I'll update the awnser.

Package compilation and relative path

I must be very confused. Have looked around but cannot find a suitable answer and have a feeling I am doing something wrong.
Here is a minimalist example:
My function test import a file from a folder and does subsequent analysis on that file. I have dozens of compressed files in the folder specified by path = "inst/extdata/input_data"
test = structure(function(path,letter) {
file = paste0(path, "/file_",letter,".tsv.gz")
data = read.csv(file,sep="\t",header=F,quote="\"",stringsAsFactors=F)
return(mean(data$var1))
}, ex = function(){
path = "inst/extdata/input_data"
m1 = test(path,"A")
})
I am building a package with the function in the folder R/ of the package directory.
When I set the working directory to the package parent and run the example line by line, everything goes fine. However when I check the package with R CMD check it gives me the following:
cannot open file 'inst/extdata/input_data/file_A.tsv.gz': No such file or directory
Error in file(file, "rt") : cannot open the connection
I thought in checking and building the package the working directory is automatically set to the parent directory of the package (that in my case is "C:/Users/yuhu/R/Projects/ABCDpackage" but it seems not to be the case.
What is the best practice in this case? I would avoid converting all data in .rda format and put it in the data folder as there are too many files. Is there a way to compile the package and set in the function example the relative working directory where the package is located? This would be helpful also when the package is distributed (therefore it should not be my own path)
Many thanks for your help.
When R CMD check (or the user later for that matter) runs the example, you need to provide the full path to the file! You can build that path easily with the system.file or the path.package command.
If your package is called foo, the following should do the trick:
}, ex = function(){
path = paste0(system.file(package = "foo"), "/extdata/input_data")
m1 = test(path,"A")
})
You might want to add a file.path command somewhere to be OS independent.
Since read.csv is just a wrapper for read.table I would not expect any fundamental difference w.r.t. to reading compressed files.
Comment: R removes the "inst/" part of the directory when it builds the system directory. This thread has a discussion on the inst directory
I think you might just want to go with read.table... At any rate give this a try.
fopen <- file(paste0(path,"/file_",letter,".tsv.gz"),open="rt")
data <- read.table(fopen,sep="\t",header=F,quote="\"",stringsAsFactors=F)
Refinement:
At the end of the day I think your problem is mainly because you are using read.csv instead of read.table which can open up .gz zipped files directly. So just to be sure. Here is a little experiment I did.
Experiment:
# zip up a .csv file (in this case example_A.csv) that exists in my working directory into .gz format
system("gzip example_A.csv")
# just wanted to pass the path as a variable like you did
path <- getwd()
file <- paste0(path, "/example_", "A", ".csv.gz")
data <- read.table(file, sep=",", header=FALSE, stringsAsFactors=FALSE) # I think
# these are the only options you need.
# stringsAsFactors=FALSE is agood one.
data <- data[1:5,1:7] # a subset of the data
V1 V2 V3 V4 V5 V6 V7
1 id Scenario Region Fuel X2005 X2010 X2015
2 1 BSE9VOG4 R1 Biomass 0 2.2986 0.8306
3 2 BSE9VOG4 R1 Coal 7.4339 13.3548 9.2918
4 3 BSE9VOG4 R1 Gas 1.9918 2.4623 2.5558
5 4 BSE9VOG4 R1 LFG 0.2111 0.2111 0.2111
At the end of the day (I say that too much) you can be certain that the problem is in either the method you used to read the zipped up files or the text string you've constructed for the file names (haven't looked into the latter). At any rate best of luck with the package. I hope it turns tides.

Creating zip file from folders in R

Try to create a zip file from one folder using R.
It mentioned "Rcompression" package here:
Creating zip file from folders
But I didn't find where I can download this package for Windows system.
Any suggestions? or other functions to create a zip file?
You can create a zip file with the function zip from utils package quite easily. Say you have a directory testDir and you wish to zip a file (or multiple files) inside the directory,
dir('testDir')
# [1] "cats.csv" "test.csv" "txt.txt"
zip(zipfile = 'testZip', files = 'testDir/test.csv')
# adding: testDir/test.csv (deflated 68%)
The zipped file is saved in the current working directory, unless a different path is specified in the zipfile argument. We can see its size relative to the original unzipped file with
file.info(c('testZip.zip', 'testDir/test.csv'))['size']
# size
# testZip.zip 805
# testDir/test.csv 1493
You can zip the whole directory of files (if no sub-folders) with
files2zip <- dir('testDir', full.names = TRUE)
zip(zipfile = 'testZip', files = files2zip)
# updating: testDir/test.csv (deflated 68%)
# updating: testDir/cats.csv (deflated 27%)
# updating: testDir/txt.txt (stored 0%)
And unzip it to view the files,
unzip('testZip.zip', list = TRUE)
# Name Length Date
# 1 testDir/test.csv 1493 2014-05-14 20:54:00
# 2 testDir/cats.csv 116 2014-05-14 20:54:00
# 3 testDir/txt.txt 32 2014-05-08 09:37:00
Note: From ?zip, regarding the zip argument.
On Windows, the default relies on a zip program (for example that from Rtools) being in the path.
For avoiding (a) an issue with relative paths (i.e., the zip file itself containing a folder structure with the full folder path to be zipped) and (b) for loops (well, style), you may use
my_wd<-getwd() # save your current working directory path
dest_path<-"C:/.../folder_with_files_to_be_zipped"
setwd(dest_path)
files<-list.files(dest_path)
named<-paste0(files,".zip")
mapply(zip,zipfile=named,files=files)
setwd(my_wd) # reset working directory path
Unlike R´s build-in unzip function, zip requires a zip-program like 7-zip (Windows) or the one being part of Rtools to be present in your system path.
For people still looking for this: there is now a "zip" package that does not depend on external executables.
You can install from the omegahat repos:
install.packages('Rcompression', repos = "http://www.omegahat.org/R", type = "source")
for windows you will need to jump through hoops installing zlib and bzip2 and linking appropriately.
utils::zip can be used in some cases. There are a number of issues with it. One case is that the maximum length of the string that you can use at the command prompt is 8191 characters (2047 characters on some versions) for windows. If you are zipping a directory with alot of characters for the names of directories/files this will cause issues. For example if you zip your firefox profile directory. Also I found the zip command needed to be issued relative the directory I was zipping to use relative directory names. Rcompression has a altNames argument which handles this.
That being said I have always had problems getting Rcompression to run on windows.
It's worth noting that zip() will fail silently if it cannot find a zip program.
zip returns an error code (or exit code) invisibly. That is, it will not print, unless you explicitly ask it to.
You can run print(zip(output, input)), to print the exit code, which in the case of no zip program found, will print 127
Alternatively you can do something along the lines of
#exit code 0 for success, all other codes are for failure
if (exit_code <- zip(output, input) != 0) {
stop("Zipping ", input, " failed with exit code:", exit_code)
}
Make that
#Convertir todas las carpetas en .zip
d <- "C:/Users/Eric/Documents/R/win-library/3.3"
array <- list.files(d)
for (i in 1:length(array)){
name <- paste0(array[i],".zip")
zip(name, files = paste0(d,paste0("/",array[i])))
}

Using R to download zipped data file, extract, and import data

#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")
For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))
To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.
Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.
Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")
rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)
I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)

Resources