R Reading in a zip data file without unzipping it

R Reading in a zip data file without unzipping it - r

I have a very large zip file and i am trying to read it into R without unzipping it like so:
temp <- tempfile("Sales", fileext=c("zip"))
data <- read.table(unz(temp, "Sales.dat"), nrows=10, header=T, quote="\"", sep=",")
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open zip file 'C:\Users\xxx\AppData\Local\Temp\RtmpyAM9jH\Sales13041760345azip'

If your zip file is called Sales.zip and contains only a file called Sales.dat, I think you can simply do the following (assuming the file is in your working directory):
data <- read.table(unz("Sales.zip", "Sales.dat"), nrows=10, header=T, quote="\"", sep=",")

The methods of the readr package also support compressed files if the file suffix indicates the nature of the file, that is files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed.
require(readr)
myData <- read_csv("foo.txt.gz")

No need to use unz, as now read.table can handle the zipped file directly:
data <- read.table("Sales.zip", nrows=10, header=T, quote="\"", sep=",")
See this post

This should work just fine if the file is sales.csv.
data <- readr::read_csv(unzip("Sales.zip", "Sales.csv"))
To check the filename without extracting the file. This works
unzip("sales.zip", list = TRUE)

If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))
This solution also has the advantage that no temporary files are created.

The gzfile function along with read_csv and read.table can read compressed files.
library(readr)
df = read_csv(gzfile("file.csv.gz"))
library(data.table)
df = read.table(gzfile("file.csv.gz"))
read_csv from the readr package can read compressed files even without using gzfile function.
library(readr)
df = read_csv("file.csv.gz")
read_csv is recommended because it is faster than read.table

In this expression you lost a dot
temp <- tempfile("Sales", fileext=c("zip"))
It should be:
temp <- tempfile("Sales", fileext=c(".zip"))

For remote-based zipped files
samhsa2015 <- fread("curl https://www.opr.princeton.edu/workshops/Downloads/2020Jan_LatentClassAnalysisPratt_samhsa_2015F.zip | funzip")
answer from here: https://stackoverflow.com/a/37824192/12387385)

Related

Parsing issue, unexpected character when loading a folder

I am using this answer to load in a folder of Excel Files:
# Get the list of files
#----------------------------#
folder <- "path/to/files"
fileList <- dir(folder, recursive=TRUE) # grep through these, if you are not loading them all
# use platform appropriate separator
files <- paste(folder, fileList, sep=.Platform$file.sep)
So far, so good.
# Load them in
#----------------------------#
# Method 1:
invisible(sapply(files, source, local=TRUE))
#-- OR --#
# Method 2:
sapply(files, function(f) eval(parse(text=f)))
But the source function (Method 1) gives me the error:
Error in source("C:/Users/Username/filename.xlsx") :
C:/Users/filename :1:3: unexpected input
1: PK
^
For method 2 get the error:
Error in parse(text = f) : <text>:1:3: unexpected '/'
1: C:/
^
EDIT: I tried circumventing the issue by setting the working directory to the directory of the folder, but that did not help.
Any ideas why this happens?
EDIT 2: It works when doing the following:
How can I read multiple (excel) files into R?
setwd("...")
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)

just to provide a proper answer outside of the comment section...
If your target is to read many Excel files, you shouldn't use source.
source is dedicated to run external R code.
If you need to read many Excel files you can use the following code and the support of one of these libraries: readxl, openxlsx, tidyxl (with unpivotr).
filelist <- dir(folder, recursive = TRUE, full.names = TRUE, pattern = ".xlsx$|.xls$", ignore.case = TRUE)
l_df <- lapply(filelist, readxl::read_excel)
Note that we are using dir to list the full paths (full.names = TRUE) of all the files that ends with .xlsx, .xls (pattern = ".xlsx$|.xls$"), .XLSX, .XLS (ignore.case = TRUE) in the folder folder and all its subfolders (recursive = TRUE).
readxl is integrated with tidyverse. It is pretty easy to use. It is most likely what you're looking for.
Personally, I advice to use openxlsx if you need to write (rather than read) customized Excel files with many specific features.
tidyxl is the best package I've seen to read Excel files, but it may be rather complicated to use. However, it's really careful in the types preservation.
With the support of unpivotr it allows you to handle complicated Excel structures.
For example, when you find multiple headers and multiple left index columns.

Read .tar.gz file in R

I have compressed file like cat.txt.tar.gz, I just need to load into R and process as follows
zip <-("cat.txt.tar.gz")
data <- read.delim(file=(untar(zip,"cat.txt")),sep="\t")
but "data" is empty while running the code.Is there any way to read a file from .tar.gz

Are you sure your file is named correctly?
Usually compressed files are named cat.tar.gz, excluding the .txt.
Second, try the following code:
tarfile <- "cat.txt.tar.gz" # Or "cat.tar.gz" if that is right
data <- read.delim(file = untar(tarfile,compressed="gzip"),sep="\t")
If this doesn't work, you might need to extract the file first, and then read the extracted file.

To read in a particular csv or txt within a gz archive without having to UNZIP it first one can use library(archive) :
library(archive)
library(readr)
read_csv(archive_read("cat.txt.tar.gz", file = 1), col_types = cols(), sep="\t")
should work.

Error comes while importing files by data.table

I'm new to R studio and was not well aware of this portal T&C, so was blocked for questing for 5 days.
I have a code for importing multiple files from any directory to R.
Using this code for doing so, but the problem is this code runs sometime and sometime it gets failed with mentioned error.
I tried to found the solution of this but yet not found any solution.
library(data.table)
t = setwd("/home/dp/vishan/olp_data/19164/1/")
files <- file.info(list.files(path = t,pattern = "", full.names=TRUE))
files = rownames(files)[files$size > 0]
temp <- lapply(files, fread, sep=",")
Error:
Error in FUN(X[[i]], ...) :
'input' can not be a directory name, but must be a single character string containing a file name, a command, full path to a file, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or the input data itself.
Thanks in advance!

try using
files <- file.info(list.files(path = t,pattern = "", full.names=TRUE))
files <- subset(files, !isdir & size > 0)
temp <- lapply(rownames(files), fread, sep=',')
since list.files also shows directories. The data.frame you create in files can be easily subset on the isdir column which indicates if this is a directory or a file.

R read.xlsx gives me java.io.FileNotFoundException

I am trying to use the R package xlsx to load a file available at this URL:
http://www.plosgenetics.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pgen.1002236.s019
library(xlsx)
filename="/home/avilella/00x/mobile.element.insertions.1000g.journal.pgen.1002236.s019.xlsx"
system(paste("ls -l",filename))
-rw-rw-r-- 1 avilella avilella 2372143 2011-12-11 16:36 /home/avilella/00x/mobile.element.insertions.1000g.journal.pgen.1002236.s019.xlsx
Once downloaded, I try to load it in R using read.xlsx or read.xlsx2:
file <- system.file("mobile.element.insertions.1000g", filename, package = "xlsx")
res <- read.xlsx2(file, 1) # read first sheet
But I get an error:
Error in .jnew("java/io/FileInputStream", file) :
java.io.FileNotFoundException: (No such file or directory)
Any ideas?

1) xlsx package. Try using file.choose which will allow you to interactively navigate to the file and thereby eliminate the possibility of misidentifying it:
fn <- file.choose()
DF <- read.xls(fn, 1)
2) gdata package. If the above still does not work then you might try read.xls in gdata. It uses a perl program rather than java. It can read both xls and xlsx files and can read data right off the net (downloading it into a temporary file and reading it from there in a manner that is transparent to the user):
library(gdata)
URL <- "http://www.plosgenetics.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pgen.1002236.s019"
DF <- read.xls(URL)
?read.xls in gdata has more info.

Using R to download zipped data file, extract, and import data

#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?

Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)

Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)

I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")

For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")

Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))

To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.

Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.

Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")

rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)

I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Reading in a zip data file without unzipping it - r

If your zip file is called Sales.zip and contains only a file called Sales.dat, I think you can simply do the following (assuming the file is in your working directory): data <- read.table(unz("Sales.zip", "Sales.dat"), nrows=10, header=T, quote="\"", sep=",")

The methods of the readr package also support compressed files if the file suffix indicates the nature of the file, that is files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. require(readr) myData <- read_csv("foo.txt.gz")

No need to use unz, as now read.table can handle the zipped file directly: data <- read.table("Sales.zip", nrows=10, header=T, quote="\"", sep=",") See this post

This should work just fine if the file is sales.csv. data <- readr::read_csv(unzip("Sales.zip", "Sales.csv")) To check the filename without extracting the file. This works unzip("sales.zip", list = TRUE)

If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use: zipfile<-"test.zip" myData <- read.delim(pipe(paste("zcat", zipfile))) This solution also has the advantage that no temporary files are created.

In this expression you lost a dot temp <- tempfile("Sales", fileext=c("zip")) It should be: temp <- tempfile("Sales", fileext=c(".zip"))

For remote-based zipped files samhsa2015 <- fread("curl https://www.opr.princeton.edu/workshops/Downloads/2020Jan_LatentClassAnalysisPratt_samhsa_2015F.zip | funzip") answer from here: https://stackoverflow.com/a/37824192/12387385)

Related

Parsing issue, unexpected character when loading a folder

Read .tar.gz file in R

Error comes while importing files by data.table

R read.xlsx gives me java.io.FileNotFoundException

Using R to download zipped data file, extract, and import data

Categories

Resources