R: Read single file from within a tar.gz directory - r

Consider a tar.gz file of a directory which containing a lot of individual files.
From within R I can easily extract the name of the individual files with this command:
fileList <- untar(my_tar_dir.tar.gz, list=T)
Using only R is it possible to directly read/load a single of those files into R (aka without first unpacking and writing the file to the disk)?

It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.
The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.
ParseTGZ<- function(archname){
# open tgz archive
tf <- gzfile(archname, open='rb')
on.exit(close(tf))
fnames <- list()
offset <- 0
nfile <- 0
while (TRUE) {
# go to beginning of entry
# never use "seek" to re-locate in a gzipped file!
if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
# read file name
fName <- rawToChar(readBin(tf, what="raw", n=100))
if (nchar(fName)==0) break
nfile <- nfile + 1
fnames <- c(fnames, fName)
attr(fnames[[nfile]], "offset") <- offset+512
# read size, first skip 24 bytes (file permissions etc)
# again, we only use readBin, not seek()
readBin(tf, what="raw", n=24)
# file size is encoded as a length 12 octal string,
# with the last character being '\0' (so 11 actual characters)
sz <- readChar(tf, nchars=11)
# convert string to number of bytes
sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
attr(fnames[[nfile]], "size") <- sz
# cat(sprintf('entry %s, %i bytes\n', fName, sz))
# go to the next message
# don't forget entry header (=512)
offset <- offset + 512*(ceiling(sz/512) + 1)
}
# return a named list of characters strings with attributes?
names(fnames) <- fnames
return(fnames)
}
This will give you the exact position and length of all files in the tar.gz archive.
Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.
extractTGZ <- function(archfile, filename) {
# this function returns a raw vector
# containing the desired file
fp <- ParseTGZ(archfile)
offset <- attributes(fp[[filename]])$offset
fsize <- attributes(fp[[filename]])$size
gzf <- gzfile(archfile, open="rb")
on.exit(close(gzf))
# jump to the byte position, don't use seek()
# may be a bad idea on really large archives...
readBin(gzf, what="raw", n=offset)
# now read the data into a raw vector
result <- readBin(gzf, what="raw", n=fsize)
result
}
now, finally:
ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))
Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.

One can read in a csv within an archive using library(archive) as follows (this should be a lot more elegant than the currently accepted answer, this package also supports all major archive formats - 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' & 'xz' and it works on all platforms):
library(archive)
library(readr)
read_csv(archive_read("my_tar_dir.tar.gz", file = 1), col_types = cols())

Related

How to simultaneously read and write a file line by line?

I would like to remove all lines from a file which start with a certain pattern. I would like to do this with R. It is good practice to not first read the whole file, then remove all matching lines and afterwards write the whole file, as the file can be huge. I am thus wondering if I can have both a read and a write connection (open all the time, one at a time?) to the same file. The following shows the idea (but 'hangs' and thus fails).
## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()
## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
while(TRUE) {
rcon <- file(fnm, "r") # read connection
line <- readLines(rcon, n = 1) # read one line
close(rcon)
if(length(line) == 0) { # end of file
break
} else {
if(!grepl(pat, line)) {
wcon <- file(fnm, "w")
writeLines(line, con = wcon)
close(wcon)
}
}
}
Note:
1) See here for an answer if one writes to a new file. One could then delete the old file and rename the new one to the old one, but that does not seem very elegant :-).
2) Update: The following MWE produces
Hello
world
-
world
See:
## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()
## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
con <- file(fnm, "r+") # read and write connection
while(TRUE) {
line <- readLines(con, n = 1L) # read one line
if(length(line) == 0) break # end of file
if(!grepl(pat, line))
writeLines(line, con = con)
}
close(con)
I think you just need open = 'r+'. From ?file:
Modes
"r+", "r+b" -- Open for reading and writing.
I don't have your sample file, so I'll instead just have the following minimal example:
take a file with a-z on 26 lines and replace them one by one with A-Z:
tmp = tempfile()
writeLines(letters, tmp)
f = file(tmp, 'r+')
while (TRUE) {
l = readLines(f, n = 1L)
if (!length(l)) break
writeLines(LETTERS[match(l, letters)], f)
}
close(f)
readLines(f) afterwards confirms this worked.
I understand you want to use R, but just in case you're not aware, there are some really simple scripting tools that excel in this type of task. E.g gawk is designed for pretty much exactly this type of operation and is simple enough to learn that you could write a script for this within minutes even without any prior knowledge.
Here's a one-liner to do this in gawk (or awk if you are on Unix):
gawk -i inplace '!/^pat/ {print}' foo.txt
Of course, it is trivial to do this from within R using
system(paste0("gawk -i inplace '!/^", pat, "/ {print}' ", fnm))

How to determine gzfile size OR reading it raw in R?

Following this answer for reading a whole file, I need to determine the uncompressed file size of a gzfile.
It's saved at the last 4 bytes of the gzfile, but I couldn't find how to open the file without r will wrap it with an uncompressing layer, so I have no access to the raw gz file. I haven't found a method that provides this information as well.
Provided you are sure this is a complete gzip'd file with a single stream and <2GB uncompressed:
gz_size <- function(path) {
path <- path.expand(path)
f <- file(path, open="rb", raw=TRUE)
seek(f, -4L, "end", "read")
ret <- readBin(f, "integer", 1)
close(f)
return(ret)
}

I have a zip folder which contains 332 csv file. I have to first unzip it using R and then save it to a directory. How do i do that?

I have tried-
read.zip(file ="C:/Users/dm/Downloads/rprog-data-specdata.zip")
and-
l = list.files("C:/Users/dm/Downloads/rprog-data-specdata")
read.csv(l[1:332])
But it's not working
Unless you really want them all extracted, you don't have to. You can read them all in directly from the archive:
# you
zipped_csvs <- "rprog-data-specdata.zip"
# get data.frame of file info in the zip
fils <- unzip(zipped_csvs, list=TRUE)
# read them all into a list (or you can read individual ones)
dats <- lapply(fils$Name, function(x) {
read.csv(unzip(zipped_csvs, x), stringsAsFactors=FALSE)
})

l_ply Confused About how to Pass Variables Through to Function

I'm sure this must have been answered somewhere, so; if you have a pointer to an answer that helps please let me know... ;o)
I have a number of fairly sizeable processing tasks (mainly multi-label text classifiers) which read in large volumes of files, do stuff with that, output a result then move on to the next.
I have this working neatly sequentially but wanted to parallelise things.
By way of a really basic example...
require(plyr)
fileDir <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))
l_ply(files, function(x){
print(x)
#change to dir containing source files
setwd(fileDir)
# read file
content <- read.csv(file=x,header=TRUE)
# change directory to output
setwd(outputDir)
# append the itemID from CSV file to
write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE)
}, .parallel=FALSE )
Will iterate through all the files in directory fileDir, opening each CSV, extracting a value from the file and appending this to an output CSV held in the directory outputDir. A basic example but it runs just fine to illustrate the problem.
To run this in parallel creates a problem in so far as the directory variables (fileDir & outputDir) are essentially unknown by the anonymous function (x), ala...
require(plyr)
require(doParallel)
fileDir <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))
cl<-makeCluster(4) # make a cluster of available cores
registerDoParallel(cl) # raise cluster
l_ply(files, function(x){
print(x)
#change to dir containing source files
#setwd(fileDir)
# read file
content <- read.csv(file=x,header=TRUE)
# change directory to output
setwd(y)
# append the itemID from CSV file to
write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE)
}, .parallel=TRUE )
stopCluster() # kill the cluster
Can anyone shed light on how I pass those two directory variables through to the function here?
So thanks to #Roland my parallel function would now be...
require(plyr)
require(doParallel)
fileDir <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))
cl<-makeCluster(4) # make a cluster of available cores
registerDoParallel(cl) # raise cluster
l_ply(files, function(x,y,z){
filename <- x
fileDir <- y
outputDir <- z
#change to dir containing source files
setwd(fileDir)
# read file
content <- read.csv(file=filename,header=TRUE)
# change directory to output
setwd(outputDir)
# append the itemID from CSV file to
write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE)
}, y=fileDir, z=outputDir, .parallel=TRUE )
stopCluster() # kill the cluster

Use R to unzip and rename file

I have used R to download about 200 zip files. The zipped files are in mmyy.dat format. The next step is to use R to unzip all the files and rename it as yymm.txt. I know the function unzip can unpack the files. But I am not sure which argument in the function can change the name and format of the unzipped files as well.
And when I unzip the files using
for (i in 1:length(destfile)){
unzip(destfile[i],exdir='C:/data/cps1')
}
The files extrated are jan94pub.cps which is supposed to be jan94pub.dat. The code I use to download the files are here.
month_vec <- c('jan','feb','mar','apr','may', jun','jul','aug','sep','oct','nov','dec')
year_vec <- c('94','95','96','97','98','99','00','01','02','03','04','05','06','07','08','09','10','11','12','13','14')
url <- "http://www.nber.org/cps-basic/"
month_year_vec <- apply(expand.grid(month_vec, year_vec), 1, paste, collapse="")
bab <-'pub.zip'
url1 <- paste(url,month_year_vec,bab,sep='')
for (i in 1:length(url1)){
destfile <- paste('C:/data/cps1/',month_year_vec,bab,sep='')
download.file(url1[i],destfile[i])
}
for (i in 1:length(destfile)){
unzip(destfile[i],exdir='C:/data/cps1')
}
When I use str(destfile), the filenames are correct, jan94pub.dat. I don't see where my code goes wrong.
I'd do something like:
file_list = list.files('*zip')
lapply(file_list, unzip)
Next you want to use the same kind of lapply trick in combination with strptime to convert the name of the file to a date:
t = strptime('010101.txt', format = '%d%m%y.txt') # Note I appended 01 (day) before the name, you can use paste for this (and its collapse argument)
[1] "2001-01-01"
You will need to tweak the filename a bit to get a reliable date, as only the month and the year is not enough. Next you can use strftime to transform it back to you desired yymm.txt format:
strftime(t, format = '%y%d.txt')
[1] "0101.txt"
Then you can use file.rename to perform the actual moving. To get this functionality into one function call, create a function which performs all the steps:
unzip_and_move = function(path) {
# - Get file list
# - Unzip files
# - create output file list
# - Move files
}

Resources