read table from subfolder of tar.gz file - r

table can open a file directly from tar.gz file
myData <- read.table('myFile.tar.gz')
However without having to unzip and then delete is there a way to read the table of a specific file in the compress file, so for example under firstf/secondf/table.txt?
there has been a similar post but note quite what I need.
thanks.

Ok I found an answer by using a package call archive.
library ( archive )
file <- "test.tar.gz"
x <- archive::archive_read(archive = file , file = "firstf/secondf/table.txt")
df <- read.table(x, header=TRUE,sep="\t",stringsAsFactors = FALSE)

Related

Read .tar.gz file in R

I have compressed file like cat.txt.tar.gz, I just need to load into R and process as follows
zip <-("cat.txt.tar.gz")
data <- read.delim(file=(untar(zip,"cat.txt")),sep="\t")
but "data" is empty while running the code.Is there any way to read a file from .tar.gz
Are you sure your file is named correctly?
Usually compressed files are named cat.tar.gz, excluding the .txt.
Second, try the following code:
tarfile <- "cat.txt.tar.gz" # Or "cat.tar.gz" if that is right
data <- read.delim(file = untar(tarfile,compressed="gzip"),sep="\t")
If this doesn't work, you might need to extract the file first, and then read the extracted file.
To read in a particular csv or txt within a gz archive without having to UNZIP it first one can use library(archive) :
library(archive)
library(readr)
read_csv(archive_read("cat.txt.tar.gz", file = 1), col_types = cols(), sep="\t")
should work.

Read files with a specific a extension from a folder in R

I want to read files with extension .output with the function read.table.
I used pattern=".output" but its'not correct.
Any suggestions?
As an example, heres how you could read in files with the extension ".output" and create a list of tables
list.filenames <- list.files(pattern="\\.output$")
trialsdata <- lapply(list.filenames,read.table,sep="\t")
or if you just want to read them one at a time manually just include the extention in the filename argument.
read.table("ACF.output",sep=...)
So finally because i didn't found a solution(something is going wrong with my path) i made a text file including all the .output files with ls *.output > data.txt.
After that using :
files = read.table("./data.txt")
i am making a data.frame including all my files and using
files[] <- lapply(files, as.character)
Finally with test = read.table(files[i,],header=F,row.names=1)
we could read every file which is stored in i (i = no of line).

Read latest SPSS file from directory

I am trying to read the latest SPSS file from the directory which has several SPSS files. I want to read only the newest file from a list of 3 files which changes with time. Currently, I have manually entered the filename (SPSS-1568207835.sav for ex.) which works absolutely fine, but I want to make this dynamic and automatically fetch the latest file. Any help would be greatly appreciated.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
mydata = read_spss("SPSS-1568207835.sav",reencode = TRUE)
w <- data.frame(mydata)
args <- commandArgs(TRUE)
This should return a character string for the filename of the .sav file modified most recently
# get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
# use file.info to get the index of the file most recently modified
all_sav[with(file.info(all_sav), which.max(mtime))]

Using R to download gzipped data file, extract, and import data

A follow up to this question: How can I download and uncompress a gzipped file using R? For example (from the UCI Machine Learning Repository), I have a file of insurance data. How can I download it using R?
Here is the data url: http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz.
I like Ramnath's approach, but I would use temp files like so:
tmpdir <- tempdir()
url <- 'http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz'
file <- basename(url)
download.file(url, file)
untar(file, compressed = 'gzip', exdir = tmpdir )
list.files(tmpdir)
The list.files() should produce something like this:
[1] "TicDataDescr.txt" "dictionary.txt" "ticdata2000.txt" "ticeval2000.txt" "tictgts2000.txt"
which you could parse if you needed to automate this process for a lot of files.
Here is a quick way to do it.
# create download directory and set it
.exdir = '~/Desktop/tmp'
dir.create(.exdir)
.file = file.path(.exdir, 'tic.tar.gz')
# download file
url = 'http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz'
download.file(url, .file)
# untar it
untar(.file, compressed = 'gzip', exdir = path.expand(.exdir))
Please the content of help(download.file) for that. If the file in question is merely a gzipped but otherwise readable file, you can feed the complete URL to read.table() et al too.
Using library(archive) one can also read in a particular csv file within an archive without having to UNZIP it first : read_csv(archive_read("http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", file = 1), col_types = cols())
This is quite a bit faster.
To unzip everything one can do archive_extract("http://archive.ics.uci.edu/ml/databases/tic/tic.tar.gz", dir=XXX).
That worked very well for me & is faster than the unbuilt untar(). It also works on all platforms. It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.

Using R to download zipped data file, extract, and import data

#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")
For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))
To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.
Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.
Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")
rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)
I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)

Resources