Reading data from zip files located in zip files with R - r

I'd like to use R to extract data from zip files located in zip files (i.e. preform some ZIP file inception).
An example "directory" of one of my datapoints looks like this:
C:\ZipMother.zip\ZipChild.zip\index.txt
My goal is to read in the "index.txt" from each ZipChild.zip. The issue is that I have 324 ZipMother.zip files with an average of 2000 ZipChild.zip files, therefore unzipping the ZipMother.zip files is a concern due to memory constraints (the ZipMother files are about 600 megabytes on average).
With the unzip package, I can successfully get the filepaths of each ZipChild located in the ZipMother, but I cannot use it to list the files located in the ZipChild folders.
Therefore,
unzip("./ZipMother.zip",list=TRUE)
works just fine, but...
unzip("./ZipMother.zip/ZipChild.zip",list=TRUE)
gives me the following error
Error in unzip("./ZipMother.zip/ZipChild.zip", list = TRUE) :
zip file './ZipMother.zip/ZipChild.zip' cannot be opened
Is there any way to use unzip or another method to extract the data from the ZipChild files?
Once I get this to work, I plan on using the ldply function to compile the index.txt files into a dataset.
Any input is very much appreciated. Thank you!

A reproducible example (i.e. a link to a zip file with the appropriate structure) would be useful, but how about:
tmpd <- tempdir()
## extract just the child
unzip("./ZipMother.zip",
files="zipChild.zip",exdir=tmpd)
ff <- file.path(tmpd,"zipChild.zip")
index <- unzip(ff,list=TRUE)
unlink(ff)
This could obviously be packaged into a function for convenience.
It could be slow, but it means you never have to unpack more than one child at a time ...

Related

Extract a CSV inside a zip file inside another zip online in R?

I am attempting to extract a CSV that is inside of a zip file nested in another zip file posted online.
The analysis I am doing draws on files that usually have the same name but are updated regularly. Every so often they update format. This time they decided to put multiple versions of the data embedded in zip files inside of a larger zip file.
What have I done and tried?
I know have a list of many other file that I have downloaded and then loaded into objects. In all the cases the code block looks similar to this:
temp <- tempfile()
download.file("http://fakeurl.com/data/filename.zip",temp, mode="wb")
unzip(temp, "data.csv")
db <- read.csv("data.csv", header=T)
I cannot wrap my head around taking it to the next level. Because I am not downloading it 'directly' I do not know how to manipulate it.
Ideally, I want to unzip one file into a temp file, then unzipping the next file, then reading in the csv into a data frame.
I thank you all for your help and will answer any questions you might have to help clarify.
Unzip the downloaded file into the current directory and then iterate through the generated files unzipping any that itself is a zip file.
Files <- unzip(temp)
for(File in Files) if (grepl("\\.zip$", File)) unzip(File)
There also various approaches listed here:
How do I recursively unzip nested ZIP files?
Creating zip file from folders in R
https://superuser.com/questions/1287028/how-to-explore-nested-zip-without-extract
https://github.com/ankitkaushal/nzip

Convert multiple AVI files to JPEG

I am trying to convert multiple (250 or so) .avi video files into .jpeg files with R.
I have managed to convert single .avi files using function av_video_images() from library av, but I would love to know how to iterate this over multiple input files.
av::av_video_images("FILE001.AVI", destdir = "Site_1_JPEG", format = "jpg", fps = 1)
I have the 250 .avi files in a folder and would like all frames produced in the output folder Site_1_JPEG.
This is not a complete solution since I cannot reproduce your issue, but I think it will get you closer. Your example suggests that the files you want to process are in your current working directory. Secondly, your code will not produce the desired results because av_video_images names the .jpg files as "image_000001.jpg", "image_000002.jpg", "image_000003.jpg" and I see no way to alter the names of the extracted jpg's. That means your code will successively overwrite the previous files and at the end you will only have the final set of jpg's. To prevent that you need to create separate folders for each video file. Here is one solution:
library(av)
sapply(flist[1:3], function(x) av_video_images(paste0(path, x), x, fps=.5))
To test the code I specified that only the first 3 files will be processed to check things out. There are two differences between my code and yours. First my video files are located in a different directory (path) so I pasted the path onto the file name. Second I provided a different destination directory for each file which is just the file name. This produced three folders with jpg files in each.
The error message could indicate that one or more of the .avi files is corrupt. You can get the directory information on all of the files with
file.info(flist)
The main thing to look at is the size column to make sure the size is large enough.

Is there a way to reference files in a folder within the working directory in R?

I have already finished with my RMarkdown and I'm trying to clean up the workspace a little bit. This isn't exactly a necessary thing but more of an organizational practice which I'm not even sure if it's a good practice, so that I can keep the data separate from some scripts and other R and git related files.
I have a bunch of .csv files for data that I used. Previously they were on (for example)
C:/Users/Documents/Project
which is what I set as my working directory. But now I want them in
C:/Users/Document/Project/Data
The problem is that this only breaks the following code because they are not in the wd.
#create one big dataframe by unioning all the data
bigfile <- vroom(list.files(pattern = "*.csv"))
I've tried adding a full path to list.files() to where the csvs are but no luck.
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv"))
Error: 'data1.csv' does not exist in current working directory ('C:/Users/Documents/Project').
Is there a way to only access the /Data folder once for creating my dataframe with vroom() instead of changing the working directory multiple times?
You can list files including those in all subdirectories (Data in particular) using list.files(pattern = "*.csv", recursive = TRUE)
Best practices
Have one directory of raw and only raw data (the stuff you measured)
Have another directory of external data (e.g. reference data bases). This is something you do can remove afterwards and redownload if required.
Have another directory for the source code
Put only the source code directory under version control plus one other file containing check sums of the raw and external data to proof integrity
Every other thing must be reproducible using raw data and the source code. This can be removed after the project. Maybe you want to keep small result files (e.g. tables) which take long time to reproduce.
You can list the files and capture the full filepath name right?
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv", full.names = T))
and that should read the file in the directory without reference to your wd
Try one of these:
# list all csv files within Data within current directory
Sys.glob("Data/*.csv")
# list all csv files within immediate subdirectories of current directory
Sys.glob("*/*.csv")
If you only have csv files then these would also work but seem less desirable. Might be useful though if you quickly want to review what files and directories are there. (I would be very careful not to use the second one within statements to delete files since if you are not in the directory you think it is in then you can wind up deleting files you did not intend to delete. The first one might too but is a bit safer since it would only lead to deleting wrong files if the directory you are in does have a Data subdirectory.)
# list all files & directories within Data within current directory
Sys.glob("Data/*")
# list all files & directories within immediate subdirectories of current directory
Sys.glob("*/*")
If the subfolder always has the same name (or the same number of characters), you should be able to do it thanks to substring. In your example, "Data" has 4 characters (5 with the /), so the following code should do:
Repository <- substring(getwd(), 1, nchar(getwd())-5)

Vemco Acoustic Telemetry Data (vrl files) in R

Does anyone know a good way to read .vrl files from Vemco acoustic telemetry receivers directly into r as an object. Converting .vrl files to .csv files in the program VUE prior to analyzing the data in r seems like a waste of time if there is a way to bring them in directly. My internet searches have not turned up anything that worked for me.
I figured out a way using glatos to convert all .vrl files to .csv and then reading the .csv files in and binding them.
glatos has to be installed from github.
Convert all .vrl files to .csv files using vrl2csv. The help page has info on finding the path for vueExePath
library(glatos)
vrl2csv(vrl = "VRLFileInput",outDir = "VRLFilesToCSV", vueExePath = "C:/Program Files (x86)/VEMCO/VUE")
The following will pull in all .csv files in the output folder from vrl2csv and rbind them together. I had to add the paste0 function to create the full file path for each .csv in the list.
library(data.table)
AllDetections <- do.call(rbind, lapply(paste0("VRLFilesToCSV/", list.files(path = "VRLFilesToCSV")), read.csv))

Import multiple csv files into R from zip folder

I know that this question has been asked exhaustively on this site, but I cannot find any question which addresses my problem.
I am trying to import multiple .csv files into R which are located in nested .zip files on my PC. The other questions seem to relate to importing a single file from a URL, which is not my issue.
I have set my working directory to the folder which contains the first .zip file, but there is another one inside of it, which then contains normal folders, and finally hundreds of .csv files which I am looking to access.
Up to now I have always manually extracted the data since I have no idea where to begin with unzipping code, but considering this folder contains around 20GB of data, I'm going to need to try something else.
Any help would be appreciated!
EDIT - CODE:
setwd("C:/docs/data/241115")
temp <- tempfile()
unzip("C:/docs/data/241115/Requested.zip",exdir=temp)
l = list.files(temp)
unzip("C:/docs/data/241115/Requested/Data Requested.zip",exdir=temp)
> error 1 in extracting from zip file
Without a minimal reproducible example it's difficult to know exactly where the problem lies. My best guess is that using a tempfile() is causing problems.
I would create a folder within your working directory to unzip the files to. You can do this from within R if you like:
# Create the folder 'temp' in your wd
dir.create("temp")
Now assuming your zip file is in the working directory I would unzip the first .zip in to temp in one step:
unzip("Requested.zip", exdir = "temp")
Finally, unzip the final .zip:
unzip("temp/Data Requested.zip", exdir = "temp")

Resources