UNIX untar content into multiple folders

UNIX untar content into multiple folders - unix

I have a tar.gz file about 13GB in size. It contains about 1.2 million documents. When I untar this all these files sit in one single directory & any reads from this directory takes ages. Is there any way I can split the files from the tar into multiple new folders?
e.g.: I would like to create new folders named [1,2,...] each having 1000 files.

This is a quick and dirty solution but it does the job in Bash without using any temporary files.
i=0 # file counter
dir=0 # folder name counter
mkdir $dir
tar -tzvf YOURFILE.tar.gz |
cut -d ' ' -f12 | # get the filenames contained in the archive
while read filename
do
i=$((i+1))
if [ $i == 1000 ] # new folder for every 1000 files
then
i=0 # reset the file counter
dir=$((dir+1))
mkdir $dir
fi
tar -C $dir -xvzf YOURFILE.tar.gz $filename
done
Same as a one liner:
i=0; dir=0; mkdir $dir; tar -tzvf YOURFILE.tar.gz | cut -d ' ' -f12 | while read filename; do i=$((i+1)); if [ $i == 1000 ]; then i=0; dir=$((dir+1)); mkdir $dir; fi; tar -C $dir -xvzf YOURFILE.tar.gz $filename; done
Depending on your shell settings the "cut -d ' ' -f12" part for retrieving the last column (filename) of tar's content output could cause a problem and you would have to modify that.
It worked with 1000 files but if you have 1.2 million documents in the archive, consider testing this with something smaller first.

Obtain filename list with --list
Make files containing filenames with grep
untar only these files using --files-from
Thus:
tar --list archive.tar > allfiles.txt
grep '^1' allfiles.txt > files1.txt
tar -xvf archive.tar --files-from=files1.txt

If you have GNU tar you might be able to make use of the --checkpoint and --checkpoint-action options. I have not tested this, but I'm thinking something like:
# UNTESTED
cd /base/dir
mkdir $(printf "dir%04d\n" {1..1500}) # probably more than you need
ln -s dest0 linkname
tar -C linkname ... --checkpoint=1000 \
--checkpoint-action='sleep=1' \
--checkpoint-action='exec=ln -snf dest%u linkname ...

you can look at the man page and see if there are options like that. worst comes to worst, just extract the files you need (maybe using --exclude ) and put them into your folders.

tar doesn't provide that capability directly. It only restores its files into the same structure from which it was originally generated.
Can you modify the source directory to create the desired structure there and then tar the tree? If not, you could untar the files as they are in the file and then post-process that directory using a script to move the files into the desired arrangement. Given the number of files, this will take some time but at least it can be done in the background.

Related

Split files and process in chunks

I believe this is pretty simple, but so far I've had no luck.
I have a directory containing many large files which I split and move to /temp. I then proceed to process the files using script.R.
# create temp dir
mkdir -p "$1/temp"
# split and move files to /temp
for file in "$1"/*; do
split --verbose -b 10M --numeric-suffixes "$file" "$file"
mv -t "$1/temp" "$1/"*[0-99]
done
# process files in /temp
script.R "$1/temp"
The splitting of files result in nearly 8000 files. And for some reason the whole process crashes past a couple of thousand files. This is a problem which I have no idea how to construct a question for. :)
When I test this on smaller number of files it runs smoothly which is why I would like to perform the whole thing in chunks.
So how do I split, let's say, 10 files at a time, process them and then move on to the next 10 files.
I believe this can be achieved using xargs, nested for loops and other approaches... But welp, I'm a GNU noob.
Thanks in advance.

Could you please try this script:
cd "$1" || exit
# create temp dir
rm -rf temp && mkdir temp || exit
# split files to temp
for file in *; do
if [ -f "$file" ]; then
split --verbose -b 10M --numeric-suffixes "$file" temp/"$file"
fi
done
# process files in temp
script.R temp

Linux one line command to gzip and move

I have some .txt file in a particular /path/doc.txt and i wish to gzip all the files and move the new file that zipped all txt file into another path. How will i achieve that in one line of code.

maybe something like:
find /path/doc/ -type f -name \*.txt | xargs tar -z -c -f save.tar.gz && mv save.tar.gz other/path
use
tar -vtf save.tar.gz
to check archive content

How to recursively diff without transversing filesystems?

I'd like to create a patchset for two directory trees both of which contain (bind-)mounts which should be ignored. Is there any diff -r option similar to rsync's -x, --one-file-system? Or is another tool more appropriate for this? I considered using rsync --compare-dest, but the problem is a "diff"-directory obtained this way contains no information on file deletions.
Background: I want to store the modifications made to a chrooted-into Gentoo stage3 archive

As a workaround, I currently waste a lot of time by running rsync twice:
ORIG=/path/to/original
MOD=/path/to/modded
# find the modified/added files:
mkdir modded && rsync -axP --prune-empty-dirs --compare-dest=$ORIG $MOD/ modded
# the other way around, includes both deleted and modded files
mkdir deleted && rsync -axP --prune-empty-dirs --compare-dest=$MOD $ORIG/ deleted
# find the modded files and remove them
for i in $(find deleted); do [ -e modded${i#deleted} ] && rm $i; done
# delete the empty directories
find modded delete -type d -empty -delete
# create a list of the deleted files
cd deleted && find -type f > ../deleted.list && cd ..
# tar the modifications
cd modded && tar czf ../modded.tgz && cd ..
rm -rf deleted modded
Now modded.tgz contains the files that were modified/added, while deleted.list contains the names of deleted files, so to apply them run
tar xf modded.tgz
while read -r line; do rm $line; done < deleted.list
This can probably also be used to create a patchfile instead...

Shell script to sort & mv file based on date

Im new to unix,I have search a lot of info but still don not how to make it in a bash
What i know is used this command ls -tr|xargs -i ksh -c "mv {} ../tmp/" to move file by file.
Now I need to make a script that sorts all of these files by system date and moves them into a directory, The first 1000 oldest files being to be moved.
Example files r like these
KPK.AWQ07102011.66.6708.01
KPK.AWQ07102011.68.6708.01
KPK.EER07102011.561.8312.13
KPK.WWS07102011.806.3287.13
-----------This is the script tat i hv been created-------
if [ ! -d /app/RAID/Source_Files/test/testfolder ] then
echo "test directory does not exist!"
mkdir /app/RAID/Source_Files/calvin/testfolder
echo "unused_file directory created!"
fi
echo "Moving xx oldest files to test directory"
ls -tr /app/RAID/Source_Files/test/*.Z|head -1000|xargs -i ksh -c "mv {} /app/RAID/Source_Files/test/testfolder/"
the problem of this script is
1) unix prompt a syntax erro 'if'
2) The move command is working but it create a new filename testfolder instead move to directory testfolder (testfolder alredy been created in this path)
anyone can gv me a hand ? thanks

Could this help?
mv `ls -tr|head -1000` ../tmp/
head -n takes the n first lines of the previous command (here the 1000 oldest files). The backticks allow for the result of ls and head commands to be used as arguments to mv.

Performing grep operation in tar files without extracting

I have list of files which contain particular patterns, but those files have been tarred. Now I want to search for the pattern in the tar file, and to know which files contain the pattern without extracting the files.
Any idea...?

the tar command has a -O switch to extract your files to standard output. So you can pipe those output to grep/awk
tar xvf test.tar -O | awk '/pattern/{print}'
tar xvf test.tar -O | grep "pattern"
eg to return file name one pattern found
tar tf myarchive.tar | while read -r FILE
do
if tar xf test.tar $FILE -O | grep "pattern" ;then
echo "found pattern in : $FILE"
fi
done

The command zgrep should do exactly what you want, directly.
for example
zgrep "mypattern" *.gz
http://linux.about.com/library/cmd/blcmdl1_zgrep.htm

GNU tar has --to-command. With it you can have tar pipe each file from the archive into the given command. For the case where you just want the lines that match, that command can be a simple grep. To know the filenames you need to take advantage of tar setting certain variables in the command's environment; for example,
tar xaf thing.tar.xz --to-command="awk -e '/thing.to.match/ {print ENVIRON[\"TAR_FILENAME\"] \":\", \$0}'"
Because I find myself using this often, I have this:
#!/bin/sh
set -eu
if [ $# -lt 2 ]; then
echo "Usage: $(basename "$0") <pattern> <tarfile>"
exit 1
fi
if [ -t 1 ]; then
h="$(tput setf 4)"
m="$(tput setf 5)"
f="$(tput sgr0)"
else
h=""
m=""
f=""
fi
tar xaf "$2" --to-command="awk -e '/$1/{gsub(\"$1\", \"$m&$f\"); print \"$h\" ENVIRON[\"TAR_FILENAME\"] \"$f:\", \$0}'"

This can be done with tar --to-command and grep --label:
tar xaf archive.tar.gz --to-command 'egrep -Hn --label="$TAR_FILENAME" your_pattern_here || true'
--label gives grep the filename
-H tells grep to display the filename, and -n the line number
|| true because otherwise grep will exit with an error if the pattern is not found, and tar will complain about that.
xaf means to extract, and automagically decompress based off the file extension
--to-command has tar pass each file in the tarfile to a separate invocation of grep, and sets various environment variables with info about the file. See the manpage for more info.
Pretty heavily based off of Chipaca's answer (and Daniel H's comment), but this should be a bit easier to use and just uses tar and grep.

Python's tarfile module along with Tarfile.extractfile() will allow you to inspect the tarball's contents without extracting it to disk.

The easiest way is probably to use avfs. I've used this before for such tasks.
Basically, the syntax is:
avfsd ~/.avfs # Sets up a avfs virtual filesystem
rgrep pattern ~/.avfs/path/to/file.tar#/
/path/to/file.tar is the path to the actual tar file.
Pre-pending ~/.avfs/ (the mount point) and appending # lets avfs expose the tar file as a directory.

That's actually very easy with ugrep option -z:
-z, --decompress
Decompress files to search, when compressed. Archives (.cpio,
.pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
.tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
matching pathnames of files in archives are output in braces. If
-g, -O, -M, or -t is specified, searches files within archives
whose name matches globs, matches file name extensions, matches
file signature magic bytes, or matches file types, respectively.
Supported compression formats: gzip (.gz), compress (.Z), zip,
bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).
For example:
ugrep -z PATTERN archive.tgz
This greps each of the archived files to display PATTERN matches with the archived filenames. Archived filenames are shown in braces to distinguish them from ordinary filenames. Everything else is the same as grep (ugrep has the same options and produces the same output). For example:
$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}: { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello
If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:
$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt
Tarballs (.tar.gz/.tgz, .tar.bz2/.tbz, .tar.xz/.txz, .tar.lzma/.tlz) are searched as well as .zip archives.

You can mount the TAR archive with ratarmount and then simply search for the pattern in the mounted view:
pip install --user ratarmount
ratarmount large-archive.tar mountpoint
grep -r '<pattern>' mountpoint/
This should be much faster than iterating over each file and printing it to stdout, especially for compressed TARs.
Here is a simple comparison benchmark:
function checkFilesWithRatarmount()
{
local pattern=$1
local archive=$2
ratarmount "$archive" "$archive.mountpoint"
'grep' -r -l "$pattern" "$archive.mountpoint/"
}
function checkEachFileViaStdOut()
{
local pattern=$1
local archive=$2
tar --list --file "$archive" | while read -r file; do
if tar -x --file "$archive" -O -- "$file" | grep -q "$pattern"; then
echo "Found pattern in: $file"
fi
done
}
function createSampleTar()
{
for i in $( seq 40 ); do
head -c $(( 1024 * 1024 )) /dev/urandom | base64 > $i.dat
done
tar -czf "$1" [0-9]*.dat
}
createSampleTar myarchive.tar.gz
time checkEachFileViaStdOut ABCD myarchive.tar.gz
time checkFilesWithRatarmount ABCD myarchive.tar.gz
sleep 0.5s
fusermount -u myarchive.tar.gz.mountpoint
Results in seconds for a 55 MiB uncompressed and 42 MiB compressed TAR archive containing 40 files:
Compression
Ratarmount
Bash Loop over tar -O
none
0.31 +- 0.01
0.55 +- 0.02
gzip
1.1 +- 0.1
13.5 +- 0.1
bzip2
1.2 +- 0.1
97.8 +- 0.2
Of course, these results are highly dependent on the archive size and how many files the archive contains. These test examples are pretty small because I didn't want to wait too long but they already show the problem. The more files there are, the longer it takes for tar -O to jump to the correct file. And for compressed archives, it will be quadratically slower the larger the archive size is because everything before the requested file has to be decompressed and each file is requested separately. Both of these problems are solved by ratarmount.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

UNIX untar content into multiple folders - unix

Obtain filename list with --list Make files containing filenames with grep untar only these files using --files-from Thus: tar --list archive.tar > allfiles.txt grep '^1' allfiles.txt > files1.txt tar -xvf archive.tar --files-from=files1.txt

you can look at the man page and see if there are options like that. worst comes to worst, just extract the files you need (maybe using --exclude ) and put them into your folders.

Related

Split files and process in chunks

Linux one line command to gzip and move

How to recursively diff without transversing filesystems?

Shell script to sort & mv file based on date

Performing grep operation in tar files without extracting

Categories

Resources