Unix : how to tar only N first files of each folder? - unix

I have a folder containing 2Gb of images, with sub-folders several levels deep.
I'd like to archive only N files of each (sub) folder in a tar file. I tried to use find then tail then tar but couldn't manage to get it to work. Here is what I tried (assuming N = 10):
find . | tail -n 10 | tar -czvf backup.tar.gz
… which outputs this error:
Cannot stat: File name too long
What's wrong here? thinking of it - even if it works I think it will tar only the first 10 files of all folders, not the first 10 files of each folder.
How can I get the first N files of each folder?

A proposal with some quirks: order is only determined by the order out of find, so "first" isn't well-defined.
find . -type f |
awk -v N=10 -F / 'match($0, /.*\//, m) && a[m[0]]++ < N' |
xargs -r -d '\n' tar -rvf /tmp/backup.tar
gzip /tmp/backup.tar
Comments:
use find . -type f to ensure that files have a leading directory-name prefix, so the next step can work
the awk command tracks such leading directory names, and emits full path names until N (10, here) files with the same leading directory have been emitted
use xargs to invoke tar - we're gathering regular file names, and they need to be arguments to that archiving command
xargs may invoke tar more than once, so we'll append (-r option) to a plain archive, then compress it after it's all written
Also, you may not want to write a backup file into the current directory, since you're scanning that - that's why this suggestion writes into /tmp.

Related

Mac OS: How to use RSYNC to copy files modified within the last 24 hours and keep folder structure?

It's a simple question that I can't seem to figure out. I'm on a Mac with Big Sur with all the latest updates, and I'm going through Terminal to get these commands to run. If there's a better way please let me know.
This is, in basic terms, what I'm trying to do--I want RSYNC to recursively go through a source directory (which in this case would ideally be an entire drive), find any files modified within the last 24 hours, and copy those to another drive, while preserving the folder structure. So if I have:
/Volumes/Drive1/Folder1/File1.file
/Volumes/Drive1/Folder1/File2.file
/Volumes/Drive1/Folder1/File3.file
And File1 has been modified in the last 24 hours, but the other two haven't, I want it to copy that file, so that on the second drive I wind up with:
/Volumes/Drive2/Folder1/File1.file
But without copying File2 and File3.
I've tried a lot of different solutions and strings, but I'm running into problems. The closest I've been able to get is this:
find /Volumes/Drive1/ -type f -mtime -1 -exec cp -a "{}" /Volumes/Drive2/ \;
The problem is that while this one does go through Drive1 and find all the files newer than a day like I want, when it copies them it just dumps them all into the root of Drive2.
This one also seems to come close:
rsync --progress --files-from=<(find /Volumes/Drive1/ -mtime -1 -type f -exec basename {} \;) /Volumes/Drive1/ /Volumes/Drive2/
This one also identifies all the files modified in the last 24 hours, but instead of copying them it gives an error, "link_stat (filename and path) failed: no such file or directory (2)."
I've spent several days trying to figure out what I'm doing wrong but I can't figure it out. Help please!
I think this'll work:
srcDir=/Volumes/Drive1
destDir=/Volumes/Drive2
(cd "$srcDir" && find . -type f -mtime -1 -print0) |
while IFS= read -r -d $'\0' filepath; do
mkdir -p "$(dirname "$destDir/$filepath")"
cp -a "$srcDir/$filepath" "$destDir/$filepath"
done
Explanation:
Using cd "$srcDir"; find . -whatever will generate relative paths (starting with "./") from the source directory to the found files; that means appending the results to $srcDir and $destDir will give the full source and destination paths for each file.
Putting it in parentheses makes it run in a subshell, so the cd won't affect other commands. Coupling cd and find with && means that if cd fails, it won't run find (which would run in the wrong place, generate a list of the wrong file file, and generally cause trouble).
Using -print0 and while IFS= read -r -d $'\0' is a standard weird-filename-safe way of iterating over found files (see BashFAQ #20). Note that if anything in the loop reads from standard input (e.g. cp -i asking for confirmation), it'll steal part of the file list; if this is a worry, use this variant (instead of the pipe) to send the file list over file descriptor #3 instead of standard input:
while IFS= read -r -d $'\0' filepath <&3; do
...
done 3< <(cd "$srcDir" && find . -type f -mtime -1 -print0)
Finally, mkdir -p is used to make sure the destination directory exists, and then cp to copy the file.

Is there a method to enter every subdirectory in a directory and perform analysis on file with certain extension?

I have one directory, with multiple subdirectories. In each subdirectory there is a file on which I want to perform analysis (code already written).
Common for all subdirectories is that they have file with same extension on which analysis should be performed.
Using Unix shell, is there a way to write a commands which will:
for each subdirectory in main directory, use file with certain extension and perform further commands on that file (further commands include creation of some new directories and files)
repeat it for all subdirectories in main directory and files inside them
I will appreciate all suggestions.
Use the find command. find . -type f -name '*.txt' -exec prog \{} \; will execute program prog with the name of every file in the current directory . and below with the extension .txt (i.e. that matches the pattern *.txt). The -type f excludes directories (and pipes and devices). The -exec means execute this command; the \{} will be replaced with the filename; \; means end of command.
This definitely works if your filenames have no spaces, quote marks, or backslashes in them. If they do, it gets a little trickier: find . -type f -name '*.txt' -print0 | xargs -0 -n1 prog, assuming the filename argument goes at the end of the line. The -print0 means output the file with null termination (zero character) and the -0 means input with null termination. xargs takes its input and invokes prog for every null-terminated word. -n1 means only use one argument per invocation; you can omit it if the program accepts multiple filenames as arguments. You can use -i if you need to insert text after the argument.
Note: I am aware that using -exec for various obscure reasons may not be preferable for, say, secure system shell scripts, but for a use case like this it is fine.

Exclude files from tar gzipping a directory in unix

I have a directory (dir) (with files and subdirectories):
ls -1 dir
plot.pdf
subdir.1
subdir.2
obj.RDS
And then ls -1 for either subdir.1 or subdir.2:
plot.pdf
PC.pdf
results.csv
de.pdf
de.csv
de.RDS
I would like to tar and gzip dir (in unix) and I'd like to exclude all RDS files (the the level right below dir and the ones in its subdirectories).
What's the easiest way to achieve that? Perhaps in a one liner
Something like:
find dir -type f -not -name '*.RDS' -print0 |
tar --null -T- -czf TARGET.tgz
should do it.
First, find finds the files, and then tar accepts the list via -T- (= --files-from /dev/stdin).
-print0 on find combined wth --null on tar protect from weird filenames.
-czf == Create gZipped File
You can add v to get verbose output.
To later inspect the contents, you can do:
tar tf TARGET.tgz
tar --exclude=*.RDS -Jcf outputball.tar dir_to_compress
this will ignore *.RDS across any dir or subdirs
decompress using
tar -xvf outputball.tar

Find and tar files on Solaris

I've got a little problem with my bash script. I'm newbie in unix world, so I find it difficult to deal with an exercise. What I have to do is find files on Solaris server with specific name, modified in specific time and archive them in one .tar file. First two points are easy, but I'm having a nightmare with trying to archive it. The thing is, I constantly archive whole tree of file (with file at the end) to .tar file, but I need just a file. My code looks like this:
find ~ -name "$maska" -mtime -$dni | xargs -t -L 1 tar -cvf $3 -C
where $maska is the name of the file, $dni refers to modification time and $3 is just a archive name. I found out about -C switch, that let's me jump into the folder where desired file is, but when I use it with xargs, it seems just to jump there and do nothing else.
So my question is:
1) is there any possibility of achieving my goal this way?
Please remember, I don't work on gnu tar. And I HAVE TO use commands: tar, find.
Edit: I'd like to specify more my problem. When I use the script for, for example, file a, it should look for it since the point shown in script (it's ~ ) and everything it will find should be in one tar file.
What I got right now is (I'm in /home/me/Scripts):
-bash-3.2$ ./Script.sh a 1000 backup
a /home/me/Program/Test/a/ 0K
a /home/me/Program/Test/a/a.c 1K
a /home/me/Program/Test/a/a.out 8K
So script has done some packing. Next I want to see my packed file, so:
-bash-3.2$ tar -tf backup
/home/me/Program/Test/a/
/home/me/Program/Test/a/a.c
/home/me/Program/Test/a/a.out
And that's the problem. Tar file have all the paths in it, so if I will untar it, instead of getting just the file I wanted to archive, I will replace them in their old places. For visualisation:
-bash-3.2$ ls
Script.sh* Script.sh~* backup
-bash-3.2$ tar -xvf backup
x /home/me/Program/Test/a, 0 bytes, 0 tape blocks
x /home/me/Program/Test/a/a.c, 39 bytes, 1 tape blocks
x /home/me/Program/Test/a/a.out, 7928 bytes, 16 tape blocks
-bash-3.2$ ls
Script.sh* Script.sh~* backup
That's the problem.
So all I want is to pack all those desired file (a in example above) in one tar file without those paths, so it will simply untar in the directory I run the Script.sh.
I'm not sure to understand what you want but this might be it :
find ~ -name "$maska" -mtime -$dni -exec tar cvf $3 {} +
Edit: second attempt after your wrote the main issue is the absolute path:
( cd ~; find . -name "$maska" -type f -mtime -$dni -exec tar cvf $3 {} + )
Edit: third attempt, after you wrote you want no path at all in the archive, maska is a directory name and $3 need to be in the current directory:
mkdir ~/foo && \
find ~ -name "$maska" -type d -mtime -$dni -exec sh -c 'ln -s $1/* ~/foo/' sh {} \; && \
( cd ~/foo ; tar chf - * ) > $3 && \
rm -rf ~/foo
Replace ~/foo by ~/somethingElse if ~/foo already exists for some reason.
Maybe you can do something like this:
#!/bin/bash
find ~ -name "$maska" -mtime -$dni -print0 | while read -d $'\0' file; do
d=$(dirname "$file")
f=$(basename "$file")
echo $d: $f # Show directory and file for debug purposes
tar -rvf tarball.tar -C"$d" "$f"
done
I don't have a Solaris box at hand for testing :-)
First of all, my assumptions:
1. "one tar file", like you said, and
2. no absolute paths, ie if you backup ~/dir/file, you should be able to test extracting it in /tmp obtaining /tmp/dir/file.
If the problem is the full paths, you should replace
find ~ # etc
with
cd ~ || exit
find . # etc
If the tar archive isn't an absolute name, instead, it should be something like
(
cd ~ || exit
find . etc etc | xargs tar cf - etc etc
) > $3
Explanation
"(...)" runs a subshell, meaning some of the tings you change in there have no effects outside of the parens; the current directory is one of them, so "(cd whatever; foo)" means you run another shell, change its current directory, run foo from there, and then you're back in your script which never changed directory.
"cd ~ || exit" is paranoia, it means "cd ~; if that fails, exit".
"." is an alias meaning "the current directory, whatever that is"; play with "find ." vs "find ~" if you don't know what it means, you'll understand it better than if I explained it here.
"tar cf -" means that you create the tar archive on standard output; I think the syntax is portable enough, you may have to replace "-" with "/dev/stdout" or whatever works on solaris (the simplest solution is simply "tar", without the "c" command, but it's ugly to read).
The final "> $3", outside of the parens, is output redirection: rather than writing the output to the terminal, you save it into a file.
So the whole script reads like this:
- open a subshell
- change the subshell's current directory to ~
- in the subshell, find the files newer than requested, archive them, and write the contents of the resulting tar archive to standard output
- the subshell's stdout is saved to $3; because the redirection is outside the parens, relative paths are resolved relatively to your script's $PWD, meaning that eg if you run the script from the /tmp directory you'll get a tar archive in the /tmp directory (it would be in ~ if the redirection happened in the subshell).
If I misunderstood your question, the solution doesn't work or the explanation isn't clear let me know (the answer is too long, but I already know that :).
The pax command will output tar-compatible archives and has the flexibility you need to rewrite pathnames.
find ~ -name "$maska" -mtime -$dni | pax -w -x ustar -f "$3" -s '!.*/!!'
Here are what the options mean, paraphrasing from the man page:
-w write the contents of the file operands to the standard output (or to the pathname specified by the -f option) in an archive format.
-x ustar the output archive format is the extended tar interchange format specified in the IEEE POSIX standard.
-s '!.*/!!' Modifies file operands according to the substitution expression, using regular expression syntax. Here, it deletes all characters in each file name from the beginning to the final /.

Unix shell file copy flattening folder structure

On the UNIX bash shell (specifically Mac OS X Leopard) what would be the simplest way to copy every file having a specific extension from a folder hierarchy (including subdirectories) to the same destination folder (without subfolders)?
Obviously there is the problem of having duplicates in the source hierarchy. I wouldn't mind if they are overwritten.
Example: I need to copy every .txt file in the following hierarchy
/foo/a.txt
/foo/x.jpg
/foo/bar/a.txt
/foo/bar/c.jpg
/foo/bar/b.txt
To a folder named 'dest' and get:
/dest/a.txt
/dest/b.txt
In bash:
find /foo -iname '*.txt' -exec cp \{\} /dest/ \;
find will find all the files under the path /foo matching the wildcard *.txt, case insensitively (That's what -iname means). For each file, find will execute cp {} /dest/, with the found file in place of {}.
The only problem with Magnus' solution is that it forks off a new "cp" process for every file, which is not terribly efficient especially if there is a large number of files.
On Linux (or other systems with GNU coreutils) you can do:
find . -name "*.xml" -print0 | xargs -0 echo cp -t a
(The -0 allows it to work when your filenames have weird characters -- like spaces -- in them.)
Unfortunately I think Macs come with BSD-style tools. Anyone know a "standard" equivalent to the "-t" switch?
The answers above don't allow for name collisions as the asker didn't mind files being over-written.
I do mind files being over-written so came up with a different approach. Replacing each / in the path with - keep the hierarchy in the names, and puts all the files in one flat folder.
We use find to get the list of all files, then awk to create a mv command with the original filename and the modified filename then pass those to bash to be executed.
find ./from -type f | awk '{ str=$0; sub(/\.\//, "", str); gsub(/\//, "-", str); print "mv " $0 " ./to/" str }' | bash
where ./from and ./to are directories to mv from and to.
If you really want to run just one command, why not cons one up and run it? Like so:
$ find /foo -name '*.txt' | xargs echo | sed -e 's/^/cp /' -e 's|$| /dest|' | bash -sx
But that won't matter too much performance-wise unless you do this a lot or have a ton of files. Be careful of name collusions, however. I noticed in testing that GNU cp at least warns of collisions:
cp: will not overwrite just-created `/dest/tubguide.tex' with `./texmf/tex/plain/tugboat/tubguide.tex'
I think the cleanest is:
$ find /foo -name '*.txt' | xargs -i cp {} /dest
Less syntax to remember than the -exec option.
As far as the man page for cp on a FreeBSD box goes, there's no need for a -t switch. cp will assume the last argument on the command line to be the target directory if more than two names are passed.

Resources