Find when tar find n matches using wildcards - wildcard

I'm trying to extract from a huge tar file some files from a list that are using wildcards. I'm using a loop to read the list but passing from one element in the list to the next one is taking too long, I'm guessing because is trying to match the element through the whole tar file. I want that after 2 matches for any element, the loop continues with the next one.
while read line;do
tar --wildcards -xzvf file.tar.gz "$line"
done <$file
And one line looks like this
dataset/0113947.*

I went aggresive and kill the tar process as soon as it finds two files. Here is my solution
file=list.txt
while read line;do
tar --wildcards --checkpoint=10000 --checkpoint-action=exec='sh stop.sh dummy.txt 1' -xzvf ny_file.tar.gz "$line" > dummy.txt
done <$file
where stop.sh checks if dummy.txt has more than two lines and kill the process.
n=$(wc -l < $1)
if [ $n -gt 1 ];then
kill $(ps aux|grep "[t]ar --wildcards*" | cut -d " " -f 4)
fi
I had to use cut to recover the ID process because the single quotes for awk were troubling

Related

Append "/" to end of directory

Completely noob question but, using ls piped to grep, I need to find files or directories that have all capitals in their name, and directories need to have "/" appended to indicate that it is a directory. Trying to append the "/" is the only part I am stuck on. Again, I apologize for the amateur question. I currently have ls | grep [A-Z] and the example out should be: BIRD, DOG, DOGDIR/
It's an interesting question because it's a somewhat difficult thing to accomplish with a bash one-liner.
Here's what I came up with. It doesn't seem very elegant, but I'm not sure how to improve.
find /animals -type d -or -type f \
| grep '/[A-Z]*$' \
| xargs -I + bash -c 'echo -n $(basename +)$( test -d + && echo -n /),\\ ' \
| sed -e 's/, *$//'; echo
I'll break that down for you
find /animals -type d -or -type f writes out, once per line, the directories and files it found in /animals (see below for my test environment dockerfile - I created /animals to match your desired output). Find can't do a regex match as far as I know on the name, so...
grep '/[A-Z]*$' filter's find's output so that only paths are shown where the last part of the file or directory name, after the final /, is all uppercase
xargs -I + bash -c '...' when you're in a shell and you want to use a "for" loop, chances are what you should be using is xargs. Learn it, know it, love it. xargs takes its input, separated by default by $IFS, and runs the command you give it for each piece of input . So this is going to run a bash shell for each path. that passed the grep filter. In my case, -I + will make xargs replace the literal '+' character with its current input filename. -I also makes it pass one at a time through xargs. For more information, see the xargs manual page.
'echo -n $(basename +)$( test -d + && echo -n /),\\ ' this is the inner bash script that will be run by xargs for each path that got through grep.
basename + cuts the directory component off the path; from your example output you don't want eg /animals/DOGDIR/, you want DOGDIR/. basename is the program that trims the directories for us.
test -d + && echo -n / checks to see whether + (remember xargs will replace it with filename) is a directory ,and if so, runs echo -n /. the -n argument to echo suppresses the newline, important to get the output in the CSV format you specified.
now we can put it all together to see that we're echo -n the output of basename + , with / appended, if it's a directory, and then , appended to that. All the echos run with -n to suppress newlines to keep output CSV looking.
| sed -e 's/, *$//'; echo is purely for formatting. Adding , to each individual output was an easy way to get the CSV, but it leaves us with a final , at the end of the list. The sed invocation removes , followed by any number of spaces at the end of the output so far - eg the entire output from all the xargs invocations. And since we never did output a newline at the end of that output, the final echo is adding that.
Usually in unix shells, you probably wouldn't want a CSV style output. You'd probably instead want a newline-separated output in most cases, one matching file per line, and that would be somewhat simpler to do because you wouldn't need all that faffing with -n and , to make it CSV style. But, valid requirement if the need is there.
FROM debian
RUN mkdir -p /animals
WORKDIR /animals
RUN mkdir -p DOGDIR lowerdir && touch DOGDIR/DOG DOGDIR/lowerDOG2 lowerdir/BIRD
ENTRYPOINT [ "/bin/bash" ]
CMD [ "-c" , "find /animals -type d -or -type f | grep '/[A-Z]*$'| xargs -I + bash -c 'echo -n $(basename +)$( test -d + && echo -n /),\\ ' | sed -e 's/, *$//'; echo"]
$ docker run --rm test
BIRD, DOGDIR/, DOG
You can start looking at
ls -F | grep -v "[[:lower:]]"
I did not add something for a comma-seperated line, because this is the wrong method: Parsing ls should be avoided ! It will go wrong for filenames like
I am a terribble filename,
with newlines inside me,
and the ls command combined with grep
will only show the last line
BECAUSE THIS LINE HAS NO LOWERCASE CHARACTERS
To get the files without a pipe, you can use
shopt -s extglob
ls -dp +([[:upper:]])
shopt -u extglob
An explanation of the extglob and uppercase can be found at https://unix.stackexchange.com/a/389071/57293
When you want the output in one line, you can get troubles with filenames that have newlines or commas in its name. You might want something like
# parsing ls, yes wrong and failing for some files
ls -dp +([[:upper:]]) | tr "\n" "," | sed 's/,$/\n/'

unix combine grep w and v command

I want to search a file and include the text #!/bin/bash, but exclude any other line that has a # sign. These two commands: grep -w '#!/bin/bash' file and grep -v '^#' file each do one part of this job. I would like this to be a single command, so here's what I've tried.
grep -w '#!/bin/bash' | grep -v '^#' file
This excludes lines beginning with #, but doesn't include the line #!/bin/bash
grep -w '#!/bin/bash' -v '^#' file
This just prints every line but #!/bin/bash
grep "^[^#]\|^#\!/bin/bash$" test.sh
Explanation:
^[^#] means starts by something different that #
\| is a or
^#\!/bin/bash$ is the exact line #!/bin/bash
So .. it looks as if you're trying to strip comments from bash files without removing their shebang.
The grep command can search for regular expressions, but isn't so good at applying rules of logic. You could do something like this:
grep -v '^#[^!]' input.sh
But you'd fail to strip comments that are affixed to the ends of lines. Note that I'm being a little more liberal with this regex, since it's entirely possible that a script might use something other than /bin/bash for its shebang. :-)
Another possibility would be to use awk. This lets you apply logic that cannot be expressed within a regular expression. For example, if you want to keep the commented line only if it is a shebang on the first line of the file, and remove all other comments, awk can express that as follows:
awk '
NF==1 && /^#!/; # if we're on the first line and find shebang, print.
/^#/ { next } # if this is a comment line, skip it.
1 # print everything else.
' input.sh

Rename files based on pattern in path

I have thousands of files named "DOCUMENT.PDF" and I want to rename them based on a numeric identifier in the path. Unfortunately, I don't seem to have access to the rename command.
Three examples:
/000/000/002/605/950/ÐÐ-02605950-00001/DOCUMENT.PDF
/000/000/002/591/945/ÐÐ-02591945-00002/DOCUMENT.PDF
/000/000/002/573/780/ÐÐ-02573780-00002/DOCUMENT.PDF
To be renamed as, without changing their parent directory:
2605950.pdf
2591945.pdf
2573780.pdf
Use a for loop, and then use the mv command
for file in *
do
num=$(awk -F "/" '{print $(NF-1)}' file.txt | cut -d "-" -f2);
mv "$file" "$num.pdf"
done
You could do this with globstar in Bash 4.0+:
cd _your_base_dir_
shopt -s globstar
for file in **/DOCUMENT.PDF; do # loop picks only DOCUMENT.PDF files
# here, we assume that the serial number is extracted from the 7th component in the directory path - change it according to your need
# and we don't strip out the leading zero in the serial number
new_name=$(dirname "$file")/$(cut -f7 -d/ <<< "$file" | cut -f2 -d-).pdf
echo "Renaming $file to $new_name"
# mv "$file" "$new_name" # uncomment after verifying
done
See this related post that talks about a similar problem: How to recursively traverse a directory tree and find only files?

Print labels using awk

On my FreeBSD 10.1 I'm writing a little piece of code that basically calls ls and automatically breaks the results down into something like this:
directory:
2.4M .git
528K src
380K dist
184K test
file:
856K CONDUCT.md
20K README.md
........
You will only need to list out directories and regular files, and you don't have to list out . .., but you have to list out hidden files, and sort them from largest to smallest separately.
The challenge is to complete it as a one-line command without using $(cmd), &&, ||, >, >>, <, ;, & and within 12 pipes (back quotes count as well).
Currently my progress is:
ls -Alh | sort -d -h -r |
awk 'BEGIN {print "Directories:"}
NR>1 {if(substr($1,1,1)~"d")print" "$5" "$9}'
which prints out only until the last directory item. But since the entire command will output once every record, I can't find a way to print files: only once, and then print out the remaining output.
Well, you may have to store the files in an array and print at the end:
ls -Alh|sed 1d|
sort -h -k5r|
awk 'BEGIN {print "Directories:"}
/^d/{print "\t"$5"\t"$9}
/^-/{f[n++]=sprintf("\t"$5"\t"$9)}
END{print "Files:";
for(i=0;i<n;++i)print f[i]}'
One additional problem you'll need to work out: files and dirs may have spaces in the name, and the simple $9 will be insufficient for that case.

Breaking out of "tail -f" that's being read by a "while read" loop in HP-UX

I'm trying to write a (sh -bourne shell) script that processes lines as they are written to a file. I'm attempting to do this by feeding the output of tail -f into a while read loop. This tactic seems to be proper based on my research in Google as well as this question dealing with a similar issue, but using bash.
From what I've read, it seems that I should be able to break out of the loop when the file being followed ceases to exist. It doesn't. In fact, it seems the only way I can break out of this is to kill the process in another session. tail does seem to be working fine otherwise as testing with this:
touch file
tail -f file | while read line
do
echo $line
done
Data I append to file in another session appears just file from the loop processing written above.
This is on HP-UX version B.11.23.
Thanks for any help/insight you can provide!
If you want to break out, when your file does not exist any more, just do it:
test -f file || break
Placing this in your loop, should break out.
The remaining problem is, how to break the read line, as this is blocking.
This could you do by applying a timeout, like read -t 5 line. Then every 5 second the read returns, and in case the file does not longer exist, the loop will break. Attention: Create your loop that it can handle the case, that the read times out, but the file is still present.
EDIT: Seems that with timeout read returns false, so you could combine the test with the timeout, the result would be:
tail -f test.file | while read -t 3 line || test -f test.file; do
some stuff with $line
done
I don't know about HP-UX tail but GNU tail has the --follow=name option which will follow the file by name (by re-opening the file every few seconds instead of reading from the same file descriptor which will not detect if the file is unlinked) and will exit when the filename used to open the file is unlinked:
tail --follow=name test.txt
Unless you're using GNU tail, there is no way it'll terminate of its own accord when following a file. The -f option is really only meant for interactive monitoring--indeed, I have a book that says that -f "is unlikely to be of use in shell scripts".
But for a solution to the problem, I'm not wholly sure this isn't an over-engineered way to do it, but I figured you could send the tail to a FIFO, then have a function or script that checked the file for existence and killed off the tail if it'd been unlinked.
#!/bin/sh
sentinel ()
{
while true
do
if [ ! -e $1 ]
then
kill $2
rm /tmp/$1
break
fi
done
}
touch $1
mkfifo /tmp/$1
tail -f $1 >/tmp/$1 &
sentinel $1 $! &
cat /tmp/$1 | while read line
do
echo $line
done
Did some naïve testing, and it seems to work okay, and not leave any garbage lying around.
I've never been happy with this answer but I have not found an alternative either:
kill $(ps -o pid,cmd --no-headers --ppid $$ | grep tail | awk '{print $1}')
Get all processes that are children of the current process, look for the tail, print out the first column (tail's pid), and kill it. Sin-freaking-ugly indeed, such is life.
The following approach backgrounds the tail -f file command, echos its process id plus a custom string prefix (here tailpid: ) to the while loop where the line with the custom string prefix triggers another (backgrounded) while loop that every 5 seconds checks if file is still existing. If not, tail -f file gets killed and the subshell containing the backgrounded while loop exits.
# cf. "The Heirloom Bourne Shell",
# http://heirloom.sourceforge.net/sh.html,
# http://sourceforge.net/projects/heirloom/files/heirloom-sh/ and
# http://freecode.com/projects/bournesh
/usr/local/bin/bournesh -c '
touch file
(tail -f file & echo "tailpid: ${!}" ) | while IFS="" read -r line
do
case "$line" in
tailpid:*) while sleep 5; do
#echo hello;
if [ ! -f file ]; then
IFS=" "; set -- ${line}
kill -HUP "$2"
exit
fi
done &
continue ;;
esac
echo "$line"
done
echo exiting ...
'

Resources