Recursively finding files in list of directories - unix

How do I recursively count files in a list of Linux directories?
Example:
/dog/
/a.txt
/b.txt
/c.ipynb
/cat/
/d.txt
/e.pdf
/f.png
/g.txt
/owl/
/h.txt
I want following output:
5 .txt
1 .pynb
1 .pdf
1 .png
I tried the following, with no luck.
find . -type f | sed -n 's/..*\.//p' | sort | uniq -c

This find + gawk may work for you:
find . -type f -print0 |
awk -v RS='\0' -F/ '{sub(/^.*\./, ".", $NF); ++freq[$NF]} END {for (i in freq) print freq[i], i}'
It is safe to use -print0 in find to handle files with whitespace and other special glob characters. Likewise we use -v RS='\0' in awk to ensure NUL byte is record seperator.

Use Perl one-liners to make the output in the format you need, like so:
find . -type f | perl -pe 's{.*[.]}{.}' | sort | uniq -c | perl -lane 'print join "\t", #F;' | sort -nr
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

Assume you have a known a directory path with the following subdirectories foo, bar, baz, qux, quux, gorge and we want to count the file types based on extension, but only for the subdirectories, foo, baz and qux
The best is to just do
$ find /path/{foo,baz,qux} -type f -exec sh -c 'echo "${0##*.}"' {} \; | sort | uniq -c
The exec part just uses a simple sh variable substitution to print the extension.

Related

Combine find, grep and xargs with printf

I have a find command combined with exec grep and a printf option :
find -L /home/blast/dirtest -maxdepth 3 **-exec grep -q "pattern" {} \;** -printf '%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n' 2> /dev/null
Result :
f/#/2018-01-01 10:00:00/#/191/#/filee.xml/#//#//home/blast/dirtest/01/05
I need the printf to get all the desired file informations at once (date, type size etc)
The above command works fine. But the exec option is too slow comparing to xargs.
I tryed to do the same with xarg but I did not succeed.
Any Idea on how to acheive that ? using the xargs command keeping the desired printf or similar .
Thanks
Your code is:
find -L /home/blast/dirtest -maxdepth 3 \
-exec grep -q "pattern" {} \; \
-printf '%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n' 2> /dev/null
This invokes a new grep process for each file.
If you are using GNU utilities, you can reduce the number of grep processes by something like:
(
format=\''%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n'\'
find -L /home/blast/dirtest -maxdepth 3 -print0 |\
xargs -0 grep -l -Z "pattern" |\
xargs -0 sh -c 'find "$#" -printf '"$format" --
) 2>/dev/null
for clarity, store the formatstring in a variable
use -print0 / -0 / -Z options to enable null-delimited data
generate initial filelist with find
filter on "pattern" with grep (use of xargs minimises the number of times grep gets called)
feed the filtered filelist into another xargs to run a minimal number of find -printf
in second xargs, call a subshell so that extra arguments can be appended (find requires the paths to precede the operators)
dummy second argument (--) to the sh -c invocation prevents the first filename being lost due to assignment to $0
To do it exactly how you want:
find -L /home/blast/dirtest/ -maxdepth 3 \
-printf '%p#%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n' \
> tmp.out
cut -d# -f1 tmp.out \
| xargs grep -l "pattern" 2>/dev/null \
| sed 's/^/^/; s/$/#/' \
| grep -f /dev/stdin tmp.out \
| sed 's/^.*#//'
This operates under the assumption that you have no character # in your file names.
What it does is avoid the grep at first and just dump all the files with the requested metadata to a temporary file.
But it also prefixes each line with the full path (%p#).
Then we extract (cut) the full paths out of this list and list the files which contains the pattern (xargs grep).
We then use sed to prefix each such file name with ^ and suffix it with #, which makes it a greppable pattern in our tmp.out file.
Then we use this pattern (grep -f /dev/stdin) to extract only those paths from the big list in tmp.out.
Now all that's left is to remove the artificial full path we prefixed using the last sed command.
Seeing how you used /home, there's a good chance you're on Linux, which, if you're willing to accept some output format changes, allows you to do it somewhat more elegantly:
find -L /home/blast/dirtest/ -maxdepth 3 \
| xargs grep -l "pattern" 2>/dev/null \
| xargs stat --printf '%F/#/%y/#/%s/#/%n\n'
The output of stat --printf is different from that of find -printf (and from that of MacOS' stat -f), but it's the same information.
Do note, however, that because you passed -L to find, and you're grepping the result:
The results are limited to file types which can be grepped, so they will never be directories, links, etc..
If you stumble upon a broken link, it will not be in the output because it cannot be grepped.
I'v found an intresting thing about the -exec option.
We could run the grep once using the exec with the plus-sign (+)
-exec command {} +
This variant of the -exec option runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total
number of invocations of the command will be much less than the number of matched files. The command line is built in much the same way that xargs builds its command
lines. Only one instance of ’{}’ is allowed within the command. The command is executed in the starting directory.
That means if I change this :
-exec grep -l 'pattern' {} \;
By this ( replace the semicolon with the plus signe ):
-exec grep -l 'pattern' {} \+
Will improve the performance significantly.
Then I can pipe only one xargs for the format printing needs only.

Append "/" to end of directory

Completely noob question but, using ls piped to grep, I need to find files or directories that have all capitals in their name, and directories need to have "/" appended to indicate that it is a directory. Trying to append the "/" is the only part I am stuck on. Again, I apologize for the amateur question. I currently have ls | grep [A-Z] and the example out should be: BIRD, DOG, DOGDIR/
It's an interesting question because it's a somewhat difficult thing to accomplish with a bash one-liner.
Here's what I came up with. It doesn't seem very elegant, but I'm not sure how to improve.
find /animals -type d -or -type f \
| grep '/[A-Z]*$' \
| xargs -I + bash -c 'echo -n $(basename +)$( test -d + && echo -n /),\\ ' \
| sed -e 's/, *$//'; echo
I'll break that down for you
find /animals -type d -or -type f writes out, once per line, the directories and files it found in /animals (see below for my test environment dockerfile - I created /animals to match your desired output). Find can't do a regex match as far as I know on the name, so...
grep '/[A-Z]*$' filter's find's output so that only paths are shown where the last part of the file or directory name, after the final /, is all uppercase
xargs -I + bash -c '...' when you're in a shell and you want to use a "for" loop, chances are what you should be using is xargs. Learn it, know it, love it. xargs takes its input, separated by default by $IFS, and runs the command you give it for each piece of input . So this is going to run a bash shell for each path. that passed the grep filter. In my case, -I + will make xargs replace the literal '+' character with its current input filename. -I also makes it pass one at a time through xargs. For more information, see the xargs manual page.
'echo -n $(basename +)$( test -d + && echo -n /),\\ ' this is the inner bash script that will be run by xargs for each path that got through grep.
basename + cuts the directory component off the path; from your example output you don't want eg /animals/DOGDIR/, you want DOGDIR/. basename is the program that trims the directories for us.
test -d + && echo -n / checks to see whether + (remember xargs will replace it with filename) is a directory ,and if so, runs echo -n /. the -n argument to echo suppresses the newline, important to get the output in the CSV format you specified.
now we can put it all together to see that we're echo -n the output of basename + , with / appended, if it's a directory, and then , appended to that. All the echos run with -n to suppress newlines to keep output CSV looking.
| sed -e 's/, *$//'; echo is purely for formatting. Adding , to each individual output was an easy way to get the CSV, but it leaves us with a final , at the end of the list. The sed invocation removes , followed by any number of spaces at the end of the output so far - eg the entire output from all the xargs invocations. And since we never did output a newline at the end of that output, the final echo is adding that.
Usually in unix shells, you probably wouldn't want a CSV style output. You'd probably instead want a newline-separated output in most cases, one matching file per line, and that would be somewhat simpler to do because you wouldn't need all that faffing with -n and , to make it CSV style. But, valid requirement if the need is there.
FROM debian
RUN mkdir -p /animals
WORKDIR /animals
RUN mkdir -p DOGDIR lowerdir && touch DOGDIR/DOG DOGDIR/lowerDOG2 lowerdir/BIRD
ENTRYPOINT [ "/bin/bash" ]
CMD [ "-c" , "find /animals -type d -or -type f | grep '/[A-Z]*$'| xargs -I + bash -c 'echo -n $(basename +)$( test -d + && echo -n /),\\ ' | sed -e 's/, *$//'; echo"]
$ docker run --rm test
BIRD, DOGDIR/, DOG
You can start looking at
ls -F | grep -v "[[:lower:]]"
I did not add something for a comma-seperated line, because this is the wrong method: Parsing ls should be avoided ! It will go wrong for filenames like
I am a terribble filename,
with newlines inside me,
and the ls command combined with grep
will only show the last line
BECAUSE THIS LINE HAS NO LOWERCASE CHARACTERS
To get the files without a pipe, you can use
shopt -s extglob
ls -dp +([[:upper:]])
shopt -u extglob
An explanation of the extglob and uppercase can be found at https://unix.stackexchange.com/a/389071/57293
When you want the output in one line, you can get troubles with filenames that have newlines or commas in its name. You might want something like
# parsing ls, yes wrong and failing for some files
ls -dp +([[:upper:]]) | tr "\n" "," | sed 's/,$/\n/'

Remove underscores from all filenames within a directory

I have a folder "model" with files named like:
a_EmployeeData
a_TableData
b_TestData
b_TestModel
I basically need to drop the underscore and make them:
aEmployeeData
aTableData
bTestData
bTestModel
Is there away in the Unix Command Line to do so?
This will correctly process files containing odd characters like spaces or even newlines and should work on any Unix / Linux distribution being only based on POSIX syntax.
find model -type f -name "*_*" -exec sh -c 'd=$(dirname "$1"); mv "$1" "$d/$(basename "$1" | tr -d _)"' sh {} \;
Here is what it does:
For each file (not directory) containing an underscore in its name under the model directory and its subdirectories, rename the file in place with all the underscores stripped out.
You can do this simply with bash.
for file in /path/to/model/*; do
mv "$file" "${file/_/}"
done
If you have rename command available then simply do
rename 's/_//' /path/to/model/*
for f in model/* ; do mv "$f" `echo "$f" | sed 's/_//g'` ; done
Edit: modified a few things thanks to suggestions by others, but I'm afraid my code is still bad for strange filenames.
maybe this:
find model -name "*_*" -type f -maxdepth 1 -print | sed -e 'p;s/_//g' | xargs -n2 echo mv
Decomposition:
find all plain files in the directory model what contains at least one underscore, and don't search subdirectories
with the sed make filename adjustments - replace the _ with nothing
also print the old name
fed the two filenames to xargs what will rename the files with mv
The above is for a dry-run. When satisfied, remove the echo before mv for actual rename.
Warning: Will not work if filename contains spaces. If you have GNU sed you can
find . -name "*_*" -maxdepth 1 -print0 | sed -z 'p;s/_//g' | xargs -0 -n2 echo mv
and will works with a filenames with spaces too...
In zsh:
autoload zmv # in ~/.zshrc
cd model && zmv '(**/)(*)' '$1${2//_}'
marc#panic:~$ echo 'a_EmployeeData' | tr -d '_'
aEmployeeData
I had the same problem on my machine, but the filenames had more than one underscore. I used rename with the g option so that all underscores get removed:
find model/ -maxdepth 1 -type f | rename 's/_//g'
Or if there are no subdirectories, just
rename 's/_//g'
If you don't have rename, see Jaypal Singh's answer.
Use the global flag /g with your replace pattern to replace all occurrences within the filename.
find . -type f -print0 | xargs -0 rename 's/_//g'
Or if you want underscores replaced with spaces then use this:
find . -type f -print0 | xargs -0 rename 's/_/ /g'
If you like to live dangerously add the force flag -f in front of your replace pattern rename -f 's/_//g'

unix: how to concatenate files matched in grep

I want to concatenate the files whose name does not include "_BASE_". I thought it would be somewhere along the lines of ...
ls | grep -v _BASE_ | cat > all.txt
the cat part is what I am not getting right. Can anybody give me some idea about this?
Try this
ls | grep -v _BASE_ | xargs cat > all.txt
You can ignore some files with ls using --ignore option and then cat them into a file.
ls --ignore="*_BASE_*" | xargs cat > all.txt
Also you can do that without xargs:
cat $( ls --ignore="*_BASE_*" ) > all.txt
UPD:
Dale Hagglund noticed, that filename like "Some File" will appear as two filenames, "Some" and "File". To avoid that you can use --quoting-style=WORD option, when WORD can be shell or escape.
For example, if --quoting-style=shell Some File will print as 'Some File' and will be interpreted as one file.
Another problem is output file could the same of one of lsed files. We need to ignore it too.
So answer is:
outputFile=a.txt; ls --ignore="*sh*" --ignore="${outputFile}" --quoting-style=shell | xargs cat > ${outputFile}
If you want to get also files from subdirectories, `find' is your friend:
find . -type f ! -name '*_BASE_*' ! -path ./all.txt -exec cat {} >> all.txt \+
It searches files in the current directory and its subdirectories, it finds only files (-type f), ignores files matching to wildcard pattern *_BASE_*, ignores all.txt, and executes cat in the same manner as xargs would.

How to copy files in shell that do not end with a certain file extension

For example copy all files that do not end with .txt
Bash will accept a not pattern.
cp !(*.txt)
You can use ls with grep -v option:
for i in `ls | grep -v ".txt"`
do
cp $i $dest_dir
done
Depending on how many assumptions you can afford to make about the characters in the file names, it might be as simple as:
cp $(ls | grep -v '\.txt$') /some/other/place
If that won't work for you, then maybe find ... -print0 | xargs -0 cp ... can be used instead (though that has issues - because the destination goes at the end of the argument list).
On MacOS X, xargs has an option -J that supports what is needed:
-J replstr
If this option is specified, xargs will use the data read from standard input to replace the first occurrence of replstr instead of append-
ing that data after all other arguments. This option will not affect how many arguments will be read from input (-n), or the size of the
command(s) xargs will generate (-s). The option just moves where those arguments will be placed in the command(s) that are executed. The
replstr must show up as a distinct argument to xargs. It will not be recognized if, for instance, it is in the middle of a quoted string.
Furthermore, only the first occurrence of the replstr will be replaced. For example, the following command will copy the list of files and
directories which start with an uppercase letter in the current directory to destdir:
/bin/ls -1d [A-Z]* | xargs -J % cp -rp % destdir
It appears the GNU xargs does not have -J but does have the related but slightly restrictive -I option (which is also present in MacOS X):
-I replace-str
Replace occurrences of replace-str in the initial-arguments with
names read from standard input. Also, unquoted blanks do not
terminate input items; instead the separator is the newline
character. Implies -x and -L 1.
You can rely on:
find . -not -name "*.txt"
By using:
find -x . -not -name "*.txt" -d 1 -exec cp '{}' toto/ \;`
Which copies all file that are not .txt of the current directory to a subdirectory toto/. the -d 1 is used to prevent recursion here.
Either do:
for f in $(ls | grep -v "\.txt$")
do
cp -- "$f" ⟨destination-directory⟩
done
or if you have a huge amount of files:
find -prune \! -name "*.txt" -exec cp -- "{}" ⟨destination-directory⟩ .. \;
Two things here to comment on. One is the use of the double hyphen in the invocation of cp, and the quoting of $f. The first guards against "wacky" filenames that begin with a hyphen and might be interpreted as options. The second guards agains filenames with spaces (or what's in IFS) in them.
In zsh:
setopt extendedglob
cp *^.txt /some/folder
(if you just want files)...
cp *.^txt(.) /some/folder
More information on zsh globbing here and here.
I would do it like this, where destination is the destination directory:
ls | grep -v "\.txt$" | xargs cp -t destination
Edit: added "-t" thanks to the comments

Resources