UNIX: How to count number of rows in multiple files without headers - unix

I have a set of files with similar naming pattern. I am trying to get the total row count of all the files combined sans the header in a go. But I am having trouble with the commands.
I have tried:
sed '1d' IN-Pass-30* | wc -l
and
awk 'END {print NR-1}' IN-Pass-30*
But each time it only subtracts the header count from just one file. What am I doing wrong here?

You were close. Wrap the sed command in a bash glob loop:
for f in IN-Pass-30*; do sed '1d' "$f"; done | wc -l

I propose following "simple" solution:
Prompt> find ./ -maxdepth 1 -name "IN-Pass-30*" | wc -l
53
Prompt> cat IN-Pass-30* | wc -l
1418549
Prompt> echo $(($(cat IN-Pass-30* | wc -l) - $(find ./ -maxdepth 1 -name "IN-Pass-30*" | wc -l)))
1418496
What does this mean?
Prompt> find ./ -maxdepth 1 -name "IN-Pass-30*" | wc -l
// find all files inside that directory without checking subdirectories.
// once they are found, count them.
Prompt> cat IN-Pass-30* | wc -l
// use `cat` to concatenate all files' content.
// at the end, count the amount of lines.
Prompt> echo $$(a - b))
// calculate the difference between a and b.
Prompt> echo $(command)
// show (or do whatever with it) the result of a command
Oh, the whole idea is that a header takes 1 line per file, so by counting the amount of lines in all the files, subtracted by the amount of files (which is the same as the amount of header lines), you should get the desired result.

Related

Recursively finding files in list of directories

How do I recursively count files in a list of Linux directories?
Example:
/dog/
/a.txt
/b.txt
/c.ipynb
/cat/
/d.txt
/e.pdf
/f.png
/g.txt
/owl/
/h.txt
I want following output:
5 .txt
1 .pynb
1 .pdf
1 .png
I tried the following, with no luck.
find . -type f | sed -n 's/..*\.//p' | sort | uniq -c
This find + gawk may work for you:
find . -type f -print0 |
awk -v RS='\0' -F/ '{sub(/^.*\./, ".", $NF); ++freq[$NF]} END {for (i in freq) print freq[i], i}'
It is safe to use -print0 in find to handle files with whitespace and other special glob characters. Likewise we use -v RS='\0' in awk to ensure NUL byte is record seperator.
Use Perl one-liners to make the output in the format you need, like so:
find . -type f | perl -pe 's{.*[.]}{.}' | sort | uniq -c | perl -lane 'print join "\t", #F;' | sort -nr
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start
Assume you have a known a directory path with the following subdirectories foo, bar, baz, qux, quux, gorge and we want to count the file types based on extension, but only for the subdirectories, foo, baz and qux
The best is to just do
$ find /path/{foo,baz,qux} -type f -exec sh -c 'echo "${0##*.}"' {} \; | sort | uniq -c
The exec part just uses a simple sh variable substitution to print the extension.

Finding and sorting files by size in Unix

I want to create a function in shell programming that gets 2 parameters, directory-name and file-name and that does the following: it searches starting in the given directory-name for the file-name and then goes in all subdirectories of the directory-name to continue the search. I want the output to be every parent-directory where the file-name has been found, sorted using the file-name size.
Help would be much appreciated, thanks.
not sure about which Unix you asked for, but for Linux and maybe common Unix systems:
find <directory> -name "<filename>" -ls | sort -k 7 -n -r | awk '{print $NF}' | xargs -n 1 dirname
sort => sort by file size (the 7th column of find output is filesize)
awk => print the filename full path
dirname => get parent directory of the matched file
Example:
# Find parent directory of all types.h under /usr/include, sorted by file size in desc order
$ find /usr/include/ -name "types.h" -ls | sort -k 7 -n -r | awk '{print $NF}' | xargs -n 1 dirname
/usr/include/x86_64-linux-gnu/bits
/usr/include/x86_64-linux-gnu/sys
/usr/include/c++/7/parallel
/usr/include/rpc
/usr/include/linux/sched
/usr/include/linux/iio
/usr/include/linux
/usr/include/asm-generic
/usr/include/x86_64-linux-gnu/asm

Combine find, grep and xargs with printf

I have a find command combined with exec grep and a printf option :
find -L /home/blast/dirtest -maxdepth 3 **-exec grep -q "pattern" {} \;** -printf '%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n' 2> /dev/null
Result :
f/#/2018-01-01 10:00:00/#/191/#/filee.xml/#//#//home/blast/dirtest/01/05
I need the printf to get all the desired file informations at once (date, type size etc)
The above command works fine. But the exec option is too slow comparing to xargs.
I tryed to do the same with xarg but I did not succeed.
Any Idea on how to acheive that ? using the xargs command keeping the desired printf or similar .
Thanks
Your code is:
find -L /home/blast/dirtest -maxdepth 3 \
-exec grep -q "pattern" {} \; \
-printf '%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n' 2> /dev/null
This invokes a new grep process for each file.
If you are using GNU utilities, you can reduce the number of grep processes by something like:
(
format=\''%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n'\'
find -L /home/blast/dirtest -maxdepth 3 -print0 |\
xargs -0 grep -l -Z "pattern" |\
xargs -0 sh -c 'find "$#" -printf '"$format" --
) 2>/dev/null
for clarity, store the formatstring in a variable
use -print0 / -0 / -Z options to enable null-delimited data
generate initial filelist with find
filter on "pattern" with grep (use of xargs minimises the number of times grep gets called)
feed the filtered filelist into another xargs to run a minimal number of find -printf
in second xargs, call a subshell so that extra arguments can be appended (find requires the paths to precede the operators)
dummy second argument (--) to the sh -c invocation prevents the first filename being lost due to assignment to $0
To do it exactly how you want:
find -L /home/blast/dirtest/ -maxdepth 3 \
-printf '%p#%y/#/%TY-%Tm-%Td %TX/#/%s/#/%f/#/%l/#/%h\n' \
> tmp.out
cut -d# -f1 tmp.out \
| xargs grep -l "pattern" 2>/dev/null \
| sed 's/^/^/; s/$/#/' \
| grep -f /dev/stdin tmp.out \
| sed 's/^.*#//'
This operates under the assumption that you have no character # in your file names.
What it does is avoid the grep at first and just dump all the files with the requested metadata to a temporary file.
But it also prefixes each line with the full path (%p#).
Then we extract (cut) the full paths out of this list and list the files which contains the pattern (xargs grep).
We then use sed to prefix each such file name with ^ and suffix it with #, which makes it a greppable pattern in our tmp.out file.
Then we use this pattern (grep -f /dev/stdin) to extract only those paths from the big list in tmp.out.
Now all that's left is to remove the artificial full path we prefixed using the last sed command.
Seeing how you used /home, there's a good chance you're on Linux, which, if you're willing to accept some output format changes, allows you to do it somewhat more elegantly:
find -L /home/blast/dirtest/ -maxdepth 3 \
| xargs grep -l "pattern" 2>/dev/null \
| xargs stat --printf '%F/#/%y/#/%s/#/%n\n'
The output of stat --printf is different from that of find -printf (and from that of MacOS' stat -f), but it's the same information.
Do note, however, that because you passed -L to find, and you're grepping the result:
The results are limited to file types which can be grepped, so they will never be directories, links, etc..
If you stumble upon a broken link, it will not be in the output because it cannot be grepped.
I'v found an intresting thing about the -exec option.
We could run the grep once using the exec with the plus-sign (+)
-exec command {} +
This variant of the -exec option runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total
number of invocations of the command will be much less than the number of matched files. The command line is built in much the same way that xargs builds its command
lines. Only one instance of ’{}’ is allowed within the command. The command is executed in the starting directory.
That means if I change this :
-exec grep -l 'pattern' {} \;
By this ( replace the semicolon with the plus signe ):
-exec grep -l 'pattern' {} \+
Will improve the performance significantly.
Then I can pipe only one xargs for the format printing needs only.

Pipe output to parameter

So I wanted to write a simple command that counts one less than the number of files in my current directory. I have this command that comes close but is off by one.
ls | wc -l
How can I pipe this to bc so I can subtract it by one?
Thanks!
To pipe to bc you could use something like this
echo " $(ls | wc -l) - 1 " | bc
EDIT: replace the part in the $( ) with steve's answer, or any other command you need.
That's really not what you want to do. Use find instead:
find . -maxdepth 1 -type f | wc -l
Also, you can exclude hidden files, with:
find . -maxdepth 1 -type f ! -name ".*" | wc -l
For completeness, you can handle files containing newlines and spaces like:
find . -maxdepth 1 -type f -print0 | tr -dc '\0' | wc -c

Adding data line by line in file in Unix

I am extracting file names from one command it returns many file names and i am putting them into one file
code :
echo `find ${FILE_SYSTEM}/${dir_name}/${sub_dir_name} -type f -size +${BADFILES_SIZE} -exec ls -1lutr {} \; | sort -rn | awk '{print $9}'` >> Somefile.txt
The problem here is that i am not getting file names on each line.
Its giving two filenames on 1 line.
But i want to have each filename on 1 line.
Eg :
/informatica/ETD/PC9/scripts/kamil/temp/temp1.txt /informatica/ETD/PC9/scripts/kamil/temp/temp2.txt
I am getting filenames as shown above and i want as shown below.
/informatica/ETD/PC9/scripts/kamil/temp/temp1.txt
/informatica/ETD/PC9/scripts/kamil/temp/temp2.txt
Please give ur suggestions,
The problem is that you're using echo and backticks. Don't! The echo flattens all its arguments (a list of two files, it seems) into a single line of output.
Wrong:
echo `find ${FILE_SYSTEM}/${dir_name}/${sub_dir_name} -type f -size +${BADFILES_SIZE} -exec ls -1lutr {} \; | sort -rn | awk '{print $9}'` >> Somefile.txt
Right:
find ${FILE_SYSTEM}/${dir_name}/${sub_dir_name} -type f \
-size +${BADFILES_SIZE} -exec ls -1lutr {} + |
sort -rn |
awk '{print $9}' >> Somefile.txt

Resources