Count number of blank lines in a file - unix

In count (non-blank) lines-of-code in bash they explain how to count the number of non-empty lines.
But is there a way to count the number of blank lines in a file? By blank line I also mean lines that have spaces in them.

Another way is:
grep -cvP '\S' file
-P '\S'(perl regex) will match any line contains non-space
-v select non-matching lines
-c print a count of matching lines
If your grep doesn't support -P option, please use -E '[^[:space:]]'

One way using grep:
grep -c "^$" file
Or with whitespace:
grep -c "^\s*$" file

You can also use awk for this:
awk '!NF {sum += 1} END {print sum}' file
From the manual, "The variable NF is set to the total number of fields in the input record". Since the default field separator is the space, any line consisting in either nothing or some spaces will have NF=0.
Then, it is a matter of counting how many times this happens.
Test
$ cat a
aa dd
ddd
he llo
$ cat -vet a # -vet to show tabs and spaces
aa dd$
$
ddd$
$
^I$
he^Illo$
Now let's' count the number of blank lines:
$ awk '!NF {s+=1} END {print s}' a
3

grep -v '\S' | wc -l
(On OSX the Perl expressions are not available, -P option)

grep -cx '\s*' file
or
grep -cx '[[:space:]]*' file
That is faster than the code in Steve's answer.

Using Perl one-liner:
perl -lne '$count++ if /^\s*$/; END { print int $count }' input.file

To count how many useless blank lines your colleague has inserted in a project you can launch a one-line command like this:
blankLinesTotal=0; for file in $( find . -name "*.cpp" ); do blankLines=$(grep -cvE '\S' ${file}); blankLinesTotal=$[${blankLines} + ${blankLinesTotal}]; echo $file" has" ${blankLines} " empty lines." ; done; echo "Total: "${blankLinesTotal}
This prints:
<filename0>.cpp #blankLines
....
....
<filenameN>.cpp #blankLines
Total #blankLinesTotal

Related

Retrieving a variable name that starts with a specific string

I have a variable name that appears in multiple locations of a text file. This variable will always start with the same string but not always end with the same characters. For example, it can be var_name or var_name_TEXT.
I'm looking for a way to extract the first occurrence in the text file of this string starting with var_name and ending with , (but I don't want the comma in the output).
Example1: var_name, some_other_var, another_one, ....
Output: var_name
Example2: var_name_TEXT, some_other_var, another_one, ...
Output: var_name_TEXT
grep -oPm1 '\bvar_name[^, ]*(?=,)' file | head -1
match and output only variables starting with var_name and ending with comma, do not include comma in the output, quit after the first line of match and pick the first match on that line (if there are more than one)
ps. you have to include space in the regex as well.
I suggest with GNU grep:
grep -o '\bvar_name[^,]*' file | head -n 1
All you need is (GNU awk):
$ awk 'match($0,/\<var_name[^,]*/,a){print a[0]; exit}' file
var_name_TEXT
To print the field only (i.e., var_name or var_name_TEXT only; not the line containing it) you could use awk:
awk -F, '{for (i=1;i<=NF;i++) if ($i~/^var_name/) print $i}' file
If you actually have spaces before or after the commas (as you show in your example) you can change to awk field separator:
awk -F"[, ]+" '{for (i=1;i<=NF;i++) if ($i~/^var_name/) print $i}' file
You can also use GNU grep with a word boundary assertion:
grep -o '\bvar_name[^,]*' file
Or GNU awk:
awk '/\<var_name/' file
If you want only one considered, add exit to awk or -m 1 to grep to exit after the first match.

Grep files containing two or more occurrence of a specific string

I need to find files where a specific string appears twice or more.
For example, for three files:
File 1:
Hello World!
File 2:
Hello World!
Hello !
File 3:
Hello World!
Hello
Hello Again.
--
I want to grep Hello and only get files 2 & 3.
What about this:
grep -o -c Hello * | awk -F: '{if ($2 > 1){print $1}}'
Since the question is tagged grep, here is a solution using only that utility and bash (no awk required):
#!/bin/bash
for file in *
do
if [ "$(grep -c "Hello" "${file}")" -gt 1 ]
then
echo "${file}"
fi
done
Can be a one-liner:
for file in *; do if [ "$(grep -c "Hello" "${file}")" -gt 1 ]; then echo "${file}"; fi; done
Explanation
You can modify the for file in * statement with whatever shell expansion you want to get all the data files.
grep -c returns the number of lines that match the pattern, with multiple matches on a line still counting for just one matched line.
if [ ... -gt 1 ] test that more than one line is matched in the file. If so:
echo ${file} print the file name.
This awk will print the file name of all files with 2 or more Hello
awk 'FNR==1 {if (a>1) print f;a=0} /Hello/ {a++} {f=FILENAME} END {if (a>1) print f}' *
file2
file3
What you need is a grep that can recognise patterns across line endings ("hello" followed by anything (possibly even line endings), followed by "hello")
As grep processes your files line by line, it is (by itself) not the right tool for the job - unless you manage to cram the whole file into one single line.
Now, that is easy, for example using the tr command, replacing line endings by spaces:
if cat $file | tr '\n' ' ' | grep -q 'hello.*hello'
then
echo "$file matches"
fi
This is quite efficient, even on large files with many (say 100000) lines, and can be made even more efficient by calling grep with --max-count=1 , making it stop the search after a match has been found. It doesn't matter whether the two hellos are on the same line or not.
After reading your question, I think you also want to find the case hello hello in one line. ( find files where a specific string appears twice or more.) so I come up with this one-liner:
awk -v p="hello" 'FNR==1{x=0}{x+=gsub(p,p);if(x>1){print FILENAME;nextfile}}' *
in the above line, p is the pattern you want to search
it will print the filename if the file contains the pattern two or more times. no matter they are in same or different lines
during the processing, after checking some line, if we had already found two or more pattern, print the filename and stop processing current file, take the next input file, if there still are. This is helpful if you have big files.
A little test:
kent$ head f*
==> f <==
hello hello world
==> f2 <==
hello
==> f3 <==
hello
hello
SK-Arch 22:27:00 /tmp/test
kent$ awk -v p="hello" 'FNR==1{x=0}{x+=gsub(p,p);if(x>1){print FILENAME;nextfile}}' f*
f
f3
Another way:
grep Hello * | cut -d: -f1 | uniq -d
Grep for lines containing 'Hello'; keep only the file names; print only the duplicates.
grep -c Hello * | egrep -v ':[01]$' | sed 's/:[0-9]*$//'
Piping to a scripting language might be overkill, but it's oftentimes much easier than just using awk
grep -rnc "Hello" . | ruby -ne 'file, count = $_.split(":"); puts "#{file}: #{count}" if count&.to_i >= 2'
So for your input, we get
$ grep -rnc "Hello" . | ruby -ne 'file, count = $_.split(":"); puts "#{file}: #{count}" if count&.to_i >= 2'
./2: 2
./3: 3
Or to omit the count
grep -rnc "Hello" . | ruby -ne 'file, _ = $_.split(":"); puts file if count&.to_i >= 2'

invalid sum expression while trying to obtain sum of index

i need to take the sum of all the values present at a particular index in every line of a csv file. the file may contain more than 50000 records. so efficiency is a given.
i was trying the following code. but doesnt seem to be working.
#!/bin/sh
FILE=$1
# read $FILE using the file descriptors
exec 3<&0
exec 0<$FILE
while read line
do
valindex=`cut -d "," -f 3`
echo $valindex
sum=`expr $sum+$valindex`
done
echo $sum
You should initialise sum before your while loop:
sum=0
You need to cut the line you are reading:
valindex=`echo $line|cut -d "," -f 3`
You need a space before and after the plus in expr:
sum=`expr $sum + $valindex`
Alternatively, use awk. It's a lot simpler:
awk -F, '{SUM+=$3} END{print SUM}' $FILE
Or one of my favourite patterns:
cut -d "," -f 3 "$FILE" | paste -sd+ | bc

Unix - Need to cut a file which has multiple blanks as delimiter - awk or cut?

I need to get the records from a text file in Unix. The delimiter is multiple blanks. For example:
2U2133 1239
1290fsdsf 3234
From this, I need to extract
1239
3234
The delimiter for all records will be always 3 blanks.
I need to do this in an unix script(.scr) and write the output to another file or use it as an input to a do-while loop. I tried the below:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then
int_1=0
else
int_2=0
fi
done < awk -F' ' '{ print $2 }' ${Directoty path}/test_file.txt
test_file.txt is the input file and file1.txt is a lookup file. But the above way is not working and giving me syntax errors near awk -F
I tried writing the output to a file. The following worked in command line:
more test_file.txt | awk -F' ' '{ print $2 }' > output.txt
This is working and writing the records to output.txt in command line. But the same command does not work in the unix script (It is a .scr file)
Please let me know where I am going wrong and how I can resolve this.
Thanks,
Visakh
The job of replacing multiple delimiters with just one is left to tr:
cat <file_name> | tr -s ' ' | cut -d ' ' -f 2
tr translates or deletes characters, and is perfectly suited to prepare your data for cut to work properly.
The manual states:
-s, --squeeze-repeats
replace each sequence of a repeated character that is
listed in the last specified SET, with a single occurrence
of that character
It depends on the version or implementation of cut on your machine. Some versions support an option, usually -i, that means 'ignore blank fields' or, equivalently, allow multiple separators between fields. If that's supported, use:
cut -i -d' ' -f 2 data.file
If not (and it is not universal — and maybe not even widespread, since neither GNU nor MacOS X have the option), then using awk is better and more portable.
You need to pipe the output of awk into your loop, though:
awk -F' ' '{print $2}' ${Directory_path}/test_file.txt |
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done
The only residual issue is whether the while loop is in a sub-shell and and therefore not modifying your main shell scripts variables, just its own copy of those variables.
With bash, you can use process substitution:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done < <(awk -F' ' '{print $2}' ${Directory_path}/test_file.txt)
This leaves the while loop in the current shell, but arranges for the output of the command to appear as if from a file.
The blank in ${Directory path} is not normally legal — unless it is another Bash feature I've missed out on; you also had a typo (Directoty) in one place.
Other ways of doing the same thing aside, the error in your program is this: You cannot redirect from (<) the output of another program. Turn your script around and use a pipe like this:
awk -F' ' '{ print $2 }' ${Directory path}/test_file.txt | while read readline
etc.
Besides, the use of "readline" as a variable name may or may not get you into problems.
In this particular case, you can use the following line
sed 's/ /\t/g' <file_name> | cut -f 2
to get your second columns.
In bash you can start from something like this:
for n in `${Directoty path}/test_file.txt | cut -d " " -f 4`
{
grep -c $n ${Directory path}/file*.txt
}
This should have been a comment, but since I cannot comment yet, I am adding this here.
This is from an excellent answer here: https://stackoverflow.com/a/4483833/3138875
tr -s ' ' <text.txt | cut -d ' ' -f4
tr -s '<character>' squeezes multiple repeated instances of <character> into one.
It's not working in the script because of the typo in "Directo*t*y path" (last line of your script).
Cut isn't flexible enough. I usually use Perl for that:
cat file.txt | perl -F' ' -e 'print $F[1]."\n"'
Instead of a triple space after -F you can put any Perl regular expression. You access fields as $F[n], where n is the field number (counting starts at zero). This way there is no need to sed or tr.

grep invert search with context

I want to filter out several lines before and after a matching line in a file.
This will remove the line that I don't want:
$ grep -v "line that i don't want"
And this will print the 2 lines before and after the line I don't want:
$ grep -C 2 "line that i don't want"
But when I combine them it does not filter out the 2 lines before and after the line I don't want:
# does not remove 2 lines before and after the line I don't want:
$ grep -v -C 2 "line that i don't want"
How do I filter out not just the line I don't want, but also the lines before and after it? I'm guessing sed would be better for this...
Edit: I know this could be done in a few lines of awk/Perl/Python/Ruby/etc, but I want to know if there is a succinct one-liner I could run from the command line.
If the lines are all unique you could grep the lines you want to remove into a file, and then use that file to remove the lines from the original, e.g.
grep -C 2 "line I don't want" < A.txt > B.txt
grep -f B.txt A.txt
Give this a try:
sed 'h;:b;$b;N;N;/PATTERN/{N;d};$b;P;D' inputfile
You can vary the number of N commands before the pattern to affect the number of lines to delete.
You could programmatically build a string containing the number of N commands:
C=2 # corresponds to grep -C
N=N
for ((i = 0; i < C - 1; i++)); do N=$N";N"; done
sed "h;:b;\$b;$N;/PATTERN/{N;d};\$b;P;D" inputfile
awk 'BEGIN{n=2}{a[++i]=$0}
/dont/{
for(j=1;j<=i-(n+1);j++)print a[j];
for(o=1;o<=n;o++)getline;
delete a}
END{for(i in a)print a[i]} ' file
I solved it with two sequential grep, actually. It seems way more straightforward to me.
grep -C "match" yourfile | grep -v -f - yourfile
I think #fxm27 has an excellent, bash-y answer.
I would add that you could solve this another way by using egrep if you knew in advance the patterns of the subsequent lines.
command | egrep -v "words|from|lines|you|dont|want"
That will do an "inclusive OR", meaning that a line that matches any of those will be excluded.
2019 Solution
This is a simple solution, found elsewhere:
grep --invert-match "test*"
Selects all not matching "test*". Super useful and easy to remember!
(Edit)
This doesn't completely answer the original question and returns the entire set of lines not matching.

Resources