I would like to count the number of lines below a sed command and append filename to the output.
Sample file.txt
Aaaaaaa
Bbbbbbb
Ccccccc
Ddddddd
I would like to grep Bbbbbb and find the number of line below and output the number plus the filename
I tried this cat ${samplename}.txt|sed -n 'Bbbbbbb/,$p'| wc -l but the filename is not in the output
In order to know the line where "Bbbbb" is found:
grep -n "Bbbbb" file.txt | cut -d ':' -f 1
// grep -n adds line number in front of the search result, this is followed by a colon.
// You get that number by splitting over that colon and take the first field.
In order to know the amount of lines in a file:
wc -l file.txt
In order to perform calculations:
echo $((43 - 7))
Just combine everything :-)
Have fun
Related
I have a file named tt.txt and the contents of this file is as follows:
fdgs
jhds
fdgs
I am trying to get the similar row as the output in a text file.
my expected output is:
fdgs
fdgs
to do so, I used this command:
uniq -u tt.txt > output.txt
but it returns:
fdgs
jhds
fdgs
do you know how to fix it?
If by similar row you mean the row with the same content.
From the uniq manpage the uniq command would only filter the adjacent matching lines from the repeated lines. So you need to sort the input first and used -D option to print all duplicated lines like below. However -D options is limited to the GNU implementation, and doing this would print the output in different order from the input.
sort tt.txt | uniq -D
If you want the output to be in the same order you need to remember the input line number and sort the line number again like this
cat -n tt.txt | sort -k 2 | uniq -f 1 -D | sort -k 1,1 | sed 's/\s+[0-9]+\s+//'
cat -n would print the content with the line number
sort -k 2 would sort the input starting at 2rd column
uniq -f 1 would ignore the first column
sort -k1,1 would sort the the output back by the original line number
sed 's/\s+[0-9]+\s+//' would delete the first column with line number
uniq -u command would output only the unique input line, which is completely opposite as what you want.
One in awk:
$ awk '++seen[$0]==2;seen[$0]>1' file
fdgs
fdgs
I need to find files where a specific string appears twice or more.
For example, for three files:
File 1:
Hello World!
File 2:
Hello World!
Hello !
File 3:
Hello World!
Hello
Hello Again.
--
I want to grep Hello and only get files 2 & 3.
What about this:
grep -o -c Hello * | awk -F: '{if ($2 > 1){print $1}}'
Since the question is tagged grep, here is a solution using only that utility and bash (no awk required):
#!/bin/bash
for file in *
do
if [ "$(grep -c "Hello" "${file}")" -gt 1 ]
then
echo "${file}"
fi
done
Can be a one-liner:
for file in *; do if [ "$(grep -c "Hello" "${file}")" -gt 1 ]; then echo "${file}"; fi; done
Explanation
You can modify the for file in * statement with whatever shell expansion you want to get all the data files.
grep -c returns the number of lines that match the pattern, with multiple matches on a line still counting for just one matched line.
if [ ... -gt 1 ] test that more than one line is matched in the file. If so:
echo ${file} print the file name.
This awk will print the file name of all files with 2 or more Hello
awk 'FNR==1 {if (a>1) print f;a=0} /Hello/ {a++} {f=FILENAME} END {if (a>1) print f}' *
file2
file3
What you need is a grep that can recognise patterns across line endings ("hello" followed by anything (possibly even line endings), followed by "hello")
As grep processes your files line by line, it is (by itself) not the right tool for the job - unless you manage to cram the whole file into one single line.
Now, that is easy, for example using the tr command, replacing line endings by spaces:
if cat $file | tr '\n' ' ' | grep -q 'hello.*hello'
then
echo "$file matches"
fi
This is quite efficient, even on large files with many (say 100000) lines, and can be made even more efficient by calling grep with --max-count=1 , making it stop the search after a match has been found. It doesn't matter whether the two hellos are on the same line or not.
After reading your question, I think you also want to find the case hello hello in one line. ( find files where a specific string appears twice or more.) so I come up with this one-liner:
awk -v p="hello" 'FNR==1{x=0}{x+=gsub(p,p);if(x>1){print FILENAME;nextfile}}' *
in the above line, p is the pattern you want to search
it will print the filename if the file contains the pattern two or more times. no matter they are in same or different lines
during the processing, after checking some line, if we had already found two or more pattern, print the filename and stop processing current file, take the next input file, if there still are. This is helpful if you have big files.
A little test:
kent$ head f*
==> f <==
hello hello world
==> f2 <==
hello
==> f3 <==
hello
hello
SK-Arch 22:27:00 /tmp/test
kent$ awk -v p="hello" 'FNR==1{x=0}{x+=gsub(p,p);if(x>1){print FILENAME;nextfile}}' f*
f
f3
Another way:
grep Hello * | cut -d: -f1 | uniq -d
Grep for lines containing 'Hello'; keep only the file names; print only the duplicates.
grep -c Hello * | egrep -v ':[01]$' | sed 's/:[0-9]*$//'
Piping to a scripting language might be overkill, but it's oftentimes much easier than just using awk
grep -rnc "Hello" . | ruby -ne 'file, count = $_.split(":"); puts "#{file}: #{count}" if count&.to_i >= 2'
So for your input, we get
$ grep -rnc "Hello" . | ruby -ne 'file, count = $_.split(":"); puts "#{file}: #{count}" if count&.to_i >= 2'
./2: 2
./3: 3
Or to omit the count
grep -rnc "Hello" . | ruby -ne 'file, _ = $_.split(":"); puts file if count&.to_i >= 2'
In count (non-blank) lines-of-code in bash they explain how to count the number of non-empty lines.
But is there a way to count the number of blank lines in a file? By blank line I also mean lines that have spaces in them.
Another way is:
grep -cvP '\S' file
-P '\S'(perl regex) will match any line contains non-space
-v select non-matching lines
-c print a count of matching lines
If your grep doesn't support -P option, please use -E '[^[:space:]]'
One way using grep:
grep -c "^$" file
Or with whitespace:
grep -c "^\s*$" file
You can also use awk for this:
awk '!NF {sum += 1} END {print sum}' file
From the manual, "The variable NF is set to the total number of fields in the input record". Since the default field separator is the space, any line consisting in either nothing or some spaces will have NF=0.
Then, it is a matter of counting how many times this happens.
Test
$ cat a
aa dd
ddd
he llo
$ cat -vet a # -vet to show tabs and spaces
aa dd$
$
ddd$
$
^I$
he^Illo$
Now let's' count the number of blank lines:
$ awk '!NF {s+=1} END {print s}' a
3
grep -v '\S' | wc -l
(On OSX the Perl expressions are not available, -P option)
grep -cx '\s*' file
or
grep -cx '[[:space:]]*' file
That is faster than the code in Steve's answer.
Using Perl one-liner:
perl -lne '$count++ if /^\s*$/; END { print int $count }' input.file
To count how many useless blank lines your colleague has inserted in a project you can launch a one-line command like this:
blankLinesTotal=0; for file in $( find . -name "*.cpp" ); do blankLines=$(grep -cvE '\S' ${file}); blankLinesTotal=$[${blankLines} + ${blankLinesTotal}]; echo $file" has" ${blankLines} " empty lines." ; done; echo "Total: "${blankLinesTotal}
This prints:
<filename0>.cpp #blankLines
....
....
<filenameN>.cpp #blankLines
Total #blankLinesTotal
I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?
Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.
The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.
60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1
In order to use the uniq command, you have to sort your file first.
But in the file I have, the order of the information is important, thus how can I keep the original format of the file but still get rid of duplicate content?
Another awk version:
awk '!_[$0]++' infile
This awk keeps the first occurrence. Same algorithm as other answers use:
awk '!($0 in lines) { print $0; lines[$0]; }'
Here's one that only needs to store duplicated lines (as opposed to all lines) using awk:
sort file | uniq -d | awk '
FNR == NR { dups[$0] }
FNR != NR && (!($0 in dups) || !lines[$0]++)
' - file
There's also the "line-number, double-sort" method.
nl -n ln | sort -u -k 2| sort -k 1n | cut -f 2-
You can run uniq -d on the sorted version of the file to find the duplicate lines, then run some script that says:
if this_line is in duplicate_lines {
if not i_have_seen[this_line] {
output this_line
i_have_seen[this_line] = true
}
} else {
output this_line
}
Using only uniq and grep:
Create d.sh:
#!/bin/sh
sort $1 | uniq > $1_uniq
for line in $(cat $1); do
cat $1_uniq | grep -m1 $line >> $1_out
cat $1_uniq | grep -v $line > $1_uniq2
mv $1_uniq2 $1_uniq
done;
rm $1_uniq
Example:
./d.sh infile
You could use some horrible O(n^2) thing, like this (Pseudo-code):
file2 = EMPTY_FILE
for each line in file1:
if not line in file2:
file2.append(line)
This is potentially rather slow, especially if implemented at the Bash level. But if your files are reasonably short, it will probably work just fine, and would be quick to implement (not line in file2 is then just grep -v, and so on).
Otherwise you could of course code up a dedicated program, using some more advanced data structure in memory to speed it up.
for line in $(sort file1 | uniq ); do
grep -n -m1 line file >>out
done;
sort -n out
first do the sort,
for each uniqe value grep for the first match (-m1)
and preserve the line numbers
sort the output numerically (-n) by line number.
you could then remove the line #'s with sed or awk