Difference between linenumbers of cat file | nl and wc -l file - unix

i have a file with e.g. 9818 lines. When i use wc -l file, i see 9818 lines. When i vi the file, i see 9818 lines. When i :set numbers, i see 9818 lines. But when i cat file | nl, i see the final line number is 9750 (e.g.). Basically i'm asking why line numbers from cat file | nl and wc -l file do not match.

wc -l: count all lines
nl: count all (nonempty) lines
try
nl -ba: count all lines

nl(1) says the default is for header and footer lines to not be numbered (-hn -fn), and those are specified by repeating \; on various lines. Perhaps your input file includes some of these?
I suggest reading the output of nl line by line against cat -n output and see where things diverge. Or use diff -u if you want to take the fun out of reading 9818 lines. :)

nl does not number blank lines, so this is almost certainly the reason. If you can point us to the file, we can confirm that, but I suspect this is the case.

Related

Excluding options for AWK, separating breadXI from breadX

I'm working with manipulating lines from a .vcf file where bread is listed 1 through 20 in roman numerals. I only want the lines corresponding to bread 10, so I've used
awk '/breadX/ {print}' file.vcf > Test.txt
to output a list of lines containing "breadX" to Test.txt. That is all good, however it is also including "breadXI" on to "breadXX" in the list. Is there an option to exclude cases that don't match assuming "breadX" is out of order and towards the middle (XIV...X...XX), and that there is more information in the line. I only want lines that start with bread 10, and not any of the other options. Any help would be appreciated.
In the lack of definitive data sample to see what might follow the breadX just exclude all possible strings where roman numeral symbols I, V, X, L, D, M follow:
$ awk '/^breadX([^IVXLDM]|$)/' file
Sample test file:
$ cat file
breadX
breadXI
breadX2
3
Test it:
$ awk '/^breadX([^IVXLDM]|$)/' file
Output:
breadX
breadX2
If breadX is a word, you can use word boundary to limit your search.
cat file
test breadXI more
hi breadX yes
cat home breadXX
awk '/\<breadX\>/' file
hei breadX yes
\< start of word
\> end of word
PS you do not need the print since its default action if test is true.

Grep but very literal

This question might have been asked a million times before, but did’t see my exaxt case.
Suppose a text file contains:
a
ab
bac
Now I want to grep on ‘a’ and have a hit only on the 1st line. After the ‘a’ there’s always a [tab] character.
Anyone any ideas?
Thanks!
Ronald
Try this:
head -1 *.txt | grep -P "a\t"
head will give you specified amount of lines of each file (all txt files in my example) , grep -P use the regular expressions as defined by perl (perl has \t as tab)

xargs does not read line by line properly

Here is my current commands which i use:
xargs -n 1 -P 100 php b <links
I have a script 'b' and a file with links in 'links' and sometimes I don't know why this doesn't work correctly and add to every line symbol "?"
like this:
root 19714 0.0 1.0 19880 5480 pts/2 R+ 11:19 0:00 php b http://www.google.com?
root 19715 0.0 0.9 19524 4892 pts/2 R+ 11:19 0:00 php b http://www.alexa.com?
root 19716 0.0 1.0 19880 5496 pts/2 R+ 11:19 0:00 php b http://www.amazon.com?
see to the end of the ps-aux result line is "?" symbol and my file doesn't contain that... but not all the times... what could be the problem?
Converting comments into an answer
Could your links file have CRLF (Windows or DOS style) line endings? In which case, the ? represents the CR (carriage return)… And the fix is to convert the DOS file into a Unix file somewhere along the line. For example:
tr -d '\r' < links | xargs -n 1 -P 100 php b
I build the links file with cat "[another file]" | sort | uniq > links. So I don't think it is a Windows/DOS style issue.
Does [another file] have DOS-format line endings? If so, links will too; the command line will preserve the CRLF line endings if they're in the input.
Also, the use of cat is often unnecessary (UUoC or Useless Use of Cat), and in this case uniq too is not needed; you could use
sort -u "[another file]" > links
[another file] is like this:
line1\r\nline2\r\nline3\r\n
The \r\n is one of the ways the CRLF line endings are reported on Unix. The \r notation is used for carriage return, CR, in C and other languages; the \n notation is used for newline (NL, aka line feed or LF).
BTW: If I paste the same file with copy/paste via vi or nano it works.
The editors may be converting the file format for you automatically. See: How to convert the ^M linebreak to normal linebreak in a file opened in vim? Alternatively, if you are copying and pasting, then the copy may well not see, and therefore not copy, the CR characters.
I tried with tr -d … and it works.
Good; that at least means this is a reasonable explanation, and it all makes sense. If you don't need the list of links separately, you can use:
sort -u "[another file]" | tr -d '\r' | xargs -n 1 -P 100 php b
This does it all in one command line with no intermediate files to clean up. You could add tee links | before xargs if you do want the files list too, of course.

commandline output lines that are specified in another file

iam searching for some command line that takes a text file and a file with line numbers (one on each line) (alternatively from stdin) and outputs only that lines from the first file.
the text file may be several hundreds of MB large and the line list may contains several thousands of entries (but are sorted ascending)
in short:
one file contains data
another file contains indexes
a command should extract only indexed lines
first file:
many lines
of course they are all very different
and contain very important data
...
more lines
...
even more lines
second file
1
5
7
expected output
many lines
more lines
even more lines
The second (line number) file does not necessarily have to exist. Its data also may come from stdin (in deed this would the optimum). Also the format of that data may vary from the shown if this would make the task easier.
This can be an approach:
$ awk 'FNR==NR {a[$1]; next} FNR in a' file_with_line_numbers file_with_data
many lines
more lines
even more lines
It reads the file_with_line_numbers and stores the lines in an array a[]. Then it reads the other file and keeps checking if the line number is in the array, in which case the line is printed.
The trick used is the following:
awk 'FNR==NR {something; next} {other things}' file1 file2
that performs actions related to file1 in the {something} block and then actions related to file2 in the {other things} block.
What if the line numbers are given through stdin?
For this you can use awk '...' - file, so that stdin is called with -. This is called Naming Standard Input. So that you can do:
your_commands | awk 'FNR==NR {a[$1]; next} FNR in a' - file_with_data
Test
$ echo "1
5
7" | awk 'FNR==NR {a[$1]; next} FNR in a' - file_with_data
many lines
more lines
even more lines
With sed, convert the line numbers to a sed program, and use that generated program to print out the wanted lines;
$ sed -n "$( sed 's/$/p/' second_file )" first_file
many lines
more lines
even more lines
This works too.
foreach line ( "cat file2" )
foreach? sed -n "$line p" file1
foreach? end
many lines
more lines
even more lines

WC command of mac showing one less result

I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?
Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.
The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.
60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1

Resources