WC command of mac showing one less result - unix

I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?

Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.

The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.

60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1

Related

How to read nth line and mth field of text file in unix

Suppose i have | delimeted file,
Line1: 1|2|3|4
Line2: 5|6|7|8
Line3: 9|9|1|0
Now i need to read 3 field at second line which is 7 in above example how i can do that using Cut or Sed Command. I'm new to unix please help
A job for awk:
awk -F '|' 'NR==2{print $3}' file
or
awk -F '|' -v row=2 -v col=3 'NR==row{print $col}' file
Output:
7
This should work:
sed -n '2p' file |awk -F '|' '{print $3}'
This might work for you (GNU sed):
sed -rn '2s/^(([^|]*)\|?){3}.*/\2/p' file
Turn off automatic printing by setting the -n option, turn on easier regexp declaration by -r option. Use pattern matching and back references to replace the whole of the second line by the third field of the same line and print the result.
The address of the substitution command is limited to only the second line.
The regexp groups the non-delimited characters followed by a delimiter a specific number of times. The second group, only retains the non-delimited characters for the specific number. Each grouping is replaced by the next and so the last grouping is reported, the .* consumes the remainder of the line and so only the third field (contents of second group) is printed.
N.B. the delimiter would be present following the final column and is therefore optional \|?

unix merging character and next line to one line

$ cat TPSCIS1705291200.err
0301705293504895 000003330011868452100001742N
#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
#ERROR - Can not find Account:12404797-010
Here I am looking to replace the last character N and next line character # to come in one line
Expected output should be something like
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
please assist I am not able to find the best way for this as character N and # are in different rows
sed won't match newlines. One possible trick is to first "translate" them to other character, then do sed-subtitution.
In this code I use tr command to replace newlines to a different 'carriage feed' character (`\f'), then replace it by sed, and finally replace those new lines back
cat myfile | tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
Another dirty trick above is doing command-substitution for echo '\f\ since it cannot be part of the regular expression either!
Working code (in bash for MacOS):
-- /tmp » cat in
0301705293504895 000003330011868452100001742N
ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
ERROR - Can not find Account:12404797-010
--- /tmp » cat in| tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
It appears you are just looking to merge every other line:
awk 'NR%2 { printf "%s,", $0; next} 1' input
This might work for you (GNU sed):
sed -i ':a;/N$/N;s/\n#/,#/;ta;P;D' file
If the current line ends in N and the following line begins with a # replace the newline with a , and repeat. Otherwise print the first line and repeat.

How to remove blank lines from a Unix file

I need to remove all the blank lines from an input file and write into an output file. Here is my data as below.
11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321
11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003
11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032
11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001
11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701
11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383
11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo
This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original:
sed '/^$/d' foo > foo.tmp
mv foo.tmp foo
If you also want to remove lines consisting only of whitespace (not just empty lines) then use:
sed -i '/^[[:space:]]*$/d' foo
Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too:
sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename
awk 'NF > 0' filename
sed -i '/^$/d' filename
awk '!/^$/' filename
awk '/./' filename
The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($):
grep -v '^$' infile.txt > outfile.txt
If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways:
grep -P -v '^\s*$' infile.txt > outfile.txt
or, without Perl regular expressions:
grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output
Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option.
If I execute the above command still I have blank lines in my output file. What could be the reason?
There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then:
sed -e 's/ *$//' -e '/^ *$/d' input > output
The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs.
Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return:
tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output
If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against:
head -n 2 input | od -c
Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.)
You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with:
echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /'
If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file
grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk
awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it:
cat file.txt | perl -lane "print if /\S/"
Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do.
Cheers
You can sed's -i option to edit in-place without using temporary file:
sed -i '/^$/d' file

Using grep to search DNA sequence files

I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:
Using
$ grep -i -n "GACGGCT" grep3.txt
To search the file grep3.txt (I put the target 'GACGGCT's in double stars)
GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
Returns
3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.
How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?
pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
I assume that your each line is 60 char long. Then the below cmd should work
tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT"
output :
1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC
2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG
4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

use of grep commands in unix

I have a file and i want to sort it according to a word and to remove the special characters.
The grep command is used to search for the characters
-b Display the block number at the beginning of each line.
-c Display the number of matched lines.
-h Display the matched lines, but do not display the filenames.
-i Ignore case sensitivity.
-l Display the filenames, but do not display the matched lines.
-n Display the matched lines and their line numbers.
-s Silent mode.
-v Display all lines that do NOT match.
-w Match whole word
but
How to use the grep command to do the file sort and remove the special character and number.
grep searches inside all the files to find matching text. It doesn't really sort and it doesn't really chop and change output. What you want is probably to use the sort command
sort <filename>
and the output sent to either the awk command or the sed command, which are common tools for manipulating text.
sort <filename> | sed 's/REPLACE/NEW_TEXT/g'
something like above I'd imagine.
The following command would do it.
sort FILE | tr -d 'LIST OF SPECIAL CHARS' > NEW_FILE

Resources