WC command of mac showing one less result - unix
I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?
Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.
The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.
60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1
Related
How to read nth line and mth field of text file in unix
Suppose i have | delimeted file, Line1: 1|2|3|4 Line2: 5|6|7|8 Line3: 9|9|1|0 Now i need to read 3 field at second line which is 7 in above example how i can do that using Cut or Sed Command. I'm new to unix please help
A job for awk: awk -F '|' 'NR==2{print $3}' file or awk -F '|' -v row=2 -v col=3 'NR==row{print $col}' file Output: 7
This should work: sed -n '2p' file |awk -F '|' '{print $3}'
This might work for you (GNU sed): sed -rn '2s/^(([^|]*)\|?){3}.*/\2/p' file Turn off automatic printing by setting the -n option, turn on easier regexp declaration by -r option. Use pattern matching and back references to replace the whole of the second line by the third field of the same line and print the result. The address of the substitution command is limited to only the second line. The regexp groups the non-delimited characters followed by a delimiter a specific number of times. The second group, only retains the non-delimited characters for the specific number. Each grouping is replaced by the next and so the last grouping is reported, the .* consumes the remainder of the line and so only the third field (contents of second group) is printed. N.B. the delimiter would be present following the final column and is therefore optional \|?
unix merging character and next line to one line
$ cat TPSCIS1705291200.err 0301705293504895 000003330011868452100001742N #ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N #ERROR - Can not find Account:12404797-010 Here I am looking to replace the last character N and next line character # to come in one line Expected output should be something like 0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010 please assist I am not able to find the best way for this as character N and # are in different rows
sed won't match newlines. One possible trick is to first "translate" them to other character, then do sed-subtitution. In this code I use tr command to replace newlines to a different 'carriage feed' character (`\f'), then replace it by sed, and finally replace those new lines back cat myfile | tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n' Another dirty trick above is doing command-substitution for echo '\f\ since it cannot be part of the regular expression either! Working code (in bash for MacOS): -- /tmp » cat in 0301705293504895 000003330011868452100001742N ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N ERROR - Can not find Account:12404797-010 --- /tmp » cat in| tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n' 0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895 04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
It appears you are just looking to merge every other line: awk 'NR%2 { printf "%s,", $0; next} 1' input
This might work for you (GNU sed): sed -i ':a;/N$/N;s/\n#/,#/;ta;P;D' file If the current line ends in N and the following line begins with a # replace the newline with a , and repeat. Otherwise print the first line and repeat.
How to remove blank lines from a Unix file
I need to remove all the blank lines from an input file and write into an output file. Here is my data as below. 11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321 11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003 11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032 11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001 11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701 11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326 11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383 11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955 11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original: sed '/^$/d' foo > foo.tmp mv foo.tmp foo If you also want to remove lines consisting only of whitespace (not just empty lines) then use: sed -i '/^[[:space:]]*$/d' foo Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too: sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename awk 'NF > 0' filename sed -i '/^$/d' filename awk '!/^$/' filename awk '/./' filename The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($): grep -v '^$' infile.txt > outfile.txt If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways: grep -P -v '^\s*$' infile.txt > outfile.txt or, without Perl regular expressions: grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option. If I execute the above command still I have blank lines in my output file. What could be the reason? There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then: sed -e 's/ *$//' -e '/^ *$/d' input > output The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs. Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return: tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against: head -n 2 input | od -c Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.) You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with: echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /' If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it: cat file.txt | perl -lane "print if /\S/" Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do. Cheers
You can sed's -i option to edit in-place without using temporary file: sed -i '/^$/d' file
Using grep to search DNA sequence files
I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example: Using $ grep -i -n "GACGGCT" grep3.txt To search the file grep3.txt (I put the target 'GACGGCT's in double stars) GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC Returns 3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT 8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3. How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?
pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt 1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT 2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC 6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
I assume that your each line is 60 char long. Then the below cmd should work tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT" output : 1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC 2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG 4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
use of grep commands in unix
I have a file and i want to sort it according to a word and to remove the special characters. The grep command is used to search for the characters -b Display the block number at the beginning of each line. -c Display the number of matched lines. -h Display the matched lines, but do not display the filenames. -i Ignore case sensitivity. -l Display the filenames, but do not display the matched lines. -n Display the matched lines and their line numbers. -s Silent mode. -v Display all lines that do NOT match. -w Match whole word but How to use the grep command to do the file sort and remove the special character and number.
grep searches inside all the files to find matching text. It doesn't really sort and it doesn't really chop and change output. What you want is probably to use the sort command sort <filename> and the output sent to either the awk command or the sed command, which are common tools for manipulating text. sort <filename> | sed 's/REPLACE/NEW_TEXT/g' something like above I'd imagine.
The following command would do it. sort FILE | tr -d 'LIST OF SPECIAL CHARS' > NEW_FILE