Using grep to search DNA sequence files

Using grep to search DNA sequence files - unix

I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:
Using
$ grep -i -n "GACGGCT" grep3.txt
To search the file grep3.txt (I put the target 'GACGGCT's in double stars)
GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
Returns
3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.
How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?

pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC

I assume that your each line is 60 char long. Then the below cmd should work
tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT"
output :
1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC
2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG
4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

Related

how to list a specific string or number in a file in Unix

for Example if your file has following lines
1=10200|2=2343i|3=otit|5=89898|54=9546i96i|10=2459
1=10200|54=9546i96i|10=2459|2=2343i|3=otit|5=8
1=10200|5=IGY|14=897|459=122|132=1|54=9546i96i|10=2459
1=10200|2=2343i|5=0|54=9546i96i
The output should be
5=89898
5=8
5=IGY
5=0

You could use grep with the -o flag to return only the regexp matches.
Assuming you have a file.txt that you want to parse:
cat file.txt | grep -o -E "(\||^)5=[^|]*" | grep -o "5=[^|]*"
This will match anything that starts with 5= up until the first |.
By running this command on the input you provided I get:
5=89898
5=8
5=IGY
5=0
Cheers
Edit: as Walter A suggested, my previous solution did not cover all cases.
I have added an extra parsing step: first, you get all strings that match 5=... at the start of a line, or |5=..., and then you remove the |.

Use (^|[|]) for matching start of field (start of line or |) and remember/match string until next | or end-of-line.
sed -nr 's/.*(^|[|])(5=[^|]*).*/\2/p' file

unix merging character and next line to one line

$ cat TPSCIS1705291200.err
0301705293504895 000003330011868452100001742N
#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
#ERROR - Can not find Account:12404797-010
Here I am looking to replace the last character N and next line character # to come in one line
Expected output should be something like
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
please assist I am not able to find the best way for this as character N and # are in different rows

sed won't match newlines. One possible trick is to first "translate" them to other character, then do sed-subtitution.
In this code I use tr command to replace newlines to a different 'carriage feed' character (`\f'), then replace it by sed, and finally replace those new lines back
cat myfile | tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
Another dirty trick above is doing command-substitution for echo '\f\ since it cannot be part of the regular expression either!
Working code (in bash for MacOS):
-- /tmp » cat in
0301705293504895 000003330011868452100001742N
ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
ERROR - Can not find Account:12404797-010
--- /tmp » cat in| tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010

It appears you are just looking to merge every other line:
awk 'NR%2 { printf "%s,", $0; next} 1' input

This might work for you (GNU sed):
sed -i ':a;/N$/N;s/\n#/,#/;ta;P;D' file
If the current line ends in N and the following line begins with a # replace the newline with a , and repeat. Otherwise print the first line and repeat.

Join lines depending on the line beginning

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with a space, empty line or a nonnumeric character. E.g.
40403813|7|Failed|No such file or directory|1
40403816|7|Hi,
The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
I'd like join the split line back with the previous line (as mentioned below):
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
using a Unix command like sed/awk. I'm not clear how to join a line with the preceeding one.
Any suggestion?

awk to the rescue!
awk -v ORS='' 'NR>1 && /^[0-9]/{print "\n"} NF' file
only print newline when the current line starts with a digit, otherwise append rows (perhaps you may want to add a space to ORS if the line break didn't preserve the space).

Don't do anything based on the values of the strings in your fields as that could go wrong. You COULD get a wrapping line that starts with a digit, for example. Instead just print after every complete record of 5 fields:
$ awk -F'|' '{rec=rec $0; nf+=NF} nf>=5{print rec; nf=0; rec=""}' file
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1

Try:
awk 'NF{printf("%s",$0 ~ /^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
OR
awk 'NF{printf("%s",/^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
It will check if each line starts from a digit or not if yes and greater than line number 1 than it will insert a new line with-it else it will simply print it, also it will print a new line after reading the whole file, if we not mention it, it is not going to insert that at end of the file reading.

If you only ever have the line split into two, you can use this sed command:
sed 'N;s/\n\([^[:digit:]]\)/\1/;P;D' infile
This appends the next line to the pattern space, checks if the linebreak is followed by something other than a digit, and if so, removes the linebreak, prints the pattern space up to the first linebreak, then deletes the printed part.
If a single line can be broken across more than two lines, we have to loop over the substitution:
sed ':a;N;s/\n\([^[:digit:]]\)/\1/;ta;P;D' infile
This branches from ta to :a if a substitution took place.
To use with Mac OS sed, the label and branching command must be separate from the rest of the command:
sed -e ':a' -e 'N;s/\n\([^[:digit:]]\)/\1/;ta' -e 'P;D' infile

If the continuation lines always begin with a single space:
perl -0000 -lape 's/\n / /g' input
If the continuation lines can begin with an arbitrary amount of whitespace:
perl -0000 -lape 's/\n(\s+)/$1/g' input
It is probably more idiomatic to write:
perl -0777 -ape 's/\n / /g' input

You can use sed when you have a file without \r :
tr "\n" "\r" < inputfile | sed 's/\r\([^0-9]\)/\1/g' | tr '\r' '\n'

WC command of mac showing one less result

I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?

Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.

The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.

60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1

use of grep commands in unix

I have a file and i want to sort it according to a word and to remove the special characters.
The grep command is used to search for the characters
-b Display the block number at the beginning of each line.
-c Display the number of matched lines.
-h Display the matched lines, but do not display the filenames.
-i Ignore case sensitivity.
-l Display the filenames, but do not display the matched lines.
-n Display the matched lines and their line numbers.
-s Silent mode.
-v Display all lines that do NOT match.
-w Match whole word
but
How to use the grep command to do the file sort and remove the special character and number.

grep searches inside all the files to find matching text. It doesn't really sort and it doesn't really chop and change output. What you want is probably to use the sort command
sort <filename>
and the output sent to either the awk command or the sed command, which are common tools for manipulating text.
sort <filename> | sed 's/REPLACE/NEW_TEXT/g'
something like above I'd imagine.

The following command would do it.
sort FILE | tr -d 'LIST OF SPECIAL CHARS' > NEW_FILE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using grep to search DNA sequence files - unix

pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt 1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT 2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC 6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC

Related

how to list a specific string or number in a file in Unix

unix merging character and next line to one line

Join lines depending on the line beginning

WC command of mac showing one less result

use of grep commands in unix

Categories

Resources