Finding amount of sequence matches per line - unix

I'm looking to use GREP or something similiar to find the total matches of a 5 letter sequence (AATTC) in every line of a file, and then print the result in a new file. For example:
File 1:
GGGGGAATTCGAATTC
GGGGGAATTCGGGGGG
GGGGGAATTCCAATTC
Then in another file it prints the matches line by line
File 2:
2
1
2

Awk solution:
awk '{ print gsub(/AATTC/,"") }' file1 > file2
The gsub() function returns the number of substitutions made
$ cat file2
2
1
2

If you have to use grep, then put that in a while loop,
$ while read -r line; do grep -o 'AATTC'<<<"$line"|wc -l >> file2 ; done < file1
$ cat file2
2
1
2
Another way: using perl.
$ perl -ne 'print s/AATTC/x/g ."\n"' file1 > file2

Related

Append data from 1 file to another using AWK

I have an already existing script to check the exclusive data between 2 files and load it in 3rd file. The command is below.
var='FNR == NR {keys[$1 $2]; next} !($1 $2 in keys)'
awk -F\| $var file1.dat file2.dat > file3.dat
The requirement is to reuse the same but just append the data from file2 to file3 ignoring file1. I tried to do the below but it is spooling the data from both file1 and file2. All I need is, though there are 2 file names provided in the awk command, only the 2nd file data to be appended.
var='{print $0}'
awk -F\| $var file1.dat file2.dat > file3.dat
Can anyone help with the exact command.
Below is the data in each file and expected output.
File1 (Can have 0 or more) - We should not look at this file at all
123
456
789
File2:
123
ABC
XYZ
456
Expected output in File3 (All from file2 and just ignore file1 input, but I have to have the file1 name in awk command)
123
ABC
XYZ
456
All from file2 and just ignore file1 input, but I have to have the file1 name in awk command.
If you must use file1 and file2 in arguments to awk command and want to output content from file2 only then you can just use:
awk 'BEGIN {delete ARGV[1]} 1' file1 file2 > file3
123
ABC
XYZ
456
delete ARGV[1] will delete first argument from argument list.
With your shown samples and attempts please try following awk code. Written and tested in GNU awk. Simply use nextfile to skip first Input_file named file1 itself and read 2nd file onwards.
awk 'NR==1{nextfile} 1' file1 file2
also remember not to waste time splitting unneeded fields
{m,g}awk 'BEGIN { delete ARGV[_^=FS="^$"] }_' file1 file2
and it's MUUUCH faster not reading it a row at a time :
mawk2 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:01 [1.11GiB/s] [1.11GiB/s] [ <=>]
f9d2e18d22eb58e5fc2173863cff238e stdin
mawk2 'BEGIN { delete ARGV[_^=RS=FS="^$"] }_^(ORS=__)' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:00 [1.92GiB/s] [1.92GiB/s] [<=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin
and try to avoid the slow default mode of gawk :
gawk 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:03 [ 620MiB/s] [ 620MiB/s] [ <=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin

Using Awk how to merge fields between files, F2 of file1 plus last 8char of F2 in file 2

I have two files file1 and file2, I need to replace F1 value of file1 by merging F2 of file1 plus last 8char of F2 in file2
File 1 :
123456|AAAAAAA|BBBBBB|CCCCCCC
444444|kkkkkkk|rrrrrr|NNNNNNN
File 2:
AAAAAAA|DDDDDD12345678
kkkkkkk|987654321aaaaa
Expected Output
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
I have tried with Bellow awk function not sure how to fetch last 8 char of F2 from file2
# awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2a[$2];print}' OFS='|' File2 File1
123456|AAAAAAADDDDDD12345678|BBBBBB|CCCCCCC
444444|kkkkkkk987654321aaaaa|rrrrrr|NNNNNNN
In order to get the last 8 characters of a[$2], you need to use substr:
substr(a[$2],length(a[$2])-7)
The above takes the substring of a[$2] starting at position length(a[$2])-7.
With that one change, your code produces your desired output:
$ awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
As Ghoti points out in the comments, the more usual awk style is to use next so as to avoid the need for the second condition, NR>FNR, as follows:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
When awk encounters next, it skips the rest of the commands and starts over on the next line.
As awk programmers often value conciseness over clarity, it is common to see the print statement replaced with a 1:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7)} 1' OFS='|' File2 File1
In this case, 1 is a condition and it always evaluates to true. Since no command is associated with that condition, the default command is executed which is print.

How to get a pattern from a file and search in another file in unix

I have 2 files File1 and File2.
File1 has some values separated by "|". For example,
A|a
C|c
F|f
File2 also has some values separated by "|". For example,
a|1
b|2
c|3
d|4
e|5
Means 2nd column in File1 is resembled with 1st column of File2.
I have to create 3rd file File3 with expected output
A|a|1
C|c|3
I tried to take each record in loop and searched for that in File2 using "awk".
It worked, but the problem is both File1 and File2 are having more than 5 million records.
I need an optimized solution.
You can use this awk,
awk -F'|' 'NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$1,$2 }' OFS="|" file1 file2 > file3
More clearer way:
awk 'BEGIN{ OFS=FS="|";} NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$1,$2 }' file1 file2 > file3
As per #Kent suggestion:
If your file2 have more than two columns that you want it in file3 then,
awk 'BEGIN{ OFS=FS="|";} NR==FNR{a[$2]=$1;next} $1 in a { print a[$1],$0 }' file1 file2 > file3
Here,
FS - Field Separator
OFS - Output Field Separator
This is what join was created to do:
$ join -t '|' -o '1.1,1.2,2.2' -1 2 -2 1 file1 file2
A|a|1
C|c|3
man join for more details and pay particular attention to the files needing to be sorted on the join fields (i.e. 2nd field for file1 and 1st field for file2), as your posted sample input is.

Unix Command for counting number of words which contains letter combination (with repeats and letters in between)

How would you count the number of words in a text file which contains all of the letters a, b, and c. These letters may occur more than once in the word and the word may contain other letters as well. (For example, "cabby" should be counted.)
Using sample input which should return 2:
abc abb cabby
I tried both:
grep -E "[abc]" test.txt | wc -l
grep 'abcdef' testCount.txt | wc -l
both of which return 1 instead of 2.
Thanks in advance!
You can use awk and use the return value of sub function. If successful substitution is made, the return value of the sub function will be the number of substitutions done.
$ echo "abc abb cabby" |
awk '{
for(i=1;i<=NF;i++)
if(sub(/a/,"",$i)>0 && sub(/b/,"",$i)>0 && sub(/c/,"",$i)>0) {
count+=1
}
}
END{print count}'
2
We keep the condition of return value to be greater than 0 for all three alphabets. The for loop will iterate over every word of every line adding the counter when all three alphabets are found in the word.
I don't think you can get around using multiple invocations of grep. Thus I would go with (GNU grep):
<file grep -ow '\w+' | grep a | grep b | grep c
Output:
abc
cabby
The first grep puts each word on a line of its own.
Try this, it will work
sed 's/ /\n/g' test.txt |grep a |grep b|grep c
$ cat test.txt
abc abb cabby
$ sed 's/ /\n/g' test.txt |grep a |grep b|grep c
abc
cabby
hope this helps..

How to interleave lines from two text files

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Resources