need code in awk unix or use substr - unix

I want to print words between "ctr{words}" and count the same words in a file.
I tried:
sed -n 's/.*ctr{\(.[^}]*\).*/\1/p' file
but it didn't search all words, just one word
The file is:
962796057604|mar0101|0|00000107A20E00000A6C331650B920340C00|0|0|400019FD7DBFBF7F|1001|962796057604|0 |01001|||-1|795971936| 00962795971936|16||-1| 00962795971936|-1|0|2|0|416019000659493|0||||||0|0|2012.12.01 00:07:09|12|30|0|516|16|1|2012.12.01 00:06:39|1|0||202|20001||0B12F1001104697209100300000000000000|1|1|11000|0|0||0881006972091003F000||0 714F610045584E6|000000000000|3|1|0000000000000000|0|140|0|0|0|0|0|0|||0|2|||||||||||||||||||||0|||0| |0|1|143|acf{0}cif{0}fcf{0}con{0}cuf{0}ctr{**Mo7afazat**}cgpa{962796057604}vlr{0096279001300}cff{0}roaf{0}mpty{0}ftksn{JMT}ftksr{0001}ftktp{CallTicketCPOCS} ||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962796298894|mar0101|0|000001028225AE4AD868A8B750B900980C00|1|0|4000018001000002||962796298894|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|0|||||||||||||0|0|||3797|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|244|tid{111210532409329884}pfid{20}gob{1}rid{globitel} afid{}uid1{962796298894}aid1{1}ar1{0}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC COMMIT AMOUNT 0}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**JaishanaIN**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
1|34|2012.12.01 00:08:35|12|4|100-50-0-962796605155|mar0101|0|00000102A20400000A6A439D50B920520C00|0|0|400019FD7DBFBF7F|1001|962796605155|1 6||||-1|b116c||16||-1||-1|0|0|0|416017002233360|0||||||0|0|1970.01.01 02:00:00|0|0|0|220|0|1|1970.01.01 02:00:00|1|0||194|0||000000000000000000000000000000000000|0|0||0|0||00000000000000000000||0000000000 000000|000000000000|0|0|0000000000000000|0|370|0|0|0|0|0|0|||0|0|||||||||||||||||||||0|||0||0|1|70|a cf{3}ussd{1}ctr{**ZainElKul**}ftksn{JMT}ftksr{0001}ftktp{CallTicketCPOCS}||
1|34|2012.12.01 00:08:35|12|4|100-10-0
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797611253|mar0101|0|0000010282B54BD015FF4C4B50B8F96E0C00|1|0|4000018001000002||962797611253|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|0|||||||||||||0|0|||885|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|243|tid{111220371293561120}pfid{20}gob{1}rid{globitel} afid{}uid1{962797611253}aid1{1}ar1{0}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC COMMIT AMOUNT 0}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**ZainElKul**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
-962795292027|mar0101|0|00000101A20200000A6A96B750B920300C00|0|0|400019FD7DBFBF7F|1001|962795292027|0 |01004|||-1|797196452| 00962797196452|16||-1| 00962797196452|-1|0|2|0|416018002276781|0||||||0|0|2012.12.01 00:07:09|12|12|23|516|16|1|2012.12.01 00:06:34|1|0||202|1||0B12F1001104697209100300000000000000|1|1|11000|0|0||0881006972091003F000||0714F 6100455AD67|000000000000|3|1|0000000000000000|0|30|0|0|0|0|0|0|||0|0|||||||||||||||||||||0|||0||0|1| 171|acf{0}cif{0}fcf{0}con{0}cuf{0}ctr{ZainUnlimited}cgpa{962795292027}vlr{0096279001300}cff{0}roaf{0}mpty{0}cacc{1;0;30}cquo{1;230;}ftksn{JMT}ftksr{000 1}ftktp{CallTicketCPOCS}||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962796012818|mar0101|0|0000010882218115085D5F9150B920520C00|0|0|4000018001000002||962796012818|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||70|0|0|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|258|tid{111221366974701289}pfid{17}gob{1}rid{globitel} afid{}uid1{962796012818}aid1{1}ar1{-2147483648}uid2{}aid2{-1}pid{DEFAULT_DECISION}pur{!GDRC Balance Check}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**AlBarakehNew**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797251349|mar0101|0|0000010282A451483EDFCFD350B920400C00|1|0|4000018001000002||962797251349|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|0|||||||||||||0|0|||440|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|245|tid{111211342745325133}pfid{20}gob{1}rid{globitel} afid{}uid1{962797251349}aid1{1}ar1{0}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC COMMIT AMOUNT 0}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{0}ctr{**ZainElKulSN**}ftksn{JMT}ftksr{0001}ftktp{PayCallTicket}||
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-

It looks like you are missing the counts. The easiest way to do this is to pipe your output through uniq -c:
$ sed -n 's/.*ctr{\(.[^}]*\).*/\1/p' file | sort | uniq -c
1 **Mo7afazat**
1 **JaishanaIN**
2 **ZainElKul**
1 ZainUnlimited
1 **AlBarakehNew**
1 **ZainElKulSN**
Another way, only using awk:
$ awk 'match($0,".*ctr{([^}]*)}.*",m){a[m[1]]++}END{for(i in a) print i,a[i]}' file
ZainUnlimited 1
**ZainElKulSN** 1
**Mo7afazat** 1
**ZainElKul** 2
**JaishanaIN** 1
**AlBarakehNew** 1

When searching for matches in files grep is the best choice more often than not.
Using grep with postive lookahead and uniq -c:
$ grep -Po "(?<=ctr{)[^}]+" file | uniq -c
1 Mo7afazat
1 JaishanaIN
2 ZainElKul
1 ZainUnlimited
1 AlBarakehNew
1 ZainElKulSN
From man uniq:
Note: 'uniq' does not detect repeated lines unless they are adjacent.
For files where the duplicates are not adjacent pipe to sort first however the order in which each match is found in the orignal file will be lost:
grep -Po "(?<=ctr{)[^}]+" file | sort | uniq -c
1 AlBarakehNew
1 JaishanaIN
1 Mo7afazat
2 ZainElKul
1 ZainElKulSN
1 ZainUnlimited

Related

Finding amount of sequence matches per line

I'm looking to use GREP or something similiar to find the total matches of a 5 letter sequence (AATTC) in every line of a file, and then print the result in a new file. For example:
File 1:
GGGGGAATTCGAATTC
GGGGGAATTCGGGGGG
GGGGGAATTCCAATTC
Then in another file it prints the matches line by line
File 2:
2
1
2
Awk solution:
awk '{ print gsub(/AATTC/,"") }' file1 > file2
The gsub() function returns the number of substitutions made
$ cat file2
2
1
2
If you have to use grep, then put that in a while loop,
$ while read -r line; do grep -o 'AATTC'<<<"$line"|wc -l >> file2 ; done < file1
$ cat file2
2
1
2
Another way: using perl.
$ perl -ne 'print s/AATTC/x/g ."\n"' file1 > file2

Count number of unique genes by type

I'm new to this so please if this has been answered somewhere else kindly refer me to that question. I searched extensively and there are sort of similar questions but none are really applicable to my problem.
I want to count the number of unique names per class. I have a sheet with a list of names (column 1) and their class (column 2). I need to know how many unique names are in the list, per class. The list is tab delimited.
I think probably awk will be able to solve this quickly, but I'm really not that skilled in awk.
Example input:
Name Class
ABCD protein-coding
ABCD protein-coding
DCFG lincRNA
GTFR lincRNA
Desired output:
Class Count
protein-coding 1
lincRNA 2
$ cat f
Name Class
ABCD protein-coding
ABCD protein-coding
DCFG lincRNA
GTFR lincRNA
$ awk 'FNR>1{a[$2]+=!( ($1,$2) in b); b[$1,$2]}END{for(i in a)print i, a[i]}' f
lincRNA 2
protein-coding 1
You can sort the file and then get the unique value as follow:
code:
sort test_file.txt | uniq | awk '{print $2}' | uniq -c
Output:
1 protein-coding
2 lincRNA
With GNU awk for true multi-dimensional arrays:
$ awk 'NR>1{names[$2][$1]} END{for (class in names) print class, length(names[class])}' file
lincRNA 2
protein-coding 1
With any awk:
$ awk 'NR>1{if (!seen[$2,$1]++) cnt[$2]++} END{for (class in cnt) print class, cnt[class]}' file
lincRNA 2
protein-coding 1
If working with uniq and sort, a solution may be:
sed 1d input.tsv | sort -t $'\t' | uniq | awk '{print $2}' | uniq -c | awk 'BEGIN{print "Class\tCount"}{print $2"\t"$1}'
I skipped the headers with sed 1d and wrote an output file separated wth tabs.

Count lines that contains a repeated pattern

I have a file (File.txt) that contains 10 lines which may contains a repeated patterns at a certain position (14-17)
fsdf sfkljkl4565
fjjf lmlkfdm1235
fkljfgdfgdfg6583
eretjioijolj6933
ioj ijijsfoi4565
dgodiiopkpok6933
fsj opkjfiej4565
ihfzejijjijf4565
dfsdkfjlfeff1235
dijdijojijdz4565
The Desired Output is counting the lines that contains a pattern :
#occurences pattern
5 4565
2 1235
1 6583
2 6933
I have tried to filter the file
cat File.txt | cut -c14-17 | sort -n -K1,1-1,3 >> File_Filtered.txt
I need your help to add the first column (#occurences)
To get a count of repeats, use uniq -c. Thus, try:
$ cut -c13-17 File.txt | sort -n | uniq -c | sort -nr
5 4565
2 6933
2 1235
1 6583
The above was tested using Linux with GNU utilities. (Judging by your sample code, you may be using different tools.)
Including a header
The following includes the header and uses column -t to assure that everything lines up nicely:
$ { echo '#occurences pattern'; cut -c13-17 File.txt | sort -n | uniq -c | sort -nr; } | column -t
#occurences pattern
5 4565
2 6933
2 1235
1 6583
$ awk '{cnt[substr($0,13)]++} END{for (i in cnt) print cnt[i], i}' file
2 6933
1 6583
5 4565
2 1235

Compare 2 files in unix file1(2M numbers/rows/lines) , file2(2,000,480 numbers/rows/lines)

How can I compare this 2 big files in unix.
I've already tried using 'grep -Fxvf file1.txt file2.txt | wc -l' but the output is 2,000,480 and when switching file1 and file2 the output is 1,999,999.
How can I get the output of '480' because that's what i am expecting.
I've also tried using diff/cmp commands but the output is too complicated.
I think you want an absolute value of a difference in line numbers in 2 files. You can achieve it easily with awk and get a decent result. You'd read numbers of lines in an array and later subtract the array values in the END block. For pure shell it'd have to get more complex. Imagine you get some test data generated (10 and 14 line files):
$ seq 1 10 > ten
$ seq 1 14 > fourteen
And then you do:
$ ( wc -l ten ; wc -l fourteen ) | awk '{ print $1}' | sort -rn | xargs -J % echo % - p | dc
The result:
4
But much better way would be do just do it in 3 lines (get word count for file1, then file2 and then subtract)

Using grep -f and -w together

I have two files like this:
abc.txt
a
b
z
1
10
and abcd.txt
a
b
c
d
1
10
100
1000
I would like:
a
b
1
10
I would like to use grep -fw abc.txt abcd.txt to search through every line of abc.txt and print lines which match the entire word. If I just use grep -f, I get lines 100 since the pattern '10' matches '100'. But grep -f -w abc.txt abcd.txt produces:
a
b
1
and doesn't print out the 10. So, I guess, what is the best way to match every line in abc.txt with the entire line of abcd.txt ?

Resources