awk to parse text file issue - unix

I have just started to learn how to use awk to parse and print text file .
I have started with bellow code who can help me ?
NB: quote are mandatory at output file ( see desired output )
awk '/^IPDATA=/ && /A|B|C| '{print "ADD IP ="$0"\n(\n \Ref "$1",Type vlan="$2"\"\n)\n"}' file > file1
NB: Ref is the sum of line of IPREF here in the example we have three : [2] && [2] && [1].
the sample text file in fact is huge but I have summurized it :
IPDATA=A IPID A
IPDATA=A IPREF [2] =
--- IPREF = VRID=A_1
--- IPREF = VRID=A_2
IPDATA=B IPID B
IPDATA=B IPREF [2] =
--- IPREF = VRID=B_1
--- IPREF = VRID=B_2
IPDATA=C IPID C
IPDATA=C IPREF [1] =
--- IPREF = VRID=C_1
I want to get bellow result :
"ADD IP=A "
( Ref 2
"Type vlan=VRID=A_1"
"Type vlan=VRID=A_2"
)
"ADD IP=B "
( Ref 2
"Type vlan=VRID=B_1"
"Type vlan=VRID=B_2"
)
"ADD IP=C "
( Ref 1
"Type vlan=VRID=C_1"
)
thanks

Could you please try following, written and tested with shown samples only in GNU awk.
awk -v s1="\"" '
/^IPDATA/ && /IPID .*/{
if(FNR>1){ print ")" }
print s1 "ADD IP" s1 "="s1 $NF OFS s1
next
}
/^IPDATA.*IPREF.*\[[0-9]+\]/{
match($0,/\[[^]]*/)
print "( Ref sum of IPREF " substr($0,RSTART+1,RLENGTH-1)
next
}
/^--- IPREF/{
print s1 "Type vlan="$NF s1
}
END{
print ")"
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v s1="\"" ' ##Starting awk program from here.
/^IPDATA/ && /IPID .*/{ ##Checking condition if line starts IPDATA and having IPID here.
if(FNR>1){ print ")" } ##Checking condition if FNR>1 then printing ) here.
print s1 "ADD IP" s1 "="s1 $NF OFS s1 ##Printing s1 ADD IP s1 equal to s1 last field OFS and s1 here.
next ##next will skip all further statements from here.
}
/^IPDATA.*IPREF.*\[[0-9]+\]/{ ##Checking condition if line starts from IPDATA and having IPREF till [ digits ].
match($0,/\[[^]]*/) ##Using match to match from [ till ] in line.
print "( Ref sum of IPREF " substr($0,RSTART+1,RLENGTH-1) ##printing string as per request and sub-string from RSTART+1 to till RLENGTH-1 here.
next
}
/^--- IPREF/{ ##Checking conditon if line starts from --- IPREF then do following.
print s1 "Type vlan="$NF s1 ##Printing s1 string last field and s1 here.
}
END{ ##Starting END block of this program from here.
print ")" ##Printing ) here.
}
' Input_file ##Mentioning Input_file name here.

Related

Is there way to extract all the duplicate records based on a particular column?

I'm trying to extract all (only) the duplicate values from a pipe delimited file.
My data file has 800 thousands rows with multiple columns and I'm particularly interested about column 3. So I need to get the duplicate values of column 3 and extract all the duplicate rows from that file.
I'm, however able to achieve this as shown below..
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
and I take the above in loop as shown below..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
I've also tried the awk method
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
But, as I have large number of records in the file, it's taking ages to complete. So I'm looking for an easy and quick alternative.
For example, I have data like this:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
And my expected output which doesn't include unique records:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
This may be what you want:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
or if the file's too large for all the keys ($3 values) to fit in memory (which shouldn't be a problem with just the unique $3 values from 800,000 lines):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: As per Ed sir's suggestion fine tuned my suggestion with more meaningful names(IMO) of arrays.
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Edit by Ed Morton: FWIW here's how I'd have named the variables in the above code:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
EDIT: Since OP changed Input_file with |delimited so changing code a bit to as follows, which deals with new Input_file(Thanks to Ed Morton sir for pointing it out).
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Could you please try following, following will give output in same sequence of in which lines are occurring in Input_file.
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Output will be as follows.
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
Explanation for above code:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
Another in awk:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
Output for the sample data:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
Also, would this work:
$ sort -k 2 file | uniq -D -f 1
or -k2,5 or smth. Nope, as the delimiter changed from space to pipe.
Two steps of improvement.
First step:
After
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
you can use
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
Second step:
Use awk as given in other, better, answers.

awk if statement with simple math

I'm just trying to do some basic calculations on a CSV file.
Data:
31590,Foo,70
28327,Bar,291
25155,Baz,583
24179,Food,694
28670,Spaz,67
22190,bawk,4431
29584,alfred,142
27698,brian,379
24372,peter,22
25064,weinberger,8
Here's my simple awk script:
#!/usr/local/bin/gawk -f
BEGIN { FPAT="([^,]*)|(\"[^\"]+\")"; OFS=","; OFMT="%.2f"; }
NR > 1
END { if ($3>1336) $4=$3*0.03; if ($3<1336) $4=$3*0.05;}1**
Wrong output:
31590,Foo,70
28327,Bar,291
28327,Bar,291
25155,Baz,583
25155,Baz,583
24179,Food,694
24179,Food,694
28670,Spaz,67
28670,Spaz,67
22190,bawk,4431
22190,bawk,4431
29584,alfred,142
29584,alfred,142
27698,brian,379
27698,brian,379
24372,peter,22
24372,peter,22
25064,weinberger,8
25064,weinberger,8
Excepted output:
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,.04
Simple math is if
field $3 > 1336 = $3*.03 and results in field $4
field $3 < 1336 = $3*.05 and results in field $4
There's no need to force awk to recompile every record (by assigning to $4), just print the current record followed by the result of your calculation:
awk 'BEGIN{FS=OFS=","; OFMT="%.2f"} {print $0, $3*($3>1336?0.03:0.05)}' file
You shouldn't have anything in the END block
BEGIN {
FS = OFS = ","
OFMT="%.2f"
}
{
if ($3 > 1336)
$4 = $3 * 0.03
else
$4 = $3 * 0.05
print
}
This results in
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4
$ awk -F, -v OFS=, '{if ($3>1336) $4=$3*0.03; else $4=$3*0.05;} 1' data
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4
Discussion
The END block is not executed at the end of each line but at the end of the whole file. Consequently, it is not helpful here.
The original code has two free standing conditions, NR>1 and 1. The default action for each is to print the line. That is why, in the "wrong output," all lines after the first were doubled in the output.
With awk:
awk -F, -v OFS=, '$3>1336?$4=$3*.03:$4=$3*.05' file
The conditional-expression ? action1 : action2 ; is the much shorter terinary operator in awk.

How to match a list of strings in two different files using a loop structure?

I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list

Unix for loop, prompt for value. Read value into ArrayVar[i]

I am trying to do a nested if statement inside of a while loop. I get a
unexpected "end of file".
while :
do
if [ "$CHOICE" != "x" -o "$CHOICE != "X" ]
then
echo "Enter two whole numbers seperated by a space ex:1 123"
read Num1 Num2
if echo "$Num1$Num2" | egrep '^[0-9]+$' 2>/dev/null
then
# Num1 and Num 2 are INTS
break
else
break
fi
else
# One of the numbers is not an INT
printf 'Error: You did not enter two whole numbers, Try Again.\n\n'
continue
fi
done
echo "$Num1 $Num2"
if [ "$CHOICE" != "x" -o "$CHOICE != "X" ]
^
/
missing closing quote ---

extracting a pattern and a certain field from the line above it using awk and grep preferably

i have a text file like this:
********** time1 **********
line of text1
line of text1.1
line of text1.2
********** time2 **********
********** time3 **********
********** time4 **********
line of text2.1
line of text2.2
********** time5 **********
********** time6 **********
line of text3.1
i want to extract line of text and the time(without the stars) above it and store it in a file.(time with no line of text beneath them have to be ignored). I want to do this preferably with grep and awk.
So for example, my output for the above code should be
time1 : line of text1
time1 : line of text1.1
time1 : line of text1.2
time4 : line of text2.1
time4 : line of text2.2
time6 : line of text3
how do i go about it?
This assumes that there are no spaces in the time and that there is only one (or zero) line of text after each time marker.
awk '$1 ~ /\*+/ {prev = $2} $1 !~ /\*+/ {print prev, ":", $0}' inputfile
Works with spaces in the time:
awk '/^[^*]+/ { gsub(/*/,"",x);printf x": "; print };{x=$0}' data.txt
You can do it like this with vim:
:%s_\*\+ \(YOUR TIME PATTERN\) \*\+\_.\(\[^*\].*\)$_\1 : \2_ | g_\*\+ YOUR TIME PATTERN \*\+_d
That is search for TIME PATTERN lines and saves the time pattern and the next line if it's not started with *. Then create the new line from them. Then delete every remaining TIME PATTERN line.
Note this assumes, that the time pattern lines are ending with *, etc.
With awk:
awk '/\*+ YOUR TIME PATTERN \*+/ { time=gensub("\*+ (YOUR TIME PATTERN) \*+","\\1","g") }
! /\*+ YOUR TIME PATTERN \*+/ { print time " : " $0 }' INPUTFILE
And there are other ways to do it.
In awk, see :
#!/bin/bash
awk '
BEGIN{
t=0
}
{
if ($0 ~ " time[0-9]+ ") {
v=$2
t=1
}
else if ($0 ~ "line of text") {
if (t==1) {
printf("%s : %s\n", v, $0)
} else {
t=0;
}
}
}
' FILE
Just replace FILE by your filename.
This might work for you (GNU sed):
sed '/^\*\+ \S\+.*/!d;s/[ *]//g;$!N;/\n[^*]/!D;s/\n/ : /' file
Explanation:
Look for lines beginning with *'s if not delete. /^\*\+ \S\+.*/!d
Got a time line. Delete *'s and spaces (leaving time). s/[ *]//g
Get next line $!N
Check the second line doesn't begin with *'s otherwise delete first line /\n[^*]/!D
Got intended pattern, replace \n with spaced : and print. s/\n/ : /
awk '{ if( $0 ~ /^\*+ time[0-9] \*+$/ ) { time = $2 } else { print time " : " $0 } }' file
$ uniq -f 2 input-file | awk '{getline n; print $2 " : " n}'
If your timestamp has spaces in it, change the argument to the -f option so that uniq is only comparing the final string of *. Eg, use -f X where X-2 is the number of spaces in the timestamp. Also if there are spaces in the timestamp, the awk will need to change. Either of these will work:
$ uniq -f 3 input-file | awk -F '**********' '{getline n; print $2 " : " n}'
$ uniq -f 3 input-file | awk '{getline n; $1=""; $NF=""; print $0 ": " n }'

Resources