I am learning file comparison using awk.
I found syntax like below,
awk '
(NR == FNR) {
s[$0]
next
}
{
for (i=1; i<=NF; i++)
if ($i in s)
delete s[$i]
}
END {
for (i in s)
print i
}' tests tests2
I couldn't understand what is the Syntax ...Can you please explain in detail?
What exactly does it do?
awk ' # use awk
(NR == FNR) { # process first file
s[$0] # hash the whole record to array s
next # process the next record of the first file
}
{ # process the second file
for (i=1; i<=NF; i++) # for each field in record
if ($i in s) # if field value found in the hash
delete s[$i] # delete the value from the hash
}
END { # after processing both files
for (i in s) # all leftover values in s
print i # are output
}' tests tests2
For example, for files:
tests:
1
2
3
tests2:
1 2
4 5
program would output:
3
Related
for below scenario i need ur help.
File 1: Is a config file
a|b|c|d|e|f|g
File 2: Input file
a|c|d|g
I have to compare "File 1" and "File 2" and print the following from file 2
a||c|d|||g
So basically I need to compare both the records and for the matching records i have to print it from file 2 and when not matching i have to put NULLS.
I think I can interpret what you're asking: you want to "remap" File2 onto the headers of File1.
$ cat File1
first|last|email|phone|address|colour|size
$ cat File2
first|last|phone|size
Glenn|Jackman|555-555-1212|L
$ awk -F '|' '
NR == FNR {
n = NF
for (i=1; i<=NF; i++) head[i] = $i
print
next
}
FNR == 1 {
for (i=1; i<=NF; i++) f2head[i] = $i
next
}
{
for (i=1; i<=NF; i++) data[f2head[i]] = $i
for (i=1; i<=n; i++)
printf "%s%s", data[head[i]], (i<n ? FS : RS)
}
' File1 File2
first|last|email|phone|address|colour|size
Glenn|Jackman||555-555-1212|||L
From the first block, the head array will be:
{1:"first",2:"last",3:"email",4:"phone",5:"address",6:"colour",7:"size"}
From the second block, the f2head array will be:
{1:"first",2:"last",3:"phone",4:"size"}
In the third block, the data array will be
{"first":"Glenn","last":"Jackman","phone":"555-555-1212","size":"L"}
another awk
$ awk 'BEGIN {FS=OFS="|"};
NR==1 {n=split($0,ht);
for(i=1;i<=n;i++) h[ht[i]]=i; next}
NR==2 {n2=split($0,h2)}
{split($0,t); $0="";
for(i=1;i<=n2;i++) $(h[h2[i]])=t[i]}1' file1 file2
seems cryptic but in essence does the following:
find the index of header2 (h2) values in header1 (h) and set the corresponding field ($(h[h2[i]])) in from file2 (t[i]).
I have detail.txt file ,which contains
cat >detail.txt
Student ID,Student Name, Percentage
101,A,75
102,B,77
103,C,34
104,D,42
105,E,75
106,F,42
107,G,77
1.I want to print concatenated output based on Percentage (group by Percentage) and print student name in single line separated by comma(,).
Expected Output:
75-A,E
77-B,G
42-D,F
34-C
For above question i got that how can achieve this for 75 or 77 or 42. But i did not get how to write a code grouping third field (Percentage).
I tried below code
awk -F"," '{OFS=",";if($3=="75") print $2}' detail.txt
2. I want to get output based on grading system which is given below.
marks < 45=THIRD
marks>=45 and marks<60 =SECOND
marks>=60 and marks<=75 =FIRST
marks>75 =DIST
Expected Output:
DIST:B,G
FIRST:A,E
THIRD:C,D,F
Please help me to get the expected output. Thank You..
awk solution:
awk -F, 'NR>1{
if ($3<45) k="THIRD"; else if ($3>=45 && $3<60) k="SECOND";
else if ($3>=60 && $3<=75) k="FIRST"; else k="DIST";
a[k] = a[k]? a[k]","$2 : $2;
}END{ for(i in a) print i":"a[i] }' detail.txt
k - variable that will be assigned with "grading system" name according to one of the if (...) <exp>; else if(...) <exp> ... statements
a[k] - array a is indexed by determined "grading system" name k
a[k] = a[k]? a[k]","$2 : $2 - all "student names"(presented by the 2nd field $2) are accumulated/grouped into the needed "grading system"
The output:
DIST:B,G
THIRD:C,D,F
FIRST:A,E
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR>1 {
stud = $2
pct = $3
if ( pct <= 45 ) { band = "THIRD" }
else if ( pct <= 60 ) { band = "SECOND" }
else if ( pct <= 75 ) { band = "FIRST" }
else { band = "DIST" }
pcts[pct][stud]
bands[band][stud]
}
END {
for (pct in pcts) {
out = ""
for (stud in pcts[pct]) {
out = (out == "" ? pct "-" : out OFS) stud
}
print out
}
print "----"
for (band in bands) {
out = ""
for (stud in bands[band]) {
out = (out == "" ? band ":" : out OFS) stud
}
print out
}
}
.
$ gawk -f tst.awk file
34-C
42-D,F
75-A,E
77-B,G
----
DIST:B,G
THIRD:C,D,F
FIRST:A,E
For your first question, the following awk one-liner should do:
awk -F, '{a[$3]=a[$3] (a[$3] ? "," : "") $2} END {for(i in a) printf "%s-%s\n", i, a[i]}' input.txt
The second question can work almost the same way, storing your mark divisions in an array, then stepping through that array to determine the subscript for a new array:
BEGIN { FS=","; m[0]="THIRD"; m[45]="SECOND"; m[60]="FIRST"; m[75]="DIST" } { for (i=0;i<=100;i++) if ((i in m) && $3 > i) mdiv=m[i]; marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2 } END { for(i in marks) printf "%s:%s\n", i, marks[i] }
But this is unreadable. When you need this level of complexity, you're past the point of a one-liner. :)
So .. combining the two and breaking them out for easier reading (and commenting) we get the following:
BEGIN {
FS=","
m[0]="THIRD"
m[45]="SECOND"
m[60]="FIRST"
m[75]="DIST"
}
{
a[$3]=a[$3] (a[$3] ? "," : "") $2 # Build an array with percentage as the index
for (i=0;i<=100;i++) # Walk through the possible marks
if ((i in m) && $3 > i) mdiv=m[i] # selecting the correct divider on the way
marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2
# then build another array with divider
# as the index
}
END { # Once we've processed all the input,
for(i in a) # step through the array,
printf "%s-%s\n", i, a[i] # printing the results.
print "----"
for(i in marks) # step through the array,
printf "%s:%s\n", i, marks[i] # printing the results.
}
You may be wondering why we for (i=0;i<=100;i++) instead of simply using for (i in m). This is because awk does not guarantee the order of array elements, and when stepping through the m array, it's important that we see the keys in increasing order.
i have n number of file, in these files a specific column named "thrudate" is given at different column number in every files.
i Just want to extract the value of this column from all files in one go. So i tried with using awk. Here i'm considering only one file, and extracting the values of thrudate
awk -F, -v header=1,head="" '{for(j=1;j<=2;j++){if($header==1){for(i=1;i<=$NF;i++){if($i=="thrudate"){$head=$i;$header=0;break}}} elif($header==0){print $0}}}' file | head -10
How i have approached:
used find command to find all the similar files and then executing the second step for every file
loop all fields in first row, checking the column name with header values as 1 (initialized it to 1 to check first row only), once it matched with 'thrudate', i set header as 0, then break from this loop.
once i get the column number then print it for every row.
You can use the following awk script:
print_col.awk:
# Find the column number in the first line of a file
FNR==1{
for(n=1;n<=NF;n++) {
if($n == header) {
next
}
}
}
# Print that column on all other lines
{
print $n
}
Then use find to execute this script on every file:
find ... -exec awk -v header="foo" -f print_col.awk {} +
In comments you've asked for a version that could print multiple columns based on their header names. You may use the following script for that:
print_cols.awk:
BEGIN {
# Parse headers into an assoc array h
split(header, a, ",")
for(i in a) {
h[a[i]]=1
}
}
# Find the column numbers in the first line of a file
FNR==1{
split("", cols) # This will re-init cols
for(i=1;i<=NF;i++) {
if($i in h) {
cols[i]=1
}
}
next
}
# Print those columns on all other lines
{
res = ""
for(i=1;i<=NF;i++) {
if(i in cols) {
s = res ? OFS : ""
res = res "" s "" $i
}
}
if (res) {
print res
}
}
Call it like this:
find ... -exec awk -v header="foo,bar,test" -f print_cols.awk {} +
I want to replace the second existence of the pattern in unix.
Input File:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID|12
1245|425345|TaskID|dksj|kdjfdsjf|TaskID|12
1234|25345|TaskID|dksj|TaskID|kdjfdsjf|12|TaskID
123425|65345|TaskID|dksj|kdjfdsjf|12|TaskID
123425|15325|TaskID|dksj|kdjfdsjf|12
Sample Output file:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID1|12
1245|425345|TaskID2|dksj|kdjfdsjf|TaskID3|12
1234|25345|TaskID|dksj|TaskID1|kdjfdsjf|12|TaskID2
123425|65345|TaskID3|dksj|kdjfdsjf|12|TaskID4
123425|15325|TaskID|dksj|kdjfdsjf|12
your example does not match your question,
so i'll only show how to replace every second match of the given pattern
use awk. it's very powerfull tool for command line text processing
replace.sh as follow:
cat | awk -v search="$1" -v repl="$2" '
BEGIN {
flag = 0
}
{
split($0, a, search)
len = length(a)
for (f = 1; f < len; f += 1) {
printf "%s%s", a[f], (flag % 2 == 0 ? search : repl)
flag += 1
}
printf "%s%s", a[len], ORS
}
'
cat input.txt | ./replace.sh TaskID TaskID1
I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list