How to concatenate the values based on other field in unix - unix
I have detail.txt file ,which contains
cat >detail.txt
Student ID,Student Name, Percentage
101,A,75
102,B,77
103,C,34
104,D,42
105,E,75
106,F,42
107,G,77
1.I want to print concatenated output based on Percentage (group by Percentage) and print student name in single line separated by comma(,).
Expected Output:
75-A,E
77-B,G
42-D,F
34-C
For above question i got that how can achieve this for 75 or 77 or 42. But i did not get how to write a code grouping third field (Percentage).
I tried below code
awk -F"," '{OFS=",";if($3=="75") print $2}' detail.txt
2. I want to get output based on grading system which is given below.
marks < 45=THIRD
marks>=45 and marks<60 =SECOND
marks>=60 and marks<=75 =FIRST
marks>75 =DIST
Expected Output:
DIST:B,G
FIRST:A,E
THIRD:C,D,F
Please help me to get the expected output. Thank You..
awk solution:
awk -F, 'NR>1{
if ($3<45) k="THIRD"; else if ($3>=45 && $3<60) k="SECOND";
else if ($3>=60 && $3<=75) k="FIRST"; else k="DIST";
a[k] = a[k]? a[k]","$2 : $2;
}END{ for(i in a) print i":"a[i] }' detail.txt
k - variable that will be assigned with "grading system" name according to one of the if (...) <exp>; else if(...) <exp> ... statements
a[k] - array a is indexed by determined "grading system" name k
a[k] = a[k]? a[k]","$2 : $2 - all "student names"(presented by the 2nd field $2) are accumulated/grouped into the needed "grading system"
The output:
DIST:B,G
THIRD:C,D,F
FIRST:A,E
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR>1 {
stud = $2
pct = $3
if ( pct <= 45 ) { band = "THIRD" }
else if ( pct <= 60 ) { band = "SECOND" }
else if ( pct <= 75 ) { band = "FIRST" }
else { band = "DIST" }
pcts[pct][stud]
bands[band][stud]
}
END {
for (pct in pcts) {
out = ""
for (stud in pcts[pct]) {
out = (out == "" ? pct "-" : out OFS) stud
}
print out
}
print "----"
for (band in bands) {
out = ""
for (stud in bands[band]) {
out = (out == "" ? band ":" : out OFS) stud
}
print out
}
}
.
$ gawk -f tst.awk file
34-C
42-D,F
75-A,E
77-B,G
----
DIST:B,G
THIRD:C,D,F
FIRST:A,E
For your first question, the following awk one-liner should do:
awk -F, '{a[$3]=a[$3] (a[$3] ? "," : "") $2} END {for(i in a) printf "%s-%s\n", i, a[i]}' input.txt
The second question can work almost the same way, storing your mark divisions in an array, then stepping through that array to determine the subscript for a new array:
BEGIN { FS=","; m[0]="THIRD"; m[45]="SECOND"; m[60]="FIRST"; m[75]="DIST" } { for (i=0;i<=100;i++) if ((i in m) && $3 > i) mdiv=m[i]; marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2 } END { for(i in marks) printf "%s:%s\n", i, marks[i] }
But this is unreadable. When you need this level of complexity, you're past the point of a one-liner. :)
So .. combining the two and breaking them out for easier reading (and commenting) we get the following:
BEGIN {
FS=","
m[0]="THIRD"
m[45]="SECOND"
m[60]="FIRST"
m[75]="DIST"
}
{
a[$3]=a[$3] (a[$3] ? "," : "") $2 # Build an array with percentage as the index
for (i=0;i<=100;i++) # Walk through the possible marks
if ((i in m) && $3 > i) mdiv=m[i] # selecting the correct divider on the way
marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2
# then build another array with divider
# as the index
}
END { # Once we've processed all the input,
for(i in a) # step through the array,
printf "%s-%s\n", i, a[i] # printing the results.
print "----"
for(i in marks) # step through the array,
printf "%s:%s\n", i, marks[i] # printing the results.
}
You may be wondering why we for (i=0;i<=100;i++) instead of simply using for (i in m). This is because awk does not guarantee the order of array elements, and when stepping through the m array, it's important that we see the keys in increasing order.
Related
Explain the AWK syntax in detail(NR,FNR,NF)
I am learning file comparison using awk. I found syntax like below, awk ' (NR == FNR) { s[$0] next } { for (i=1; i<=NF; i++) if ($i in s) delete s[$i] } END { for (i in s) print i }' tests tests2 I couldn't understand what is the Syntax ...Can you please explain in detail? What exactly does it do?
awk ' # use awk (NR == FNR) { # process first file s[$0] # hash the whole record to array s next # process the next record of the first file } { # process the second file for (i=1; i<=NF; i++) # for each field in record if ($i in s) # if field value found in the hash delete s[$i] # delete the value from the hash } END { # after processing both files for (i in s) # all leftover values in s print i # are output }' tests tests2 For example, for files: tests: 1 2 3 tests2: 1 2 4 5 program would output: 3
Get a specific column number by column name using awk
i have n number of file, in these files a specific column named "thrudate" is given at different column number in every files. i Just want to extract the value of this column from all files in one go. So i tried with using awk. Here i'm considering only one file, and extracting the values of thrudate awk -F, -v header=1,head="" '{for(j=1;j<=2;j++){if($header==1){for(i=1;i<=$NF;i++){if($i=="thrudate"){$head=$i;$header=0;break}}} elif($header==0){print $0}}}' file | head -10 How i have approached: used find command to find all the similar files and then executing the second step for every file loop all fields in first row, checking the column name with header values as 1 (initialized it to 1 to check first row only), once it matched with 'thrudate', i set header as 0, then break from this loop. once i get the column number then print it for every row.
You can use the following awk script: print_col.awk: # Find the column number in the first line of a file FNR==1{ for(n=1;n<=NF;n++) { if($n == header) { next } } } # Print that column on all other lines { print $n } Then use find to execute this script on every file: find ... -exec awk -v header="foo" -f print_col.awk {} + In comments you've asked for a version that could print multiple columns based on their header names. You may use the following script for that: print_cols.awk: BEGIN { # Parse headers into an assoc array h split(header, a, ",") for(i in a) { h[a[i]]=1 } } # Find the column numbers in the first line of a file FNR==1{ split("", cols) # This will re-init cols for(i=1;i<=NF;i++) { if($i in h) { cols[i]=1 } } next } # Print those columns on all other lines { res = "" for(i=1;i<=NF;i++) { if(i in cols) { s = res ? OFS : "" res = res "" s "" $i } } if (res) { print res } } Call it like this: find ... -exec awk -v header="foo,bar,test" -f print_cols.awk {} +
Improve AWK script - Compare and add fields
I have a CSV file separated by semicolons, the file contains a sentiment analysis on customer reviews. The views are grouped by field 1 and 6, what I have to do is add the last fields of each line of the group and then compare the sum with field 3. If they match, the comparison equals 1 if not 0. At the same time, I have also compare fields 6 and 7 applying the same rule above. Finally calculate the number of ones obtained for both comparisons separately. I have a script made below, but I think it could be improved. Any suggestions? Also I am not sure that the script is fine..! BEGIN { OFS=FS=";"; flag=""; counter1=0; counter2=0; counter3=0; } { number=$1; topic=$6; id= number";"topic; if (id != flag) { for (i in topics) { if ((sum < 0) && (polarity[i] == "negative") || (sum > 0) && (polarity[i] == "positive")) { hit_2=1; counter2++; } else { hit_2=0; } s=split(topics[i],words,";") hit_1=0; for (k=1;k<=s;k++) { if ((words[k] == words[k+1]) && (words[k] != "") || (words[k] == "NULL") && (hit_2 == 1)) { hit_1=1; } } if (hit_1 == 1) { counter1++; } print to_print[i]";"hit_1";"hit_2; } delete topics; delete to_print; delete polarity; counter3++; sum=""; flag=id; } sum += $(NF-1); topics[$1";"$6]=topics[$1";"$6] ";"$6";"$7; to_print[$1";"$6]=$1";"$2";"$3";"$4";"$5";"$6 polarity[$1";"$6]=$3; } END { print "" print "#### - sentiments: "counter3" - topic: "counter1 " - polarity: "counter2; } A portion of the input data: 100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;place;place;good;good;2.000000; 100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;place;place;not longer;not longer;-3.000000; 100429331;"FOOD#QUALITY";negative;1004293;10042933;food;food;food;lousy;lousy;-3.000000; 100429331;"FOOD#QUALITY";negative;1004293;10042933;food;food;food;too sweet;too sweet;3.600000; 100429331;"FOOD#QUALITY";negative;1004293;10042933;food;portions;portion;tiny;tiny;-1.000000; 103269521;"FOOD#QUALITY";positive;1032695;10326952;duck breast special;visit;visit;incredible;incredible;4.000000; Output: 100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;1;1 100429331;"FOOD#QUALITY";negative;1004293;10042933;food;1;1 103269521;"FOOD#QUALITY";positive;1032695;10326952;duck breast special;0;1 #### - sentiments: 57 - topic: 28 - polarity: 39
I rewrote part of it, perhaps you can extend this further. $ awk -F';' -v OFS=';' ' {key=$1 FS $3 FS $6; sum[key]+=$(NF-1); line[key]=$1 FS $2 FS $3 FS $4 FS $5 FS $6; sign[key]=($3=="negative"?-1:1) } END{for(k in sum) print line[k],(sum[k]*sign[k]<0?0:1),sum[k],sign[k]}' data 100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;1;-1;-1 103269521;"FOOD#QUALITY";positive;1032695;10326952;duck breast special;1;4;1 100429331;"FOOD#QUALITY";negative;1004293;10042933;food;1;-0.4;-1 This only does your first check (I guess hit2), while also added sum and sign information (last two fields).
awk count and sum based on slab:
Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file. Input.csv.gz (GunZip) Desc,Date,Zone,Duration,Calls AB,01-06-2014,XYZ,450,3 AB,01-06-2014,XYZ,642,3 AB,01-06-2014,XYZ,0,0 AB,01-06-2014,XYZ,205,3 AB,01-06-2014,XYZ,98,1 AB,01-06-2014,XYZ,455,1 AB,01-06-2014,XYZ,120,1 AB,01-06-2014,XYZ,0,0 AB,01-06-2014,XYZ,193,1 AB,01-06-2014,XYZ,0,0 AB,01-06-2014,XYZ,161,2 Slab.csv StartRange,EndRange 0,0 1,10 11,100 101,200 201,300 301,400 401,500 501,10000 Expected Output: StartRange,EndRange,Count,Sum-4,Sum-5 0,0,3,0,0 1,10,NotFound,NotFound,NotFound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,NotFound,NotFound,NotFound 401,500,2,905,4 501,10000,1,642,3 I am using below two commands to get the above output , expect "NotFound"cases . awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv Op_step2.csv 101,200,3,474,4 501,10000,1,642,3 0,0,3,0,0 401,500,2,905,4 11,100,1,98,1 201,300,1,205,3 Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.
Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes. perl -F, -lane' BEGIN { $x = pop; ## Create array of arrays from start and end ranges ## $range = ( [0,0] , [1,10] ... ) (undef, #range)= map { chomp; [split /,/] } <>; #ARGV = $x; } ## Skip the first line next if $. ==1; ## Create hash of hash ## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 } for (#range) { if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) { $line{"#$_"}{"count"}++; $line{"#$_"}{"sum4"} +=$F[3]; $line{"#$_"}{"sum5"} +=$F[4]; } } }{ print "StartRange,EndRange,Count,Sum-4,Sum-5"; print join ",", #$_, $line{"#$_"}{"count"} //"NotFound", $line{"#$_"}{"sum4"} //"NotFound", $line{"#$_"}{"sum5"} //"NotFound" for #range ' slab input StartRange,EndRange,Count,Sum-4,Sum-5 0,0,3,0,0 1,10,NotFound,NotFound,NotFound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,NotFound,NotFound,NotFound 401,500,2,905,4 501,10000,1,642,3
Here is one way using awk and sort: awk ' BEGIN { FS = OFS = SUBSEP = ","; print "StartRange,EndRange,Count,Sum-4,Sum-5" } FNR == 1 { next } NR == FNR { ranges[$1,$2]++; next } { for (range in ranges) { split(range, tmp, SUBSEP); if ($4 >= tmp[1] && $4 <= tmp[2]) { count[range]++; sum4[range]+=$4; sum5[range]+=$5; next } } } END { for(range in ranges) print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2" }' slab input StartRange,EndRange,Count,Sum-4,Sum-5 0,0,3,NotFound,NotFound 1,10,NotFound,NotFound,NotFound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,NotFound,NotFound,NotFound 401,500,2,905,4 501,10000,1,642,3 Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line. If it is the first line skip it. Load the entire slab.txt in to an array called ranges. For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately. In the END block, iterate through the ranges and print them. Pipe the output to sort to get the output in order.
How to check if the variable value in AWK script is null or empty?
I am using AWK script to process some logs. At one place I need to check if the variable value is null or empty to make some decision. Any Idea how to achieve the same? awk ' { { split($i, keyVal, "#") key=keyVal[1]; val=keyVal[2]; if(val ~ /^ *$/) val="Y"; } } ' File I have tried with 1) if(val == "") 2) if(val ~ /^ *$/) not working in both cases.
The comparison with "" should have worked, so that's a bit odd As one more alternative, you could use the length() function, if zero, your variable is null/empty. E.g., if (length(val) == 0) Also, perhaps the built-in variable NF (number of fields) could come in handy? Since we don't have access to your input data it's hard to say though, but another possibility.
You can directly use the variable without comparison, an empty/null/zero value is considered false, everything else is true. See here : # setting default tag if not provided if (! tag) { tag="default-tag" } So this script will have the variable tag with the value default-tag except if the user call it like this : $ awk -v tag=custom-tag -f script.awk targetFile This is true as of : GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
It works just fine for me $ awk 'BEGIN{if(val==""){print "null or empty"}}' null or empty You can't differentiate between variable being empty and null, when you access "unset" variable, awk just initializes it with default value(here it is "" - empty string). You can use some sort of workaround, for example, setting val_accessed variable to 0 and then to 1 when you access it. Or more simple approach(somewhat "hackish") setting val to "unitialized"(or to some other value which can't appear when running your program). PS: your script looks strange for me, what are the nested brackets for?
I accidentally discovered this less-used function specific in gawk that could help differentiate : ****** gawk-only ****** BEGIN { $0 = "abc" print NF, $0 test_function() test_function($(NF + 1)) test_function("") test_function($0) } function test_function(_) { print typeof(_) } 1 abc untyped unassigned string string So it seems, for non-numeric-like data : absolutely no input to function at all : untyped non-existent or empty field, including $0 : unassigned any non-numeric-appearing string, including "" : string Here's the chaotic part - numeric data : strangely enough, for absolutely identical input, only differing between using $0 vs. $1 in function call, you frequently get a different value for typeof() even a combination of both leading and trailing spaces doesn't prevent gawk from identifying it as strnum [123]:NF:1 $0 = number:123 $1 = strnum:123 +$1 = number:123 [ 456.33]:NF:1 $0 = string: 456.33 $1 = strnum:456.33 +$1 = number:456.33000 [ 19683 ]:NF:1 $0 = string: 19683 $1 = strnum:19683 +$1 = number:19683 [-20.08554]:NF:1 $0 = number:-20.08554 $1 = strnum:-20.08554 +$1 = number:-20.08554 +/- inf/nan (same for all 4): [-nan]:NF:1 $0 = string:-nan $1 = strnum:-nan +$1 = number:-nan this one is a string because it was made from sprintf() : [0x10FFFF]:NF:1 $0 = string:0x10FFFF $1 = string:0x10FFFF +$1 = number:0 using -n / --non-decimal-data flag, all stays same except [0x10FFFF]:NF:1 $0 = string:0x10FFFF $1 = strnum:0x10FFFF +$1 = number:1114111 Long story short, if you want your gawk function to be able to differentiate between empty-string input (""), versus actually no input at all e.g. when original intention is to directly apply changes to $0 then typeof(x) == "untyped" seems to be the most reliable indicator. It gets worse when null-string padding versus a non-empty string of all zeros :: function __(_) { return (!_) ":" (!+_) } function ___(_) { return (_ == "") } function ____(_) { return (!_) ":" (!""_) } $0--->[ "000" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ] ___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1000 ] $0--->[ "" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 1:1 ] ___($0)-->{ $0=="" }-->[ 1 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1 ] $0--->[ " -0.0 -0" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ] ___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 -0.0 -0 ] $0--->[ " 0x5" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ] ___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 0x5 ]