I need to compare two files column by column using unix shell, and store the difference in a resulting file.
For example if column 1 of the 1st record of the 1st file matches the column 1 of the 1st record of the 2nd file then the result will be stored as '=' in the resulting file against the column, but if it finds any difference in column values the same need to be printed in the resulting file.
Below is the exact requirement.
File 1:
id code name place
123 abc Tom phoenix
345 xyz Harry seattle
675 kyt Romil newyork
File 2:
id code name place
123 pkt Rosy phoenix
345 xyz Harry seattle
421 uty Romil Sanjose
Expected resulting file:
id_1 id_2 code_1 code_2 name_1 name_2 place_1 place_2
= = abc pkt Tom Rosy = =
= = = = = = = =
675 421 kyt uty = = Newyork Sanjose
Columns are tab delimited.
This is rather crudely coded, but shows a way to use awk to emit what you want, and can handle files of identical "schema" - not just the particular 4-field files you give as tests.
This approach uses pr to do a simple merge of the files: the same line of each input file is concatenated to present one line to the awk script.
The awk script assumes clean input, and uses the fact that if a variable n has the value 2, the value of $n in the script is the the same as $2. So, the script walks though pairs of fields using the i and j variables. For your test input, fields 1 and 5, then 2 and 6, etc., are processed.
Only very limited testing of input is performed: mainly, that the implied schema of the two input files (the names of columns/fields) is the same.
#!/bin/sh
[ $# -eq 2 ] || { echo "Usage: ${0##*/} <file1> <file2>" 1>&2; exit 1; }
[ -r "$1" -a -r "$2" ] || { echo "$1 or $2: cannot read" 1>&2; exit 1; }
set -e
pr -s -t -m "$#" | \
awk '
{
offset = int(NF/2)
tab = ""
for (i = 1; i <= offset; i++) {
j = i + offset
if (NR == 1) {
if ($i != $j) {
printf "\nColumn name mismatch (%s/%s)\n", $i, $j > "/dev/stderr"
exit
}
printf "%s%s_1\t%s_2", tab, $i, $j
} else if ($i == $j) {
printf "%s=\t=", tab
} else {
printf "%s%s\t%s", tab, $i, $j
}
tab = "\t"
}
printf "\n"
}
'
Tested on Linux: GNU Awk 4.1.0 and pr (GNU coreutils) 8.21.
I have detail.txt file ,which contains
cat >detail.txt
Student ID,Student Name, Percentage
101,A,75
102,B,77
103,C,34
104,D,42
105,E,75
106,F,42
107,G,77
1.I want to print concatenated output based on Percentage (group by Percentage) and print student name in single line separated by comma(,).
Expected Output:
75-A,E
77-B,G
42-D,F
34-C
For above question i got that how can achieve this for 75 or 77 or 42. But i did not get how to write a code grouping third field (Percentage).
I tried below code
awk -F"," '{OFS=",";if($3=="75") print $2}' detail.txt
2. I want to get output based on grading system which is given below.
marks < 45=THIRD
marks>=45 and marks<60 =SECOND
marks>=60 and marks<=75 =FIRST
marks>75 =DIST
Expected Output:
DIST:B,G
FIRST:A,E
THIRD:C,D,F
Please help me to get the expected output. Thank You..
awk solution:
awk -F, 'NR>1{
if ($3<45) k="THIRD"; else if ($3>=45 && $3<60) k="SECOND";
else if ($3>=60 && $3<=75) k="FIRST"; else k="DIST";
a[k] = a[k]? a[k]","$2 : $2;
}END{ for(i in a) print i":"a[i] }' detail.txt
k - variable that will be assigned with "grading system" name according to one of the if (...) <exp>; else if(...) <exp> ... statements
a[k] - array a is indexed by determined "grading system" name k
a[k] = a[k]? a[k]","$2 : $2 - all "student names"(presented by the 2nd field $2) are accumulated/grouped into the needed "grading system"
The output:
DIST:B,G
THIRD:C,D,F
FIRST:A,E
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR>1 {
stud = $2
pct = $3
if ( pct <= 45 ) { band = "THIRD" }
else if ( pct <= 60 ) { band = "SECOND" }
else if ( pct <= 75 ) { band = "FIRST" }
else { band = "DIST" }
pcts[pct][stud]
bands[band][stud]
}
END {
for (pct in pcts) {
out = ""
for (stud in pcts[pct]) {
out = (out == "" ? pct "-" : out OFS) stud
}
print out
}
print "----"
for (band in bands) {
out = ""
for (stud in bands[band]) {
out = (out == "" ? band ":" : out OFS) stud
}
print out
}
}
.
$ gawk -f tst.awk file
34-C
42-D,F
75-A,E
77-B,G
----
DIST:B,G
THIRD:C,D,F
FIRST:A,E
For your first question, the following awk one-liner should do:
awk -F, '{a[$3]=a[$3] (a[$3] ? "," : "") $2} END {for(i in a) printf "%s-%s\n", i, a[i]}' input.txt
The second question can work almost the same way, storing your mark divisions in an array, then stepping through that array to determine the subscript for a new array:
BEGIN { FS=","; m[0]="THIRD"; m[45]="SECOND"; m[60]="FIRST"; m[75]="DIST" } { for (i=0;i<=100;i++) if ((i in m) && $3 > i) mdiv=m[i]; marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2 } END { for(i in marks) printf "%s:%s\n", i, marks[i] }
But this is unreadable. When you need this level of complexity, you're past the point of a one-liner. :)
So .. combining the two and breaking them out for easier reading (and commenting) we get the following:
BEGIN {
FS=","
m[0]="THIRD"
m[45]="SECOND"
m[60]="FIRST"
m[75]="DIST"
}
{
a[$3]=a[$3] (a[$3] ? "," : "") $2 # Build an array with percentage as the index
for (i=0;i<=100;i++) # Walk through the possible marks
if ((i in m) && $3 > i) mdiv=m[i] # selecting the correct divider on the way
marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2
# then build another array with divider
# as the index
}
END { # Once we've processed all the input,
for(i in a) # step through the array,
printf "%s-%s\n", i, a[i] # printing the results.
print "----"
for(i in marks) # step through the array,
printf "%s:%s\n", i, marks[i] # printing the results.
}
You may be wondering why we for (i=0;i<=100;i++) instead of simply using for (i in m). This is because awk does not guarantee the order of array elements, and when stepping through the m array, it's important that we see the keys in increasing order.
I'm developing a script on which I have a hex string 31323334353637383930313233 and I want to transform it into ASCII. Desired output is 1234567890123.
I already have it working using:
echo "31323334353637383930313233" | xxd -r -p
or
echo "31323334353637383930313233" | perl -pe 's/(..)/chr(hex($1))/ge'
But the point is try to use the minimum possible requirements for the script. I want it working in suse, fedora, debian, ubuntu, arch, etc... It seems the xxd command is included in vim package. I'm wondering if there is a way to achieve this using only awk or any internal Linux tool which is going to be present by default in all Linux systems.
Found this script here:
#!/bin/bash
function hex2string () {
I=0
while [ $I -lt ${#1} ];
do
echo -en "\x"${1:$I:2}
let "I += 2"
done
}
hex2string "31323334353637383930313233"
echo
You may change the line hex2string "31323334353637383930313233" so that it takes the hex value from parameters, that is:
#!/bin/bash
function hex2string () {
I=0
while [ $I -lt ${#1} ];
do
echo -en "\x"${1:$I:2}
let "I += 2"
done
}
hex2string "$1"
echo
So when executed as:
./hexstring.sh 31323334353637383930313233
It will provide the desired ascii output.
NOTE: Can't test if it works in all Linux systems.
Using gawk, from HEX to ASCII
$ gawk '{
gsub(/../,"0x& ");
for(i=1;i<=NF;i++)
printf("%c", strtonum($i));
print ""
}' <<<"31323334353637383930313233"
1234567890123
Using any awk
$ cat hex2asc_anyawk.awk
BEGIN{
split("0 1 2 3 4 5 6 7 8 9 A B C D E F", d, / /)
for(i in d)Decimal[d[i]]=i-1
}
function hex2dec(hex, h,i,j,dec)
{
hex = toupper(hex);
i = length(hex);
while(i)
{
dec += Decimal[substr(hex,i,1)] * 16 ^ j++
i--
}
return dec;
}
{
gsub(/../,"& ");
for(i=1;i<=NF;i++)
printf("%d",hex2dec($i));
print ""
}
Execution
$ awk -f hex2asc_anyawk.awk <<<"31323334353637383930313233"
1234567890123
Explanation
Steps :
Get the decimal equivalent of hex from table.
Multiply every digit with 16 power of digit location.
Sum all the multipliers.
Example :
BEGIN{
# Here we created decimal conversion array, like above table
split("0 1 2 3 4 5 6 7 8 9 A B C D E F", d, / /)
for(i in d)Decimal[d[i]]=i-1
}
function hex2dec(hex, h,i,j,dec)
{
hex = toupper(hex); # uppercase conversion if any A,B,C,D,E,F
i = length(hex); # length of hex string
while(i)
{
# dec var where sum is stored
# substr(hex,i,1) gives 1 char from RHS
# multiply by 16 power of digit location
dec += Decimal[substr(hex,i,1)] * 16 ^ j++
i-- # decrement by 1
}
return dec;
}
{
# it modifies record
# suppose if given string is 31323334353637383930313233
# after gsub it becomes 31 32 33 34 35 36 37 38 39 30 31 32 33
# thus re-evaluate the fields
gsub(/../,"& ");
# loop through fields , NF gives no of fields
for(i=1;i<=NF;i++)
# convert from hex to decimal
# and print equivalent ASCII value
printf("%c",hex2dec($i));
# print newline char
print ""
}
Meaning of dec += Decimal[substr(hex,i,1)] * 16 ^ j++
dec += Decimal[substr(hex,i,1)] * 16 ^ j++
^ ^ ^
| | |
| | 2.Multiply every digit with 16 power of digit location.
| |
| 1.Gives decimal equivalent of hex
|
|
3. Sum all the multipliers
here's a special cheating trick for u - due to ingenuity of how they originally mapped decimal digits to bytes, their hex are all x3[0-9],
so therefore, if u already know they would decode out to digits and nothing else, here's a fast shortcut :
echo "31323334353637383930313233" |
mawk 'gsub("..","_&") + gsub("_3",_)^_'
1234567890123
if it's already URL-percent-encoded, then it's even simpler :
echo '%31%32%33%34%35%36%37%38%39%30%31%32%33' |
mawk 'gsub("%3",_)^_'
or
gawk ++NF FS='%3' OFS=
1234567890123
This specialized approach can handle hex of absolutely any arbitrary size, even for awks that don't have built-in support for bigints
TL;DR : don't "do math" when none is needed
Alternate (g)awk solution:
echo "31323334353637383930313233" | awk 'RT{printf "%c", strtonum("0x"RT)}' RS='[0-9]{2}'
Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of
Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file.
Input.csv.gz (GunZip)
Desc,Date,Zone,Duration,Calls
AB,01-06-2014,XYZ,450,3
AB,01-06-2014,XYZ,642,3
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,205,3
AB,01-06-2014,XYZ,98,1
AB,01-06-2014,XYZ,455,1
AB,01-06-2014,XYZ,120,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,193,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,161,2
Slab.csv
StartRange,EndRange
0,0
1,10
11,100
101,200
201,300
301,400
401,500
501,10000
Expected Output:
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
I am using below two commands to get the above output , expect "NotFound"cases .
awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv
cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv
Op_step2.csv
101,200,3,474,4
501,10000,1,642,3
0,0,3,0,0
401,500,2,905,4
11,100,1,98,1
201,300,1,205,3
Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.
Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes.
perl -F, -lane'
BEGIN {
$x = pop;
## Create array of arrays from start and end ranges
## $range = ( [0,0] , [1,10] ... )
(undef, #range)= map { chomp; [split /,/] } <>;
#ARGV = $x;
}
## Skip the first line
next if $. ==1;
## Create hash of hash
## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 }
for (#range) {
if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) {
$line{"#$_"}{"count"}++;
$line{"#$_"}{"sum4"} +=$F[3];
$line{"#$_"}{"sum5"} +=$F[4];
}
}
}{
print "StartRange,EndRange,Count,Sum-4,Sum-5";
print join ",", #$_,
$line{"#$_"}{"count"} //"NotFound",
$line{"#$_"}{"sum4"} //"NotFound",
$line{"#$_"}{"sum5"} //"NotFound"
for #range
' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Here is one way using awk and sort:
awk '
BEGIN {
FS = OFS = SUBSEP = ",";
print "StartRange,EndRange,Count,Sum-4,Sum-5"
}
FNR == 1 { next }
NR == FNR {
ranges[$1,$2]++;
next
}
{
for (range in ranges) {
split(range, tmp, SUBSEP);
if ($4 >= tmp[1] && $4 <= tmp[2]) {
count[range]++;
sum4[range]+=$4;
sum5[range]+=$5;
next
}
}
}
END {
for(range in ranges)
print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2"
}' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,NotFound,NotFound
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line.
If it is the first line skip it.
Load the entire slab.txt in to an array called ranges.
For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately.
In the END block, iterate through the ranges and print them.
Pipe the output to sort to get the output in order.
I am using AWK script to process some logs.
At one place I need to check if the variable value is null or empty to make some decision.
Any Idea how to achieve the same?
awk '
{
{
split($i, keyVal, "#")
key=keyVal[1];
val=keyVal[2];
if(val ~ /^ *$/)
val="Y";
}
}
' File
I have tried with
1) if(val == "")
2) if(val ~ /^ *$/)
not working in both cases.
The comparison with "" should have worked, so that's a bit odd
As one more alternative, you could use the length() function, if zero, your variable is null/empty. E.g.,
if (length(val) == 0)
Also, perhaps the built-in variable NF (number of fields) could come in handy? Since we don't have access to your input data it's hard to say though, but another possibility.
You can directly use the variable without comparison, an empty/null/zero value is considered false, everything else is true.
See here :
# setting default tag if not provided
if (! tag) {
tag="default-tag"
}
So this script will have the variable tag with the value default-tag except if the user call it like this :
$ awk -v tag=custom-tag -f script.awk targetFile
This is true as of :
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
It works just fine for me
$ awk 'BEGIN{if(val==""){print "null or empty"}}'
null or empty
You can't differentiate between variable being empty and null, when you access "unset" variable, awk just initializes it with default value(here it is "" - empty string). You can use some sort of workaround, for example, setting val_accessed variable to 0 and then to 1 when you access it. Or more simple approach(somewhat "hackish") setting val to "unitialized"(or to some other value which can't appear when running your program).
PS: your script looks strange for me, what are the nested brackets for?
I accidentally discovered this less-used function specific in gawk that could help differentiate :
****** gawk-only ******
BEGIN {
$0 = "abc"
print NF, $0
test_function()
test_function($(NF + 1))
test_function("")
test_function($0)
}
function test_function(_) { print typeof(_) }
1 abc
untyped
unassigned
string
string
So it seems, for non-numeric-like data :
absolutely no input to function at all : untyped
non-existent or empty field, including $0 : unassigned
any non-numeric-appearing string, including "" : string
Here's the chaotic part - numeric data :
strangely enough, for absolutely identical input, only differing between using $0 vs. $1 in function call, you frequently get a different value for typeof()
even a combination of both leading and trailing spaces doesn't prevent gawk from identifying it as strnum
[123]:NF:1
$0 = number:123 $1 = strnum:123 +$1 = number:123
[ 456.33]:NF:1
$0 = string: 456.33 $1 = strnum:456.33 +$1 = number:456.33000
[ 19683 ]:NF:1
$0 = string: 19683 $1 = strnum:19683 +$1 = number:19683
[-20.08554]:NF:1
$0 = number:-20.08554 $1 = strnum:-20.08554 +$1 = number:-20.08554
+/- inf/nan (same for all 4):
[-nan]:NF:1
$0 = string:-nan $1 = strnum:-nan +$1 = number:-nan
this one is a string because it was made from sprintf() :
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = string:0x10FFFF +$1 = number:0
using -n / --non-decimal-data flag, all stays same except
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = strnum:0x10FFFF +$1 = number:1114111
Long story short, if you want your gawk function to be able to differentiate between
empty-string input (""), versus
actually no input at all
e.g. when original intention is to directly apply changes to $0
then typeof(x) == "untyped" seems to be the most reliable indicator.
It gets worse when null-string padding versus a non-empty string of all zeros ::
function __(_) { return (!_) ":" (!+_) }
function ___(_) { return (_ == "") }
function ____(_) { return (!_) ":" (!""_) }
$0--->[ "000" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1000 ]
$0--->[ "" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 1:1 ]
___($0)-->{ $0=="" }-->[ 1 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1 ]
$0--->[ " -0.0 -0" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 -0.0 -0 ]
$0--->[ " 0x5" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 0x5 ]