Bash. Hex to ascii. Is possible without xxd or perl? - hex

I'm developing a script on which I have a hex string 31323334353637383930313233 and I want to transform it into ASCII. Desired output is 1234567890123.
I already have it working using:
echo "31323334353637383930313233" | xxd -r -p
or
echo "31323334353637383930313233" | perl -pe 's/(..)/chr(hex($1))/ge'
But the point is try to use the minimum possible requirements for the script. I want it working in suse, fedora, debian, ubuntu, arch, etc... It seems the xxd command is included in vim package. I'm wondering if there is a way to achieve this using only awk or any internal Linux tool which is going to be present by default in all Linux systems.

Found this script here:
#!/bin/bash
function hex2string () {
I=0
while [ $I -lt ${#1} ];
do
echo -en "\x"${1:$I:2}
let "I += 2"
done
}
hex2string "31323334353637383930313233"
echo
You may change the line hex2string "31323334353637383930313233" so that it takes the hex value from parameters, that is:
#!/bin/bash
function hex2string () {
I=0
while [ $I -lt ${#1} ];
do
echo -en "\x"${1:$I:2}
let "I += 2"
done
}
hex2string "$1"
echo
So when executed as:
./hexstring.sh 31323334353637383930313233
It will provide the desired ascii output.
NOTE: Can't test if it works in all Linux systems.

Using gawk, from HEX to ASCII
$ gawk '{
gsub(/../,"0x& ");
for(i=1;i<=NF;i++)
printf("%c", strtonum($i));
print ""
}' <<<"31323334353637383930313233"
1234567890123
Using any awk
$ cat hex2asc_anyawk.awk
BEGIN{
split("0 1 2 3 4 5 6 7 8 9 A B C D E F", d, / /)
for(i in d)Decimal[d[i]]=i-1
}
function hex2dec(hex, h,i,j,dec)
{
hex = toupper(hex);
i = length(hex);
while(i)
{
dec += Decimal[substr(hex,i,1)] * 16 ^ j++
i--
}
return dec;
}
{
gsub(/../,"& ");
for(i=1;i<=NF;i++)
printf("%d",hex2dec($i));
print ""
}
Execution
$ awk -f hex2asc_anyawk.awk <<<"31323334353637383930313233"
1234567890123
Explanation
Steps :
Get the decimal equivalent of hex from table.
Multiply every digit with 16 power of digit location.
Sum all the multipliers.
Example :
BEGIN{
# Here we created decimal conversion array, like above table
split("0 1 2 3 4 5 6 7 8 9 A B C D E F", d, / /)
for(i in d)Decimal[d[i]]=i-1
}
function hex2dec(hex, h,i,j,dec)
{
hex = toupper(hex); # uppercase conversion if any A,B,C,D,E,F
i = length(hex); # length of hex string
while(i)
{
# dec var where sum is stored
# substr(hex,i,1) gives 1 char from RHS
# multiply by 16 power of digit location
dec += Decimal[substr(hex,i,1)] * 16 ^ j++
i-- # decrement by 1
}
return dec;
}
{
# it modifies record
# suppose if given string is 31323334353637383930313233
# after gsub it becomes 31 32 33 34 35 36 37 38 39 30 31 32 33
# thus re-evaluate the fields
gsub(/../,"& ");
# loop through fields , NF gives no of fields
for(i=1;i<=NF;i++)
# convert from hex to decimal
# and print equivalent ASCII value
printf("%c",hex2dec($i));
# print newline char
print ""
}
Meaning of dec += Decimal[substr(hex,i,1)] * 16 ^ j++
dec += Decimal[substr(hex,i,1)] * 16 ^ j++
^ ^ ^
| | |
| | 2.Multiply every digit with 16 power of digit location.
| |
| 1.Gives decimal equivalent of hex
|
|
3. Sum all the multipliers

here's a special cheating trick for u - due to ingenuity of how they originally mapped decimal digits to bytes, their hex are all x3[0-9],
so therefore, if u already know they would decode out to digits and nothing else, here's a fast shortcut :
echo "31323334353637383930313233" |
mawk 'gsub("..","_&") + gsub("_3",_)^_'
1234567890123
if it's already URL-percent-encoded, then it's even simpler :
echo '%31%32%33%34%35%36%37%38%39%30%31%32%33' |
mawk 'gsub("%3",_)^_'
or
gawk ++NF FS='%3' OFS=
1234567890123
This specialized approach can handle hex of absolutely any arbitrary size, even for awks that don't have built-in support for bigints
TL;DR : don't "do math" when none is needed

Alternate (g)awk solution:
echo "31323334353637383930313233" | awk 'RT{printf "%c", strtonum("0x"RT)}' RS='[0-9]{2}'

Related

Awk program to compare number of fields by space of each line

I am trying to check if each line has a same length(or number of fields) in a file.
I am doing the following but it seems not to work.
NR==1 {length=NF}
NR>1 && NF!=length {print}
Can this be done by a one-liner awk? or a program is fine.
A sample of input would be:
12 34 54 56
12 89 34 33
12
29 56 42 42
My expected output would be "yes" or "no" if they have the same number of fields or not.
You could try this command which checks the number of fields in each line and compares it to the number of fields of the first line:
awk 'NR==1{a=NF; b=0} (NR>1 && NF!=a){print "No"; b=1; exit 1}END{if (b==0) print "Yes"}' test.txt
Checking is aborted in the first line whose number of fields is distinct from the first line of input.
For input
12 43 43
12 32
you will get "No"
Try:
awk 'BEGIN{a="yes"} last!="" && NF!=last{a="no"; exit} {last=NF} END{print a}' file
How it works
BEGIN{a="yes"}
This initializes the variable a to yes. (We assume all lines have the same number fields until proven otherwise.)
last!="" && NF!=last{a="no"; exit}
If last has been assigned a value and the number of fields on the current line is not the same as last, then set a to no and exit.
{last=NF}
Update last to the number of fields on the current line.
END{print a}
Before exiting, print a.
Examples
$ cat file1
2 34 54 56
12 89 34 33
12
29 56 42 42
$ awk 'BEGIN{a="yes"} last!="" && NF!=last{a="no"; exit} {last=NF} END{print a}' file1
no
$ cat file2
2 34 54 56
12 89 34 33
29 56 42 42
$ awk 'BEGIN{a="yes"} last!="" && NF!=last{a="no"; exit} {last=NF} END{print a}' file2
yes
I am assuming that you want to check fields of all lines, if they are equal or not if this is case then try following.
awk '
FNR==1{
value=NF
count++
next
}
{
count=NF==value?++count:count
}
END{
if(count==FNR){
print "All lines are of same fields"
}
else{
print "All lines are NOT of same fields."
}
}
' Input_file
Additional stuff(only if require): In case you want to print contents of file whose all lines are having same fields along with yes or all are same fields in file message in output then try following.
awk '
{
val=val?val ORS $0:$0
}
FNR==1{
value=NF
count++
next
}
{
count=NF==value?++count:count
}
END{
if(count==FNR){
print "All lines are of same fields" ORS val
}
else{
print "All lines are NOT of same fields."
}
}
' Input_file
this should do
$ awk 'NR==1{p=NF} p!=NF{s=1; exit} END{print s?"No":"Yes"}' file
however, setting the exit status would be better if this will be part of a workflow.
Since equivalence has transitive property, there is no need to keep NF other than the first line; setting 0 as your success value doesn't require initialization to default value.
An efficient even fields shell function, using sed to construct a regex, (based on the first line of input), to feed to GNU grep, which looks for field length mismatches:
# Usage: ef filename
ef() { sed '1s/[^ ]*/[^ ]*/g;q' "$1" | grep -v -m 1 -q -f - "$1" \
&& echo no || echo yes ; }
For files with uneven fields grep -m 1 quits after the first non-uniform line -- so if the file is a million lines long, but the mismatch occurs on line #2, grep only needs to read two lines, not a million. On the other hand, if there's no mismatch grep would have to read a million lines.

Comparing two files column by column in unix shell

I need to compare two files column by column using unix shell, and store the difference in a resulting file.
For example if column 1 of the 1st record of the 1st file matches the column 1 of the 1st record of the 2nd file then the result will be stored as '=' in the resulting file against the column, but if it finds any difference in column values the same need to be printed in the resulting file.
Below is the exact requirement.
File 1:
id code name place
123 abc Tom phoenix
345 xyz Harry seattle
675 kyt Romil newyork
File 2:
id code name place
123 pkt Rosy phoenix
345 xyz Harry seattle
421 uty Romil Sanjose
Expected resulting file:
id_1 id_2 code_1 code_2 name_1 name_2 place_1 place_2
= = abc pkt Tom Rosy = =
= = = = = = = =
675 421 kyt uty = = Newyork Sanjose
Columns are tab delimited.
This is rather crudely coded, but shows a way to use awk to emit what you want, and can handle files of identical "schema" - not just the particular 4-field files you give as tests.
This approach uses pr to do a simple merge of the files: the same line of each input file is concatenated to present one line to the awk script.
The awk script assumes clean input, and uses the fact that if a variable n has the value 2, the value of $n in the script is the the same as $2. So, the script walks though pairs of fields using the i and j variables. For your test input, fields 1 and 5, then 2 and 6, etc., are processed.
Only very limited testing of input is performed: mainly, that the implied schema of the two input files (the names of columns/fields) is the same.
#!/bin/sh
[ $# -eq 2 ] || { echo "Usage: ${0##*/} <file1> <file2>" 1>&2; exit 1; }
[ -r "$1" -a -r "$2" ] || { echo "$1 or $2: cannot read" 1>&2; exit 1; }
set -e
pr -s -t -m "$#" | \
awk '
{
offset = int(NF/2)
tab = ""
for (i = 1; i <= offset; i++) {
j = i + offset
if (NR == 1) {
if ($i != $j) {
printf "\nColumn name mismatch (%s/%s)\n", $i, $j > "/dev/stderr"
exit
}
printf "%s%s_1\t%s_2", tab, $i, $j
} else if ($i == $j) {
printf "%s=\t=", tab
} else {
printf "%s%s\t%s", tab, $i, $j
}
tab = "\t"
}
printf "\n"
}
'
Tested on Linux: GNU Awk 4.1.0 and pr (GNU coreutils) 8.21.

How to gather characters usage statistics in text file using Unix commands?

I have got a text file created using OCR software - about one megabyte in size.
Some uncommon characters appears all over document and most of them are OCR errors.
I would like find all characters used in document to easily spot errors (like UNIQ command but for characters, not for lines).
I am on Ubuntu.
What Unix command I should use to display all characters used in text file?
This should do what you're looking for:
cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c
The premise is that the sed puts each character in the file onto a line by itself, then the usual sort | uniq -c sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.
Also, you could append | sort -n to the end of the whole sequence to sort the output by how many times each character occurred. Example:
$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
1
1 e
1 h
1 o
2 l
This will do it:
#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <tchrist#perl.com>
use open ":utf8";
++$seen{ ord() } for split //;
END {
for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
printf "%04X %d\n", $cp, $seen{$cp};
}
}
Run on itself, that program produces:
$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9
If you want the literal character and/or name of the character, too, that’s easy to add.
If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.
#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings qw<FATAL all>;
use open ":utf8";
my %cats;
our %Prop_Table;
build_prop_table();
if (#ARGV == 0 && -t STDIN) {
warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING
}
while (<>) {
for (split(//)) {
$cats{Total}++;
if (/\p{ASCII}/) { $cats{ASCII}++ }
else { $cats{Unicode}++ }
my $gcat = get_general_category($_);
$cats{$gcat}++;
my $subcat = get_general_subcategory($_);
$cats{$subcat}++;
}
}
my $width = length $cats{Total};
my $mask = "%*d %s\n";
for my $cat(qw< Total ASCII Unicode >) {
printf $mask, $width => $cats{$cat} || 0, $cat;
}
print "\n";
my #catnames = qw[
L Lu Ll Lt Lm Lo
N Nd Nl No
S Sm Sc Sk So
P Pc Pd Ps Pe Pi Pf Po
M Mn Mc Me
Z Zs Zl Zp
C Cc Cf Cs Co Cn
];
#for my $cat (sort keys %cats) {
for my $cat (#catnames) {
next if length($cat) > 2;
next unless $cats{$cat};
my $prop = length($cat) == 1
? ( " " . q<\p> . $cat )
: ( q<\p> . "{$cat}" . "\t" )
;
my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});
printf $mask, $width => $cats{$cat}, $desc;
}
exit;
sub get_general_category {
my $_ = shift();
return "L" if /\pL/;
return "S" if /\pS/;
return "P" if /\pP/;
return "N" if /\pN/;
return "C" if /\pC/;
return "M" if /\pM/;
return "Z" if /\pZ/;
die "not reached one: $_";
}
sub get_general_subcategory {
my $_ = shift();
return "Lu" if /\p{Lu}/;
return "Ll" if /\p{Ll}/;
return "Lt" if /\p{Lt}/;
return "Lm" if /\p{Lm}/;
return "Lo" if /\p{Lo}/;
return "Mn" if /\p{Mn}/;
return "Mc" if /\p{Mc}/;
return "Me" if /\p{Me}/;
return "Nd" if /\p{Nd}/;
return "Nl" if /\p{Nl}/;
return "No" if /\p{No}/;
return "Pc" if /\p{Pc}/;
return "Pd" if /\p{Pd}/;
return "Ps" if /\p{Ps}/;
return "Pe" if /\p{Pe}/;
return "Pi" if /\p{Pi}/;
return "Pf" if /\p{Pf}/;
return "Po" if /\p{Po}/;
return "Sm" if /\p{Sm}/;
return "Sc" if /\p{Sc}/;
return "Sk" if /\p{Sk}/;
return "So" if /\p{So}/;
return "Zs" if /\p{Zs}/;
return "Zl" if /\p{Zl}/;
return "Zp" if /\p{Zp}/;
return "Cc" if /\p{Cc}/;
return "Cf" if /\p{Cf}/;
return "Cs" if /\p{Cs}/;
return "Co" if /\p{Co}/;
return "Cn" if /\p{Cn}/;
die "not reached two: <$_> " . sprintf("U+%vX", $_);
}
sub build_prop_table {
for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {
L Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark (combining characters, including diacritics)
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation (may behave like Ps or Pe depending on usage)
Pf Final_Punctuation (may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other (means not L/N/P/S/Z)
Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
Co Private_Use
Cn Unassigned
End_of_Property_List
my($short_prop, $long_prop) = $line =~ m{
\b
( \p{Lu} \p{Ll} ? )
\s +
( \p{Lu} [\p{L&}_] + )
\b
}x;
$Prop_Table{$short_prop} = $long_prop;
}
}
For example:
$ unicats book.txt
2357232 Total
2357199 ASCII
33 Unicode
1604949 \pL Letter
74455 \p{Lu} Uppercase_Letter
1530485 \p{Ll} Lowercase_Letter
9 \p{Lo} Other_Letter
10676 \pN Number
10676 \p{Nd} Decimal_Number
19679 \pS Symbol
10705 \p{Sm} Math_Symbol
8365 \p{Sc} Currency_Symbol
603 \p{Sk} Modifier_Symbol
6 \p{So} Other_Symbol
111899 \pP Punctuation
2996 \p{Pc} Connector_Punctuation
6145 \p{Pd} Dash_Punctuation
11392 \p{Ps} Open_Punctuation
11371 \p{Pe} Close_Punctuation
79995 \p{Po} Other_Punctuation
548529 \pZ Separator
548529 \p{Zs} Space_Separator
61500 \pC Other
61500 \p{Cc} Control
As far as using *nix commands, the answer above is good, but it doesn't get usage stats.
However, if you actually want stats (like the rarest used, median, most used, etc) on the file, this Python should do it.
def get_char_counts(fname):
f = open(fname)
usage = {}
for c in f.read():
if c not in usage:
usage.update({c:1})
else:
usage[c] += 1
return usage

Conversion between binary and decimal

How do I convert between decimal and binary? I'm working on a Solaris 10 platform
Decimal to Binary
4000000002 -> 100000000000000000000000000010
Binary to Decimal
100000000000000000000000000010 -> 4000000002
I used the following command in unix but it takes lot of time. I have 20 million records like this
For decimal to binary, set obase to 2:
echo 'obase=2;4000000002' | bc
For binary to decimal, set ibase to 2:
echo 'ibase=2;100000000000000000000000000010' | bc
If you are running bc once for each number that will be slow.
Can you not arrange for the data to be delivered to a file and input in one go?
Here's a simple illustration, starting with your numbers in the file called input.txt:
# To binary
$ ( echo 'obase=2;ibase=16;'; cat input.txt ) | bc | paste input.txt - > output.txt
# To hex
$ ( echo 'obase=16;ibase=2;'; cat input.txt ) | bc | paste input.txt - > output.txt
The results are written to the file output.txt.
The paste is included to produce a tab-spearated output result like
07 111
1A 11010
20 100000
2B 101011
35 110101
80 10000000
FF 11111111
showing input value versus output value.
If you just want the results you can omit the paste, e.g.:
$ ( echo 'obase=2;ibase=16;'; cat input.txt ) | bc > output.txt
Note that you probably have to set ibase as well as obase for the conversion to be correct.
gclswceap1d-mc48191-CRENG_DEV [/home/mc48191/scratch]

sed/awk or other: one-liner to increment a number by 1 keeping spacing characters

EDIT: I don't know in advance at which "column" my digits are going to be and I'd like to have a one-liner. Apparently sed doesn't do arithmetic, so maybe a one-liner solution based on awk?
I've got a string: (notice the spacing)
eh oh 37
and I want it to become:
eh oh 36
(so I want to keep the spacing)
Using awk I don't find how to do it, so far I have:
echo "eh oh 37" | awk '$3>=0&&$3<=99 {$3--} {print}'
But this gives:
eh oh 36
(the spacing characters where lost, because the field separator is ' ')
Is there a way to ask awk something like "print the output using the exact same field separators as the input had"?
Then I tried yet something else, using awk's sub(..,..) method:
' sub(/[0-9][0-9]/, ...) {print}'
but no cigar yet: I don't know how to reference the regexp and do arithmetic on it in the second argument (which I left with '...' for now).
Then I tried with sed, but got stuck after this:
echo "eh oh 37" | sed -e 's/\([0-9][0-9]\)/.../'
Can I do arithmetic from sed using a reference to the matching digits and have the output not modify the number of spacing characters?
Note that it's related to my question concerning Emacs and how to apply this to some (big) Emacs region (using a replace region with Emacs's shell-command-on-region) but it's not an identical question: this one is specifically about how to "keep spaces" when working with awk/sed/etc.
Here is a variation on ghostdog74's answer that does not require the number to be anchored at the end of the string. This is accomplished using match instead of relying on the number to be in a particular position.
This will replace the first number with its value minus one:
$ echo "eh oh 37 aaa 22 bb" | awk '{n = substr($0, match($0, /[0-9]+/), RLENGTH) - 1; sub(/[0-9]+/, n); print }'
eh oh 36 aaa 22 bb
Using gsub there instead of sub would replace both the "37" and the "22" with "36". If there's only one number on the line, it doesn't matter which you use. By doing it this way, though, it will handle numbers with trailing whitespace plus other non-numeric characters that may be there (after some whitespace).
If you have gawk, you can use gensub like this to pick out an arbitrary number within the string (just set the value of which):
$ echo "eh oh 37 aaa 22 bb 19" |
awk -v which=2 'BEGIN { regex = "([0-9]+)\\>[^0-9]*";
for (i = 1; i < which; i++) {regex = regex"([0-9]+)\\>[^0-9]*"}}
{ match($0, regex, a);
n = a[which] - 1; # do the math
print gensub(/[0-9]+/, n, which) }'
eh oh 37 aaa 21 bb 19
The second (which=2) number went from 22 to 21. And the embedded spaces are preserved.
It's broken out on multiple lines to make it easier to read, but it's copy/pastable.
$ echo "eh oh 37" | awk '{n=$NF+1; gsub(/[0-9]+$/,n) }1'
eh oh 38
or
$ echo "eh oh 37" | awk '{n=$NF+1; gsub(/..$/,n) }1'
eh oh 38
something like
number=`echo "eh oh 37" | grep -o '[0-9]*'`
sed 's/$number/`expr $number + 1`/'
How about:
$ echo "eh oh 37" | awk -F'[ \t]' '{$NF = $NF - 1;} 1'
eh oh 36
The solution will not preserve the number of decimals, so if the number is 10, then the result is 9, even if one would like to have 09.
I did not write the shortest possible code, it should stay readable
Here I construct the printf pattern using RLENGTH so it becomes %02d (2 being the length of the matched pattern)
$ echo "eh oh 10 aaa 22 bb" |
awk '{n = substr($0, match($0, /[0-9]+/), RLENGTH)-1 ;
nn=sprintf("%0" RLENGTH "d", n)
sub(/[0-9]+/, nn);
print
}'
eh oh 09 aaa 22 bb

Resources