Awk change decimal formats

Awk change decimal formats - unix

I've got a file containing decimal values formatted like 9.85E-4. How can I make awk format this value to 0.000985?

Use printf with the %f option:
awk '{printf "%f\n", your_field .... }' file
Example
$ cat a
9.85E-4
23
$ awk '{printf "%f\n", $1}' a
0.000985
23.000000
From The GNU Awk User’s Guide # 5.5.2 Format-Control Letters:
%e, %E
Print a number in scientific (exponential) notation.
%f
Print a number in floating-point notation

Related

How to run awk on a file with cedella as delimiter

I have a file with below contents
cat file1.dat
anuÇ89Çhyd
binduÇ45Çchennai
I would like to print the second column with Ç as delimiter.
output should be
89
45

The manpage of awk mentions the following:
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS predefined variable).
So, this command does what you want:
cat file1.dat | awk -F'Ç' '{print $2}'

Given:
$ cat file
anuÇ89Çhyd
binduÇ45Çchennai
You can use cut:
$ cut -f 2 -d 'Ç' file
awk:
$ awk -F'Ç' '{print $2}' file
sed:
$ sed -E 's/^[^Ç]*Ç([^Ç]*).*/\1/' file
GNU grep:
$ grep -oP '^[^Ç]*Ç\K[^Ç]+(?=Ç)' file
Perl:
$ perl -lnE 'print $1 if /^[^Ç]*Ç([^Ç]+)Ç/' file
All those print:
89
45

Remove consecutive duplicate words from a file using awk or sed

My input file looks like below:
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
Output shoud look like:
"true, rohith Rohith;
cold burn, and fact and fact good?"
i am trying the same with awk, but couldn't able to get the desired result.
awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s ",$i,FS)}{printf("\n")}' input.txt
Could someone please help me here.
Regards,
Rohith

With GNU awk for the 4th arg to split():
$ cat tst.awk
{
n = split($0,words,/[^[:alpha:]]+/,seps)
prev = ""
for (i=1; i<=n; i++) {
word = words[i]
if (word != prev) {
printf "%s%s", seps[i-1], word
}
prev = word
}
print ""
}
$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”

Just match the same backreference in sed:
sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'
How it works:
:l - create a label l to jump to. See tl below.
s - substitute
/
\(^\|[^[:alpha:]]\) - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
\([[:alpha:]]\{1,\}\) - match a word - one or more alphabetic characters.
[^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
\2 - match the same thing as in the second \(...\) - ie. match the word.
\($\|[^[:alpha:]]\) - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
/
\1\2\3 - substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
/
g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.
tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.
Without the \(^\|[^[:alpha:]]\) and \($\|[^[:alpha:]]\), without them for example true rue would be substituted by true, because the suffix rue rue would match.
Below are my other solution, which also remove repeated words across lines.
My first solution was with uniq. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>. Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:
# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
# ouptut hexadecimal representation of non-word
printf "%s" "$1" | xxd -p | tr -d "\n"
# and output space with the word
printf " %s\n" "$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
# just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
# change non-word from hex to characters
printf "%s" "${1:1}" | xxd -r -p
# output word
printf "%s" "$2"
' --
But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
gawk -vRS='\0' '
NR%2==1{
nonword=$0
}
NR%2==0{
if (length(lastword) && lastword != $0) {
printf "%s%s", lastword, nonword
}
lastword=$0
}
END{
printf "%s%s", lastword, nonword
}'
In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
NR%2{ n=$0 }
NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
NR%2-1 { l=$0 }
END { printf "%s%s", l, n }
'
Tested on repl. The snippets output:
true, rohith Rohith;
cold burn, and fact and fact good?

Simple sed:
echo "true true, rohith Rohith;
cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'

This is not exactly what you have shown in output but is close using gnu-awk:
awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file
“true , rohith Rohith;
cold burn, and fact and fact good ?”

sed -E 's/(\w+) *\1/\1/g' sample.txt
sample.txt
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
output:
:~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
“true, rohith Rohith;
cold burn, and fact and fact good?”
Explanation
(\w) *\1 - matches a word separated by a space of the same word and saves it

Depending on your expected input, this might work:
sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/ / /g' myfile
([a-zA-Z0-9_-]+) = words that might be repeated.
( *)\1 = check if the previous word is repeated after a space.
s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).
s/ / /g = removes double spaces.
This works with GNU sed.

Converting timestamp to EPOCH in awk

I am converting timestamps to EPOCH seconds in awk, getting incorrect output for repeated timestamps
Input:
20180614 00:00:00
20180614 00:00:23
20180614 22:45:00
20180614 22:45:21
20180614 00:00:00
20180614 00:00:23
Expected Output :
1528930800
1528930823
1528930800
1529012721
1528930800
1528930823
I did
awk '{ ts="\""$0"\""; ("date +%s -d "ts)| getline epochsec; print epochsec}'
output after running above command:
1528930800
1528930823
1529012700
1529012721
1529012721
1529012721

With GNU xargs:
xargs -I {} date +%s -d {} < file
Output:
1528927200
1528927223
1529009100
1529009121
1528927200
1528927223

A bit a shorter GNU awk version is using FIELDWIDTHS which is available from gawk-2.13 onwards:
awk 'BEGIN{FIELDWIDTHS="4 2 3 2 1 2 1 2"}{print mktime($1" "$2" "$3$4" "$6" "$8)}'
Since gawk-4.2 you can skip intervening fields:
awk 'BEGIN{FIELDWIDTHS="4 2 2 1:2 1:2 1:2"}{print mktime($1" "$2" "$3" "$4" "$5" "$6)}'
Or even shorter using FPAT
awk 'BEGIN{FPAT="[0-9][0-9]"}{print mktime($1$2" "$3" "$4" "$5" "$6" "$7)}
note: the usage of a single awk-mktime combination will be faster than anything which makes system calls to date as you do not constantly have to call a binary. With the awk mktime solution you call a single binary. Nonetheless, the xargs solution given by Cyrus is by far the most comfortable one.

You could use system function
$ awk '{system("date +%s -d \""$0"\"")}' ip.txt
1528914600
1528914623
1528996500
1528996521
1528914600
1528914623
Or use sed
$ sed 's/.*/date +%s -d "&"/e' ip.txt
1528914600
1528914623
1528996500
1528996521
1528914600
1528914623
As per AllAboutGetline article, you'll need
$ awk '{ ts="date +%s -d \""$0"\""; while ((ts|getline ep)>0) print ep; close(ts) }' ip.txt
1528914600
1528914623
1528996500
1528996521
1528914600
1528914623
However getline is not needed at all for this case and avoid using it unless you really need it and know how to use it

Using GNU awk mktime function:
awk '{gsub(":"," ",$2); print mktime(substr($1,1,4) " " substr($1,5,2) " " substr($1,7,2) " " $2)}' file

To add to Cyrus's answer, the following works on Mac OSX. Strangely, MAC has a different way of handling date-time format to epoch conversion.
xargs -I {} date -j -u -f "%a %b %d %T %Z %Y" {} +%s < file

awk convert quoted date time format to unixtimestamp [duplicate]

I am converting timestamps to EPOCH seconds in awk, getting incorrect output for repeated timestamps
Input:
20180614 00:00:00
20180614 00:00:23
20180614 22:45:00
20180614 22:45:21
20180614 00:00:00
20180614 00:00:23
Expected Output :
1528930800
1528930823
1528930800
1529012721
1528930800
1528930823
I did
awk '{ ts="\""$0"\""; ("date +%s -d "ts)| getline epochsec; print epochsec}'
output after running above command:
1528930800
1528930823
1529012700
1529012721
1529012721
1529012721

With GNU xargs:
xargs -I {} date +%s -d {} < file
Output:
1528927200
1528927223
1529009100
1529009121
1528927200
1528927223

A bit a shorter GNU awk version is using FIELDWIDTHS which is available from gawk-2.13 onwards:
awk 'BEGIN{FIELDWIDTHS="4 2 3 2 1 2 1 2"}{print mktime($1" "$2" "$3$4" "$6" "$8)}'
Since gawk-4.2 you can skip intervening fields:
awk 'BEGIN{FIELDWIDTHS="4 2 2 1:2 1:2 1:2"}{print mktime($1" "$2" "$3" "$4" "$5" "$6)}'
Or even shorter using FPAT
awk 'BEGIN{FPAT="[0-9][0-9]"}{print mktime($1$2" "$3" "$4" "$5" "$6" "$7)}
note: the usage of a single awk-mktime combination will be faster than anything which makes system calls to date as you do not constantly have to call a binary. With the awk mktime solution you call a single binary. Nonetheless, the xargs solution given by Cyrus is by far the most comfortable one.

You could use system function
$ awk '{system("date +%s -d \""$0"\"")}' ip.txt
1528914600
1528914623
1528996500
1528996521
1528914600
1528914623
Or use sed
$ sed 's/.*/date +%s -d "&"/e' ip.txt
1528914600
1528914623
1528996500
1528996521
1528914600
1528914623
As per AllAboutGetline article, you'll need
$ awk '{ ts="date +%s -d \""$0"\""; while ((ts|getline ep)>0) print ep; close(ts) }' ip.txt
1528914600
1528914623
1528996500
1528996521
1528914600
1528914623
However getline is not needed at all for this case and avoid using it unless you really need it and know how to use it

Using GNU awk mktime function:
awk '{gsub(":"," ",$2); print mktime(substr($1,1,4) " " substr($1,5,2) " " substr($1,7,2) " " $2)}' file

To add to Cyrus's answer, the following works on Mac OSX. Strangely, MAC has a different way of handling date-time format to epoch conversion.
xargs -I {} date -j -u -f "%a %b %d %T %Z %Y" {} +%s < file

padding leading zeros in a column using awk

I would like to left pad with "0's" on first column say (width 14)
input.txt
17: gdgdgd
117: aa
Need the out put as below
00000000000017: gdgdgd
00000000000117: aa
i have tried the awk -F: '{ printf "%14i: %s\n", $1,$2 }' input.txt but it's working
padding more than %09i Nine is not working

try
awk -F: '{ printf "%014i: %s\n", $1,$2 }' input.txt
see here
A leading ‘0’ (zero) acts as a flag that indicates that output should
be padded with zeros instead of spaces. This applies only to the
numeric output formats. This flag only has an effect when the field
width is wider than the value to print.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Awk change decimal formats - unix

I've got a file containing decimal values formatted like 9.85E-4. How can I make awk format this value to 0.000985?

Related

How to run awk on a file with cedella as delimiter

Remove consecutive duplicate words from a file using awk or sed

Converting timestamp to EPOCH in awk

awk convert quoted date time format to unixtimestamp [duplicate]

padding leading zeros in a column using awk

Categories

Resources