AWK Preserve Header in Output - unix

Hi I have a csv file like so:
order,account,product
23023,Best Buy,productA
20342,Best Buy,productB
20392,Wal-Mart,productC
I am using this solution from a previous thread:
awk -F ',' '{ print > ("split-" $2 ".csv") }' dataset1.csv
However the output produces 2 files without headers:
File1
23023,Best Buy,productA
20342,Best Buy,productB
File2
20392,Wal-Mart,productC
How can I modify the awk solution above to preserve the header line in each split file so that the output resembles:
File 1
order,account,product
23023,Best Buy,productA
20342,Best Buy,productB
File2
order,account,product
20392,Wal-Mart,productC
Many thanks!

I would write this:
awk -F, '
NR == 1 { header = $0; next}
!($2 in files) {
files[$2] = "split-" $2 ".csv"
print header > files[$2]
}
{ print > files[$2] }
' dataset1.csv

You can use this awk script:
script.awk
NR == 1 { header = $0; next}
{ fname = "split-" $2 ".csv"
if( !( $2 in mem ) ) {
print header > fname
mem[ $2 ] = 1
}
print > fname
}
You use it like this: awk -F, -f script.awk dataset1.csv
Explanation
the header is stored while reading the first data line of the data file in the first line of the script
for the other data lines, the script writes the header into fname, but only on the first write to fname
this is achieved by storing $2 in mem

another similar awk
awk -F, 'NR==1 {h=$0; next}
{file="split-" $2 ".csv";
print (a[file]++?"":h ORS) $0 > file}' input
a[file]++ is the line counter indexed by output filename, insert the header appended with ORS only before the first line, which will become the header for each split file.

Related

generate header and trailer after splitting files

This is coding that I already do splitting:
awk -v DATE="$(date +"%d%m%Y")" -F\, '
BEGIN{OFS=","}
NR==1 {h=$0; next}
{
gsub(/"/, "", $1);
file="Assgmt_"$1"_"DATE".csv";
print (a[file]++?"":h ORS) $0 > file
}
' Test_01012020.CSV
but then, how can I add some header and trailer into above command?
I hope this helps you,
awk -v DATE="$(date +"%d%m%Y")" -F\, '
BEGIN{OFS=","}
NR==1 {h=$0; next}
{
gsub(/"/, "", $1);
file="Assgmt_"$1"_"DATE".csv";
print (a[file]++?"":DATE ORS h ORS) $0 > file
}
END{for(file in a) print "EOF" > file}
' Test_01012020.CSV

awk if statement with simple math

I'm just trying to do some basic calculations on a CSV file.
Data:
31590,Foo,70
28327,Bar,291
25155,Baz,583
24179,Food,694
28670,Spaz,67
22190,bawk,4431
29584,alfred,142
27698,brian,379
24372,peter,22
25064,weinberger,8
Here's my simple awk script:
#!/usr/local/bin/gawk -f
BEGIN { FPAT="([^,]*)|(\"[^\"]+\")"; OFS=","; OFMT="%.2f"; }
NR > 1
END { if ($3>1336) $4=$3*0.03; if ($3<1336) $4=$3*0.05;}1**
Wrong output:
31590,Foo,70
28327,Bar,291
28327,Bar,291
25155,Baz,583
25155,Baz,583
24179,Food,694
24179,Food,694
28670,Spaz,67
28670,Spaz,67
22190,bawk,4431
22190,bawk,4431
29584,alfred,142
29584,alfred,142
27698,brian,379
27698,brian,379
24372,peter,22
24372,peter,22
25064,weinberger,8
25064,weinberger,8
Excepted output:
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,.04
Simple math is if
field $3 > 1336 = $3*.03 and results in field $4
field $3 < 1336 = $3*.05 and results in field $4
There's no need to force awk to recompile every record (by assigning to $4), just print the current record followed by the result of your calculation:
awk 'BEGIN{FS=OFS=","; OFMT="%.2f"} {print $0, $3*($3>1336?0.03:0.05)}' file
You shouldn't have anything in the END block
BEGIN {
FS = OFS = ","
OFMT="%.2f"
}
{
if ($3 > 1336)
$4 = $3 * 0.03
else
$4 = $3 * 0.05
print
}
This results in
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4
$ awk -F, -v OFS=, '{if ($3>1336) $4=$3*0.03; else $4=$3*0.05;} 1' data
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4
Discussion
The END block is not executed at the end of each line but at the end of the whole file. Consequently, it is not helpful here.
The original code has two free standing conditions, NR>1 and 1. The default action for each is to print the line. That is why, in the "wrong output," all lines after the first were doubled in the output.
With awk:
awk -F, -v OFS=, '$3>1336?$4=$3*.03:$4=$3*.05' file
The conditional-expression ? action1 : action2 ; is the much shorter terinary operator in awk.

While read line, awk $line with multiple delimiters

I am trying a small variation of this, except I telling awk that the delimiter of the file to be split based on the 5th field can either be a colon ":" or a tab \t. I do the awk -F '[:\t]' part alone, it does indeed print the right $5 field.
However, when I try to incorporate this into the bigger command, it returns the following error:
print > f
awk: cmd. line:9: ^ syntax error
This is the code:
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = "Alignments_" $5 ".sam" print > f
} ' Tile_Number_List.txt little.sam
Why won't it work with the -F option?
The problem isn't with the value of FS it's this line as pointed to by the error:
f = "Alignments_" $5 ".sam" print > f
You have two statements on one line so either separate them with a ; or a newline:
f = "Alignments_" $5 ".sam"; print > f
Or:
f = "Alignments_" $5 ".sam"
print > f
As full one liner:
awk -F '[:\t]' 'FNR==NR{n[$1];next}$5 in n{print > ("Alignments_"$5".sam")}'
Or as a script file i.e script.awk:
BEGIN {
FS="[:\t]"
}
# read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = "Alignments_" $5 ".sam"
print > f
}
To run in this form awk -f script.awk Tile_Number_List.txt little.sam.
Edit:
The character - is used to represent input from stdin instead of a file with many *nix tools.
command | awk -f script.awk Tile_Number_List.txt -

awk field separator , when the separator shows up in double quote

I am trying to use awk to read some input at the field position at 3, $3, field 3 is a string
awk -F'","' '{print $1}' input.txt
my file input.txt looks like this
field1,field2,field3,field4,field5
the problem is that these fields are separated by commas, some of them are double quoted while others are not. And field 5 is double quoted and contains every type of symbols. Example:
imfield1,imfield2,"imfield3",imfield4,"im"",""fi"",el,""d5"
can awk handle a situation like this??
In more gner, how can I get the whole string by typIng $5 ?
You can use Lorance Stinson's Awk CSV parser, in which case it's as simple as:
function parse_csv(..) {
..
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
print csv[2]
}
If you're not hell-bent on Awk, Python also comes with a nice CSV parser:
import csv, sys
for row in csv.reader(sys.stdin):
print row[2]
Or from the command line (bit tricky in one line):
python -c 'import csv,sys;[sys.stdout.write(row[2]+"\n") for row in csv.reader(sys.stdin)]' < input.txt
The separator is a simple comma, not a comma betweeen quotes. If the fields do not contain commas, then awk may be up for the task:
awk -F , '
{
if ($3 ~ /^".*"$/) {
$3 = substr($3, 2, length($3)-2);
gsub(/""/, "", $3);
}
print $3;
}' input.txt
This is already getting pretty complicated. If there can be commas inside fields, use a proper CSV parser, for example in Perl or Python. See https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files
You can parse the line in awk setting null field separator. Instead of printf("%s",$i) you can assign $i to a var and print out when inda==0
#echo "\"AAA,BBB\",\"CCC\",\"DDD, EEE, FFF\"" > uno
awk 'BEGIN { FS="" }
{
for ( i=1; i<NF; i++) {
if ( $i == "\"" )
if ( inda == 0 )
inda = 1
else
inda = 0
if ( $i == "," )
if ( inda == 0 )
$i="|"
printf("%s",$i)
}
printf("\n")
}' uno

how do you skip the last line w/ awk?

I am processing a file with awk and need to skip some lines. The internet dosen't have a good answer.
So far the only info I have is that you can skip a range by doing:
awk 'NR==6,NR==13 {print}' input.file
OR
awk 'NR <= 5 { next } NR > 13 {exit} { print}' input.file
You can skip the first line by inputting:
awk 'NR < 1 { exit } { print}' db_berths.txt
How do you skip the last line?
One way using awk:
awk 'NR > 1 { print prev } { prev = $0 }' file.txt
Or better with sed:
sed '$d' file.txt
You can try:
awk 'END{print NR}' file

Resources