generate header and trailer after splitting files - unix

This is coding that I already do splitting:
awk -v DATE="$(date +"%d%m%Y")" -F\, '
BEGIN{OFS=","}
NR==1 {h=$0; next}
{
gsub(/"/, "", $1);
file="Assgmt_"$1"_"DATE".csv";
print (a[file]++?"":h ORS) $0 > file
}
' Test_01012020.CSV
but then, how can I add some header and trailer into above command?

I hope this helps you,
awk -v DATE="$(date +"%d%m%Y")" -F\, '
BEGIN{OFS=","}
NR==1 {h=$0; next}
{
gsub(/"/, "", $1);
file="Assgmt_"$1"_"DATE".csv";
print (a[file]++?"":DATE ORS h ORS) $0 > file
}
END{for(file in a) print "EOF" > file}
' Test_01012020.CSV

Related

UNIX shell script reading csv

I have a csv file. I would like to put the fields into different variables. Supposed there are three fields in each line of the csv file. I have this code:
csvfile=test.csv
while read inline; do
var1=`echo $inline | awk -F',' '{print $1}'`
var2=`echo $inline | awk -F',' '{print $2}'`
var3=`echo $inline | awk -F',' '{print $3}'`
.
.
.
done < $csvfile
This code is good. However, if a field is coded with an embedded comma, then, it would not work. Any suggestion? For example:
how,are,you
I,"am, very",good
this,is,"a, line"
This may not be the perfect solution but it will work in your case.
[cloudera#quickstart Documents]$ cat cd.csv
a,b,c
d,"e,f",g
File content
csvfile=cd.csv
while read inline; do
var1=`echo $inline | awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "*", $i) }1' | awk -F',' '{print $1}' | sed 's/*/,/g'`
var2=`echo $inline | awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "*", $i) }1' | awk -F',' '{print $2}' | sed 's/*/,/g'`
var3=`echo $inline | awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "*", $i) }1' | awk -F',' '{print $3}' | sed 's/*/,/g'`
echo $var1 " " $var2 " " $var3
done< $csvfile
Output :
[cloudera#quickstart Documents]$ sh a.sh
a b c
d e,f g
So basically first we are trying to handle "," in data and then replacing the "," with "*" to get desired column using awk and then reverting * to "," again to get actual field value

AWK Preserve Header in Output

Hi I have a csv file like so:
order,account,product
23023,Best Buy,productA
20342,Best Buy,productB
20392,Wal-Mart,productC
I am using this solution from a previous thread:
awk -F ',' '{ print > ("split-" $2 ".csv") }' dataset1.csv
However the output produces 2 files without headers:
File1
23023,Best Buy,productA
20342,Best Buy,productB
File2
20392,Wal-Mart,productC
How can I modify the awk solution above to preserve the header line in each split file so that the output resembles:
File 1
order,account,product
23023,Best Buy,productA
20342,Best Buy,productB
File2
order,account,product
20392,Wal-Mart,productC
Many thanks!
I would write this:
awk -F, '
NR == 1 { header = $0; next}
!($2 in files) {
files[$2] = "split-" $2 ".csv"
print header > files[$2]
}
{ print > files[$2] }
' dataset1.csv
You can use this awk script:
script.awk
NR == 1 { header = $0; next}
{ fname = "split-" $2 ".csv"
if( !( $2 in mem ) ) {
print header > fname
mem[ $2 ] = 1
}
print > fname
}
You use it like this: awk -F, -f script.awk dataset1.csv
Explanation
the header is stored while reading the first data line of the data file in the first line of the script
for the other data lines, the script writes the header into fname, but only on the first write to fname
this is achieved by storing $2 in mem
another similar awk
awk -F, 'NR==1 {h=$0; next}
{file="split-" $2 ".csv";
print (a[file]++?"":h ORS) $0 > file}' input
a[file]++ is the line counter indexed by output filename, insert the header appended with ORS only before the first line, which will become the header for each split file.

How To Run Multiple "awk" commands:

Would like to run multiple "awk" commands in single script ..
For example Master.csv.gz located at /cygdrive/e/Test/Master.csv.gz and
Input files are located in different sub directories like /cygdrive/f/Jan/Input_Jan.csv.gz & /cygdrive/f/Feb/Input_Feb.csv.gz and so on ..
All input files are *.gz extension files.
Below commands are working fine while executing command one by one:
Command#1
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$2] = $0; next} ($2 in a) {print $0}' <(gzip -dc /cygdrive/e/Test/Master.csv.gz) <(gzip -dc /cygdrive/f/Jan/Input_Jan.csv.gz) >>Output.txt
Output#1:
Name,Age,Location
abc,20,xxx
Command#2
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$2] = $0; next} ($2 in a) {print $0}' <(gzip -dc /cygdrive/e/Test/Master.csv.gz) <(gzip -dc /cygdrive/f/Feb/Input_Feb.csv.gz) >>Output.txt
Output#2:
Name,Age,Location
def,40,yyy
cat Output.txt
Name,Age,Location
abc,20,xxx
def,40,yyy
Have tried below commands to run in via single script , got error:
Attempt#1: awk -f Test.awk
cat Test.awk
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$2] = $0; next} ($2 in a) {print $0}' <(gzip -dc /cygdrive/e/Test/Master.csv.gz) <(gzip -dc /cygdrive/f/Jan/Input_Jan.csv.gz) >>Output.txt
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$2] = $0; next} ($2 in a) {print $0}' <(gzip -dc /cygdrive/e/Test/Master.csv.gz) <(gzip -dc /cygdrive/f/Feb/Input_Feb.csv.gz) >>Output.txt
Error : Attempt#1: awk -f Test.awk
awk: Test.awk:1: ^ invalid char ''' in expression
awk: Test.awk:1: ^ syntax error
Attempt#2: sh Test.sh
cat Test.sh
#!/bin/sh
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$2] = $0; next} ($2 in a) {print $0}' <(gzip -dc /cygdrive/e/Test/Master.csv.gz) <(gzip -dc /cygdrive/f/Jan/Input_Jan.csv.gz) >>Output.txt
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$2] = $0; next} ($2 in a) {print $0}' <(gzip -dc /cygdrive/e/Test/Master.csv.gz) <(gzip -dc /cygdrive/f/Feb/Input_Feb.csv.gz) >>Output.txt
Error : Attempt#2: sh Test.sh
Test.sh: line 2: syntax error near unexpected token `('
Desired Output:
Name,Age,Location
abc,20,xxx
def,40,yyy
Looking for your suggestions ..
Update#2-Month Name
Ed Morton, Thanks for the inputs, however the output order are not proper , "Jan2014" is print on next line , please suggest
cat Output.txt:
Name,Age,Location
abc,20,xxx
Jan2014
def,40,yyy
Feb2014
Expected Output
Name,Age,Location
abc,20,xxx,Jan2014
def,40,yyy,Feb2014
All you need is:
#!/bin/bash
awk -F, 'FNR==NR{a[$2]; next} $2 in a' \
<(gzip -dc /cygdrive/e/Test/Master.csv.gz) \
<(gzip -dc /cygdrive/f/Jan/Input_Jan.csv.gz) \
<(gzip -dc /cygdrive/f/Feb/Input_Feb.csv.gz) \
>> Output.txt
If you want to print the month name too then the simplest thing would be:
#!/bin/bash
awk -F, 'FNR==NR{a[$2]; next} $2 in a{print $0, mth}' \
<(gzip -dc /cygdrive/e/Test/Master.csv.gz) \
mth="Jan" <(gzip -dc /cygdrive/f/Jan/Input_Jan.csv.gz) \
mth="Feb" <(gzip -dc /cygdrive/f/Feb/Input_Feb.csv.gz) \
>> Output.txt
but you could remove the redundant specifying of the month name 3 times on each line with:
#!/bin/bash
mths=(Jan Feb)
awk -F, 'FNR==NR{a[$2]; next} $2 in a{print $0, mth}' \
<(gzip -dc /cygdrive/e/Test/Master.csv.gz) \
mth="${mths[$((i++))]}" <(gzip -dc "/cygdrive/f/${mths[$i]}/Input_${mths[$i]}.csv.gz") \
mth="${mths[$((i++))]}" <(gzip -dc "/cygdrive/f/${mths[$i]}/Input_${mths[$i]}.csv.gz") \
>> Output.txt
Your first attempt failed because you were trying to call awk in an awk script, and your second attempt failed because the bash process substitution, <(...), is not defined by POSIX, and is not guaranteed to work with /bin/sh. Here is an awk script that should work.
#!/usr/bin/awk -f
BEGIN {
if (ARGC < 3) exit 1;
ct = "cat ";
gz = "gzip -dc "
f = "\"" ARGV[1] "\"";
c = (f~/\.gz$/)?gz:ct;
while ((c f | getline t) > 0) {
split(t, a, ",");
A[a[2]] = t;
}
close(c f);
for (n = 2; n < ARGC; n++) {
f = "\"" ARGV[n] "\"";
c = (f~/\.gz$/)?gz:ct;
while ((c f | getline t) > 0) {
split(t, a, ",");
if (a[2] in A) print t;
}
close(c f);
}
exit;
}
usage
script.awk /cygdrive/e/Test/Master.csv.gz /cygdrive/f/Jan/Input_Jan.csv.gz
script.awk /cygdrive/e/Test/Master.csv.gz /cygdrive/f/Feb/Input_Feb.csv.gz
or
script.awk /cygdrive/e/Test/Master.csv.gz /cygdrive/f/Jan/Input_Jan.csv.gz\
/cygdrive/f/Feb/Input_Feb.csv.gz

How to find the distinct values in unix

I need distinct values from the below columns:
AA|BB|CC
a#gmail.com,c#yahoo.co.in|a#gmail.com|a#gmail.com
y#gmail.com|x#yahoo.in,z#redhat.com|z#redhat.com
c#gmail.com|b#yahoo.co.in|c#uix.xo.in
Here records are '|' seperated and in the 1st column, we can two email id's which are ',' seperated. so, I want to consider that also. I want distinct email id's in the AA,BB,CC column, whether it is '|' seperated or ',' seperated.
Expected output:
c#yahoo.co.in|a#gmail.com|
y#gmail.com|x#yahoo.in|z#redhat.com
c#gmail.com|b#yahoo.co.in|c#uix.xo.in
is awk unix enough for you?
{
for(i=1; i < NF; i++) {
if ($i ~ /#/) {
mail[$i]++
}
}
}
END {
for (x in mail) {
print mail[x], x
}
}
output:
$ awk -F'[|,]' -f v.awk f1
2 z#redhat.com
3 a#gmail.com
1 x#yahoo.in
1 c#yahoo.co.in
1 c#gmail.com
1 y#gmail.com
1 b#yahoo.co.in
Using awk :
cat file | tr ',' '|' | awk -F '|' '{ line=""; for (i=1; i<=NF; i++) {if ($i != "" && list[NR"#"$i] != 1){line=line $i "|"}; list[NR"#"$i]=1 }; print line}'
Prints :
a#gmail.com|c#yahoo.co.in|
y#gmail.com|x#yahoo.in|z#redhat.com|
c#gmail.com|b#yahoo.co.in|c#uix.xo.in|
Edit :
Now works properly with inputs such as :
a#gmail.com|c#yahoo.co.in|
y#gmail.com|x#yahoo.in|a#gmail.com|
c#gmail.com|c#yahoo.co.in|c#uix.xo.in|
Prints :
a#gmail.com|c#yahoo.co.in|
y#gmail.com|x#yahoo.in|a#gmail.com|
c#gmail.com|c#yahoo.co.in|c#uix.xo.in|
The following python code will solve your problem:
#!/usr/bin/env python
while True:
try:
addrs = raw_input()
except EOFError:
break
print '|'.join(set(addrs.replace(',', '|').split('|')))
In Bash only:
while read s; do
IFS='|,'
for e in $s; do
echo "$e"
done | sort | uniq
unset IFS
done
This seems to work, although I'm not sure what to do if there are more than three unique mails. Run with awk -f filename.awk dataname.dat
BEGIN {IFS=/[,|]/}
NF {
delete uniqmails;
for (i=1; i<=NF; i++)
uniqmails[$i] = 1;
sep="";
n=0;
for (m in uniqmails) {
printf "%s%s", sep, m;
sep="|";
n++;
}
for (;n<3;n++) printf "|";
print ""; // EOL
}
There's also this "one-liner" that doesn't need awk:
while read line; do
echo $line | tr ",|" "\n" | sort -u |\
paste <( seq 3) - | cut -f 2 |\
tr "\n" "|" |\
rev | cut -c 2- | rev;
done
With perl:
perl -lane '$s{$_}++ for split /[|,]/; END { print for keys %s;}' input
I have edited this post, Hope it will work
while read line
do
val1=`echo $line|awk -F"|" '{print $1}'`
val2=`echo $line|awk -F"|" '{print $2}'`
val3=`echo $line|awk -F"|" '{print $3}'`
a=`echo $line|awk -F"|" '{print $2,"|",$3}'|sed 's/'$val1'//g'`
aa=`echo "$val1|$a"`
b=`echo $aa|awk -F"|" '{print $1,"|",$3}'|sed 's/'$val2'//g'`
b1=`echo $b|awk -F"|" '{print $1}'`
b2=`echo $b|awk -F"|" '{print $2}'`
bb=`echo "$b1|$val2|$b2"`
c=`echo $bb|awk -F"|" '{print $1,"|",$2}'|sed 's/'$val3'//g'`
cc=`echo "$c|$val3"|sed 's/,,/,/;s/,|/|/;s/|,/|/;s/^,//;s/ //g'`
echo "$cc">>abcd
done<ab.dat
cat abcd
c#yahoo.co.in||a#gmail.com
y#gmail.com|x#yahoo.in|z#redhat.com
c#gmail.com|b#yahoo.co.in|c#uix.xo.in
You can subtract all "," separated values and parse in the same way...if your all values are having "," separated.

how do you skip the last line w/ awk?

I am processing a file with awk and need to skip some lines. The internet dosen't have a good answer.
So far the only info I have is that you can skip a range by doing:
awk 'NR==6,NR==13 {print}' input.file
OR
awk 'NR <= 5 { next } NR > 13 {exit} { print}' input.file
You can skip the first line by inputting:
awk 'NR < 1 { exit } { print}' db_berths.txt
How do you skip the last line?
One way using awk:
awk 'NR > 1 { print prev } { prev = $0 }' file.txt
Or better with sed:
sed '$d' file.txt
You can try:
awk 'END{print NR}' file

Resources