Remove column in awk command - unix

I want to remove column as for now i have command that include splitting, header and trailer in one command. I wanted to do splitting, header, trailer and remove first 3 columns in one command.
My original output is
Country Gender Plate Name Account Number Car Name
X F A Fara XXXXXXXXXX Ixora
X F b Jiha XXXXXXXXXX Saga
X M c Jiji XXXXXXXXXX Proton
My splited files:
first file
06032017
Country Gender Plate Name Account Number Car Name
X F A Fara XXXXXXXXXX Ixora
EOF 1
second file
06032017
Country Gender Plate Name Account Number Car Name
X F b Jiha XXXXXXXXXX Saga
EOF1
My desitre output:
06032017
Name Account Number Car Name
Fara XXXXXXXXXX Ixora
EOF 1
06032017
Name Account Number Car Name
Jiha XXXXXXXXXX Saga
EOF1
This is my splitted command:
awk -v date="$(date +"%d%m%Y")" -F\| 'NR==1 {h=$0; next}
{file="CAR_V1_"$1"_"$2"_"date".csv"; print (a[file]++?"": "DETAIL "date"" ORS h ORS) $0 > file}
END{for(file in a) print "EOF " a[file] > file}' HIRE_PURCHASE_testing.csv

To get an idea how to skip some fields:
$ echo "Field1 Field2 Field3 Field4 Field5 Field6" |awk '{print $0}'
Field1 Field2 Field3 Field4 Field5 Field6 #No skipping here. Printed just for comparison
$ echo "Field1 Field2 Field3 Field4 Field5 Field6" |awk '{print substr($0, index($0,$4))}'
Field4 Field5 Field6
$ echo "Field1 Field2 Field3 Field4 Field5 Field6" |awk '{$1=$2=$3="";print}'
Field4 Field5 Field6
#Mind the gaps in the beginning. You can use this if you need to "delete" particular columns.
#Actually you are not really delete columns but replace their contents with a blank value.
$ echo "Field1 Field2 Field3 Field4 Field5 Field6" |awk '{gsub($1FS$2FS$3FS$4,$4);print}'
Field4 Field5 Field6 #Columns 1-2-3-4 have been replaced by column4. No gaps here.
$ echo "Field1 Field2 Field3 Field4 Field5 Field6" |awk '{print $4,$5,$6}'
Field4 Field5 Field6
# If you have a fixed number of fields (i.e 6 fields) you can just do not print the first three fields and print the last three fields
Adapted to your existed code, the line
print (a[file]++?"": "DETAIL "date"" ORS h ORS) $0 > file
If changed to
print (a[file]++?"": "DETAIL "date"" ORS h ORS) substr($0, index($0,$4)) > file
Or to
print (a[file]++?"": "DETAIL "date"" ORS h ORS) $4,$5,$6 > file #if you have a fixed number of columns in the file)
should be enough.
I would sugggest next time you need help, simplify the problem to get help more easily. For the rest users is kind of hard to fully follow you.
Also have a look here to see how all those functions work in details: https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

Related

LINUX: How to AWK to find maximums in every unique line

I have a file with this record
myFile.txt :
firstname lastname note
Ben Quick 12
Fred Free 10
Mark Marked 18
Mark Marked 20
Ben Quick 8
and want to print every unique person with his maximum note
the result will be like this
firstname lastname maxNote
Ben Quick 12
Fred Free 10
Mark Marked 20
so I have this awk script
#!/bin/awk -f
BEGIN {
print "firstname lastname maxNote"
}
NR > 1 {
list[$1," ",$2]++
}
END {
for(l in list)
print l
}
it print only unique firstname + lastname
my question is : how can I sort them by note and print the whole line of each ?
your help is appreciated !
With GNU awk:
awk 'NR==1{print; next}
{
note=$NF # save note from last field
NF-- # remove last field from $0
name=$0
if(a[name]<note){a[name]=note}
}
END{
for(i in a){print i,a[i]}
}' file
Output:
firstname lastname note
Fred Free 10
Ben Quick 12
Mark Marked 20
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
With your shown samples please try following. Written and tested with GNU awk should work with any awk.
awk '
FNR==1{
print
next
}
match($0,/[^0-9]*/){
val=substr($0,RSTART,RLENGTH)
sub(/[[:space:]]+$/,"",val)
}
{
arr[val]=(arr[val]>$NF?arr[val]:$NF)
}
END{
for(key in arr){
print key,arr[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line.
print ##printing current line here.
next ##next will skip all further statements from here.
}
match($0,/[^0-9]*/){ ##using match function to match till any digit comes in line.
val=substr($0,RSTART,RLENGTH) ##Creating val which has sub string of current line, with values of RSTART and RLENGTH.
sub(/[[:space:]]+$/,"",val) ##Substituting spaces coming at last in val wit NULL here.
}
{
arr[val]=(arr[val]>$NF?arr[val]:$NF) ##Creating array with index of val and keeping only maximum value by comparing with last field os current line.
}
END{ ##Starting END block of this program from here.
for(key in arr){ ##Traversing into arr elemennts now.
print key,arr[key] ##Printing key and its value here.
}
}
' Input_file ##Mentioning Input_file name here.
$ awk 'NR==1 {print; next}
{k=$1 FS $2}
max[k]<$3 {max[k]=$3}
END {for(k in max) print k, max[k]}' file
firstname lastname note
Fred Free 10
Ben Quick 12
Mark Marked 20
another approach
$ sed 1q file; sed 1d file | sort -k1,2 -k3nr | awk '!a[$1,$2]++'
firstname lastname note
Ben Quick 12
Fred Free 10
Mark Marked 20
Using any awk, sort, and cut in any shell on every Unix box:
$ awk '{print (NR>1), $0}' file | sort -k1,1n -k4,4rn | cut -d' ' -f2- | awk '!seen[$1,$2]++'
firstname lastname note
Mark Marked 20
Ben Quick 12
Fred Free 10

Is there way to extract all the duplicate records based on a particular column?

I'm trying to extract all (only) the duplicate values from a pipe delimited file.
My data file has 800 thousands rows with multiple columns and I'm particularly interested about column 3. So I need to get the duplicate values of column 3 and extract all the duplicate rows from that file.
I'm, however able to achieve this as shown below..
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
and I take the above in loop as shown below..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
I've also tried the awk method
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
But, as I have large number of records in the file, it's taking ages to complete. So I'm looking for an easy and quick alternative.
For example, I have data like this:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
And my expected output which doesn't include unique records:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
This may be what you want:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
or if the file's too large for all the keys ($3 values) to fit in memory (which shouldn't be a problem with just the unique $3 values from 800,000 lines):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: As per Ed sir's suggestion fine tuned my suggestion with more meaningful names(IMO) of arrays.
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Edit by Ed Morton: FWIW here's how I'd have named the variables in the above code:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
EDIT: Since OP changed Input_file with |delimited so changing code a bit to as follows, which deals with new Input_file(Thanks to Ed Morton sir for pointing it out).
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Could you please try following, following will give output in same sequence of in which lines are occurring in Input_file.
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Output will be as follows.
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
Explanation for above code:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
Another in awk:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
Output for the sample data:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
Also, would this work:
$ sort -k 2 file | uniq -D -f 1
or -k2,5 or smth. Nope, as the delimiter changed from space to pipe.
Two steps of improvement.
First step:
After
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
you can use
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
Second step:
Use awk as given in other, better, answers.

Unix- using awk to print in table format

I want to create a table that looks something like this, where i display the # at the front of every $2 and the Title header but I do not want to store it in my file:
Name ID Gender
John Smith #4567734D Male
Kyle Toh #2437882L Male
Julianne .T #0324153T Female
I got a result like this instead:
Name ID Gender
#
Name ID Gender
John Smith #4567734D Male
Name ID Gender
Kyle Toh #2437882L Male
Name ID Gender
Julianne .T #0324153T Female
By using this command:
awk -F: ' {print "Name:ID:Gender\n"}
{print $1":#"$2":"$3}' data.txt |column -s ":" -t
Print the headers only in the first line:
awk -F: 'NR==1 {print "Name:ID:Gender\n"} {print $1":#"$2":"$3}'
or:
awk -F: 'BEGIN {print "Name:ID:Gender\n"} {print $1":#"$2":"$3}'
Explanation:
An awk program consists of expressions in the form of:
CONDITION { ACTIONS } [ CONDITION { ACTIONS }, ... ]
... where CONDITION and ACTION optional. If CONDITION is missing, like:
{ ACTIONS }
... then ACTIONS would be applied to any line of input. In the case above we have two expressions:
# Only executed if NR==1
NR==1 {print "Name:ID:Gender\n"}
# CONDITION is missing. ACTIONS will be applied to any line of input
{print $1":#"$2":"$3}

Adding Double quotes to last value

I have the data in below format.
abc ssg,"-149,684.58","-149,469.05",-215.53
efg sfg,-80.99,-77.46,-3.53
hij sf,"4,341.23","4,131.90",209.33
kilm mm,"2,490,716.13","-180,572.48","9,223.06"
I want to add double quotes to those value at the end which does not have double quotes done through perl or unix
the output should look as below:
abc ssg,"-149,684.58","-149,469.05","-215.53"
efg sfg,-80.99,-77.46,"-3.53"
hij sf,"4,341.23","4,131.90","209.33"
kilm mm,"2,490,716.13","-180,572.48","9,223.06"
This might work for you:
gawk -F ',' -v Q='"' 'BEGIN {OFS=FS} $NF !~ Q {gsub(/.*/,Q $NF Q,$NF); print ; next} 1' INPUTFILE
-F ',' sets the input field separator
-v Q='"' sets the Q variable to ", this helps to avoid some escaping problems
BEGIN {OFS=FS} set the output field separator to the same as the input
$NF !~ Q if the last field not matches Q (==") then
gsub(/.*/,Q $NF Q,$NF) replace the last field to " delimited
print the line
skip the next rule(s) and process the next line
1 executes the default print line action for every other line (where the last field not matches ")

command to identify distinct field value

Could anyone help me on a command to identify distinct values from a particular column ?
For eg my input is like
Column1 Column2 Column3
a 11 abc
a 22 abc
b 33 edf
c 44 ghi
I require a output like
Column1
a
b
c
my input file has header. So i need a command in which we pass Column1 as parameter.
Run the following command, with an input file:
$ head -1 input.file | awk '{ print $1}'; awk '{ if (NR > 1) print $1 }' input.file | uniq
Column1
a
b
c
Or just:
$ awk '{print $1 }' input.file | uniq
Column1
a
b
c
File distinct.pl:
#!/usr/bin/perl
$_ = <STDIN>;
#F = split;
map $col{$F[$_]}=$_, (0..$#F); # map column names to numbers
while (<STDIN>)
{
#F = split;
$val{$F[$col{$ARGV[0]}]} = undef # implement the set of values
}
$, = "\n";
print $ARGV[0], ""; # output the column parameter
print sort(keys %val), "" # output sorted set of values
Example command: distinct.pl Column2 <input
Note: Non-existent column names yield the values from the first column.

Resources