compare two fields from two different files using awk - unix

I have two files where I want to compare certain fields and produce the output
I have a variable as well
echo ${CURR_SNAP}
123
File1
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|RSCNAME1
DOMAIN2|USER2|LE2|ORG2|ACCES2|RSCTYPE2|RSCNAME2
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|RSCNAME3
DOMAIN4|USER4|LE4|ORG4|ACCES4|RSCTYPE4|RSCNAME4
File2
ORG1|PRGPATH1
ORG3|PRGPATH3
ORG5|PRGPATH5
ORG6|PRGPATH6
ORG7|PRGPATH7
The output I am expecting as below where the last column is CURR_SNAP value and the matching will be 4th column of File1 should be matched with 1st column of File2
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
I tried with the below code piece but looks like I am not doing it correctly
awk -v CURRSNAP="${CURR_SNAP}" '{FS="|"} NR==FNR {x[$0];next} {if(x[$1]==$4) print $1"|"$2"|"$3"|"$4"|"$5"|"$6"|"CURRSNAP}' File2 File1

With awk:
#! /bin/bash
CURR_SNAP="123"
awk -F'|' -v OFS='|' -v curr_snap="$CURR_SNAP" '{
if (FNR == NR)
{
# this stores the ORG* as an index
# here you can store other values if needed
orgs_arr[$1]=1
}
else if (orgs_arr[$4] == 1)
{
# overwrite $7 to contain CURR_SNAP value
$7=curr_snap
print
}
}' file2 file1
As in your expected output, you didn't output RSCNAME*, so I have overwritten $7(which is column for RSCNAME*) with $CURR_SNAP. If you want to display RSCNAME* column aswell, remove $7=curr_snap and change print statement to print $0, curr_snap.

I wouldn't use awk at all. This is what join(1) is meant for (Plus sed to append the extra column:
$ join -14 -21 -t'|' -o 1.1,1.2,1.3,1.4,1.5,1.6 File1 File2 | sed "s/$/|${CURR_SNAP}/"
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
It does require that the files be sorted based on the common field, like your examples are.

You can do this with awk with two-rules. For the first file (where NR==FNR), simply use string concatenation to append the fields 1 - (NF-1) assigning the concatenated result to an array indexed by $4. Then for the second file (where NR>FNR) in rule two test if array[$1] has content and if so, output the array and append "|"CURR_SNAP (with CURR_SNAP shortened to c in the example below and array being a), e.g.
CURR_SNAP=123
awk -F'|' -v c="$CURR_SNAP" '
NR==FNR {
for (i=1;i<NF;i++)
a[$4]=i>1?a[$4]"|"$i:a[$4]$1
}
NR>FNR {
if(a[$1])
print a[$1]"|"c
}
' file1 file2
Example Use/Output
After setting the filenames to match yours, you can simply copy/middle-mouse-paste in your console to test, e.g.
$ awk -F'|' -v c="$CURR_SNAP" '
> NR==FNR {
> for (i=1;i<NF;i++)
> a[$4]=i>1?a[$4]"|"$i:a[$4]$1
> }
> NR>FNR {
> if(a[$1])
> print a[$1]"|"c
> }
> ' file1 file2
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
Look things over and let me know if you have further questions.

Related

if rows are otherwise-identical keep the one with higher value in one field

I have a file that looks like this:
cat f1.csv:
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Gm15417,6915
AK161599,Zbtb7b,1339
AK161599,Zbtb7b,1475
AK161599,Zbtb7b,1514
What I want to do is to keep one of the otherwise-duplicated rows if they have a greater number on col3. So if the col1 and col2 are the same then keep the row if has the greater number on the col3.
So the desired output should be:
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
I used the command below but it does not solve the problem:
cat f1.csv | sort -rnk3 | awk '!x[$3]++'
Any help is appreciated - thanks!
with your shown samples, please try following.
awk '
BEGIN{
FS=OFS=","
}
{ ind = $1 FS $2 }
FNR==1{
print
next
}
{
arr[ind]=(arr[ind]>$NF?arr[ind]:$NF)
}
END{
for(i in arr){
print i,arr[i]
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS, OFS as comma here.
}
{ ind = $1 FS $2 } ##Setting ind as 1st and 2nd field value here.
FNR==1{ ##Checking if its first line.
print ##Then print it.
next ##next will skip all further statements from here.
}
{
arr[ind]=(arr[ind]>$NF?arr[ind]:$NF) ##Creating arr with index of ind and keeping only higher value after each line comparison of last field.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Starting a for loop here.
print i,arr[i] ##Printing index and array arr value here.
}
}
' Input_file ##Mentioning Input_file name here.
$ head -n 1 f1.csv; { tail -n +2 f1.csv | sort -t, -k1,2 -k3rn | awk -F, '!seen[$1,$2]++'; }
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
or to avoid naming the input file twice (e.g. so it'll work if the input is a pipe):
$ awk '{print (NR>1) "," $0}' f1.csv | sort -t, -k1,1n -k1,2 -k3rn | cut -d',' -f2- | awk -F, '!seen[$1,$2]++'
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Zbtb7b,1339
The answers provided seem a little complicated to me. Here's an answer all in awk:
#! /usr/bin/awk -f
NR == 1 {
heading = $0
next
}
{
key = $1 "," $2
if( values[key] < $3 ) {
values[key] = $3
}
}
END {
print heading
for( k in values ) {
print k "," values[k] | "sort -t, -k1,2"
}
}
$ ./max.awk -F, max.dat
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
Using sort, you need
sort -t, -k3,3nr file.csv | sort -t, -su -k1,2
The first sort sorts the input numerically by the 3rd column in the descending order. The second sort is stable -s (not all sort implementations support that) and uniques the output by the first two columns, thus leaving the maximum for each combination.
I ignored the header line.

Is there way to extract all the duplicate records based on a particular column?

I'm trying to extract all (only) the duplicate values from a pipe delimited file.
My data file has 800 thousands rows with multiple columns and I'm particularly interested about column 3. So I need to get the duplicate values of column 3 and extract all the duplicate rows from that file.
I'm, however able to achieve this as shown below..
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
and I take the above in loop as shown below..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
I've also tried the awk method
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
But, as I have large number of records in the file, it's taking ages to complete. So I'm looking for an easy and quick alternative.
For example, I have data like this:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
And my expected output which doesn't include unique records:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
This may be what you want:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
or if the file's too large for all the keys ($3 values) to fit in memory (which shouldn't be a problem with just the unique $3 values from 800,000 lines):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: As per Ed sir's suggestion fine tuned my suggestion with more meaningful names(IMO) of arrays.
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Edit by Ed Morton: FWIW here's how I'd have named the variables in the above code:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
EDIT: Since OP changed Input_file with |delimited so changing code a bit to as follows, which deals with new Input_file(Thanks to Ed Morton sir for pointing it out).
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Could you please try following, following will give output in same sequence of in which lines are occurring in Input_file.
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Output will be as follows.
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
Explanation for above code:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
Another in awk:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
Output for the sample data:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
Also, would this work:
$ sort -k 2 file | uniq -D -f 1
or -k2,5 or smth. Nope, as the delimiter changed from space to pipe.
Two steps of improvement.
First step:
After
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
you can use
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
Second step:
Use awk as given in other, better, answers.

How to split and replace strings in columns using awk

I have a tab-delim text file with only 4 columns as shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:2:d:c:a:FAIL
If the string "FAIL" is found in a specific column starting from column2 to columnN (all the strings are separated by ":") then it would need to replace the second element in that column to "-1". Sample output is shown below:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
Any help using awk?
With any awk:
$ awk 'BEGIN{FS=OFS="\t"} {for (i=2;i<=NF;i++) if ($i~/:FAIL$/) sub(/:[^:]+/,":-1",$i)} 1' file
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
In order to split in awk you can use "split".
An example of it would be the following:
split(1,2,"3");
1 is the string you want to split
2 is the array you want to split it into
and 3 is the character that you want to be split on
e.g
string="hello:world"
result=`echo $string | awk '{ split($1,ARR,":"); printf("%s ",ARR[1]);}'`
In this case the result would be equal to hello, because we split the string to the " : " character and we printed the first half of the ARR, if we would print the second half (so printf("%s ",ARR[2])) of the ARR then it would be returned to result the "world".
With gawk:
awk '{$0=gensub(/[^:]*(:[^:]*:[^:]*:[^:]:FAIL)/,"-1\\1", "g" , $0)};1' File
with sed:
sed 's/[^:]*\(:[^:]*:[^:]*:[^:]:FAIL\)/-1\1/g' File
If you are using GNU awk, you can take advantage of the RT feature1 and split the records at tabs and newlines:
awk '$NF == "FAIL" { $2 = "-1"; } { printf "%s", $0 RT }' RS='[\t\n]' FS=':' infile
Output:
GT:CN:CNL:CNP:CNQ:FT .:2:a:b:c:PASS .:2:c:b:a:PASS .:-1:d:c:a:FAIL
1 The record separator that follows the current record.
Your requirements are somewhat vague, but I'm pretty sure this does what you want with bog standard awk (no gnu-awk extensions):
awk '/FAIL/{$2=-1}1' ORS=\\t RS=\\t FS=: OFS=: input

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.
See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D
another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'
This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.
I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'
Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

Remove all lines from file with duplicate value in field, including the first occurrence

I would like to remove all the lines in my data file that contain a value in column 2 that is repeated in column 2 in other lines.
I've sorted by the value in column 2, but can't figure out how to use uniq for just the values in one field as the values are not necessarily of the same length.
Alternately, I can remove lines with the duplicate using an awk one-liner like
awk -F"[,]" '!_[$2]++'
but this retains the line with the first incidence of the repeated value in col 2.
As an example, if my data is
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
I would like to remove ALL lines (including the first) where b occurs in the second column.
Like this:
d,e,f
h,i,j
Thanks for any advice!!
If the order is not important then the following should work:
awk -F, '
!seen[$2]++ {
line[$2] = $0
}
END {
for(val in seen)
if(seen[val]==1)
print line[val]
}' file
Output
h,i,j
d,e,f
Solution with grep:
grep -v -E '\b,b,\b' text.txt
Content of the file:
$ cat text.txt
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
a,n,b
b,c,f
$ grep -v -E '\b,b,\b' text.txt
d,e,f
h,i,j
a,n,b
b,c,f
Hope it helps
Some different awk:
awk -F, '
BEGIN {f=0}
FNR==NR {_[$2]++;next}
f==0 {
f=1
for(j in _)if(_[j]>1)delete _[j]
}
$2 in _
' file file
Explanation
The awk passes through the file twice - that's why it appears twice at the end. On the first pass (when FNR==NR) I count the number of times each column 2 appears in array _[]. At the end of the first pass, I then delete all elements of _[] where that element has been seen more than once. Then, on the second pass, I print lines whose second field appears in _[].

Resources