Duplicates in an unix text file based on multiple fields

Duplicates in an unix text file based on multiple fields - unix

I have a requirement to find duplicates based on three columns in a .txt file in unix which is delimited by ,.
Input:
a,b,c,d,e,f,gf,h
a,bd,cg,dd,ey,f,g,h
a,b,df,d,e,fd,g,h
a,b,ck,d,eg,f,g,h
Let's take we are finding dupliactes based on 1,2,5 fields.
Expected output:
a,b,c,d,e,f,gf,h
a,b,df,d,e,fd,g,h
Can anyone help to write a script for this or is there a command already available?
I tried like this:
awk -F, '!x[$1,$2,$3]++' file.txt but did not work

One way using awk:
awk -F, 'FNR==NR { x[$1,$2,$5]++; next } x[$1,$2,$5] > 1' a.txt a.txt
This is simple, but reads the file two times. On the first pass (FNR==NR), it maintains counts based on key fields. During the second pass, if prints the line if its key was found more than once.
Another way using awk:
awk -F, '{if (x[$1$2$5]) { y[$1$2$5]++; print $0; if (y[$1$2$5] == 1) { print x[$1$2$5] } } x[$1$2$5] = $0}' a.txt
Explanation:
1 awk -F,
2 '{if (x[$1$2$5])
3 { y[$1$2$5]++; print $0;
4 if (y[$1$2$5] == 1)
5 { print x[$1$2$5] }
6 } x[$1$2$5] = $0
7 }'
Line 2: If x has $1$2$5, this key was seen before, do steps 3-5
Line 3: Increment the count and print the line because it is a dup
Line 4: This means, We are seeing this key for the 2nd time, so we need to print the first line with this key. Last time we saw this key we did not know whether it was a dup or not. So we print the first line in step 5.
Line 6: Store the current line against the key so we can use it in step 2
Another way using sort, uniq and awk
Note: uniq command has an option '-f' to skip the specified number of fields before it starts comparison.
sort -t, -k1,1 -k2,2 -k5,5 a.txt | awk -F, 'BEGIN { OFS = " "} {print $0, $1, $2, $5}' | sed 's/,/ /g' | uniq -f7 -D | sed 's/ /,/g' | cut -d',' -f 1-7
This sorts based on fields 1,2,5. awk prints the original line and appends fields 1,2,5 . sed changes the delimiter because uniq does not have an option to specify delimiter. uniq skips first 7 fields and works on rest of the line and prints duplicate lines.

I had a similar issue
I needed to eliminate duplicate detail records while preserving flat file record formatting and seqence of the records.
The duplication caused by a time expansion of the date field in column 2 of the detail only.
Receiving system was reporting duplication on columns 4 and 5.
I cobbled together this quick hack to resolve it.
First read the file data into an array
Then we can read and manipulate the individual records (crudely with a counter) as demonstrated in this snippet integrating a case statement to logically treat the various record types.
Cheers!
readarray inrecs < [input file name]
filebase=echo "[input file name] | cut -d '.' -f1
i=1
for inrec in "${inrecs[#]}";do
field1=echo ${inrecs[$i-1]} | cut -d',' -f1
field2=echo ${inrecs[$i-1]} | cut -d',' -f2
field3=echo ${inrecs[$i-1]} | cut -d',' -f3
field4=echo ${inrecs[$i-1]} | cut -d',' -f4
field5=echo ${inrecs[$i-1]} | cut -d',' -f5
field6=echo ${inrecs[$i-1]} | cut -d',' -f6
field7=echo ${inrecs[$i-1]} | cut -d',' -f7
field8=echo ${inrecs[$i-1]} | cut -d',' -f8
case $field1 in
'H')
echo "$field1,$field2,$field3">${filebase}.new
;;
'D')
dupecount=0
dupecount=`zegrep -c -e "${field4},${field5}" ${infile}`
if [[ "$dupecount" -gt 1 ]];then
writtencount=0
writtencount=`zegrep -c -e "${field4},${field5}" ${filebase}.new`
if [[ "${writtencount}" -eq 0 ]];then
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
else
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
;;
'T')
dcount=`zegrep -c '^D' ${filebase}.new`
echo "$field1,$field2,$dcount,$field4">>${filebase}.new
;;
esac
((i++))
done

Related

awk to sort two fields:

Would like to sort Input.csv file based on fields $1 and $5 and generate country wise A-Z order.
While doing sort need to consider country name either from $1 or $5 if any of the fields are blank.
Input.csv
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
,,,,mno,50,DL,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
def,20,02-Jul-13,Aug,,,,,
def,20,02-Aug-13,Aug,,,,,
Desired Output.csv
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
def,20,02-Jul-13,Aug,,,,,
def,20,02-Aug-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,mno,50,DL,ABC~XYZ,Sep
I have tried below command but not getting desired output. Please suggest..
head -1 Input.csv > Output.csv; sort -t, -k1,1 -k5,5 <(tail -n +2 Input.csv) >> Output.csv

awk to the rescue!
$ awk -F, '{print ($1==""?$5:$1) "\t" $0}' file | sort | cut -f2-
Country,Amt,Des,Details,Country,Amt,Des,Network,Details
abc,10,03-Apr-14,Aug,abc,10,DL,ABC~XYZ,Sep
abc,10,03-Apr-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,19-Feb-14,Aug,abc,10,MN,ABC~XYZ,Sep
abc,10,22-Jan-07,Aug,abc,10,DL,ABC~XYZ,Sep
def,20,02-Aug-13,Aug,,,,,
def,20,02-Jul-13,Aug,,,,,
,,,,ghi,30,AL,DEF~PQZ,Sep
jkl,40,11-Sep-13,Aug,,,,,
,,,,mno,50,DL,ABC~XYZ,Sep
here the header starting with uppercase and data is lowercase. If this is not a valid assumption special handling of header required as you did above or better with awk
$ awk -F, 'NR==1{print; next} {print ($1==""?$5:$1) "\t" $0 | "sort | cut -f2-"}' file

Is this what you want? (Omitted first line)
cat file_containing_your_lines | awk 'NR != 1' | sed "s/,/\t/g" | sort -k 1 -k 5 | sed "s/\t/,/g"

what is the alternate way to count the occurrence of each word without using 'uniq -c' command?

Is it possible to count the occurrence of each word like using uniq -c but with the count after the word rather than before?
Example scenario
Input file named as text1.txt which contain the following data
Renault:cilo:84563
Renault:cilo:84565
M&M:Thar:84566
Tata:nano:84567
M&M:quanto:84568
M&M:quanto:84569
The fields used in the above data are car_company:car_model:customerID
Desired result
cilo 2
Thar 1
nano 1
quanto 2
(car_model and number of cars sold grouped by car_model)
My code
cat test1.txt | cut -d: -f2 | uniq -c
Actual Result
2 cilo
1 Thar
1 nano
2 quanto
Is it possible to do the above process without using uniq -c ,so that I can swap the order of the fields (columns)?

You can use uniq, and simply post-process its output to swap the columns:
cut -d: -f2 test1.txt | uniq -c | awk '{print $2 "\t" $1 "\n" }'
EDIT: Added \n, as noted in a comment.

Save your commands output into a file "badresult";
cat test1.txt | cut -d: -f2 | uniq -c > badresult
Then cut the seventh field and save it into a file named "counts"(you should use space(" ") as a seperator);
cut -d" " -f7 badresult > counts
Then cut the eighth field and save it into a file named "models"(you should use space(" ") as a seperator);
cut -d" " -f8 badresult > models
Now you have your counts and models in seperate files. All you have to do is to show these two files seperately with "pr" command(-m: one file per column, -T:no pre-information)
pr -m -T models counts

Using awk:
cat test1.txt | cut -d: -f2 | uniq -c | awk '{ t = $1; $1 = $2; $2 = t; print }'
The little awk code exchanges fields 1 and 2 using a temporary.

You just need awk for this:
$ awk -F: '{a[$2]++} END {for (i in a) print i, a[i]}' file
cilo 2
quanto 2
nano 1
Thar 1
This goes through every line keeping track of how many times the second field has appeared. Since everything is stored in the array a, then it is just a matter of looping through it and printing its content.

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.

See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D

another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'

This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.

I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'

Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

How to interleave lines from two text files

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?

paste -d '\n' file1 file2

Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1

#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-

With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3

Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.

cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Advanced grep unix

Usually grep command is used to display the line contaning the specified pattern. Is there any way to display n lines before and after the line which contains the specified pattern?
Can this will be achieved using awk?

Yes, use
grep -B num1 -A num2
to include num1 lines of context before the match, and num2 lines of context after the match.
EDIT:
Seems the OP is using AIX. This has a different set of options which doesn't include -B and -A
this link describes grep on AIX 4.3 (it doesn't look promising)
Matt's perl script might be a better solution.

Here is what I usually do on AIX:
before=2 << The number of lines to be shown Before >>
after=2 << The number of lines to be shown After >>
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk "NR<=%+$after && NR>=%-$before" <filename>
If you do not want the extra 2 varialbles you can always use it an a one line:
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+<<after>> && NR>=%-<<before>>' <filename>
Suppose I have a pattern 'stack' and the filename is flow.txt
I want 2 lines before and 3 lines after. The the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%-2' flow.txt
I want 2 lines before and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=% && NR>=%-2' flow.txt
I want 3 lines after and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%' flow.txt
Multiple Files - change it for Awk & grep. From above for the pattern 'stack' with the filename is flow.* - 2 lines before and 3 lines after. The the command will be like:
awk 'BEGIN {
before=1; after=3; pattern="stack";
i=0; hold[before]=""; afterprints=0}
{
#Print the lines from the previous Match
if (afterprints > 0)
{
print FILENAME ":" FNR ":" $0
afterprints-- #keep a track of the lines to print after - this can be reset if a match is found
if (afterprints == 0) print "---"
}
#Look for the pattern in current line
if ( match($0, pattern) > 0 )
{
# print the lines in the hold round robin buffer from the current line to line-1
# if (before >0) => user wants lines before avoid divide by 0 in %
# and afterprints => 0 - we have not printed the line already
for(j=i; j < i+before && before > 0 && afterprints == 0 ; j++)
print hold[j%before]
if (afterprints == 0) # print the line if we have not printed the line already
print FILENAME ":" FNR ":" $0
afterprints=after
}
if (before > 0) # Store the lines in the round robin hold buffer
{ hold[i]=FILENAME ":" FNR ":" $0
i=(i+1)%before }
}' flow.*

From the tags, it's likely that the system has a grep that may not support providing context (Solaris is one system that doesn't and I can't remember about AIX). If that is the case, there's a perl script that may help at http://www.sun.com/bigadmin/jsp/descFile.jsp?url=descAll/cgrep__context_grep.

If you have sed you could use this shell script
BEFORE=2
AFTER=3
FILE=file.txt
PATTERN=pattern
for i in $(grep -n $PATTERN $FILE | sed -e 's/\:.*//')
do head -n $(($AFTER+$i)) $FILE | tail -n $(($AFTER+$BEFORE+1))
done
What it does is, grep -n prefixes each match with the line it was found at, the sed strips all but the line it was found at. Then you use head to get the lines up to the line it was found on plus an additional $AFTER lines. That's then piped to tail to just get $BEFORE + $AFTER + 1 lines (that is, your matching line plus the number of lines before and after)

Sure there is (from the grep man page):
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
and if you want the same amount of lines before AND after the match, use:
-C NUM, -NUM, --context=NUM
Print NUM lines of output context. Places a line containing a
group separator (--) between contiguous groups of matches. With
the -o or --only-matching option, this has no effect and a
warning is given.

you can use awk
awk 'BEGIN{t=4}
c--&&c>=0
/pattern/{ c=t; for(i=NR;i<NR+t;i++)print a[i%t] }
{ a[NR%t]=$0}
' file
output
$ more file
1
2
3
4
5
pattern
6
7
8
9
10
11
$ ./shell.sh
2
3
4
5
6
7
8
9

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Duplicates in an unix text file based on multiple fields - unix

Related

awk to sort two fields:

what is the alternate way to count the occurrence of each word without using 'uniq -c' command?

print duplicate entries without deleting unix/linux

How to interleave lines from two text files

Advanced grep unix

Categories

Resources