Finding unique items in two rows in awk - unix

The following script gives me the number of unique elements in 4th field.
awk -F'\t' '$7 ~ /ECK/ {print $4}' filename.txt | sort | uniq | wc -l
Similarly I can find the unique elements in 2nd Field. But how do I calculate the number of unique items that are in 4th field but not in the second field. In other words, the unique elements in 4th field that do not appear in the 2nd field.

You can do it all in awk
awk '
{
field_2[$2] = 1
field_4[$4] = 1
}
END {
for (item in field_4) {
if (!(item in field_2))
print item;
}
}
'

This uses Bash (or ksh or zsh) process substitution, but you could create temporary files that are sorted if you're using a shell that doesn't support that.
join -t $'\t' -1 4 -2 2 -v 1 -o 1.4 <(sort -k4 inputfile) <(sort -k2 inputfile) | sort -u | wc -l

Related

what is the alternate way to count the occurrence of each word without using 'uniq -c' command?

Is it possible to count the occurrence of each word like using uniq -c but with the count after the word rather than before?
Example scenario
Input file named as text1.txt which contain the following data
Renault:cilo:84563
Renault:cilo:84565
M&M:Thar:84566
Tata:nano:84567
M&M:quanto:84568
M&M:quanto:84569
The fields used in the above data are car_company:car_model:customerID
Desired result
cilo 2
Thar 1
nano 1
quanto 2
(car_model and number of cars sold grouped by car_model)
My code
cat test1.txt | cut -d: -f2 | uniq -c
Actual Result
2 cilo
1 Thar
1 nano
2 quanto
Is it possible to do the above process without using uniq -c ,so that I can swap the order of the fields (columns)?
You can use uniq, and simply post-process its output to swap the columns:
cut -d: -f2 test1.txt | uniq -c | awk '{print $2 "\t" $1 "\n" }'
EDIT: Added \n, as noted in a comment.
Save your commands output into a file "badresult";
cat test1.txt | cut -d: -f2 | uniq -c > badresult
Then cut the seventh field and save it into a file named "counts"(you should use space(" ") as a seperator);
cut -d" " -f7 badresult > counts
Then cut the eighth field and save it into a file named "models"(you should use space(" ") as a seperator);
cut -d" " -f8 badresult > models
Now you have your counts and models in seperate files. All you have to do is to show these two files seperately with "pr" command(-m: one file per column, -T:no pre-information)
pr -m -T models counts
Using awk:
cat test1.txt | cut -d: -f2 | uniq -c | awk '{ t = $1; $1 = $2; $2 = t; print }'
The little awk code exchanges fields 1 and 2 using a temporary.
You just need awk for this:
$ awk -F: '{a[$2]++} END {for (i in a) print i, a[i]}' file
cilo 2
quanto 2
nano 1
Thar 1
This goes through every line keeping track of how many times the second field has appeared. Since everything is stored in the array a, then it is just a matter of looping through it and printing its content.

Duplicates in an unix text file based on multiple fields

I have a requirement to find duplicates based on three columns in a .txt file in unix which is delimited by ,.
Input:
a,b,c,d,e,f,gf,h
a,bd,cg,dd,ey,f,g,h
a,b,df,d,e,fd,g,h
a,b,ck,d,eg,f,g,h
Let's take we are finding dupliactes based on 1,2,5 fields.
Expected output:
a,b,c,d,e,f,gf,h
a,b,df,d,e,fd,g,h
Can anyone help to write a script for this or is there a command already available?
I tried like this:
awk -F, '!x[$1,$2,$3]++' file.txt but did not work
One way using awk:
awk -F, 'FNR==NR { x[$1,$2,$5]++; next } x[$1,$2,$5] > 1' a.txt a.txt
This is simple, but reads the file two times. On the first pass (FNR==NR), it maintains counts based on key fields. During the second pass, if prints the line if its key was found more than once.
Another way using awk:
awk -F, '{if (x[$1$2$5]) { y[$1$2$5]++; print $0; if (y[$1$2$5] == 1) { print x[$1$2$5] } } x[$1$2$5] = $0}' a.txt
Explanation:
1 awk -F,
2 '{if (x[$1$2$5])
3 { y[$1$2$5]++; print $0;
4 if (y[$1$2$5] == 1)
5 { print x[$1$2$5] }
6 } x[$1$2$5] = $0
7 }'
Line 2: If x has $1$2$5, this key was seen before, do steps 3-5
Line 3: Increment the count and print the line because it is a dup
Line 4: This means, We are seeing this key for the 2nd time, so we need to print the first line with this key. Last time we saw this key we did not know whether it was a dup or not. So we print the first line in step 5.
Line 6: Store the current line against the key so we can use it in step 2
Another way using sort, uniq and awk
Note: uniq command has an option '-f' to skip the specified number of fields before it starts comparison.
sort -t, -k1,1 -k2,2 -k5,5 a.txt | awk -F, 'BEGIN { OFS = " "} {print $0, $1, $2, $5}' | sed 's/,/ /g' | uniq -f7 -D | sed 's/ /,/g' | cut -d',' -f 1-7
This sorts based on fields 1,2,5. awk prints the original line and appends fields 1,2,5 . sed changes the delimiter because uniq does not have an option to specify delimiter. uniq skips first 7 fields and works on rest of the line and prints duplicate lines.
I had a similar issue
I needed to eliminate duplicate detail records while preserving flat file record formatting and seqence of the records.
The duplication caused by a time expansion of the date field in column 2 of the detail only.
Receiving system was reporting duplication on columns 4 and 5.
I cobbled together this quick hack to resolve it.
First read the file data into an array
Then we can read and manipulate the individual records (crudely with a counter) as demonstrated in this snippet integrating a case statement to logically treat the various record types.
Cheers!
readarray inrecs < [input file name]
filebase=echo "[input file name] | cut -d '.' -f1
i=1
for inrec in "${inrecs[#]}";do
field1=echo ${inrecs[$i-1]} | cut -d',' -f1
field2=echo ${inrecs[$i-1]} | cut -d',' -f2
field3=echo ${inrecs[$i-1]} | cut -d',' -f3
field4=echo ${inrecs[$i-1]} | cut -d',' -f4
field5=echo ${inrecs[$i-1]} | cut -d',' -f5
field6=echo ${inrecs[$i-1]} | cut -d',' -f6
field7=echo ${inrecs[$i-1]} | cut -d',' -f7
field8=echo ${inrecs[$i-1]} | cut -d',' -f8
case $field1 in
'H')
echo "$field1,$field2,$field3">${filebase}.new
;;
'D')
dupecount=0
dupecount=`zegrep -c -e "${field4},${field5}" ${infile}`
if [[ "$dupecount" -gt 1 ]];then
writtencount=0
writtencount=`zegrep -c -e "${field4},${field5}" ${filebase}.new`
if [[ "${writtencount}" -eq 0 ]];then
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
else
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
;;
'T')
dcount=`zegrep -c '^D' ${filebase}.new`
echo "$field1,$field2,$dcount,$field4">>${filebase}.new
;;
esac
((i++))
done

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.
See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D
another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'
This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.
I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'
Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

Unix Command for counting number of words which contains letter combination (with repeats and letters in between)

How would you count the number of words in a text file which contains all of the letters a, b, and c. These letters may occur more than once in the word and the word may contain other letters as well. (For example, "cabby" should be counted.)
Using sample input which should return 2:
abc abb cabby
I tried both:
grep -E "[abc]" test.txt | wc -l
grep 'abcdef' testCount.txt | wc -l
both of which return 1 instead of 2.
Thanks in advance!
You can use awk and use the return value of sub function. If successful substitution is made, the return value of the sub function will be the number of substitutions done.
$ echo "abc abb cabby" |
awk '{
for(i=1;i<=NF;i++)
if(sub(/a/,"",$i)>0 && sub(/b/,"",$i)>0 && sub(/c/,"",$i)>0) {
count+=1
}
}
END{print count}'
2
We keep the condition of return value to be greater than 0 for all three alphabets. The for loop will iterate over every word of every line adding the counter when all three alphabets are found in the word.
I don't think you can get around using multiple invocations of grep. Thus I would go with (GNU grep):
<file grep -ow '\w+' | grep a | grep b | grep c
Output:
abc
cabby
The first grep puts each word on a line of its own.
Try this, it will work
sed 's/ /\n/g' test.txt |grep a |grep b|grep c
$ cat test.txt
abc abb cabby
$ sed 's/ /\n/g' test.txt |grep a |grep b|grep c
abc
cabby
hope this helps..

Advanced grep unix

Usually grep command is used to display the line contaning the specified pattern. Is there any way to display n lines before and after the line which contains the specified pattern?
Can this will be achieved using awk?
Yes, use
grep -B num1 -A num2
to include num1 lines of context before the match, and num2 lines of context after the match.
EDIT:
Seems the OP is using AIX. This has a different set of options which doesn't include -B and -A
this link describes grep on AIX 4.3 (it doesn't look promising)
Matt's perl script might be a better solution.
Here is what I usually do on AIX:
before=2 << The number of lines to be shown Before >>
after=2 << The number of lines to be shown After >>
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk "NR<=%+$after && NR>=%-$before" <filename>
If you do not want the extra 2 varialbles you can always use it an a one line:
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+<<after>> && NR>=%-<<before>>' <filename>
Suppose I have a pattern 'stack' and the filename is flow.txt
I want 2 lines before and 3 lines after. The the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%-2' flow.txt
I want 2 lines before and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=% && NR>=%-2' flow.txt
I want 3 lines after and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%' flow.txt
Multiple Files - change it for Awk & grep. From above for the pattern 'stack' with the filename is flow.* - 2 lines before and 3 lines after. The the command will be like:
awk 'BEGIN {
before=1; after=3; pattern="stack";
i=0; hold[before]=""; afterprints=0}
{
#Print the lines from the previous Match
if (afterprints > 0)
{
print FILENAME ":" FNR ":" $0
afterprints-- #keep a track of the lines to print after - this can be reset if a match is found
if (afterprints == 0) print "---"
}
#Look for the pattern in current line
if ( match($0, pattern) > 0 )
{
# print the lines in the hold round robin buffer from the current line to line-1
# if (before >0) => user wants lines before avoid divide by 0 in %
# and afterprints => 0 - we have not printed the line already
for(j=i; j < i+before && before > 0 && afterprints == 0 ; j++)
print hold[j%before]
if (afterprints == 0) # print the line if we have not printed the line already
print FILENAME ":" FNR ":" $0
afterprints=after
}
if (before > 0) # Store the lines in the round robin hold buffer
{ hold[i]=FILENAME ":" FNR ":" $0
i=(i+1)%before }
}' flow.*
From the tags, it's likely that the system has a grep that may not support providing context (Solaris is one system that doesn't and I can't remember about AIX). If that is the case, there's a perl script that may help at http://www.sun.com/bigadmin/jsp/descFile.jsp?url=descAll/cgrep__context_grep.
If you have sed you could use this shell script
BEFORE=2
AFTER=3
FILE=file.txt
PATTERN=pattern
for i in $(grep -n $PATTERN $FILE | sed -e 's/\:.*//')
do head -n $(($AFTER+$i)) $FILE | tail -n $(($AFTER+$BEFORE+1))
done
What it does is, grep -n prefixes each match with the line it was found at, the sed strips all but the line it was found at. Then you use head to get the lines up to the line it was found on plus an additional $AFTER lines. That's then piped to tail to just get $BEFORE + $AFTER + 1 lines (that is, your matching line plus the number of lines before and after)
Sure there is (from the grep man page):
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
and if you want the same amount of lines before AND after the match, use:
-C NUM, -NUM, --context=NUM
Print NUM lines of output context. Places a line containing a
group separator (--) between contiguous groups of matches. With
the -o or --only-matching option, this has no effect and a
warning is given.
you can use awk
awk 'BEGIN{t=4}
c--&&c>=0
/pattern/{ c=t; for(i=NR;i<NR+t;i++)print a[i%t] }
{ a[NR%t]=$0}
' file
output
$ more file
1
2
3
4
5
pattern
6
7
8
9
10
11
$ ./shell.sh
2
3
4
5
6
7
8
9

Resources