Awk: find duplicates in 3rd field INCLUDING original

Awk: find duplicates in 3rd field INCLUDING original - dictionary

I figured out the following code to find duplicate UIDs in a passwd file, but it doesn't include the first instance (the one that was later duplicated), I ultimately wanted to have a dictionary with UID = [ USER1, USER2 ] but I am not sure how to get it done in Awk.
What I have so far:
awk -F':' '$1 !~ /^#/ && _[$3]++ {print}' /etc/passwd
Explanation (as I understand it), if regex matches a line not beginning with comment '#', then increment an array based on the current line UID value which makes that line become a non-zero/True value thus printing it.

This may help you to do it. First we save in an array the data, and in the END{} block we print all repeated lines in the array (also you have a print in execution time). Hope it helps you
awk -F":" '
$1 !~ /^#/ && (counter[$3]>0) {a++;print "REPEATED|UID:"$3"|"$0"|"LastReaded[$3]; repeateds["a"a]=$0; repeateds["b"a]=LastReaded[$3]}
$1 !~ /^#/ { counter[$3]++; LastReaded[$3]=$0}
END {for (i in repeateds)
{
print i"|"repeateds[i]
}
}
' /etc/passwd
REPEATED|UID:229|pepito:*:229:229:pepito:/var/empty:/usr/bin/false|_avbdeviced:*:229:-2:Ethernet AVB Device Daemon:/var/empty:/usr/bin/false
a1|pepito:*:229:229:pepito:/var/empty:/usr/bin/false
b1|_avbdeviced:*:229:-2:Ethernet AVB Device Daemon:/var/empty:/usr/bin/false

Related

unix ksh how to print $1 and first n characters of $2

I have a file as follows:
$ cat /etc/oratab
hostname01:DBNAME11:/oracle_home/A_19.0.0.0:N
hostname01:DBNAME1_DC:/oracle_home/A_19.0.0.0:N
hostname02:DBNAME21:/oracle_home/B_19.0.0.0:N
hostname02:DBNAME2_DC:/oracle_home/B_19.0.0.0:N
I want print the unique of the first column, first 6 characters of the second column and the third column when the third column matches the string "19.0.0".
The output I want to see is:
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
I put together this piece of code but looks like its not the correct way to do it.
cat /etc/oratab|grep "19.0.0"|awk '{print $1}' || awk -F":" '{print subsrt($2,1,8)}
sorry I am very new to shell scripting

1st solution: With your shown sample please try following, written and tested with GNU awk.
awk 'BEGIN{FS=OFS=":"} {$2=substr($2,1,7)} !arr[$1,$2]++ && $3~/19\.0\.0/{NF--;print}' Input_file
2nd solution: OR in case your awk doesn't support NF-- then try following.
awk '
BEGIN{
FS=OFS=":"
}
{
$2=substr($2,1,7)
}
!arr[$1,$2]++ && $3~/19\.0\.0/{
$4=""
sub(/:$/,"")
print
}
' Input_file
Explanation: Simple explanation would be, set field separator and output field separator as :. Then in main program, set 2nd field to 1st 7 characters of its value. Then check condition if they are unique(didn't occur before) and 3rd field is like 19.0.0, reduce 1 field and print that line.

You may try this awk:
awk 'BEGIN{FS=OFS=":"} $3 ~ /19\.0\.0/ && !seen[$1]++ {
print $1, substr($2,1,7), $3}' /etc/fstab
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0
We check and populate associative array seen only if we find 19.0.0 in $3.

If the lines can be like this and ending on 19.0.0
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname01:DBNAME1:/oracle_home/A_19.0.0.1
and the hostname01 only should be unique, you might miss a line.
You could match the pattern using sed and use 2 capture groups that you want to keep and match what you don't want.
Then pipe the output to uniq to get all unique lines instead of line the first column.
sed -nE 's/^([^:]+:.{7})[^:]*(:[^:]*19\.0\.0[^:]*).*/\1\2/p' file | uniq
Output
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0

$ awk 'BEGIN{FS=OFS=":"} index($3,"19.0.0"){print $1, substr($2,1,7), $3}' file | sort -u
hostname01:DBNAME1:/oracle_home/A_19.0.0.0
hostname02:DBNAME2:/oracle_home/B_19.0.0.0

Adding previous lines to current line unless pattern is found in unix shell script

I am facing an issue while adding previous lines to current line for a pattern. I have a 43 MB file in unix. The snippet is shown below:
AAA7034 new value and a old value
A
78698 new line and old value
BCA0987 old value and new value
new value
What I want is :
AAA7034 new value and a old value A 78698 new line and old value
BCA0987 old value and new value new value
Means I have add all the the lines till next pattern is found ( first pattern is : AAA and next pattern is : BCA )
because of high size of files..not sure if awk/sed shall work. Any bash script is appreciated.

You can combine all patterns and perform a regex match. Try something like this (it is just a scratch, you should trim the output if you need):
#!/bin/bash
patterns="^(AAA|BCS|BABA|BCA)"
file="$1"
while IFS= read -r line; do
if [[ "$line" =~ $patterns ]] ; then
echo # prints new line
fi
echo -n $line " " # prints the line itself and a space as a separator
done < "$file"
You can redirect the output to a file, of course.

It's not really clear precisely what you want. You've stated that you want to match the patterns 'AAA' and 'BCA', and later expanded that to "patter shall be like: AAA, BCS, BABA, BCA". I don't know if that means that you only want to match 'AAA', 'BCA', 'AAA, 'BCS, 'BABA', and 'BCA, or if you want to match 3 or 4 characters strings containing only 'A', B', 'C', and 'S', but it sounds like you are just looking for:
awk '/[A-Z]{3,4}/{printf "\n"} { printf "%s ", $0} END {printf "\n"}' input-file
Change the pattern as needed when your requirements are made more precise.
Based on the comment, it is trivial to convert any awk program to perl. Here is (basically) the output of a2p on the above awk script, with changes to reflect the stated pattern:
#!/usr/bin/env perl
while (<>) {
chomp;
if (/AAA|BCA|BCS|BABA/) {
printf "\n";
}
printf '%s ', $_;
}
printf "\n";
You can simplify that a bit:
perl -pe 'chomp; printf "\n" if /AAA|BCA|BCS|BABA/; printf "%s ", $_' input-file; echo

Finding Records Using awk Command using math score

Write the unix command to display all the fields of students who has score more than 80 in math as well as math score should be top score among all subjects, moreover output should be in ascending order of std(standard)of the students.
INPUT:
roll,name,std,science_marks,math_marks,college
1,A,9,60,86,SM
2,B,10,85,80,DAV
3,C,10,95,92,DAV
4,D,9,75,92,DAV
OUTPUT:
1|A|9|60|86|SM
4|D|9|75|92|DAV
myCode:
awk 'BEGIN{FS=',' ; OFS="|"} {if($4<$5 && $5>80){print $1,$2,$3,$4,$5,$6}}'
but I'm getting unexpected token error please help me.
Error Message on my Mac System Terminal:
awk: syntax error at source line 1
context is
BEGIN >>> {FS=, <<<
awk: illegal statement at source line 1

Could you please try following, written and tested with shown samples in GNU awk. This answer doesn't hard code the field number it gathers column which has math in it and checks for rest of the lines accordingly then.
awk '
BEGIN{
FS=","
OFS="|"
}
FNR==1{
for(i=1;i<=NF;i++){
if($i=="math_marks"){ field=i }
}
next
}
{
for(i=3;i<=(NF-1);i++){
max=(max>$i?(max?max:$i):$i)
}
if(max==$field && $field>80){ $1=$1; print }
max=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
FS="," ##Setting field separator as comma here.
OFS="|" ##Setting output field separator as | here for all lines.
}
FNR==1{ ##Checking condition if its first line then do following.
for(i=1;i<=NF;i++){ ##Going through all fields here.
if($i=="math_marks"){ field=i } ##Checking if a field value is math_marks then set field to tht field numner here.
}
next ##next will skip all further statements from here.
}
{
for(i=3;i<=(NF-1);i++){ ##Going through from 3rd field to 2nd last field here.
max=(max>$i?(max?max:$i):$i) ##Creating max variable which checks its value with current field and sets maximum value by comparison here.
}
if(max==$field && $field>80){ $1=$1; print } ##After processing of all fields checking if maximum and field value is equal AND math number field is greater than 80 then print the line.
max="" ##Nullifying max var here.
}
' Input_file ##Mentioning Input_file name here.

Your code has double quotes of wrong encoding:
here
| |
v v
$ busybox awk 'BEGIN{FS=”,” ; OFS="|"} {if($4<$5 && $5>80){print $1,$2,$3,$4,$5,$6}}'
awk: cmd. line:1: Unexpected token
Replace those and your code works fine.

Unix: Using filename from another file

A basic Unix question.
I have a script which counts the number of records in a delta file.
awk '{
n++
} END {
if(n >= 1000) print "${completeFile}"; else print "${deltaFile}";
}' <${deltaFile} >${fileToUse}
Then, depending on the IF condition, I want to process the appropriate file:
cut -c2-11 < ${fileToUse}
But how do I use the contents of the file as the filename itself?
And if there are any tweaks to be made, feel free.
Thanks in advance
Cheers
Simon

To use as a filename the contents of a file which is itself identified by a variable (as asked)
cut -c2-11 <"$( cat $filetouse )"
// or in zsh just
cut -c2-11 <"$( < $filetouse )"
unless the filename in the file ends with one or more newline character(s), which people rarely do because it's quite awkward and inconvenient, then something like:
read -rdX var <$filetouse; cut -c2-11 < "${var%?}"
// where X is a character that doesn't occur in the filename
// maybe something like $'\x1f'
Tweaks: your awk prints the variable reference ${completeFile} or ${deltaFile} (because they're within the single-quoted awk script) not the value of either variable. If you actually want the value, as I'd expect from your description, you should pass the shell vars to awk vars like this
awk -vf="$completeFile" -vd="$deltaFile" '{n++} END{if(n>=1000)print f; else print d}' <"$deltaFile"`
# the " around $var can be omitted if the value contains no whitespace and no glob chars
# people _often_ but not always choose filenames that satisfy this
# and they must not contain backslash in any case
or export the shell vars as env vars (if they aren't already) and access them like
awk '{n++} END{if(n>=1000) print ENVIRON["completeFile"]; else print ENVIRON["deltaFile"]}' <"$deltaFile"
Also you don't need your own counter, awk already counts input records
awk -vf=... -vd=... 'END{if(NR>=1000)print f;else print d}' <...
or more briefly
awk -vf=... -vd=... 'END{print (NR>=1000?f:d)}' <...
or using a file argument instead of redirection so the name is available to the script
awk -vf="$completeFile" 'END{print (NR>=1000?f:FILENAME)}' "$deltaFile" # no <
and barring trailing newlines as above you don't need an intermediate file at all, just
cut -c2-11 <"$( awk -vf="$completeFile" -'END{print (NR>=1000?f:FILENAME)}' "$deltaFile")"
Or you don't really need awk, wc can do the counting and any POSIX or classic shell can do the comparison
if [ $(wc -l <"$deltaFile") -ge 1000 ]; then c="$completeFile"; else c="$deltaFile"; fi
cut -c2-11 <"$c"

How to delete partial duplicate lines with AWK?

I have files with these kind of duplicate lines, where only the last field is different:
OST,0202000070,01-AUG-09,002735,6,0,0202000068,4520688,-1,0,0,0,0,0,55
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,5
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,55
OST,0202000068,01-AUG-09,003019,6,0,0202000071,4520690,-1,0,0,0,0,0,55
I need to remove the first occurrence of the line and leave the second one.
I've tried:
awk '!x[$0]++ {getline; print $0}' file.csv
but it's not working as intended, as it's also removing non duplicate lines.

#!/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]+$/))
if (!seen[s]) {
print $0
seen[s] = 1
}
}

If your near-duplicates are always adjacent, you can just compare to the previous entry and avoid creating a potentially huge associative array.
#!/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]*$/))
if (s != prev) {
print prev0
}
prev = s
prev0 = $0
}
END {
print $0
}
Edit: Changed the script so it prints the last one in a group of near-duplicates (no tac needed).

As a general strategy (I'm not much of an AWK pro despite taking classes with Aho) you might try:
Concatenate all the fields except
the last.
Use this string as a key to a hash.
Store the entire line as the value
to a hash.
When you have processed all lines,
loop through the hash printing out
the values.
This isn't AWK specific and I can't easily provide any sample code, but this is what I would first try.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Awk: find duplicates in 3rd field INCLUDING original - dictionary

Related

unix ksh how to print $1 and first n characters of $2

Adding previous lines to current line unless pattern is found in unix shell script

Finding Records Using awk Command using math score

Unix: Using filename from another file

How to delete partial duplicate lines with AWK?

Categories

Resources