Unix uniq utility: What is wrong with this code? - unix

What I want to accomplish: print duplicated lines
This is what uniq man says:
SYNOPSIS
uniq [OPTION]... [INPUT [OUTPUT]]
DESCRIPTION
Discard all but one of successive identical lines from INPUT (or stan-
dard input), writing to OUTPUT (or standard output).
...
-d, --repeated
only print duplicate lines
This is what I try to execute:
root#laptop:/var/www# cat file.tmp
Foo
Bar
Foo
Baz
Qux
root#laptop:/var/www# cat file.tmp | uniq --repeated
root#laptop:/var/www#
So I was waiting for Foo in this example but it returns nothing..
What is wrong with this snippet?

uniq only checks consecutive lines against each other. So you can only expect to see something printed if there are two or more Foo lines in a row, for example.
If you want to get around that, sort the file first with sort.
$ sort file.tmp | uniq -d
Foo
If you really need to have all the non-consecutive duplicate lines printed in the order they occur in the file, you can use awk for that:
$ awk '{ if ($0 in lines) print $0; lines[$0]=1; }' file.tmp
but for a large file, that may be less efficient than sort and uniq. (May be - I haven't tried.)

cat file.tmp | sort | uniq --repeated
or
sort file.tmp | uniq --repeated

cat file.tmp | sort | uniq --repeated
the lines needs to be sorted

uniq operates on adjacent lines. what you want is
cat file.tmp | sort | uniq --repeated
On OS X, I actually would have
sort file.tmp | uniq -d

I've never tried this myself, but I think the word "successive" is the key.
This would probably work if you sorted the input before running uniq over it.
Something like
sort file.tmp | uniq -d

Related

Get a list of unique sender(from=) domains in postfix maillog

I am currenlty trying to extract all the sender domains from maillog. I am able to do some of that with the below command but the output is not quite what I desired. What would be the best approach to retrieve a unique list of sender domain from maillog?
grep from= /var/log/maillog | awk '{print $7}' | sort | uniq -c | sort -n
output
1 from=<user#test.com>,
1 from=<apache#app1.com>,
2 from=<bounceld_5BFa-bx0p-P3tQ-67Nn#example.com>,
2 from=<bounceld_19iI-HqaS-usVU-fqe5#example.com>,
12 reject:
666 from=<>,
desired output
test.com
app1.com
example.com
See useless use of grep; if you are using Awk anyway, you don't really need grep at all.
awk '$7 ~ /from=.*#/{split($7, a, /#/); ++count[a[2]] }
END { for(dom in count) print count[dom], dom }' /var/log/maillog
Collecting the counts in an associative array does away with the need to call sort and uniq, too. Obviously, if you don't care about the count, don't print count[dom] at the end.
This should give you the answer:
grep from= /var/log/maillog | awk '{print $7}' | grep -Po '(?=#).{1}\K.*(?=>)' | sort -n | uniq -c
... change last items to "| sort | uniq" to remove the counts.
References:
https://www.baeldung.com/linux/bash-remove-first-characters {1}\K use
Extract email addresses from log with grep or sed -Po grep function

Can someone explain me the unix command the following command

I want to validate the file. As per validation, I need to check the length of each column, null or not null and primary constant of that file.
cat File_name| awk -F '|' '{print NF}' | sort | uniq
This command split lines of the file on tokens using pipe | as delimiter, print number of tokens on each row (NF variable), sort the output (sort command) and on the end get only uniq numbers (uniq command).
The script can be optimised getting rid of cat command and combine it in awk and use parameter of sort to get uniq records:
awk -F '|' '{print NF}' file_name | sort -u

sort and uniq oneliner

Is there a oneliner for for sort and uniq given a filename in unix?
I googled and found the following but its not sorting,also not sure what is the below command doing..any better ways using awk or anyother unix tool?
cut -d, -f1 file | uniq | xargs -I{} grep -m 1 "{}" file
On a side note,is there one that can be used in both windows and unix?this is not important but just checking..
C:\Users\Chola>sort -t "#" -k2,2 email-list.txt
Input text file:-
436485
422636
429228
427041
433414
425810
422636
431526
428808
If your file consists only of numbers, one per line:
sort -n FILENAME | uniq
or
sort -u -n FILENAME
(You can add -u to the sort command instead of piping through uniq in all of the following.).
If you want to extract just one column of a file, and then sort that column numerically removing duplicates, you could do this:
cut -f7 FILENAME | sort -n | uniq
Cut assumes that there is a single tab between columns. If your file is CSV, you might be able to do this:
cut -f7 -d, FILENAME | sort -n | uniq
but that won't work if there is a , in a text field in the file (where CSV will protect it with "'s).
If you want to sort by the column but remove only completely duplicate lines, then you can do this:
sort -k7,7n FILENAME | uniq
sort assumes that columns are separated by whitespace. Again, if you want to separate with ,, you can use:
sort -k7,7n -t, FILENAME | uniq

sorting ls-l owners in Unix

I want to sort the owners in alphabetical order from a call to ls -l and cannot figure out a way to do it. I know something like ls-l | sort would sort the file name but how do i sort the owners in order?
The owner is the third field, so use -k 3:
ls -l | sort -k 3
You can extend this idea to sorting based on other fields, and you can have multiple -k options. For instance, maybe you want to sort by owner, and then size in descending order:
ls -l | sort -k 3,3 -k 5rn
I am not sure if you want only the owners or the whole information sorted by owner. In the former case superfo's solution is almost correct.
Additionally you need to remove repeating white spaces from ls's output with tr because otherwise cut that uses them as a delimiter won't work in all directories.*
So in the end you get this:
ls -l | tr -s ' ' | cut -d ' ' -f 3 | sort | uniq
*Some directories have a two digit value in the second field and all other lines with a single digit get an additional whitespace to preserve the layout.
How about ...
ls -l | cut -d ' ' -f 3 | sort | uniq
Try this:
ls -l | awk '{print $3, $4, $8}' | sort
It will print the user name, the group name and the file name. (File name cannot contain spaces)
ls -l | awk '{print $3, $4, $0}' | sort
This will print the user name, group name and the full ls -l output, sorted by the user name first, then the group name, then what ls -l prints first

How to keep a file's format if you use the uniq command (in shell)?

In order to use the uniq command, you have to sort your file first.
But in the file I have, the order of the information is important, thus how can I keep the original format of the file but still get rid of duplicate content?
Another awk version:
awk '!_[$0]++' infile
This awk keeps the first occurrence. Same algorithm as other answers use:
awk '!($0 in lines) { print $0; lines[$0]; }'
Here's one that only needs to store duplicated lines (as opposed to all lines) using awk:
sort file | uniq -d | awk '
FNR == NR { dups[$0] }
FNR != NR && (!($0 in dups) || !lines[$0]++)
' - file
There's also the "line-number, double-sort" method.
nl -n ln | sort -u -k 2| sort -k 1n | cut -f 2-
You can run uniq -d on the sorted version of the file to find the duplicate lines, then run some script that says:
if this_line is in duplicate_lines {
if not i_have_seen[this_line] {
output this_line
i_have_seen[this_line] = true
}
} else {
output this_line
}
Using only uniq and grep:
Create d.sh:
#!/bin/sh
sort $1 | uniq > $1_uniq
for line in $(cat $1); do
cat $1_uniq | grep -m1 $line >> $1_out
cat $1_uniq | grep -v $line > $1_uniq2
mv $1_uniq2 $1_uniq
done;
rm $1_uniq
Example:
./d.sh infile
You could use some horrible O(n^2) thing, like this (Pseudo-code):
file2 = EMPTY_FILE
for each line in file1:
if not line in file2:
file2.append(line)
This is potentially rather slow, especially if implemented at the Bash level. But if your files are reasonably short, it will probably work just fine, and would be quick to implement (not line in file2 is then just grep -v, and so on).
Otherwise you could of course code up a dedicated program, using some more advanced data structure in memory to speed it up.
for line in $(sort file1 | uniq ); do
grep -n -m1 line file >>out
done;
sort -n out
first do the sort,
for each uniqe value grep for the first match (-m1)
and preserve the line numbers
sort the output numerically (-n) by line number.
you could then remove the line #'s with sed or awk

Resources