find similar rows in a text file in unix system - unix

I have a file named tt.txt and the contents of this file is as follows:
fdgs
jhds
fdgs
I am trying to get the similar row as the output in a text file.
my expected output is:
fdgs
fdgs
to do so, I used this command:
uniq -u tt.txt > output.txt
but it returns:
fdgs
jhds
fdgs
do you know how to fix it?

If by similar row you mean the row with the same content.
From the uniq manpage the uniq command would only filter the adjacent matching lines from the repeated lines. So you need to sort the input first and used -D option to print all duplicated lines like below. However -D options is limited to the GNU implementation, and doing this would print the output in different order from the input.
sort tt.txt | uniq -D
If you want the output to be in the same order you need to remember the input line number and sort the line number again like this
cat -n tt.txt | sort -k 2 | uniq -f 1 -D | sort -k 1,1 | sed 's/\s+[0-9]+\s+//'
cat -n would print the content with the line number
sort -k 2 would sort the input starting at 2rd column
uniq -f 1 would ignore the first column
sort -k1,1 would sort the the output back by the original line number
sed 's/\s+[0-9]+\s+//' would delete the first column with line number
uniq -u command would output only the unique input line, which is completely opposite as what you want.

One in awk:
$ awk '++seen[$0]==2;seen[$0]>1' file
fdgs
fdgs

Related

Can someone explain me the unix command the following command

I want to validate the file. As per validation, I need to check the length of each column, null or not null and primary constant of that file.
cat File_name| awk -F '|' '{print NF}' | sort | uniq
This command split lines of the file on tokens using pipe | as delimiter, print number of tokens on each row (NF variable), sort the output (sort command) and on the end get only uniq numbers (uniq command).
The script can be optimised getting rid of cat command and combine it in awk and use parameter of sort to get uniq records:
awk -F '|' '{print NF}' file_name | sort -u

unix command to print every 2nd line of duplicate

I have a text file that has 110132 lines and looks like this,
b3694658:heccc 238622
b3769025:heccc 238622
b3694659:heccc 238623
b3769026:heccc 238623
b3694660:heccc 238624
b3769027:heccc 238624
b3694661:heccc 238625
b3769028:heccc 238625
Notice that every 2nd line has a duplicate entry at heccc etc., i want an output that only has the 2nd occurrence of the duplicate, so it would look like this,
b3769025:heccc 238622
b3769026:heccc 238623
b3769027:heccc 238624
b3769028:heccc 238625
Thanks for your help!
It appears that you are just looking to output unique values. If that is so, just do this:
cat textfile | sort | uniq
uniq -f1 file.txt
should do in this case.
see how -f , -s options work with the uniq command?

Grep from multiple files and get the first n lines of each output

Let's say I have f files.
From each file I want to grep a pattern.
I just want n pattern matches from each file.
What I have:
strings <files_*> | grep <pattern> | head -<n>
I do need to use strings because I'm dealing with binaries, and from this command I am only getting n lines from the total.
grep has a -mX option that allows you to specify how many matches. However, adding this to your piped command line, is going to stop at the the first X matches total... not per file.
To get per-file count, I came up with this:
for FILE in `ls -f <files_*>` ; do strings "$FILE" | grep -m<X> <pattern> ; done
Example (searching for "aa" the files that match x* and returning up to 3 lines from each would be:
for FILE in `ls -f x*` ; do strings "$FILE" | grep -m3 aa ; done

Appending multiple files into one file

I append multiple data files into a single data file using the cat command. How can I assign that single file value into a new file?
I am using the command:
cat file1 file2 file3 > Newfile.txt
AnotherFile=`cat Newfile.txt`
sort $AnotherFile | uniq -c
it showing error like can not open AnotherFile
How to assign this newfile value into another file?
Original answer to original question
Well, the easiest way is probably cp:
cat file1 file2 file3 > Newfile.txt
cp Newfile.txt AnotherFile.txt
Failing that, you can use:
cat file1 file2 file3 > Newfile.txt
AnotherFile=$(cat Newfile.txt)
echo "$AnotherFile" > AnotherFile.txt
Revised answer to revised question
The original question had echo "$AnotherFile" as the third line; the revised question has sort $AnotherFile | uniq -c as the third line.
Assuming that sort $AnotherFile is not sorting all the contents of the files mentioned in the list created from concatenating the original files (that is, assuming that file1, file2 and file3 do not contain just lists of file names), then the objective is to sort and count the lines found in the source files.
The whole job can be done in a single command line:
cat file1 file2 file3 | tee Newfile.txt | sort | uniq -c
Or (more usually):
cat file1 file2 file3 | tee Newfile.txt | sort | uniq -c | sort -n
which lists the lines in increasing order of frequency.
If you really do want to sort the contents of the files listed in file1, file2, file3 but only list the contents of each file once, then:
cat file1 file2 file3 | tee Newfile.txt | sort -u | xargs sort | sort | uniq -c
It looks weird having three sort-related commands in a row, but there is justification for each step. The sort -u ensures each file name is listed once. The xargs sort converts a list of file names on standard input into a list of file names on the sort command line. The output of this is the sorted data from each batch of files that xargs produces. If there are so few files that xargs doesn't need to run sort more than once, then the following plain sort is redundant. However, if xargs has to run sort more than once, then the final sort has to deal with the fact that the first lines from the second batch produced by xargs sort probably come before the last lines produced by the first batch produced by xargs sort.
This becomes a judgement call based on knowledge of the data in the original files. If the files are small enough that xargs won't need to run multiple sort commands, omit the final sort. A heuristic would be "if the sum of the sizes of the source files is smaller than the maximum command line argument list, don't include the extra sort".
You can probably do that in one go:
# Write to two files at once. Both files have a constantly varying
# content until cat is finished.
cat file1 file2 file3 | tee Newfile.txt> Anotherfile.txt
# Save the output filename, just in case you need it later
filename="Anotherfile.txt"
# This reads the contents of Newfile into a variable called AnotherText
AnotherText=`cat Newfile.txt`
# This is the same as "cat Newfile.txt"
echo "$AnotherText"
# This saves AnotherText into Anotherfile.txt
echo "$AnotherText" > Anotherfile.txt
# This too, using cp and the saved name above
cp Newfile.txt "$filename"
If you want to create the second file all in one go, this is a common pattern:
# During this process the contents of tmpfile.tmp is constantly changing
{ slow process creating text } > tmpfile.tmp
# Very quickly create a complete Anotherfile.txt
mv tmpfile.tmp Anotherfile.txt
make file and redirectin this in append mode.
touch Newfile.txt
cat files* >> Newfile.txt

sort and uniq oneliner

Is there a oneliner for for sort and uniq given a filename in unix?
I googled and found the following but its not sorting,also not sure what is the below command doing..any better ways using awk or anyother unix tool?
cut -d, -f1 file | uniq | xargs -I{} grep -m 1 "{}" file
On a side note,is there one that can be used in both windows and unix?this is not important but just checking..
C:\Users\Chola>sort -t "#" -k2,2 email-list.txt
Input text file:-
436485
422636
429228
427041
433414
425810
422636
431526
428808
If your file consists only of numbers, one per line:
sort -n FILENAME | uniq
or
sort -u -n FILENAME
(You can add -u to the sort command instead of piping through uniq in all of the following.).
If you want to extract just one column of a file, and then sort that column numerically removing duplicates, you could do this:
cut -f7 FILENAME | sort -n | uniq
Cut assumes that there is a single tab between columns. If your file is CSV, you might be able to do this:
cut -f7 -d, FILENAME | sort -n | uniq
but that won't work if there is a , in a text field in the file (where CSV will protect it with "'s).
If you want to sort by the column but remove only completely duplicate lines, then you can do this:
sort -k7,7n FILENAME | uniq
sort assumes that columns are separated by whitespace. Again, if you want to separate with ,, you can use:
sort -k7,7n -t, FILENAME | uniq

Resources