I would like to compare two files [ unsorted ]
file1 and file2. I would like to do file2 - file1 [ the difference ] irrespective of the line number?
diff is not working.
I got the solution by using comm
comm -23 file1 file2
will give you the desired output.
The files need to be sorted first anyway.
Well, you can just sort the files first, and diff the sorted files.
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted
You can also filter the output to report lines in file2 which are absent from file1:
diff -u file1.sorted file2.sorted | grep "^+"
As indicated in comments, you in fact do not need to sort the files. Instead, you can use a process substitution and say:
diff <(sort file1) <(sort file2)
There are 3 basic commands to compare files in unix:
cmp : This command is used to compare two files byte by byte and as any mismatch occurs,it echoes it on the screen.if no mismatch occurs i gives no response.
syntax:$cmp file1 file2.
comm : This command is used to find out the records available in one but not in another
diff
Most easy way: sort files with sort(1) and then use diff(1).
Related
This might be the worst example ever given on StackOverflow, but my purpose is to remove everything in File1 against File2. Whilst ignoring case sensitivity and matching the entire line. For example Cats#123:bob would be removed from File2 as the word Cat appears in File1. So regardless of case sensitivty, if a matching word is found it should eradicate the entirety of the line.
Input (File1):
Cat
Dog
Horse
Wheel
MainFile (File2)
Cats#123:bob
dog#1:truth
Horse-1:fairytale
Wheel:tremendous
Divination:maximus
Desired output
Divination:maximus
As the output shows, only "Divination:maximus" should be outputted as no matching words were found in File1. I prefer to use Sed or Awk generally as I use Cygwin. But any suggestions are welcomed, I can answer all questions you may have, thanks.
Here's what I've tried so far, but it's not working unfortunately, as my output is incorrect. To add to this, simply the wrong lines are being outputted. I'm fairly inexperienced so I don't know how to develop upon this syntax below, and maybe it's completely irrelevant to the job at hand.
grep -avf file1.txt file2.txt > output.txt
The grep command can do that for you:
grep -v -i -f file1 file2
The -f file1 tells grep to use the patterns in file1
The -i flag means case insensitive
The -v flag means to search lines that do not contain those patterns
Say, I have two files:
file1:
0
0
-3.44785
-2.15069
5.70183
17.8715
and file2:
31.9812
50.5646
72.361
96.8705
119.893
144.409
Two combine them side by side, I use :
paste -d" " file1 file2
or I use awk command to do such thing.
If I want to combine these two files one after another, what should I write? I know how to do this using "cat". I have tried different things to modify the "paste" command but they don't give desired output.
Could you please help? Thanks.
cat (short for concatenate) is your friend:
cat file1 file2
That's pretty basic; most people are aware of cat long before they learn to deal with awk, so kudos for mastering the latter!
Normally I would use cat file1 file2, but you could do it like:
awk '{print $0}' file1 file2
or
awk '1' file1 file2
(Note: the '1' does the the same thing as print)
I have a basic CSV that contains key/value. The first two columns being the key and the third column being the value.
Example file1:
12389472,1,136-7402
23247984,1,136-7402
23247984,2,136-7402
34578897,1,136-7402
And in another file I have a list of keys that need their value changed in the first file. I'm trying to change the value to 136-7425
Example file2:
23247984,1
23247984,2
Here's what I'm currently doing:
/usr/xpg4/bin/awk '{FS=",";OFS=","}NR==FNR{a[$1,$2]="136-7425";next}{$3=a[$1,$2]}1' file2 file1 > output
Which is working but it's leaving the value blank for keys not found in file2. I'd like to only change the value for keys present in file2, and leave the current value for keys not found.
Can anyone point out what I'm doing wrong? Or perhaps there's an easier way to accomplish this.
Thanks!
Looks like you're just zapping the third field for keys that don't exist in the first file. Try this:
awk '{FS=OFS=","}NR==FNR{a[$1,$2]="136-7425";next} ($1,$2) in a{$3=a[$1,$2]} 1' file2 file1 > output
or (see comments below):
awk '{FS=OFS=","}NR==FNR{seen[$1,$2]++;next} seen[$1,$2]{$3="136-7425"} 1' file2 file1 > output
FYI an array named seen[] is also similarly and commonly used to remove duplicates from input, e.g.:
awk '!seen[$0]++' file
this line should work for you:
awk -F, -v OFS="," 'NR==FNR{a[$1,$2]=1;next}a[$1,$2]{$3="136-7425"}7' file2 file1
This question already has answers here:
Fast way of finding lines in one file that are not in another?
(11 answers)
Closed 5 years ago.
I have two files (let's say a.txt and b.txt), both of which has a list of names. I have already run sort on both the files.
Now I want to find lines from a.txt which are not present in b.txt.
(I spent lot of time to find the answer for this question, so documenting it for future reference)
The command you have to use is not diff but comm
comm -23 a.txt b.txt
By default, comm outputs 3 columns: left-only, right-only, both. The -1, -2 and -3 switches suppress these columns.
So, -23 hides the right-only and both columns, showing the lines that appear only in the first (left) file.
If you want to find lines that appear in both, you can use -12, which hides the left-only and right-only columns, leaving you with just the both column.
The simple answer did not work for me because I didn't realize comm matches line for line, so duplicate lines in one file will be printed as not-existing in the other. For example, if file1 contained:
Alex
Bill
Fred
And file2 contained:
Alex
Bill
Bill
Bill
Fred
Then comm -13 file1 file2 would output:
Bill
Bill
In my case, I wanted to know only that every string in file2 existed in file1, regardless of how many times that line occurred in each file.
Solution 1: use the -u (unique) flag to sort:
comm -13 <(sort -u file1) <(sort -u file2)
Solution 2: (the first "working" answer I found) from unix.stackexchange:
fgrep -v -f file1 file2
Note that if file2 contains duplicate lines that don't exist at all in file1, fgrep will output each of the duplicate lines. Also note that my totally non-scientific tests on a single laptop for a single (fairly large) dataset showed Solution 1 (using comm) to be almost 5 times faster than Solution 2 (using fgrep).
I am not sure why it has been said diff should not be used. I would use it to compare the two files and then output only lines that are in the left file but not in right one. Such lines are flagged by diff with < so it suffices to grep that symbol at the beginning of the line
diff a.txt b.txt | grep \^\<
In the case the files wouldn't be sorted yet, you can use:
comm -23 <(sort a.txt) <(sort b.txt)
I have two word lists (180k and 260k), and I would like to generate a third file which is the set of words that appear in BOTH lists.
What is the best (most efficient) way of doing this? I've read forums talking about using grep, however I think the word lists are too big for this method.
If the two files are sorted (or you can sort them), you can use comm -1 -2 file1 file2 to print out the intersection.
You are correct, grep would be a bad idea. Type "man join" and follow the instructions.
If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:
$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2
Otherwise, you may need to give the join(1) command some additional instructions:
JOIN(1) BSD General Commands Manual JOIN(1)
NAME
join -- relational database operator
SYNOPSIS
join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2
DESCRIPTION
The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The
first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field,
the remaining fields from file1 and then the remaining fields from file2.
. . .
. . .
Presuming one word per line, I would use grep:
grep -xFf seta setb
-x matches the whole lines (no partial matches)
-F interprets the given patterns literally (no regular expressions)
-f seta specifies the patterns to search
setb is the file to search for the contents of seta
comm will do the same thing, but requires your sets to be pre-sorted:
comm -12 <(sort seta) <(sort setb)
grep -P '[ A-Za-z0-9]*' file1 | xargs -0 -I {} grep {} file2 > file3
I believe this looks for anything in file1, then checks if what was in file1 is in file2, and puts anything that matches into file3.
Back in the days I managed to find a Perl script that does something similar:
http://www.perlmonks.org/?node_id=160735