Intersection of two large word lists - dictionary

I have two word lists (180k and 260k), and I would like to generate a third file which is the set of words that appear in BOTH lists.
What is the best (most efficient) way of doing this? I've read forums talking about using grep, however I think the word lists are too big for this method.

If the two files are sorted (or you can sort them), you can use comm -1 -2 file1 file2 to print out the intersection.

You are correct, grep would be a bad idea. Type "man join" and follow the instructions.
If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:
$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2
Otherwise, you may need to give the join(1) command some additional instructions:
JOIN(1) BSD General Commands Manual JOIN(1)
NAME
join -- relational database operator
SYNOPSIS
join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2
DESCRIPTION
The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The
first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field,
the remaining fields from file1 and then the remaining fields from file2.
. . .
. . .

Presuming one word per line, I would use grep:
grep -xFf seta setb
-x matches the whole lines (no partial matches)
-F interprets the given patterns literally (no regular expressions)
-f seta specifies the patterns to search
setb is the file to search for the contents of seta
comm will do the same thing, but requires your sets to be pre-sorted:
comm -12 <(sort seta) <(sort setb)

grep -P '[ A-Za-z0-9]*' file1 | xargs -0 -I {} grep {} file2 > file3
I believe this looks for anything in file1, then checks if what was in file1 is in file2, and puts anything that matches into file3.

Back in the days I managed to find a Perl script that does something similar:
http://www.perlmonks.org/?node_id=160735

Related

Remove data in file1 against file2

This might be the worst example ever given on StackOverflow, but my purpose is to remove everything in File1 against File2. Whilst ignoring case sensitivity and matching the entire line. For example Cats#123:bob would be removed from File2 as the word Cat appears in File1. So regardless of case sensitivty, if a matching word is found it should eradicate the entirety of the line.
Input (File1):
Cat
Dog
Horse
Wheel
MainFile (File2)
Cats#123:bob
dog#1:truth
Horse-1:fairytale
Wheel:tremendous
Divination:maximus
Desired output
Divination:maximus
As the output shows, only "Divination:maximus" should be outputted as no matching words were found in File1. I prefer to use Sed or Awk generally as I use Cygwin. But any suggestions are welcomed, I can answer all questions you may have, thanks.
Here's what I've tried so far, but it's not working unfortunately, as my output is incorrect. To add to this, simply the wrong lines are being outputted. I'm fairly inexperienced so I don't know how to develop upon this syntax below, and maybe it's completely irrelevant to the job at hand.
grep -avf file1.txt file2.txt > output.txt
The grep command can do that for you:
grep -v -i -f file1 file2
The -f file1 tells grep to use the patterns in file1
The -i flag means case insensitive
The -v flag means to search lines that do not contain those patterns

How to use the output from sed command as a string and use that string in the grep command?

I have a file with only one string per row, "File1", and a file that has two strings per row, "File2". I wan't go through File1 and check for every row/string in how many of the rows in File2 it appear.
I would like to use something like this:
grep '{sed -n 1p File1}' File2
But it doesn't work. Why and what should I do?
//Thanks!
It seems to me that you're asking two completely different questions: one in the title and one in the body. I'll try to summarize them and answer both.
First question:
How to use the output from sed command as a string and use that string in the grep command?
First answer:
The unix way is to pipe a command's output as input to another program. If you instead want to use the first program's output as e.g. a parameter to the subsequent command (not as its input), there's a handy program that allows you to do just that: xargs.
Example of piping the sed output as input to grep:
sed -n 1p File1 | grep 'needle'
Example of using the sed output as a search string for grep:
sed -n 1p File1 | xargs -I{} grep {} File2
Second question:
I have a file with only one string per row, "File1", and a file that has two strings per row, "File2". I wan't go through File1 and check for every row/string in how many of the rows in File2 it appear.
Second answer:
awk 'NR==FNR{!a[$0]++;next} $1 in a{a[$1]++} END{for(i in a){print i" "a[i]-1}}' File1 File2
Test files:
==> File1 <==
one
two
three
==> File2 <==
one apple
two bananas
two strawberries
three kiwis
three pomegrenades
three lychees
Test run output:
three 3
two 2
one 1
That's only if you mean the string from File1 to appear as the first column of File2. For a more general approach, where you want to count every File2 row containing the string from File1:
awk 'NR==FNR{!a[$0]++;next} {for(i in a){if(match($0,i)){a[i]++}}} END{for(i in a){print i" "a[i]-1}}' File1 File2
Test files:
==> File1 <==
one
two
three
==> File2 <==
one one
two one
two two
three three
three two
three one
Test run output:
three 3
two 3
one 3

One liner required, pref UNIX-based, for variation of JOIN command

I need a one-liner (that I can put in a dos batch file), preferably using a unix command like AWK or JOIN. The function I need is essentially a more elaborate version of the following JOIN command:
join -j 1 -a 1 file1.txt file2.txt -t "^" > output.txt
[walkthrough: field separators are "^", join key is 1st field of both fields, and not exactly sure what the "-a 1" is doing exactly but it is sticking the bit-to-be joined on the end of the row of the other file, which is what I want.
Now, this one-liner works fine where both files are sorted and there is only one matching line in the 2nd file ... but I need it to try to match up to 4 lines in the 2nd file.
E.g.
file1:
12^blahblah
13^blahblahblahblah
14^blahblahblahblahblahblahblahblah
file2:
12^banana
12^orange
12^apple
13^potato
14^tomato
So I want the output like this:
12^blahblah^banana,orange,apple
13^blahblahblahblah^potato
14^blahblahblahblahblahblahblahblah^tomato
[Doesn't have to be a comma separating the new items]
You can try this awk command:
awk -F'^' 'NR==FNR{if($1 in a){a[$1]=a[$1]","$2} else {a[$1]=$2}} NR>FNR{print $0 "^" a[$1]}' file2 file1
The script fills an array a with the content of file2 and and append the content of the array when parsing file1
$ awk -F'^' 'NR==FNR{a[$1]=$0 FS;next} {a[$1] = a[$1] s[$1] $2; s[$1]=","} END{for (i in a) print a[i]}' file1 file2
12^blahblah^banana,orange,apple
13^blahblahblahblah^potato
14^blahblahblahblahblahblahblahblah^tomato

Left outer join on two files in unix

I need to join two files on two fields. However i should retrieve all the values in file 1 even if the join fails its like a left outer join.
File 1:
01|a|jack|d
02|b|ron|c
03|d|tom|e
File 2:
01|a|nemesis|f
02|b|brave|d
04|d|gorr|h
output:
01|a|jack|d|nemesis|f
02|b|ron|c|brave|d
03|d|tom|e||
It's join -t '|' file1 file2 -a1
Options used:
t: Delimiter.
a: Decides the file number from which the unpaired lines have to be printed.
join -t '|' file1 file2 -a2 would do a right outer join.
Sample Run
[aman#aman test]$ cat f1
01|a|jack|d
02|b|ron|c
03|d|tom|e
[aman#aman test]$ cat f2
01|a|nemesis|f
02|b|brave|d
04|d|gorr|h
[aman#aman test]$ join -t '|' f1 f2 -a1
01|a|jack|d|a|nemesis|f
02|b|ron|c|b|brave|d
03|d|tom|e
To do exactly what the question asks is a bit more complicated than previous answer and would require something like this:
sed 's/|/:/2' file1 | sort -t: >file1.tmp
sed 's/|/:/2' file2 | sort -t: >file2.tmp
join -t':' file1.tmp file2.tmp -a1 -e'|' -o'0,1.2,2.2' | tr ':' '|'
Unix join can only join on a single field AFAIK so you must use files that use a different delimiter to "join two files on two fields", in this case the first two fields. I'll use a colon :, however if : exists in any of the input you would need to use something else, a tab character for example might be a better choice for production use. I also re-sort the output on the new compound field, sort -t:, which for the example input files makes no difference but would for real world data. sed 's/|/:/2' replaces the second occurrence of pipe with colon on each line in file.
file1.tmp
01|a:jack|d
02|b:ron|c
03|d:tom|e
file2.tmp
01|a:nemesis|f
02|b:brave|d
04|d:gorr|h
Now we use join output filtered by tr with a few more advanced options:
-t':' specify the interim colon delimiter
-a1 left outer join
-e'|' specifies the replacement string for failed joins, basically the final output delimiter N-1 times where N is the number of pipe delimited fields joined to the right of the colon in file2.tmp. In this case N=2 so one pipe character.
-o'0,1.2,2.2' specifies the output format:
0 join field
1.2 field 2 of file1.tmp, i.e. everything right of colon
2.2 field 2 of file2.tmp
tr ':' '|' Finally we translate the colons back to pipes for the final output.
The output now matches the question sample output exactly which the previous answer did not do:
01|a|jack|d|nemesis|f
02|b|ron|c|brave|d
03|d|tom|e||
I recently had this issue with a very simple input file , just one field, hence no considerations of delimiters.
cat file1 > k1
cat file2 >> k1
sort k1 | uniq -c | grep "^.*1 "
will give you lines that occur in only 1 file
This is a special case, it may not be applicable or comparable to the above techniques posted here, but putting out there in case its useful to someone, who's looking for left outer joins (i.e. unmatched cases only). Grepping for "^.*2 " will give you matched cases. In case you have a multi-field file (the more common case), but you only care about a single join field, you can use Awk to create a key-only file (for each file) and then process as above.

compare two files in UNIX

I would like to compare two files [ unsorted ]
file1 and file2. I would like to do file2 - file1 [ the difference ] irrespective of the line number?
diff is not working.
I got the solution by using comm
comm -23 file1 file2
will give you the desired output.
The files need to be sorted first anyway.
Well, you can just sort the files first, and diff the sorted files.
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted
You can also filter the output to report lines in file2 which are absent from file1:
diff -u file1.sorted file2.sorted | grep "^+"
As indicated in comments, you in fact do not need to sort the files. Instead, you can use a process substitution and say:
diff <(sort file1) <(sort file2)
There are 3 basic commands to compare files in unix:
cmp : This command is used to compare two files byte by byte and as any mismatch occurs,it echoes it on the screen.if no mismatch occurs i gives no response.
syntax:$cmp file1 file2.
comm : This command is used to find out the records available in one but not in another
diff
Most easy way: sort files with sort(1) and then use diff(1).

Resources