Left outer join on two files in unix - unix

I need to join two files on two fields. However i should retrieve all the values in file 1 even if the join fails its like a left outer join.
File 1:
01|a|jack|d
02|b|ron|c
03|d|tom|e
File 2:
01|a|nemesis|f
02|b|brave|d
04|d|gorr|h
output:
01|a|jack|d|nemesis|f
02|b|ron|c|brave|d
03|d|tom|e||

It's join -t '|' file1 file2 -a1
Options used:
t: Delimiter.
a: Decides the file number from which the unpaired lines have to be printed.
join -t '|' file1 file2 -a2 would do a right outer join.
Sample Run
[aman#aman test]$ cat f1
01|a|jack|d
02|b|ron|c
03|d|tom|e
[aman#aman test]$ cat f2
01|a|nemesis|f
02|b|brave|d
04|d|gorr|h
[aman#aman test]$ join -t '|' f1 f2 -a1
01|a|jack|d|a|nemesis|f
02|b|ron|c|b|brave|d
03|d|tom|e

To do exactly what the question asks is a bit more complicated than previous answer and would require something like this:
sed 's/|/:/2' file1 | sort -t: >file1.tmp
sed 's/|/:/2' file2 | sort -t: >file2.tmp
join -t':' file1.tmp file2.tmp -a1 -e'|' -o'0,1.2,2.2' | tr ':' '|'
Unix join can only join on a single field AFAIK so you must use files that use a different delimiter to "join two files on two fields", in this case the first two fields. I'll use a colon :, however if : exists in any of the input you would need to use something else, a tab character for example might be a better choice for production use. I also re-sort the output on the new compound field, sort -t:, which for the example input files makes no difference but would for real world data. sed 's/|/:/2' replaces the second occurrence of pipe with colon on each line in file.
file1.tmp
01|a:jack|d
02|b:ron|c
03|d:tom|e
file2.tmp
01|a:nemesis|f
02|b:brave|d
04|d:gorr|h
Now we use join output filtered by tr with a few more advanced options:
-t':' specify the interim colon delimiter
-a1 left outer join
-e'|' specifies the replacement string for failed joins, basically the final output delimiter N-1 times where N is the number of pipe delimited fields joined to the right of the colon in file2.tmp. In this case N=2 so one pipe character.
-o'0,1.2,2.2' specifies the output format:
0 join field
1.2 field 2 of file1.tmp, i.e. everything right of colon
2.2 field 2 of file2.tmp
tr ':' '|' Finally we translate the colons back to pipes for the final output.
The output now matches the question sample output exactly which the previous answer did not do:
01|a|jack|d|nemesis|f
02|b|ron|c|brave|d
03|d|tom|e||

I recently had this issue with a very simple input file , just one field, hence no considerations of delimiters.
cat file1 > k1
cat file2 >> k1
sort k1 | uniq -c | grep "^.*1 "
will give you lines that occur in only 1 file
This is a special case, it may not be applicable or comparable to the above techniques posted here, but putting out there in case its useful to someone, who's looking for left outer joins (i.e. unmatched cases only). Grepping for "^.*2 " will give you matched cases. In case you have a multi-field file (the more common case), but you only care about a single join field, you can use Awk to create a key-only file (for each file) and then process as above.

Related

Can sort command be used to sort file based on multiple columns in a csv file

We have a requirement where we have a csv file with custom delimiter '||' (double-pipes) . We have 40 columns in the file and the file size is approximately between 400 to 500 MB.
We need to sort the file based on 2 columns, first on column 4 and then by column 17.
We found this command using which we can sort for one column, but not able to find a command which can sort based on both columns.
Since we use a delimiter with 2 characters, we are using awk command for sorting.
Command:
awk -F \|\| '{print $4}' abc.csv | sort > output.csv
Please advise.
If your inputs are not too fancy (no newlines in the middle of a record, for instance), the sort utility can almost do what you want, but it supports only one-character field separators. So || would not work. But wait, if you do not have other | characters in your files, we could just consider | as the field separator and account for the extra empty fields:
sort -t'|' -k7 -k33 foo.csv
We sort by fields 7 (instead of 4) and then 33 (instead of 17) because of these extra empty fields. The formula that gives the new field number is simply 2*N-1 where N is the original field number.
If you do have | characters inside your fields a simple solution is to substitute them all by one unused character, sort, and restore the original ||. Example with tabs:
sed 's/||/\t/g' foo.csv | sort -t$'\t' -k4 -k17 | sed 's/\t/||/g'
If tab is also used in your fields chose any unused character instead. Form feed (\f) or the field separator (ASCII code 28, that is, replace the 3 \t with \x1c) are good candidates.
Using PROCINFO in gnu-awk you can use this solution to sort on multi-character delimiter:
awk -F '\\|\\|' '{a[$2,$17] = $0} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in a) print a[i]}' file.csv
You could try following awk code. Written as per your shown attempts only. Set OFS as |(this is putting | as output field separator in case you want it ,comma etc then change OFS value accordingly in program) and print 17th field also as per your requirement in awk program. In sort use 1st and 2nd fields to sort it(because now 4th and 17th fields have become 1st and 2nd fields respectively for sort).
awk -F'\\|\\|' -v OFS='\\|' '{print $4,$17}' abc.csv | sort -t'|' -k1.1 -k2.1 > output.csv
The sort command works on physical lines, which may or may not be acceptable. CSV files can contain quoted fields which contain newlines, which will throw off sort (and most other Unix line-oriented utilities; it's hard to write a correct Awk script for this scenario, too).
If you need to be able to manipulate arbitrary CSV files, probably look to a dedicated utility, or use a scripting language with proper CSV support. For example, assume you have a file like this:
Title,Number,Arbitrary text
"He said, ""Hello""",2,"There can be
newlines and
stuff"
No problem,1,Simple undramatic single-line CSV
In case it's not obvious, CSV is fundamentally just a text file, with some restrictions on how it can be formatted. To be valid CSV, every record should be comma-separated; any literal commas or newlines in the data needs to be quoted, and any literal quotes need to be doubled. There are many variations; different tools accept slightly different dialects. One common variation is TSV which uses tabs instead of commas as delimiters.
Here is a simple Python script which sorts the above file on the second field.
import csv
import sys
with open("test.csv", "r") as csvfile:
csvdata = csv.reader(csvfile)
lines = [line for line in csvdata]
titles = lines.pop(0) # comment out if you don't have a header
writer = csv.writer(sys.stdout)
writer.writerow(titles) # comment out if you don't have a header
writer.writerows(sorted(lines, key=lambda x: x[1]))
Using sys.stdout for output is slightly unconventional; obviously, adapt to suit your needs. The Python csv library documentation is obviously not designed primarily to be friendly for beginners, but it should not be impossible to figure out, and it's not hard to find examples of working code.
In Python, sorted() returns a copy of a list in sorted order. There is also sort() which sorts a list in-place. Both functions accept an optional keyword parameter to specify a custom sort order. To sort on the 4th and 17th fields, use
sorted(lines, key=lambda x: (x[3], x[16]))
(Python's indexing is zero-based, so [3] is the fourth element.)
To use | as a delimiter, specify delimiter='|' in the csv.reader() and csv.writer() calls. Unfortunately, Python doesn't easily let you use a multi-character delimiter, so you might have to preprocess the data to switch to a single-character delimiter which does not occur in the data, or properly quote the fields which contain the character you selected as your delimiter.

Merging(Joining) 2 huge flat files in Solaris, using an index column(first field)

I have 2 huge flat files in Unix(Solaris), each say about 500-600 GB. And i need join and merge the 2 files into a single flat file using the first column which would be a key index column. How could i do it in an optimized way?
Basically it should be an inner join between the 2 flat files. Reason try to use flat files is, we have a 2 huge table that have been split into 2 separate tables, and that is extracted into 2 flat files, and i am trying to join it at Unix level instead of at database level.
I did use the below commands :
sort -n file1 > file_temp1;
sort -n file2 > file_temp2;
join -j 1 -t';' file_temp1 file_temp2 > Final
It works fine with sort as the 1st field is the index column. However when the join happens, i get hardly 2% of the data in the Final file. So just was trying to understand what mistake i am doing in the join command? Both the files contain about .2 million records and all of the records are matching between the 2 files. I want to have a performance check if the join made at unix would be better than that performed at the database level. Sorry for incomplete question! The first field is a numeric index field. do we have something like a"-n" switch to indicate the join that the first field is a numeric index?
You should not sort -n, since join has no corresponding -n flag. Just keep all the leading/trailing whitespace as it is:
#!/bin/sh
sort -t';' -k 1 file1 > file1.srt
sort -t';' -k 1 file2 > file2.srt
join -t';' -1 1 -2 1 file1.srt file2.srt > both
#cat both

awk - Find duplicate entries in 2 columns, keep 1 duplicate and unique entries

I need to find a duplicate entry in 2 different columns and keep only one of the duplicate and all unique entries. For me if A123 is in the first column and it show up later in the second column it's a duplicate. I also know for sure that A123 will always be paired to B123 by either being A123,B123 or B123,A123. I only need to keep one and it doesn't matter which one it is.
Ex: My input file would contain:
A123,B123
A234,B234
C123,D123
B123,A123
B234,A234
I'd like the output to be:
A123,B123
A234,B234
C123,D123
The best I can do is to extract the unique entries with :
awk -F',' 'NR==FNR{x[$1]++;next}; !x[$2]' file1 file1
or get only the duplicates with
awk -F',' 'NR==FNR{x[$1]++;next}; x[$2]' file1 file1
Any help would be greatly appreciated.
This can be shorter!
First print if the element is not yet present in the array. Then add the first field to the array. Only one run over the inputfile is necessary:
awk -F, '!x[$2];{x[$1]++}' file1
This awk one-liner works for your example:
awk -F, '!($2 in a){a[$1]=$0}END{for(x in a)print a[x]}' file
The conventional, idiomatic awk solution:
$ awk -F, '!seen[$1>$2 ? $1 : $2]++' file
A123,B123
A234,B234
C123,D123
By convention we always use seen (rather than x or anything else) as the array name when it represents a set where you want to check if it's index has been seen before, and using a ternary expression to produce the largest of the possible key values as the index ensures the order they appear in the input doesn't matter.
The above doesn't care about your unique situation where every $2 is paired to a specific $1 - it simply prints unique individual occurrences across a pair of fields. If you wanted it to work on the pair of fields combined (and assuming you have more fields so just using $0 as the index wouldn't work) that'd be:
awk -F, '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file

Need help parsing a file via UNIX commands

I have a file that has lines that look like this
LINEID1:FIELD1=ABCD,&FIELD2-0&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=ABCD,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=ABCD,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=XYZ,&FIELD2-0&FIELD3-1&FIELD9-0
LINEID3:FIELD1=XYZ,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=PQRS,&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=PQRS,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=PQRS,&FIELD7-0&FIELD8-0;
I'm interested in only the lines that begin with LINEID1 and only some elements (FIELD1, FIELD2, FIELD4 and FIELD9) from that line. The output should look like this (no & signs.can replace with |)
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0;
FIELD1=PQRS|FIELD4-0|FIELD9-0;
If additional information is required, do let me know, I'll post them in edits. Thanks!!
This is not exactly what you asked for, but no-one else is answering and it is pretty close for you to get started with!
awk -F'[&:]' '/^LINEID1:/{print $2,$3,$5,$6}' OFS='|' file
Output
FIELD1=ABCD,|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ,|FIELD2-0|FIELD9-0|
FIELD1=PQRS,|FIELD3-1|FIELD9-0;|
The -F sets the Input Field Separator to colon or ampersand. Then it looks for lines starting LINEID1: and prints the fields you need. The OFS sets the Output Field Separator to the pipe symbol |.
Pure awk:
awk -F ":" ' /LINEID1[^0-9]/{gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2); gsub(/,*&+/,"|",$2); print $2} ' file
Updated to give proper formatting and to omit LINEID11, etc...
Output:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
Explanation:
awk -F ":" - split lines into LHS ($1) and RHS ($2) since output only requires RHS
/LINEID1[^0-9]/ - return only lines that match LINEID1 and also ignores LINEID11, LINEID100 etc...
gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2) - remove all fields that aren't 1, 4 or 9 on the RHS
gsub(/,*&+/,"|",$2) - clean up the leftover delimiters on the RHS
To select rows from data with Unix command lines, use grep, awk, perl, python, or ruby (in increasing order of power & possible complexity).
To select columns from data, use cut, awk, or one of the previously mentioned scripting languages.
First, let's get only the lines with LINEID1 (assuming the input is in a file called input).
grep '^LINEID1' input
will output all the lines beginning with LINEID1.
Next, extract the columns we care about:
grep '^LINEID1' input | # extract lines with LINEID1 in them
cut -d: -f2 | # extract column 2 (after ':')
tr ',&' '\n\n' | # turn ',' and '&' into newlines
egrep 'FIELD[1249]' | # extract only fields FIELD1, FIELD2, FIELD4, FIELD9
tr '\n' '|' | # turn newlines into '|'
sed -e $'s/\\|\\(FIELD1\\)/\\\n\\1/g' -e 's/\|$//'
The last line inserts newlines in front of the FIELD1 lines, and removes any trailing '|'.
That last sed pattern is a little more challenging because sed doesn't like literal newlines in its replacement patterns. To put a literal newline, a bash escape needs to be used, which then requires escapes throughout that string.
Here's the output from the above command:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
This command took only a couple of minutes to cobble up.
Even so, it's bordering on the complexity threshold where I would shift to perl or ruby because of their excellent string processing.
The same script in ruby might look like:
#!/usr/bin/env ruby
#
while line = gets do
if line.chomp =~ /^LINEID1:(.*)$/
f1, others = $1.split(',')
fields = others.split('&').map {|f| f if f =~ /FIELD[1249]/}.compact
puts [f1, fields].flatten.join("|")
end
end
Run this script on the same input file and the same output as above will occur:
$ ./parse-fields.rb < input
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;

Intersection of two large word lists

I have two word lists (180k and 260k), and I would like to generate a third file which is the set of words that appear in BOTH lists.
What is the best (most efficient) way of doing this? I've read forums talking about using grep, however I think the word lists are too big for this method.
If the two files are sorted (or you can sort them), you can use comm -1 -2 file1 file2 to print out the intersection.
You are correct, grep would be a bad idea. Type "man join" and follow the instructions.
If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:
$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2
Otherwise, you may need to give the join(1) command some additional instructions:
JOIN(1) BSD General Commands Manual JOIN(1)
NAME
join -- relational database operator
SYNOPSIS
join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2
DESCRIPTION
The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The
first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field,
the remaining fields from file1 and then the remaining fields from file2.
. . .
. . .
Presuming one word per line, I would use grep:
grep -xFf seta setb
-x matches the whole lines (no partial matches)
-F interprets the given patterns literally (no regular expressions)
-f seta specifies the patterns to search
setb is the file to search for the contents of seta
comm will do the same thing, but requires your sets to be pre-sorted:
comm -12 <(sort seta) <(sort setb)
grep -P '[ A-Za-z0-9]*' file1 | xargs -0 -I {} grep {} file2 > file3
I believe this looks for anything in file1, then checks if what was in file1 is in file2, and puts anything that matches into file3.
Back in the days I managed to find a Perl script that does something similar:
http://www.perlmonks.org/?node_id=160735

Resources