Command to perform full outer join with duplicate entries in key/join column - unix

I have three files. I need to join them based on one column and perform some transformations.
file1.dat (Column 1 is used for joining)
123,is1,ric1,col1,smbc1
123,is2,ric1,col1,smbc1
234,is3,ric3,col3,smbc2
345,is4,ric4,,smbc2
345,is4,,col5,smbc2
file2.dat (Column 1 is used for joining)
123,abc
234,bcd
file3.dat (Column 4 is used for joining)
r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8
r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8
r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8
r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8
Expected Output (output.dat)
123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3
123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3
I wrote the following awk command.
awk '
BEGIN {FS=OFS=","}
FILENAME == ARGV[1] { temp_join_one[$1] = $2"|"$3"|"$4"|"$5; next}
FILENAME == ARGV[2] { exchtbunload[$1] = $2; next}
FILENAME == ARGV[3] { s_temp_join_one = temp_join_one[$4];
split(s_temp_join_one, array_temp_join_one,"|");
v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
v11=($2=""?"N/A":$2);
print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
"output.dat" }
' file1.dat file2.dat file3.dat
I need to join all three files.
The final output file should have all the values from file3 irrespective of whether they are in other two files and the corresponding columns should be empty(or N/A) if it is not present in other two files. (The order of the columns is not a very big problem. I can use awk to rearrange them.)
But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.
I tried to apply the solution suggested using join condition. It works with smaller files. But the files I have are close to 3-5 GB in size. And they are in numerical order and not lexicographical order. Sorting them looks like would take lot of time.
Any suggestion would be helpful.
Thanks in advance.

with join, assuming files are sorted by the key.
$ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) \
file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3
123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3

I really like the answer using join, but it does require that the files are sorted by the key column. Here's a version that doesn't have that restriction. Working under the theory that the best tool for doing database-like things is a database, it imports the CSV files into tables of a temporary SQLite database and then runs a SELECT on them to get your desired output:
(edit: Revised version based on new information about the data)
#!/bin/sh
# Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
file1=$1
file2=$2
file3=$3
rm -f scratch.db
sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)\K""(?=,|$)#N/A#g'
CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
CREATE TABLE file2(f2_1 INTEGER, f2_2);
CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
.import $file1 file1
.import $file2 file2
.import $file3 file3
-- Build indexes to speed up joining and sorting gigs of data.
CREATE INDEX file1_idx ON file1(f1_1);
CREATE INDEX file2_idx ON file2(f2_1);
CREATE INDEX file3_idx ON file3(f3_4);
SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
, f3_2, f3_3
FROM file3
LEFT JOIN file1 ON f1_1 = f3_4
LEFT JOIN file2 ON f2_1 = f3_4
ORDER BY f3_4;
EOF
rm -f scratch.db
Note: This will use a temporary database file that's going to be the size of all your data and then some because of indexes. If you're space constrained, I have an idea for doing it without temporary files, given the information that the join columns are sorted numerically, but it's enough work that I'm not going to bother unless asked.

Related

Comparison of two CSV files

I'm trying to use Awk to compare the content of two large CSV files (more than 5000 rows each) but I can't get what I want.
Here is my problem:
I have a first file (File1) with a list of names ($1) and cities ($2) whose structure looks like this:
john;london
marc;paris
karen;new york
ben;london
vic;dublin
I have a second file (File2) with other information where we find some names ($3) of File1:
45456;windows;john;454646
47764;mac;zack;470093
41225;mac;ben;622101
12634;windows;ben;218996
7856;windows;karen;637294
12;mac;finn;878317
2315;windows;beverly;221167
445;windows;lilly;12316
3232;mac;john;601316
4546;mac;fish;305035
487;windows;vic;447421
46464;mac;karen;468154
I would like to extract from File2 all the lines whose names appear in File1 while adding the cities associated to each name in File1. Here is an example of the result I am looking for:
45456;windows;john;454646;london
3232;mac;john;601316;london
7856;windows;karen;637294;new york
46464;mac;karen;468154;new york
41225;mac;ben;622101;london
12634;windows;ben;218996;london
487;windows;vic;447421;dublin
Could you help me?
Build an associative array of the first file, making the name the index and the city the value. For the second file check if the name features in the aray, if yes, print the line and attach the city.
awk -F';' 'NR==FNR{a[$1]=$2}$3 in a{print $0";"a[$3]}' File1 File2
45456;windows;john;454646;london
41225;mac;ben;622101;london
12634;windows;ben;218996;london
7856;windows;karen;637294;new york
3232;mac;john;601316;london
487;windows;vic;447421;dublin
46464;mac;karen;468154;new york
With bash, GNU sort and GNU join:
join -t ';' -1 1 -2 3 <(sort File1) <(sort -t ';' -k 3 File2) -o 2.1,2.2,1.1,2.4,1.2
Output:
12634;windows;ben;218996;london
41225;mac;ben;622101;london
45456;windows;john;454646;london
3232;mac;john;601316;london
46464;mac;karen;468154;new york
7856;windows;karen;637294;new york
487;windows;vic;447421;dublin
Using csvkit
csvjoin -d ';' -H -c 3,1 File2 File1 | csvformat -D ';' | tail -n +2

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

compare two fields from two different files using awk

I have two files where I want to compare certain fields and produce the output
I have a variable as well
echo ${CURR_SNAP}
123
File1
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|RSCNAME1
DOMAIN2|USER2|LE2|ORG2|ACCES2|RSCTYPE2|RSCNAME2
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|RSCNAME3
DOMAIN4|USER4|LE4|ORG4|ACCES4|RSCTYPE4|RSCNAME4
File2
ORG1|PRGPATH1
ORG3|PRGPATH3
ORG5|PRGPATH5
ORG6|PRGPATH6
ORG7|PRGPATH7
The output I am expecting as below where the last column is CURR_SNAP value and the matching will be 4th column of File1 should be matched with 1st column of File2
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
I tried with the below code piece but looks like I am not doing it correctly
awk -v CURRSNAP="${CURR_SNAP}" '{FS="|"} NR==FNR {x[$0];next} {if(x[$1]==$4) print $1"|"$2"|"$3"|"$4"|"$5"|"$6"|"CURRSNAP}' File2 File1
With awk:
#! /bin/bash
CURR_SNAP="123"
awk -F'|' -v OFS='|' -v curr_snap="$CURR_SNAP" '{
if (FNR == NR)
{
# this stores the ORG* as an index
# here you can store other values if needed
orgs_arr[$1]=1
}
else if (orgs_arr[$4] == 1)
{
# overwrite $7 to contain CURR_SNAP value
$7=curr_snap
print
}
}' file2 file1
As in your expected output, you didn't output RSCNAME*, so I have overwritten $7(which is column for RSCNAME*) with $CURR_SNAP. If you want to display RSCNAME* column aswell, remove $7=curr_snap and change print statement to print $0, curr_snap.
I wouldn't use awk at all. This is what join(1) is meant for (Plus sed to append the extra column:
$ join -14 -21 -t'|' -o 1.1,1.2,1.3,1.4,1.5,1.6 File1 File2 | sed "s/$/|${CURR_SNAP}/"
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
It does require that the files be sorted based on the common field, like your examples are.
You can do this with awk with two-rules. For the first file (where NR==FNR), simply use string concatenation to append the fields 1 - (NF-1) assigning the concatenated result to an array indexed by $4. Then for the second file (where NR>FNR) in rule two test if array[$1] has content and if so, output the array and append "|"CURR_SNAP (with CURR_SNAP shortened to c in the example below and array being a), e.g.
CURR_SNAP=123
awk -F'|' -v c="$CURR_SNAP" '
NR==FNR {
for (i=1;i<NF;i++)
a[$4]=i>1?a[$4]"|"$i:a[$4]$1
}
NR>FNR {
if(a[$1])
print a[$1]"|"c
}
' file1 file2
Example Use/Output
After setting the filenames to match yours, you can simply copy/middle-mouse-paste in your console to test, e.g.
$ awk -F'|' -v c="$CURR_SNAP" '
> NR==FNR {
> for (i=1;i<NF;i++)
> a[$4]=i>1?a[$4]"|"$i:a[$4]$1
> }
> NR>FNR {
> if(a[$1])
> print a[$1]"|"c
> }
> ' file1 file2
DOMAIN1|USER1|LE1|ORG1|ACCES1|RSCTYPE1|123
DOMAIN3|USER3|LE3|ORG3|ACCES3|RSCTYPE3|123
Look things over and let me know if you have further questions.

Grep to count occurrences of file A in file B

I have two files, file A may be in file B and I would like to count for each line in file A, how many times it occurs in file B. For example:
File A:
GAGGACAGACTACTAAAGCC
CTTGCCGCAGATTATCAGAG
CCAGCTTGATGTGTCCTGTG
TGATAGGCAGTGGAACACTG
File B:
NTCTTGAGGAAAGGACGAATCTGCGGAGGACAGACTACTAAAGCCGTTTGAGAGCTAGAACGAGCAAGTTAAGAGA
TCTTGAGGAAAGGACGAAACTCCGGAGGACAGACTACTAAAGCCGTTTTAGAGCTAGAAAGCGCAAGTTAAACGAC
NTCTTGAGGAAAGGACGAATCTGCGCTTGCCGCAGATTATCAGAGGTATGAGAGCTAGAACGAGCAAGTTAAGAGC
TCTTGAGGAAAGGACGAAAGTGCGCTTGCCGCAGATTATCAGAGGTTTTAGAGCTAGAAAGAGCAAGTTAAAATAA
GATCTAGTGGAAAGGACGATTCTCCGCTTGCCGCAGATTATCAGAGGTTGTAGAGCTAGAACTAGCAAGTGACAAG
ATCTTGAGGAAAGGACGAATCTGCGCTTGCCGCAGATTATCAGAGGTTTGAGAGCTAGAACTAGCAAGTTAATAGA
CGATCAAGTGGAAGGACGATTCTCCGTGATAGGCAGTGGAACACTGGATGTAGAGCTAGAAATAGCAAGTGAGCAG
ATCTAGAGGAAAGGACGAATCTCCGTGATAGGCAGTGGAACACTGGTATGAGAGCTAGAACTAGCAAGTTAATAGA
TCTTGAGGAAAGGACGAAACTCCGTGATAGGCAGTGGAACACTGGTTTTAGAGCTAGAAAGCGCAAGTTAAAAGAC
And the output should be File C:
2 GAGGACAGACTACTAAAGCC
4 CTTGCCGCAGATTATCAGAG
0 CCAGCTTGATGTGTCCTGTG
3 TGATAGGCAGTGGAACACTG
I would like to do this using grep and I've tried a few variations of -c,o,f but I can't seem to get the right output.
How can I achieve this?
Try this
for i in `cat a`; do echo "$i `grep $i -c b`"; done
In this case if line from file A occurred several times in one line of file B then this will be count as one occurrence. If you want to count such occurrences but without its overlapping use this
for i in `cat a`; do printf $i; grep $i -o b | wc -l; done
And maybe this variant would be quicker
cat b | grep "`cat a`" -o | sort | uniq -c
#!/usr/bin/perl
open A, "A"; # open file "A" to handle A
open B, "B"; # open file "B" to handle B
chomp(#keys = <A>); # read keys to array, strip line-feeds
#counts{#keys} = (0) x #keys; # initialize hash counts for keys
while(<B>){ # iterate file handle B line by line
foreach $k (#keys){ # iterate keys array
if (/$k/) { # if key matches line
$counts{$k}++; # increase count for key by one
}
}
}
print "$counts{$_} $_\n" for (keys %counts);
Linux command to compare files:
comm FileA FileB
comm produces three-column output. Column one contains lines unique to FileA, column two contains lines unique to FileB, and column three contains lines common to both files.

Remove all lines from file with duplicate value in field, including the first occurrence

I would like to remove all the lines in my data file that contain a value in column 2 that is repeated in column 2 in other lines.
I've sorted by the value in column 2, but can't figure out how to use uniq for just the values in one field as the values are not necessarily of the same length.
Alternately, I can remove lines with the duplicate using an awk one-liner like
awk -F"[,]" '!_[$2]++'
but this retains the line with the first incidence of the repeated value in col 2.
As an example, if my data is
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
I would like to remove ALL lines (including the first) where b occurs in the second column.
Like this:
d,e,f
h,i,j
Thanks for any advice!!
If the order is not important then the following should work:
awk -F, '
!seen[$2]++ {
line[$2] = $0
}
END {
for(val in seen)
if(seen[val]==1)
print line[val]
}' file
Output
h,i,j
d,e,f
Solution with grep:
grep -v -E '\b,b,\b' text.txt
Content of the file:
$ cat text.txt
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
a,n,b
b,c,f
$ grep -v -E '\b,b,\b' text.txt
d,e,f
h,i,j
a,n,b
b,c,f
Hope it helps
Some different awk:
awk -F, '
BEGIN {f=0}
FNR==NR {_[$2]++;next}
f==0 {
f=1
for(j in _)if(_[j]>1)delete _[j]
}
$2 in _
' file file
Explanation
The awk passes through the file twice - that's why it appears twice at the end. On the first pass (when FNR==NR) I count the number of times each column 2 appears in array _[]. At the end of the first pass, I then delete all elements of _[] where that element has been seen more than once. Then, on the second pass, I print lines whose second field appears in _[].

Resources