Not working join operation

Not working join operation - unix

I have a file as follows:
AKT3
ARRB1
ATF2
ATF4
BDNF
BRAF
C00076
C00165
TNF
TNFRSF1A
TP53
TRAF2
TRAF6
To me, it is perfectly sorted. Is not? Also, I have another file which contains AKT3, BRAF, TRAF6, etc. as its first column element. Since this file is too big, I do not put it here. However, after I type:
LANG=en_EN join -j 1 file2 file1 > output -t $'\t'
output file contains these lines:
TRAF6 0 genome...
TRAF6 0 genome...
TRAF6 0 genome...
TRAF6 0 genome...
I must see other rows which start with AKT3, BRAF, etc. as well in this output but there are only TRAF6 lines. What is the problem? How can I get the proper output? Thanks.
Edit: You can get the big file from this link:
https://www.dropbox.com/s/a2dmsq1tskpb9vg/sorted_mutation_data?dl=0
It is about 25 MB. I am sorry for this.
Edit (2): Lets say...
File1:
ADA
ADAM
BRUCE
GARY
File2:
AB 1
ABA 2
ABB 3
ADA 4
ADA 5
EVE 6
EVE 7
EVE 8
GARY 9
GARY 10
The output should be:
ADA 4
ADA 5
GARY 9
GARY 10
Edit: The problem was caused by non-printable ASCII characters that were hiden in the text in a way. After removed them all, I could use "join".

So, I don't know what your environment is, but this for me (I use an explicit sort to be sure it will work, and also to reveal what happens when you sort by the whole line in the default collating order instead of an explicit field).
Note also I have no -t $'\t' on the statement with the join command. If your second file has tab-separated fields then you'll need to express that correctly with a real tab character, and with the option before the filenames, and also you may have to be sure your files are sorted with the same key and field separator.
#! /bin/sh
f1=$(mktemp -t jdata)
f2=$(mktemp -t jdata)
trap "RC=$?; rm -f $f1 $f2*; exit $RC" 0 1 2 3 15
sort > $f1 <<__EOF__
ADA
ADAM
BRUCE
GARY
__EOF__
sort > $f2 <<__EOF__
AB 1
ABA 2
ABB 3
ADA 4
ADA 5
EVE 6
EVE 7
EVE 8
GARY 9
GARY 10
__EOF__
join -j 1 $f1 $f2
sh ./tjoin-multi.sh
ADA 4
ADA 5
GARY 10
GARY 9

Related

Remove all instances of the same column value across multiple files using awk

I'm back again with another awk question.
I have multiple large files that contain data i want to dedupe against each other.
Lets say I have the following data for one month:
fruit number rand
apple 12 342
taco 19 264
tortilla 2234 53423
tortillas 2 3431
apricot 13221 23424
apricots 24234 3252
pineapple 2342 2342
radish 1 3
The following month I receive this data:
fruit number rand
pineapple 2 698
apple 34 472
taco 19 234
tortilla 16 58
tortillas 87 25
potato 234 2342
radish 1 55
grapes 9 572 422
apricot 13221 24
What I am trying to do is take the second file, and check the values of the first column to see if there are items that exist in the first file. If yes, I want to remove them from the second file, leaving only items that are unique to the second file with relation to the first one.
The desired outcome would leave me something like this:
fruit number rand DUPLICATE
pineapple 2 698 DUPE
apple 34 472 DUPE
taco 19 234 DUPE
tortilla 16 58 DUPE
tortillas 87 25 DUPE
potato 234 2342
radish 1 55 DUPE
grapes 9 572 422
apricot 13221 24 DUPE
Or, more clearly:
fruit number rand
potato 234 2342
grapes 9 572 422
I was trying to think of a way to do this without having to sort the files. I was trying to modify the answer from #karafka for a related question. Instead of passing the same file twice, I tried inputting the two different files. Obviously I'm doing something wrong.
awk 'BEGIN { FS = OFS = " " }
NR==FNR {a[$1]++; next}
FNR==1 {print $0, "DUPLICATE"; next}
$1 in a{if (a[$1]>1){print $(NF+1)="DUPE";delete a[$1]}}1' file{,}
I'm still learning awk, any help the community can provide is greatly appreciated, but I'll try to explain what I think the above program does.
The first line sets the delimiter and the output delimiter to be a tab character.
This line reads the first file and stores an array with a count of how many times an item appears in the list.
This outputs the first line which is essentially the header, adding "DUPLICATE" at the end of the last item in the row
(This is where I'm stuck) If the current value is found in the array "a" it should check if the stored value is greater than one. If yes, it should print the word "DUPE" in the last column. Finally it returns the entire line.
In the test files I keep getting everything marked as "DUPE" or nothing at all.
I've also thought of combining the files and deduping that way, but that would leave me with undesired left-over values from the first file.
What am I doing wrong?

I think what you're doing wrong is just trying to use a bunch of scripts that are unrelated to your current problem as your starting point.
It sounds like all you need is:
$ awk '
NR==FNR { file1[$1]; next }
FNR==1 || !($1 in file1)
' file1 file2
fruit number rand
potato 234 2342
grapes 9 572 422

What is the difference between the following three sort commands in unix?

How are following sort commands in unix different?
1) sort -k1,4 < file
2) sort -k1,1 -k4,4 < file
3) sort -k1,1 -k2,2 -k3,3 -k4,4 < file
Especially, #1 and #2 are confusing.
For example, the following example illustrates my points
$ cat tmp
1 2 3 t
4 2 4 c
5 4 6 c
7 3 20 r
12 3 5 i
2 45 7 a
11 23 53 b
23 43 53 q
11 6 3 c
0 4 3 z
$ diff <(sort -k1,4 tmp) <(sort -k1,1 -k2,2 -k3,3 -k4,4 tmp)
1a2
> 1 2 3 t
5,6d5
< 1 2 3 t
< 23 43 53 q
7a7
> 23 43 53 q
$diff <(sort -k1,4 tmp) <(sort -k1,1 -k4,4 tmp)
1a2
> 1 2 3 t
5,6d5
< 1 2 3 t
< 23 43 53 q
7a7
> 23 43 53 q
And I did look at the sort's man page
In sort's man page, it says:
-k, --key=POS1[,POS2]
start a key at POS1 (origin 1), end it at POS2 (default end of line)
But I don't understand this explanation. If it starts from POS1 and end it at POS2, then shouln't #1 and #3 commands above produce the same results?

The difference is that #1 treats the entire line as a single key, and sorts it lexicographically. The other two have multiple keys, and in particular, while #3 uses the same set of fields as #1, it does so in a very different way. It first sorts the list by the first column (whitespace belongs to the following field, and is significant, unless you specify -b), and if two or more rows have an identical value in the first column, then it uses the second key to sort that subset of rows. If two or more rows are identical in the first two columns, it uses the third key, etc.
In your first case, depending on your locale, you can get different results (try LC_ALL=C sort -k1,4 < file and compare it to, for example, LC_ALL=en_US.utf8 sort -k1,4 < file).
In your second and third case, since the keys are split on transitions from non-whitespace to whitespace. This means the 2nd and following columns have varying sized whitespace prefixes, which affect sorting order, since you don't specify -b.
Also, if you have a mix of spaces and tabs for lining up your columns, that could be messing with things.
I got your same results when I had LC_ALL=en_US.utf8 in my environment, but your expected results (i.e. no differences) using LC_ALL=C (SuSE Enterprise 11.2).

Sort a column by number of identical occurrences - using awk, sort, tr or uniq?

Let's say I have some tab-separated data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
And I want to sort it by the number of times a name occurs in the first column (max to min)
So we'd have Peter (3 occurrences) Joe (2 occurrences) and Laura (1 occurrence).
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
It only needs sorted by the first column, not the second. I've been reading sort's documentation, and I don't think it has the functionality. Anyone have an easy method?

not sexy but works for your example:
awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' file file|sort -nr|sed -r 's/[0-9]* //'
test with your data:
kent$ cat n.txt
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
kent$ awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' n.txt n.txt|sort -nr|sed -r 's/[0-9]* //'
Peter 8
Peter 7
Peter 5
Joe 8
Joe 4
Laura 3

This works:
for person in $(awk '{print $1}' file.txt | sort | uniq -c | sort -dnr | awk '{print $2}');
do grep -e "^$person[[:space:]]" file.txt;
done

Here's one way using GNU awk. Run like:
awk -f script.awk file
Contents of script.awk:
BEGIN {
FS="\t"
}
{
c[$1]++
r[$1] = (r[$1] ? r[$1] ORS : "") $0
}
END {
for (i in c) {
a[c[i],i] = i
}
n = asorti(a)
for (i=1;i<=n;i++) {
split(a[i], b, SUBSEP)
x[++j] = b[2]
}
for (i=n;i>=1;i--) {
print r[x[i]]
}
}
Results:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3

That's a surprisingly hard sort criterion. This code works, but it is pretty ugly:
data=${1:-data}
awk '{ print $1 }' $data |
sort |
uniq -c |
sort -k2 |
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
sort -k1,1nr -k3,3 -k2n |
awk 'BEGIN{OFS="\t"} { print $3, $4 }'
It assumes bash 4.x for 'process substitution' but doesn't use any sorting built into awk (that's a GNU extension compared with POSIX awk). With an explicit temporary file, it can be made to work in shells without process substitution.
data=${1:-data} # File named on command line, or uses name 'data'
awk '{ print $1 }' $data | # List of names
sort | # Sorted list of names
uniq -c | # Count occurrences of each name
sort -k2 | # Sort in name order
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
# The process substitution numbers each record in sequence and sorts in name order
# The join matches the names (column 2) and outputs the frequency, record number, name, value
sort -k1,1nr -k3,3 -k2n | # Sort on frequency reversed, name, original line number
awk 'BEGIN{OFS="\t"} { print $3, $4 }' # Print name and value
Using GNU awk with a built-in sort, or Perl or Python, is probably better than this.
For the original data, the output is:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
Given this extended version of the data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
Peter 50
Joe 80
Peter 70
Peter 80
Joe 40
Laura 30
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Pater 50
Jae 80
Pater 70
Pater 80
Jae 40
Laura 30
The output is:
Peter 5
Peter 7
Peter 8
Peter 50
Peter 70
Peter 80
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Joe 8
Joe 4
Joe 80
Joe 40
Laura 3
Laura 30
Laura 30
Pater 50
Pater 70
Pater 80
Jae 80
Jae 40
The -k3,3 sort term is necessary for this data set; it sorts Laura's entries before Pater's entries (when omitted, you get those two lists interleaved).

Here's another one using awk:
awk '{ a[ $1, ++b[$1] ]=$0 ; if(b[$1]>max) max=b[$1] }
END{ for(x=max;x>=1;x--)
for( k in b )
if( a[k,x] )
for(y=1;y<=x;y++) {
print a[k,y]
delete a[k,y]
}
}' filename
It works fine with gawk and POSIX awk. The presence of three loops in the END statement might affect performance with big files.

Using unix join -o to not print the common field

I am using
join -1 2 -2 2 file1.txt file2.txt > file3.txt to join my two text files based on their second column and write them to file3.txt, which works perfectly. However, I do not want file3.txt to contain the common field. Googling and join's man page suggests that the -o formatting operator could help accomplish this, but how exactly should I go about doing so?

Assuming that each file only has two columns, and you want to join on the second column but show only the first columns of each file in your output, use
join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt > file3.txt
Remember that your two files should be sorted on the second column before joining.
An example run:
$ cat file1.txt
2 1
3 2
7 2
8 4
2 6
$ cat file2.txt
3 1
5 4
9 9
$ join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt
2 3
8 5

How to join data from three different spreadsheet?

I have 3 tsv files containing different data on my employees. I can join these data with the last name and first name of the employees, which appear in each file.
I would like to gather all the data for each employee in only one spreadsheet.
(I can't just do copy/past of the columns because some employees are not in file number 2 for example but will be in file number 3).
So I think - I am a beginner - a script could do that, for each employee (a row), gather as much data as possible from the files in a new tsv file.
Edit.
Example of what I have (in reality I have approximatively 300 rows for each file, some emloyees are not in all files).
file 1
john hudson 03/03 male
mary kate 34/04 female
harry loup 01/01 male
file 2
harry loup 1200$
file3
mary kate atlanta
What I want :
column1 colum2 column3 column4 column5 column6
john hudson 03/03 male
mary kate 34/04 female atlanta
harry loup 01/01 male 1200$
It would help me a lot!

Use this python script:
import sys, re
r=[]
i = 0
res = []
for f in sys.argv[1:]:
r.append({})
for l in open(f):
a,b = re.split('\s+', l.rstrip(), 1)
r[i][a] = b
if i == 0:
res += [a]
i += 1
for l in res:
print l," ".join(r[k].get(l, '-') for k in range(i))
The script loads each file into the dictionary (the first column is used as a key).
Then the script iterates through the values of the first column in the first file and
writes correspondent values from the dictionaries (that were created from the other files).
Example of usage:
$ cat 1.txt
user1 100
user2 200
user3 300
$ cat 2.txt
user2 2200
user3 2300
$ cat 3.txt
user1 1
user3 3
$ python 1.py [123].txt
user1 100 - 1
user2 200 2200 -
user3 300 2300 3

If you're familiar with SQL then you can use the perl DBD::CSV module todo the job, easily. But that also depends on whether you're comfortable writing perl.