print and match rows where two columns match between two files - unix

I want to match column 1 and 2 from the file for_matching with files in different directories in the path and named /.file and print the entire row that matches those columns
/.file (example)
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
for_matching
carrot 124555
output
carrot 124555 1 2 6
Right now I can just match column 1 between the two.
for i in */*.file; do awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' $i for_matching > $i.matched; done

Using awk
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
Test Results:
$ cat file
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
$ cat for_matching
carrot 124555
$ awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
carrot 124555 1 2 6
Same with multiple files, no need of ls */*.file
#!/usr/bin/env bash
for i in */*.file; do
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching "$i" >"$i.matched"
done

That is so simple that you could just:
$ grep -F -w -f for_matching file
carrot 124555 1 2 6
For a limitation see #karakfa's comment below.
That could of course be circumvented with (:
$ cat file
carrot 124555 1 2 6
1 carrot 124555 1 2 6
$ grep -w -f <(sed 's/^/^/g' for_matching) file
carrot 124555 1 2 6

Related

Unix. Change separator between 2 specific columns that have a different separator

I have a huge .txt file A with 6 columns. The first 5 columns are separated by a tab, but columns 5 and 6 are separated by a white space. Some columns have missing elements. Using Unix, how can I generate another file B.txt that has columns 5 & 6 separated by a tab?
A.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
B.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
Use column -t inputFile
$ column -t A.txt > B.txt
If the tabular form using column doesn't work, one can be explicit about the tab using awk:
$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' A.txt > B.txt
You can try with gnu sed
sed 's/[[:blank:]]/\t/g' A.txt > B.txt

AWK, Unix command: How to match two files using corresponding first column in unix command

I have two file, first with single column (with repeated IDs), second file is three columns file, first column is IDs which is same with first file BUT unique number, I want to print remaining two columns of second file corresponding to first file IDs.
Example:
First file:
IDs
1
3
6
7
11
13
13
14
18
20
Second file:
IDs Freq Status
1 1 JD611
2 1 QD51
3 2
5
6
7 2
11 2
13 2
14 2
Desired OUTPUT
1 1 JD611
3 2
6
7 2
11 2
13 2
13 2
14 2
18
20
You can use this awk:
awk 'NR==FNR{a[$1]=$2 FS $3; next} {print $1, a[$1]}' f2 f1
To skip the header line,
awk 'FNR==1{next} NR==FNR{a[$1]=$2 FS $3; next} {print $1, a[$1]}' f2 f1
If second file has multiple columns,
awk 'NR==FNR{c=$1; $1=""; a[c]=$0; next} {print $1, a[$1]}' f2 f1

How to count words in url

I'm given a data set which contain several url like:
http://dbpedia.org/resource/Duncan_Trussell
http://dbpedia.org/resource/Ruth_Mary_Reynolds
http://dbpedia.org/resource/Abbas_Noyan
http://dbpedia.org/resource/John_E._Dowdell
http://dbpedia.org/resource/John_F._Du
Basically, i needed to make a script which shows the name of the people, the result should be like this:
First name
John 2
Abbas 1
Duncan 1
Ruth 1
Last Name
Trussels 1
Reynolds 1
Noyan 1
Dowdell 1
Du 1
grep -Po 'resource/\K[^_]*(?=_)' urls.txt | uniq -c | awk '{print $2" "$1}'
Output:
Duncan 1
Ruth 1
Abbas 1
John 2
grep -Po '_\K[^_]*$' urls.txt | uniq -c | awk '{print $2" "$1}'
Output:
Trussell 1
Reynolds 1
Noyan 1
Dowdell 1
Du 1
If you want you can replace | uniq -c | awk '{print $2" "$1}' by | awk '{count[$1]++}END{for(j in count) print j, count[j]}'.
All in awk
Split the last field on _
Increment the first and last fields of the split using associative array
Then at the end just print the names and their respective counts.
awk -F/ '{t=split($NF,a,"_");b[a[1]]++;c[a[t]]++}
END{print "Firstnames";for(i in b)print i,b[i]
print "\nLastnames";for(j in c) print j,c[j]
}' test
output
Firstnames
Duncan 1
Ruth 1
John 2
Abbas 1
Lastnames
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
Here is another awk
awk -F"/|_" '{a[$5]++;b[$NF]++} END {for (i in a) print i,a[i];print "";for (i in b) print i,b[i]}' file
Duncan 1
Ruth 1
John 2
Abbas 1
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
This should be much faster than the multiple command grep uniq awk
With GNU awk for true 2-D arrays:
$ gawk -F'/' '
{
nr = split($NF,names,/_/)
cnt["First"][names[1]]++
cnt["Last"][names[nr]]++
}
END {
for (pos in cnt) {
print pos, "name"
for (name in cnt[pos]) {
print name, cnt[pos][name]
}
}
}
' file
Last name
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
First name
Duncan 1
Ruth 1
John 2
Abbas 1

Sort a column by number of identical occurrences - using awk, sort, tr or uniq?

Let's say I have some tab-separated data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
And I want to sort it by the number of times a name occurs in the first column (max to min)
So we'd have Peter (3 occurrences) Joe (2 occurrences) and Laura (1 occurrence).
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
It only needs sorted by the first column, not the second. I've been reading sort's documentation, and I don't think it has the functionality. Anyone have an easy method?
not sexy but works for your example:
awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' file file|sort -nr|sed -r 's/[0-9]* //'
test with your data:
kent$ cat n.txt
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
kent$ awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' n.txt n.txt|sort -nr|sed -r 's/[0-9]* //'
Peter 8
Peter 7
Peter 5
Joe 8
Joe 4
Laura 3
This works:
for person in $(awk '{print $1}' file.txt | sort | uniq -c | sort -dnr | awk '{print $2}');
do grep -e "^$person[[:space:]]" file.txt;
done
Here's one way using GNU awk. Run like:
awk -f script.awk file
Contents of script.awk:
BEGIN {
FS="\t"
}
{
c[$1]++
r[$1] = (r[$1] ? r[$1] ORS : "") $0
}
END {
for (i in c) {
a[c[i],i] = i
}
n = asorti(a)
for (i=1;i<=n;i++) {
split(a[i], b, SUBSEP)
x[++j] = b[2]
}
for (i=n;i>=1;i--) {
print r[x[i]]
}
}
Results:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
That's a surprisingly hard sort criterion. This code works, but it is pretty ugly:
data=${1:-data}
awk '{ print $1 }' $data |
sort |
uniq -c |
sort -k2 |
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
sort -k1,1nr -k3,3 -k2n |
awk 'BEGIN{OFS="\t"} { print $3, $4 }'
It assumes bash 4.x for 'process substitution' but doesn't use any sorting built into awk (that's a GNU extension compared with POSIX awk). With an explicit temporary file, it can be made to work in shells without process substitution.
data=${1:-data} # File named on command line, or uses name 'data'
awk '{ print $1 }' $data | # List of names
sort | # Sorted list of names
uniq -c | # Count occurrences of each name
sort -k2 | # Sort in name order
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
# The process substitution numbers each record in sequence and sorts in name order
# The join matches the names (column 2) and outputs the frequency, record number, name, value
sort -k1,1nr -k3,3 -k2n | # Sort on frequency reversed, name, original line number
awk 'BEGIN{OFS="\t"} { print $3, $4 }' # Print name and value
Using GNU awk with a built-in sort, or Perl or Python, is probably better than this.
For the original data, the output is:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
Given this extended version of the data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
Peter 50
Joe 80
Peter 70
Peter 80
Joe 40
Laura 30
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Pater 50
Jae 80
Pater 70
Pater 80
Jae 40
Laura 30
The output is:
Peter 5
Peter 7
Peter 8
Peter 50
Peter 70
Peter 80
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Joe 8
Joe 4
Joe 80
Joe 40
Laura 3
Laura 30
Laura 30
Pater 50
Pater 70
Pater 80
Jae 80
Jae 40
The -k3,3 sort term is necessary for this data set; it sorts Laura's entries before Pater's entries (when omitted, you get those two lists interleaved).
Here's another one using awk:
awk '{ a[ $1, ++b[$1] ]=$0 ; if(b[$1]>max) max=b[$1] }
END{ for(x=max;x>=1;x--)
for( k in b )
if( a[k,x] )
for(y=1;y<=x;y++) {
print a[k,y]
delete a[k,y]
}
}' filename
It works fine with gawk and POSIX awk. The presence of three loops in the END statement might affect performance with big files.

Using unix join -o to not print the common field

I am using
join -1 2 -2 2 file1.txt file2.txt > file3.txt to join my two text files based on their second column and write them to file3.txt, which works perfectly. However, I do not want file3.txt to contain the common field. Googling and join's man page suggests that the -o formatting operator could help accomplish this, but how exactly should I go about doing so?
Assuming that each file only has two columns, and you want to join on the second column but show only the first columns of each file in your output, use
join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt > file3.txt
Remember that your two files should be sorted on the second column before joining.
An example run:
$ cat file1.txt
2 1
3 2
7 2
8 4
2 6
$ cat file2.txt
3 1
5 4
9 9
$ join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt
2 3
8 5

Resources