Fastest way to match column from one file to another - unix

I have two files which both have over a million lines and I want to print out the match from the first column of file 1 with the second of file 2.
file_1
SNP_A-2131660 1 0.0021
SNP_A-1967418 2 0.0005
SNP_A-1969580 2 0.0011
SNP_A-4263484 1 0.0024
SNP_A-1978185 0 0.0014
SNP_A-4264431 0 0.0038
SNP_A-1980898 2 0.0050
SNP_A-1983139 0 0.0011
file_2
"AFFX-SNP_10000979" "rs4147951"
"AFFX-SNP_10009702" "rs2022235"
"SNP_A-2131660" "rs6425720"
"AFFX-SNP_10021569" "rs12997193"
"AFFX-SNP_10026879" "rs9933410"
"AFFX-SNP_10029725" "rs7142489"
"AFFX-SNP_10034687" "rs1350088"
matches.txt
"SNP_A-2131660" "rs6425720"
Right now I am doing, but it is so slow. Is there a faster way?
awk '{print $1}' file_1 | while read -r a; do grep -H $a file_2; done >> matches.txt

Could you please try following awk.
awk 'FNR==NR{a[$1]=$0;next} {val=$1;gsub(/\"/,"",val)} (val in a)' file_1 file_2
If you want to redirect above command's output into a output file you could do > output.txt(or whatever you like the name for output file) at the end of the command too.

Using awk, and setting field separator FS='"' after reading first file,
awk 'FNR==NR{a[$1];next}$2 in a' file1 FS='"' file2
# OR
awk -v q='"' 'FNR==NR{a[q $1 q];next}$1 in a' file1 file2
It will be faster among all solution posted till now, as we are not sorting file, replacing double quotes and joining them
Test Results :
$ cat f1
SNP_A-2131660 1 0.0021
SNP_A-1967418 2 0.0005
SNP_A-1969580 2 0.0011
SNP_A-4263484 1 0.0024
SNP_A-1978185 0 0.0014
SNP_A-4264431 0 0.0038
SNP_A-1980898 2 0.0050
SNP_A-1983139 0 0.0011
$ cat f2
"AFFX-SNP_10000979" "rs4147951"
"AFFX-SNP_10009702" "rs2022235"
"SNP_A-2131660" "rs6425720"
"AFFX-SNP_10021569" "rs12997193"
"AFFX-SNP_10026879" "rs9933410"
"AFFX-SNP_10029725" "rs7142489"
"AFFX-SNP_10034687" "rs1350088"
$ awk 'FNR==NR{a[$1];next}$2 in a' f1 FS='"' f2
"SNP_A-2131660" "rs6425720"
# OR
$ awk -v q='"' 'FNR==NR{a[q $1 q];next}$1 in a' f1 f2
"SNP_A-2131660" "rs6425720"

One possible tool is join, but you need some preprocessing and some post-processing:
sort < file1 > file1_sorted
sort < file2 | sed 's/^"\([^"]*\)"/\1/' > file2_sorted
join file1_sorted file2_sorted -1 1 -2 1
Gives:
SNP_A-2131660 1 0.0021 "rs6425720"

Related

Unix. Change separator between 2 specific columns that have a different separator

I have a huge .txt file A with 6 columns. The first 5 columns are separated by a tab, but columns 5 and 6 are separated by a white space. Some columns have missing elements. Using Unix, how can I generate another file B.txt that has columns 5 & 6 separated by a tab?
A.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
B.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
Use column -t inputFile
$ column -t A.txt > B.txt
If the tabular form using column doesn't work, one can be explicit about the tab using awk:
$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' A.txt > B.txt
You can try with gnu sed
sed 's/[[:blank:]]/\t/g' A.txt > B.txt

Unix command to replace exact match

I want to replace exact matching string in a tabular file. My file somewhat look like this
I 1 10 100
II 2 20 200
III 3 30 300
IX 9 100 1000
I want out put file like this
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000
I have tried sed command with no success
sed -i -e 's/"III"/chrIII/g' -e 's/"II"/chrII/g' -e 's/"I"/chrI/g' filename
please identify the mistake. The file has large memory footprint can awk be used for the same?
sed approach
sed -E -i -e 's/\b(I|II|III|IX)\b/chr\1/g' input
DEMO OUTPUT:
~$ sed -e 's/\b\(I\|II\|III\|IX\)\b/chr\1/g' input
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000
posix version (no |, \b )
sed -e 's/^I{1,3}[[:blank:]]/chr&/' -e 's/^IX[[:blank:]]/chr&/' input > output
in your
sed -i -e 's/"III"/chrIII/g' -e 's/"II"/chrII/g' -e 's/"I"/chrI/g' filename
"III" search for "III" not III
g is not a fault but not necessary (search 1st element of the string and only 1 in your sample, is it more ?)
be careful, searching I will also find II if not delimited (here ^ for starting + a [[:blank:]] after)
awk approach:
awk '$1~/^[IXV]*$/{$1="chr"$1}1' OFS='\t' file > tmp && mv tmp file
file contents after modification:
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000
$1="chr"$1 - prepends each first field with chr substring
To put "chr" in front of first column.
Either way works well in the examples below.
awk '{print "chr"$0}' file
awk '{$1="chr"$1}1' file
chrI 1 10 100
chrII 2 20 200
chrIII 3 30 300
chrIX 9 100 1000

How to count words in url

I'm given a data set which contain several url like:
http://dbpedia.org/resource/Duncan_Trussell
http://dbpedia.org/resource/Ruth_Mary_Reynolds
http://dbpedia.org/resource/Abbas_Noyan
http://dbpedia.org/resource/John_E._Dowdell
http://dbpedia.org/resource/John_F._Du
Basically, i needed to make a script which shows the name of the people, the result should be like this:
First name
John 2
Abbas 1
Duncan 1
Ruth 1
Last Name
Trussels 1
Reynolds 1
Noyan 1
Dowdell 1
Du 1
grep -Po 'resource/\K[^_]*(?=_)' urls.txt | uniq -c | awk '{print $2" "$1}'
Output:
Duncan 1
Ruth 1
Abbas 1
John 2
grep -Po '_\K[^_]*$' urls.txt | uniq -c | awk '{print $2" "$1}'
Output:
Trussell 1
Reynolds 1
Noyan 1
Dowdell 1
Du 1
If you want you can replace | uniq -c | awk '{print $2" "$1}' by | awk '{count[$1]++}END{for(j in count) print j, count[j]}'.
All in awk
Split the last field on _
Increment the first and last fields of the split using associative array
Then at the end just print the names and their respective counts.
awk -F/ '{t=split($NF,a,"_");b[a[1]]++;c[a[t]]++}
END{print "Firstnames";for(i in b)print i,b[i]
print "\nLastnames";for(j in c) print j,c[j]
}' test
output
Firstnames
Duncan 1
Ruth 1
John 2
Abbas 1
Lastnames
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
Here is another awk
awk -F"/|_" '{a[$5]++;b[$NF]++} END {for (i in a) print i,a[i];print "";for (i in b) print i,b[i]}' file
Duncan 1
Ruth 1
John 2
Abbas 1
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
This should be much faster than the multiple command grep uniq awk
With GNU awk for true 2-D arrays:
$ gawk -F'/' '
{
nr = split($NF,names,/_/)
cnt["First"][names[1]]++
cnt["Last"][names[nr]]++
}
END {
for (pos in cnt) {
print pos, "name"
for (name in cnt[pos]) {
print name, cnt[pos][name]
}
}
}
' file
Last name
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
First name
Duncan 1
Ruth 1
John 2
Abbas 1

Diff options in Unix

I am comprating two files containing a large array of numbers using 'diff'. The files look identical except some numbers have minus signs instead of plus signs. Is there an option to ignore a character like '-' and then compare the files?
Just to clarify:
If file 1 is :
1 -1 5 -3
and the other is
1 1 5 3
I want the return to say they are identical.
Thanks!
You could use sed to take out the - characters before diffing:
sed s/-//g f1 | diff - f2
Example:
$ cat f1
1 -1 5 -3
$ cat f2
1 1 5 3
$ sed s/-//g f1 | diff - f2
$ echo $?
0

How to write a unix filter that outputs only a line every N lines

Suppose to feed the filter standard input with these line:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
It would be nicer if someone tell me how to write a script that prints only every 4 lines, in the case of the example input above:
line 1
line 5
line 9
$ yes | cat -n | head -10 | awk 'NR % 4 == 1'
1 y
5 y
9 y
That is, your answer is "awk 'NR % 4 == 1'".
awk '{ if ((NR-1) %4 ==0) print}'
awk 'NR%4 == 1 {print}'</etc/hosts
Replace 4 by whatever value you want of course.
sed -ne '1~4p’
(GNU sed. Not tested on OSX, etc.)

Resources