I am using
join -1 2 -2 2 file1.txt file2.txt > file3.txt to join my two text files based on their second column and write them to file3.txt, which works perfectly. However, I do not want file3.txt to contain the common field. Googling and join's man page suggests that the -o formatting operator could help accomplish this, but how exactly should I go about doing so?
Assuming that each file only has two columns, and you want to join on the second column but show only the first columns of each file in your output, use
join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt > file3.txt
Remember that your two files should be sorted on the second column before joining.
An example run:
$ cat file1.txt
2 1
3 2
7 2
8 4
2 6
$ cat file2.txt
3 1
5 4
9 9
$ join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt
2 3
8 5
Related
I have a huge .txt file A with 6 columns. The first 5 columns are separated by a tab, but columns 5 and 6 are separated by a white space. Some columns have missing elements. Using Unix, how can I generate another file B.txt that has columns 5 & 6 separated by a tab?
A.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
B.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
Use column -t inputFile
$ column -t A.txt > B.txt
If the tabular form using column doesn't work, one can be explicit about the tab using awk:
$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' A.txt > B.txt
You can try with gnu sed
sed 's/[[:blank:]]/\t/g' A.txt > B.txt
I want to match column 1 and 2 from the file for_matching with files in different directories in the path and named /.file and print the entire row that matches those columns
/.file (example)
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
for_matching
carrot 124555
output
carrot 124555 1 2 6
Right now I can just match column 1 between the two.
for i in */*.file; do awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' $i for_matching > $i.matched; done
Using awk
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
Test Results:
$ cat file
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
$ cat for_matching
carrot 124555
$ awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
carrot 124555 1 2 6
Same with multiple files, no need of ls */*.file
#!/usr/bin/env bash
for i in */*.file; do
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching "$i" >"$i.matched"
done
That is so simple that you could just:
$ grep -F -w -f for_matching file
carrot 124555 1 2 6
For a limitation see #karakfa's comment below.
That could of course be circumvented with (:
$ cat file
carrot 124555 1 2 6
1 carrot 124555 1 2 6
$ grep -w -f <(sed 's/^/^/g' for_matching) file
carrot 124555 1 2 6
I have two file, first with single column (with repeated IDs), second file is three columns file, first column is IDs which is same with first file BUT unique number, I want to print remaining two columns of second file corresponding to first file IDs.
Example:
First file:
IDs
1
3
6
7
11
13
13
14
18
20
Second file:
IDs Freq Status
1 1 JD611
2 1 QD51
3 2
5
6
7 2
11 2
13 2
14 2
Desired OUTPUT
1 1 JD611
3 2
6
7 2
11 2
13 2
13 2
14 2
18
20
You can use this awk:
awk 'NR==FNR{a[$1]=$2 FS $3; next} {print $1, a[$1]}' f2 f1
To skip the header line,
awk 'FNR==1{next} NR==FNR{a[$1]=$2 FS $3; next} {print $1, a[$1]}' f2 f1
If second file has multiple columns,
awk 'NR==FNR{c=$1; $1=""; a[c]=$0; next} {print $1, a[$1]}' f2 f1
I am having problems trying to create a table containing a master list of names that have been matched and counted in two separate groups.
The Input_list.txt contains a master list of names and looks like this:
-5S_rRNA
-7SK
-ABCA8
-AC002480.4
-AC002978.1
-RP11-129B22.2
These names have been grep'd and counted in two separate data groups; group1_data.txt and group2_data.txt and look like this:
group1_data.txt
-5S_rRNA 20
-7SK 25
-AC002480.4 1
-AC002978.1 2
group2_data.txt
-5S_rRNA 1
-ABCA8 1
I would like to create a table that contains the master Input_list.txt and the 2 data.txt files with the matched names and corresponding counts. If there isn't a match, I would like to return a value of 0 and to look like this:
Input group1 group2
5S_rRNA 20 1
7SK 25 0
ABCA8 0 1
AC002480.4 1 0
AC002978.1 2 0
The number of matched names are not equal between the Input_list.txt and two data.txt files.
I've tried sort but I'm really stuck. Any suggestions would be great!
Using join:
join -e 0 -a 1 -o '1.1 2.2' Input_list.txt group1_data.txt | \
join -a 1 -e 0 -o '1.1 1.2 2.2' - group2_data.txt | \
sed '/ 0 0$/d'
Prints:
-5S_rRNA 20 1
-7SK 25 0
-ABCA8 0 1
-AC002480.4 1 0
-AC002978.1 2 0
I am comprating two files containing a large array of numbers using 'diff'. The files look identical except some numbers have minus signs instead of plus signs. Is there an option to ignore a character like '-' and then compare the files?
Just to clarify:
If file 1 is :
1 -1 5 -3
and the other is
1 1 5 3
I want the return to say they are identical.
Thanks!
You could use sed to take out the - characters before diffing:
sed s/-//g f1 | diff - f2
Example:
$ cat f1
1 -1 5 -3
$ cat f2
1 1 5 3
$ sed s/-//g f1 | diff - f2
$ echo $?
0