Diff options in Unix - unix

I am comprating two files containing a large array of numbers using 'diff'. The files look identical except some numbers have minus signs instead of plus signs. Is there an option to ignore a character like '-' and then compare the files?
Just to clarify:
If file 1 is :
1 -1 5 -3
and the other is
1 1 5 3
I want the return to say they are identical.
Thanks!

You could use sed to take out the - characters before diffing:
sed s/-//g f1 | diff - f2
Example:
$ cat f1
1 -1 5 -3
$ cat f2
1 1 5 3
$ sed s/-//g f1 | diff - f2
$ echo $?
0

Related

Most efficient way to subset a file by a list of text patterns to match

I have a large, tab delimited file (technically a VCF of genetic variants), call it file.vcf, with millions of lines that look something like this
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus2 1 10 0 0/1,21,2,2,;0
locus3 1 2 0 0/1,21,2,1,;0
...
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
I'd like to subset this original file to include all lines from loci in another file (search-file.txt). For example, if search-file.txt were:
locus1
locus3
locus123929
Then the final would be:
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus3 1 2 0 0/1,21,2,1,;0
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
What is the most efficient way to subset this large of a file using either bash or R? (Note, reading the entire file into memory, as in R is very very very slow, and often crashes the system.)
I'd use awk:
awk -F'\t' '
NR == FNR { a[$0]; next }
$1 in a
' search-file.txt file.vcf > filtered_file
bash would be too slow for this job.
Note: Make sure the file search-file.txt doesn't have DOS line endings.
Alternatively,
LC_ALL=C sort search-file.txt file.vcf |
awk '
NF == 1 { loc = $1; next }
$1 == loc
' > filtered_file
but this version may disturb the original order of lines.

Unix. Change separator between 2 specific columns that have a different separator

I have a huge .txt file A with 6 columns. The first 5 columns are separated by a tab, but columns 5 and 6 are separated by a white space. Some columns have missing elements. Using Unix, how can I generate another file B.txt that has columns 5 & 6 separated by a tab?
A.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
B.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
Use column -t inputFile
$ column -t A.txt > B.txt
If the tabular form using column doesn't work, one can be explicit about the tab using awk:
$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' A.txt > B.txt
You can try with gnu sed
sed 's/[[:blank:]]/\t/g' A.txt > B.txt

Awk command to print from 3rd column to till nth column

How to print from 3rd column to till last columns using awk command in unix, if there are 'n' columns in a file. I am getting with cut command but I need awk command. I am trying to do with awk -F " " '{ for{i=3;i<=NF;i++) print $i}', I am getting the output but it is not in the correct format. Can anyone suggest me the proper command.
Combining Ed Morton's answers in:
Print all but the first three columns
delete a column with awk or sed
We get something like this:
awk '{sub(/^(\S+\s*){2}/,""); sub(/(\s*\S+){2}$/,"")}1'
# ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
# remove 2 first cols remove 2 last cols
Which you can adapt to your exact needs in terms of columns.
See an example given this input:
$ paste -d ' ' <(seq 5) <(seq 2 6) <(seq 3 7) <(seq 4 8) <(seq 5 9)
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
Let's just print the 3rd column:
$ awk '{sub(/^(\S+\s*){2}/,""); sub(/(\s*\S+){2}$/,"")}1' <(paste -d ' ' <(seq 5) <(seq 2 6) <(seq 3 7) <(seq 4 8) <(seq 5 9))
3
4
5
6
7
Your attempt was close but appears that it would print each and every column on a new line.
To correct this we create a variable called 'line' and initialize it to an empty string. The first time we are in the loop we just add the column to 'line'. From that point on we will append to 'line' with the field separator and the next column. Finally, we print 'line'. This will happen for each line in the file.
awk '{line="";for(i=3;i<=NF;i++) if(i==3) line=$i; else line=line FS $i; print line}'
In this example I assume to use awk's default field separator. Also any lines that are less than three will print blank lines.
Assuming your fields are space-separated then with GNU awk for gensub():
$ cat file
a b c d e f
g h i j k l
$ awk '{print gensub(/(\S\s){2}/,"",1)}' file
c d e f
i j k l
In general to print from, say, field 3 to field 5 if they are blank separated using GNU awk again with gensub():
$ awk '{print gensub(/(\S\s){2}((\S\s){2}\S).*/,"\\2",1)}' file
c d e
i j k
or the 3rd arg to match():
$ awk 'match($0,/(\S\s){2}((\S\s){2}\S)/,a){print a[2]}' file
c d e
i j k
or in general if they are separated by any single character:
$ awk '{print gensub("([^"FS"]"FS"){2}(([^"FS"]"FS"){2}[^"FS"]).*","\\2",1)}' file
c d e
i j k
$ awk 'match($0,"([^"FS"]"FS"){2}(([^"FS"]"FS"){2}[^"FS"])",a){print a[2]}' file
c d e
i j k
If the fields are separated by a string instead of a single-character but the RS is a single character then you should temporarily change FS to RS (which by definition you KNOW can't be present in the record) so you can negate it in the bracket expressions:
$ cat file
aSOMESTRINGbSOMESTRINGcSOMESTRINGdSOMESTRINGeSOMESTRINGf
gSOMESTRINGhSOMESTRINGiSOMESTRINGjSOMESTRINGkSOMESTRINGl
$ awk -F'SOMESTRING' '{gsub(FS,RS)} match($0,"([^"RS"]"RS"){2}(([^"RS"]"RS"){2}[^"RS"])",a){gsub(RS,FS,a[2]); print a[2]}' file
cSOMESTRINGdSOMESTRINGe
iSOMESTRINGjSOMESTRINGk
If both the FS and the RS are multi-char then there's various options but the simplest is to use the NUL character or some other character you know can't appear in your input file instead of RS as the temporary replacement FS:
$ awk -F'SOMESTRING' '{gsub(FS,"\0")} match($0,/([^\0]\0){2}(([^\0]\0){2}[^\0])/,a){gsub("\0",FS,a[2]); print a[2]}' file
cSOMESTRINGdSOMESTRINGe
iSOMESTRINGjSOMESTRINGk
Obviously change FS to OFS in the final gsub()s above if desired.
If the FS was a regexp instead of a string and you want to retain it in the output then you need to look at GNU awk for the 4th arg for split().
If you don't mind normalizing the space, the most straightforward way is
$ awk '{$1=$2=""}1' | sed -r 's/^ +//'
in action
$ seq 11 40 | pr -6ts' ' | awk '{$1=$2=""}1' | sed -r 's/^ +//'
21 26 31 36
22 27 32 37
23 28 33 38
24 29 34 39
25 30 35 40
for the input
$ seq 11 40 | pr -6ts' '
11 16 21 26 31 36
12 17 22 27 32 37
13 18 23 28 33 38
14 19 24 29 34 39
15 20 25 30 35 40
To print from third column to till end then
cat filename|awk '{for(i=1;i<3;i++) $i="";print $0}'

How do I create/ sort a Table containing a list of matched terms with their corresponding counts

I am having problems trying to create a table containing a master list of names that have been matched and counted in two separate groups.
The Input_list.txt contains a master list of names and looks like this:
-5S_rRNA
-7SK
-ABCA8
-AC002480.4
-AC002978.1
-RP11-129B22.2
These names have been grep'd and counted in two separate data groups; group1_data.txt and group2_data.txt and look like this:
group1_data.txt
-5S_rRNA 20
-7SK 25
-AC002480.4 1
-AC002978.1 2
group2_data.txt
-5S_rRNA 1
-ABCA8 1
I would like to create a table that contains the master Input_list.txt and the 2 data.txt files with the matched names and corresponding counts. If there isn't a match, I would like to return a value of 0 and to look like this:
Input group1 group2
5S_rRNA 20 1
7SK 25 0
ABCA8 0 1
AC002480.4 1 0
AC002978.1 2 0
The number of matched names are not equal between the Input_list.txt and two data.txt files.
I've tried sort but I'm really stuck. Any suggestions would be great!
Using join:
join -e 0 -a 1 -o '1.1 2.2' Input_list.txt group1_data.txt | \
join -a 1 -e 0 -o '1.1 1.2 2.2' - group2_data.txt | \
sed '/ 0 0$/d'
Prints:
-5S_rRNA 20 1
-7SK 25 0
-ABCA8 0 1
-AC002480.4 1 0
-AC002978.1 2 0

Using unix join -o to not print the common field

I am using
join -1 2 -2 2 file1.txt file2.txt > file3.txt to join my two text files based on their second column and write them to file3.txt, which works perfectly. However, I do not want file3.txt to contain the common field. Googling and join's man page suggests that the -o formatting operator could help accomplish this, but how exactly should I go about doing so?
Assuming that each file only has two columns, and you want to join on the second column but show only the first columns of each file in your output, use
join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt > file3.txt
Remember that your two files should be sorted on the second column before joining.
An example run:
$ cat file1.txt
2 1
3 2
7 2
8 4
2 6
$ cat file2.txt
3 1
5 4
9 9
$ join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt
2 3
8 5

Resources