Unix comparing more than 2 files - unix

I got 4 files i need to compare, and i want the output to display the numbers in file X that are not present in the other files.
The files consist of 1 column, where on each row we have a number.
I can't use the diff command because it is not available.
I was trying to use comm, but the output was wrong every time.
Thank for your help.

It's not an easy task actually. You can try this. It uses GNU awk.
gawk 'function report() { printf("File \"%s\" has different number of lines: %d\nIt should be %d\n", FILENAME, FNR, o); exit } NR == FNR { a[NR] = $0; ++o; next } FNR > o { report() } a[FNR] != $0 { printf("Line %d of file \"%s\" is different from line in file \"%s\": %s\nIt should be: %s\n", FNR, FILENAME, ARGV[1], $0, a[FNR]); exit } ENDFILE { if (ARGIND > 1 && FNR != o) report() }' file1 file2 file3
Script version:
#!/usr/bin/gawk -f
function report() {
printf("File \"%s\" has different number of lines: %d\nIt should be %d\n",
FILENAME, FNR, o)
exit 1
}
NR == FNR {
a[NR] = $0
++o
next
}
FNR > o {
report()
}
a[FNR] != $0 {
printf("Line %d of file \"%s\" is different from line in file \"%s\": %s\nIt should be: %s\n",
FNR, FILENAME, ARGV[1], $0, a[FNR])
exit 1
}
ENDFILE {
if (ARGIND > 1 && FNR != o)
report()
}
Usage:
gawk -f script.awk file1 file2 file3 ...

Consider we have 3 files N1 N2 N3 to compare.
Add all the values in the files in a new one called out, and remove duplicates. Then compare each file with the main one "out" and output Missing for each comparison. Then add the Missing values in 1 file, sort them and remove duplicates.
cat N1 N2 N3 | sort -u > out
comm -2 -3 out $sort N1 > Missing1
comm -2 -3 out $sort N2 > Missing2
comm -2 -3 out $sort N3 > Missing3
cat Missing1 Missing2 Missing3 | sort -u > Missing.txt
Thanks for the help :)

If you have vim installed then issue this command
vim -d file1 file2 file3 ...

Related

Explain the AWK syntax in detail(NR,FNR,NF)

I am learning file comparison using awk.
I found syntax like below,
awk '
(NR == FNR) {
s[$0]
next
}
{
for (i=1; i<=NF; i++)
if ($i in s)
delete s[$i]
}
END {
for (i in s)
print i
}' tests tests2
I couldn't understand what is the Syntax ...Can you please explain in detail?
What exactly does it do?
awk ' # use awk
(NR == FNR) { # process first file
s[$0] # hash the whole record to array s
next # process the next record of the first file
}
{ # process the second file
for (i=1; i<=NF; i++) # for each field in record
if ($i in s) # if field value found in the hash
delete s[$i] # delete the value from the hash
}
END { # after processing both files
for (i in s) # all leftover values in s
print i # are output
}' tests tests2
For example, for files:
tests:
1
2
3
tests2:
1 2
4 5
program would output:
3

Compare 2 lines and print column numbers if matching and NULL if not

for below scenario i need ur help.
File 1: Is a config file
a|b|c|d|e|f|g
File 2: Input file
a|c|d|g
I have to compare "File 1" and "File 2" and print the following from file 2
a||c|d|||g
So basically I need to compare both the records and for the matching records i have to print it from file 2 and when not matching i have to put NULLS.
I think I can interpret what you're asking: you want to "remap" File2 onto the headers of File1.
$ cat File1
first|last|email|phone|address|colour|size
$ cat File2
first|last|phone|size
Glenn|Jackman|555-555-1212|L
$ awk -F '|' '
NR == FNR {
n = NF
for (i=1; i<=NF; i++) head[i] = $i
print
next
}
FNR == 1 {
for (i=1; i<=NF; i++) f2head[i] = $i
next
}
{
for (i=1; i<=NF; i++) data[f2head[i]] = $i
for (i=1; i<=n; i++)
printf "%s%s", data[head[i]], (i<n ? FS : RS)
}
' File1 File2
first|last|email|phone|address|colour|size
Glenn|Jackman||555-555-1212|||L
From the first block, the head array will be:
{1:"first",2:"last",3:"email",4:"phone",5:"address",6:"colour",7:"size"}
From the second block, the f2head array will be:
{1:"first",2:"last",3:"phone",4:"size"}
In the third block, the data array will be
{"first":"Glenn","last":"Jackman","phone":"555-555-1212","size":"L"}
another awk
$ awk 'BEGIN {FS=OFS="|"};
NR==1 {n=split($0,ht);
for(i=1;i<=n;i++) h[ht[i]]=i; next}
NR==2 {n2=split($0,h2)}
{split($0,t); $0="";
for(i=1;i<=n2;i++) $(h[h2[i]])=t[i]}1' file1 file2
seems cryptic but in essence does the following:
find the index of header2 (h2) values in header1 (h) and set the corresponding field ($(h[h2[i]])) in from file2 (t[i]).

How to replace second existing patteren in unix file

I want to replace the second existence of the pattern in unix.
Input File:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID|12
1245|425345|TaskID|dksj|kdjfdsjf|TaskID|12
1234|25345|TaskID|dksj|TaskID|kdjfdsjf|12|TaskID
123425|65345|TaskID|dksj|kdjfdsjf|12|TaskID
123425|15325|TaskID|dksj|kdjfdsjf|12
Sample Output file:-
12345|45345|TaskID|dksj|kdjfdsjf|TaskID1|12
1245|425345|TaskID2|dksj|kdjfdsjf|TaskID3|12
1234|25345|TaskID|dksj|TaskID1|kdjfdsjf|12|TaskID2
123425|65345|TaskID3|dksj|kdjfdsjf|12|TaskID4
123425|15325|TaskID|dksj|kdjfdsjf|12
your example does not match your question,
so i'll only show how to replace every second match of the given pattern
use awk. it's very powerfull tool for command line text processing
replace.sh as follow:
cat | awk -v search="$1" -v repl="$2" '
BEGIN {
flag = 0
}
{
split($0, a, search)
len = length(a)
for (f = 1; f < len; f += 1) {
printf "%s%s", a[f], (flag % 2 == 0 ? search : repl)
flag += 1
}
printf "%s%s", a[len], ORS
}
'
cat input.txt | ./replace.sh TaskID TaskID1

How to match a list of strings in two different files using a loop structure?

I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list

While read line, awk $line and write to variable

I am trying to split a file into different smaller files depending on the value of the fifth field. A very nice way to do this was already suggested and also here.
However, I am trying to incorporate this into a .sh script for qsub, without much success.
The problem is that in the section where the file to which output the line is specified,
i.e., f = "Alignments_" $5 ".sam" print > f
, I need to pass a variable declared earlier in the script, which specifies the directory where the file should be written. I need to do this with a variable which is built for each task when I send out the array job for multiple files.
So say $output_path = ./Sample1
I need to write something like
f = $output_path "/Alignments_" $5 ".sam" print > f
But it does not seem to like having a $variable that is not a $field belonging to awk. I don't even think it likes having two "strings" before and after the $5.
The error I get back is that it takes the first line of the file to be split (little.sam) and tries to name f like that, followed by /Alignments_" $5 ".sam" (those last three put together correctly). It says, naturally, that it is too big a name.
How can I write this so it works?
Thanks!
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = "Alignments_" $5 ".sam" print > f
} ' Tile_Number_List.txt little.sam
UPDATE, AFTER ADDING -V TO AWK AND DECLARING THE VARIABLE OPATH
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h $input | awk 'NR >= 18' | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR {
num[$1]
next
}
$5 in num {
f = newdir"/Alignments_"$5".sam";
print > f
} ' Tile_Number_List.txt -
mkdir: created directory little_TEST'
awk: cmd. line:10: (FILENAME=- FNR=1) fatal: can't redirect to `/Alignments_1101.sam' (Permission denied)
awk variables are like C variables - just reference them by name to get their value, no need to stick a "$" in front of them like you do with shell variables:
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
output_path = "./Sample1/"
f = output_path "Alignments_" $5 ".sam"
print > f
} ' Tile_Number_List.txt little.sam
To pass the value of the shell variable such as $output_path to awk you need to use the -v option.
$ output_path=./Sample1/
$ awk -F '[:\t]' -v opath="$ouput_path" '
# read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = opath"Alignments_"$5".sam"
print > f
} ' Tile_Number_List.txt little.sam
Also you still have the error from your previous question left in your script
EDIT:
The awk variable created with -v is obase but you use newdir what you want is:
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h "$input" | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR && NR >= 18 {
num[$1]
next
}
$5 in num {
f = opath"/Alignments_"$5".sam" # <-- opath is the awk variable not newdir
print > f
}' Tile_Number_List.txt -
You should also move NR >= 18 into the second awk script.

Resources